Episode Overview:
Episode Links & Resources:
Good morning, good afternoon, good evening, good whatever time it is, wherever you are in our amazing world. Thanks for joining the CDO Matters podcast. I’m now the host. I’m also the CDO of processing management.
If you’ve got questions about MDM, check with me on LinkedIn. We’ll happily answer any questions about the strategy, data governance, you name it.
On today’s episode, the Studio Matters podcast, I’m thrilled to be joined by the group. We are going off all about unstructured data, governing, managing, classifying, fixing, you name it. If you’re interested in finding a way to better manage all of that unstructured data that you’ve got that is gonna be produced, obviously.
This is a topic very near and very clear to my heart.
According to partner, eighty to ninety percent of all data is unstructured.
And the funny thing is is that generative forms of AI, prefer unstructured data as they they they prefer a diet of unstructured data.
We got all this data out there.
We’re gonna need to get our hands around it. We’re gonna need to govern it at scale.
We are going to need to manage it at scale, classify that scale, all these things. That’s today’s topic.
Junaid and I met at I think we met in person at the DGGI cube in Anaheim, where Junaid was was doing a fantastic presentation that I attended, that was the genesis for today’s conversation.
He’s also the founder of Pegasus nineteen consulting. He is a former SVP from, Citizens Bank.
And fun fact that we figured out during your presentation, we we were at Gartner at the same time.
Yep.
I I think I think you you made you said, well, when I was at Gartner, and then I was in the audience, and I was like, well, hey. Wait a minute. So was I.
And and then we figured out that we were supposed to have met by that point already because we were in we were supposed to be introduced by our our friend in common, Angelie Bansal.
Because Angelie is like, hey. You gotta meet you gotta connect when you’re at the event. And I didn’t put two and two together, but the person I was supposed to connect with was you.
Yep. Well, it’s funny. I’m kinda glad we were introduced by Angelie.
I think it’s a fun fact that we worked at Gartner at the same time and unfortunate that we didn’t run into each other directly back then. But I have been a follower of your podcast, of your thought leadership, so I couldn’t couldn’t be more excited to get connected with you and to be here with you today.
Well, it’s it’s it’s my honor. I learned from your presentation, and I’m looking forward to learning from you today.
So let’s dive into it. Unstructured data. What what why should we be having this conversation today? I mean, I kinda teed it up. Up. I talked it all about AI. But from your perspective, why is this why is unstructured data something that data leaders, data managers, practitioners should be more focused on?
Yeah. You you you you you said you said it all, which is eighty to ninety percent of the data that’s out there. Any given company, that’s out there is unstructured data. The sheer volume of of it is reason enough to go look to to go look at what’s happening. And I also agree with you that we previously never had the technology to go solve the problem, I feel.
AI has enabled that, the natural, language processing, better OCR technology.
The technology now exists. And aside from AI, I also think that we have more processing power than we have ever had previously.
We have more storage capacity than we’ve ever had from previously.
And all of that has become cheaper.
So you have more space to store data, you have more processing power to to go compute on it, and you have the technology. So I think the timing is is now right to go look at it. And, again, the sheer volume of it, there’s undoubtedly, I feel, a gold mine or there are pockets of of gold that are that are out there that we have to to go look at. I I think about unstructured data and structured data the same way that I, like, I think about human exploration of the deep sea and outer space.
Right? I feel, and and it’s it’s it’s a fact. We know more about interstellar space. Like, we know the age of the universe.
We have the vision to see thirteen point eight billion light years into the past.
We have this view of the universe. We know how it works. We know how stars are created. We know how black holes are created. We know we know so much about things that we can’t even, think about getting to, yet we know very little about the ocean depths while here on Earth because the technology just hasn’t existed.
We don’t have the ability to go down there. I think every time that there’s a breakthrough in technology to go deep sea diving, there are hundreds or thousands of species of of fish that are that are discovered.
So I feel like there’s a there’s this parallel where there’s we know so much about some some some things and so little about things that are that are around us.
I I love that metaphor of, you know, a deep undersea. I really I really like that metaphor.
I I think whether or not you are a believer in AI and at this point, I mean, I don’t know who isn’t. Maybe there’s still some that are, like, you know, yelling at clouds and and saying AI is just this this passing fade, and will will this too shall pass? I I I just can’t imagine there’s many of those people left. But even putting the AI issue aside, to your point, like, the the piles of gold, like, the unmined piles of gold sitting out there, all of the insights in chunks of text, Word docs, PDFs, even video files that are just sitting out on SharePoint servers somewhere.
Like, that in of itself, like, just to me, the the, like, the explorer gene in me that I mean, like like, I’d I’d wanna go even find it. Right? Like, as a kid, I wanted to go collect all of my Lego and put all of my Lego into one bucket. Like, to me, just even finding all that stuff out there would would be more than enough from an insight perspective because there’s got to be tons of gold out there. I I think it it’s important there there may be some of us who are listening to this and saying, well, you know, there’s structured, there’s unstructured, there’s semi structured.
Is is it is it is it for me, I just kinda put semi structured and unstructured into one bucket. Do you do the same? Is it useful to to to separate those?
I I know. I I do keep them together. And I think that, like, you know how you said it about going finding that goal? The key, I think, is putting semi structure around your structured data, tagging it with, data classification.
And I think that is the is the key step to start to start digging for that for the for that value. I would also agree with your statement that, you know, AI is not a a a passing fad as such. I think there are things that AI is, like, very, very good at. And I had this interesting conversation with somebody, yesterday where we said that there are some things that they go to AI for, and then they realize that I should have just done this on my own. So, like, creating a PowerPoint or creating a custom image. There are some things that it’s very good at that sort of search, analyze, looking at large amounts of volume is what AI is good at. But here’s what it’s even better at, I think, from a comparative advantage perspective.
So going to get the data is sort of step one. Where does it exist? Text files, audios, videos, putting some semi structure around that. But the key to, of leveraging AI is once you’ve sort of invent have some inventory or some sense of inventory of your unstructured data, AI is will be very good at connecting the dots and finding the relationships amongst your, unstructured data. Because unstructured data will exist will be the data silo problem is is, I think, even even more extreme than in structured data. Because you have unstructured data that is audio and it sits in some system, then you have, unstructured data that might be video and it sits on some other system. You might have unstructured data that is Word documents or Excel sizes on a SharePoint.
And so the silo problem for unstructured data, I think, is even more complex than for un than for structured data. But if you can start putting some structure around it, right, and make it semi structured, tagging it, defining it, you can start, leveraging it to find the relationships between your your data silos. And I think that’s one of the the key or, you know, one of the best uses of AI for unstructured data.
So you just you really touched on two things in terms of kind of an order of operation here.
Step two, we’ll start with step two, is is kind of classifying, tagging all the metadata tags that you would put on a video file or a text file or something else. But step one would be just finding it. Right? Is is going out there and doing some sort of inventory, go doing some sort of discovery to figure out what’s out there. I’m not even sure most data leaders are even fully aware what’s even out there. So there’s a there’s certainly a discovery process, and then there is some sort of a classification process.
How would I, as a data leader I mean, if if eighty or ninety percent of the data is unstructured, there’s a lot of it out there. How how do I prioritize?
Right? Where do I you know, when I’m talking unstructured data, when I’ve been talking about I mean, I’m an MDM guy. I talk all the time about starting with a use case. Right?
Starting with a specific problem that you’re trying to solve. Right? When you’ve got a well defined KPI, right, that is, you know, number of customers retained in a given year. Right?
You can and and all of your data is structured. Well, you know, okay. I need some insight around sales and I need some insight around customer to fulfill that need. I can kind of work backwards from the output.
Is there a similar model in unstructured data? How would I go about prioritizing and doing the discovery and classification of all these mountains of data?
Yeah. So maybe I’ll take a, like, a use case as an example. Let’s say you have a client retention problem. Yep. I’ll use a, I’ll use a bank acquisition Yep. As a as an example. Let’s say you have a a bank that might have failed or was acquired by another bank.
Usually, when that happens, the customers at the failing bank or the bank that was acquired start getting the jitters, and they will be they’ll they’ll start being afraid of what’s what’s happening here.
Is my money safe? So in that kind of use case, you’ll, you know, you’ll pick your customer data, but then within your sort of customer data domain, the way you’d wanna typically prioritize it, you’re gonna go after your high net worth customers.
Right? So you’re gonna have this some combination of looking at your structured data to say, who are my who are my high net worth clients?
And you can start developing a sense of who they are. And then what you do is you tap into any unstructured data that they created, and they created that unstructured data by sending an email, by calling in.
And so what you end up doing is you can start prioritizing, you know, when you’re looking let’s say you’re acquiring bank, you’re worried about high network high network individuals leaving, you identify you can identify pretty quickly who they are through your structured data. And then the key is finding, okay, have they called in? Have they reached out to relationship managers?
What are they saying? What’s the language that they’re using? What’s the sentiment that they’re creating? And I would say in terms of prioritizing your approach into unstructured data, is that’s like a very, like, a a very practical sort of real life, example in in in in some ways.
And then further, you could sort of combine what you’re hearing from them to understanding what what products they currently have and how you might have a relationship manager reassure them with a better product offering, and you can start creating playbooks for them to to retain those clients.
So so I love it. There is a analog here between the world that I described and the structured data world, which is focus on a use case, focus on an outcome, focus on the product a problem that you’re trying to solve. So there’s most certainly an analog here in the unstructured data world. But I do see I see more of a need for instead of maybe, you know, a classic analyst who’s very data centric and very kind of focused on the data, I would see there to be a growing need for people who are very much domain experts and who know those business processes, who can help us look to go find okay.
As as a, you know, as a as a VP of data and analytics, I may not even know or my team may not even know where that information for that high value client could be. Right? Like, is is it where where could this actually because I’ve been historically let’s just assume I’ve been ignoring a lot of this unstructured data. Right?
It’s just been outside of my governance scope for a lot of different reasons. Let’s just assume I don’t even know that there’s a SharePoint server in my marketing team where where there’s data related to that high value client. That’s a problem. Right?
Because that’s that’s a real problem. But if I have people on my team who are or in some sort of federated team potentially, and I have somebody in my team who is a business process expert who knows all of the processes related to those high value clients. I’m gonna know where those nuggets of data are.
Yeah. You’re you listen, you’re completely right. I think that whether it’s unstructured data or structured data, the amount of in like, institutional knowledge still resides inside people.
And I think a big part of getting to that the problem that you solve is the, you know, well, you know, what Rumsfeld very famously said, there are no knowns, known, unknowns, and unknown unknowns. If you don’t know that something is out there, you don’t even know to go look for it is is is probably the toughest scenario to solve for.
And I would say this, whether you’re working data and AI, I would actually say solving the data silo problem is easier than solving the people silo problem.
Like, so if you can solve the people silo problem, you will solve the data problems, period. And I’d almost, advise most people when they think when they work in the data space, just given the ubiquitous nature of data, to go solve the people, go establish those relationships. To your point, who are the marketing people? Who are the relationship managers?
What is the information they know that I don’t know to go figure out, oh, wow. They have their own app and systems, for x. How do I go get access to it? I think it’s a it’s a very there’s no there’s no easy answer for that one other than there’s no technical. So I think, you know, nothing that comes to mind immediately. But if you can solve the people problem in that in that scenario, you’ll you’ll get to the technical problem.
Well yeah. And, you know, if that data is sitting on a a private SharePoint server, you may not even have access to it. Right? Like, even just getting access to data is half the problem for a lot of data leaders out there.
I had that problem. I remember the first time I ever was given an MDM mandate, you know, go fix this, and we decided MDM was the right approach. It’s I struggled just to get access to to to to data. Right?
So because I didn’t have the relationships. I just went in guns blazing and said, hey. Give me give me root access to your database.
And people in marketing were like, no.
Yeah. I’ll tell you. I’ll tell you. I’ll share a very funny story with you.
We I had a similar mandate where we had from leadership to go solve client data problem.
At at this specific institution, we had something like forty four client masters. Right. Right? It was insane.
And if you’d god forbid you ask for access to one of them because people are very protective of their information.
And one of the ways that we so we were thrown into a workshop. I think we were there for two weeks, all day workshops, and then we’d call down the executives that put us on this sort of SWAT team to go solve this problem. One of the ways that we demonstrated our problem to them was we told them to come to the, you know, to the floor that we were on with the conference room. And he wouldn’t give any other instructions. We told them that we only told them that your presentation is on one of these in these conferences.
In one of these conference rooms, you have to go find it to demonstrate how hard it is to go get information.
And they literally we kinda walked them through the it was a it was a whole floor of conference rooms. We got them to the place where the conference room was was gonna happen, and we locked the door to demonstrate that now you can’t get in. Now that we know where something is, we can’t get in. So we were demonstrating, like, this sort of very in this sort of physical, like, a literal way.
Finding the information with this step one.
Now that you’re here, we don’t have the appropriate access to get in to go do anything about it. And we just you know, without without going through the entire thing, we it’s funny how, like and that really resonated saying that if you want us to make a change, we’re gonna need access. We’re gonna need information. We’re gonna need some sort of ability, like, you know, ability to compel others to do something for us in the in the data space.
But, yeah, access finding it, getting access is is the are steps one. And then what you said, understanding it, you almost need the SMEES engagement. Yeah. And it’s I feel, Malcolm, it’s everywhere I’ve been in the last thirty years, there’s like twenty percent of the folks who know eighty percent of the work.
Yes. And they’re those twenty percent are always the busiest.
They’re always the even if they wanna be collaborative, they’re just buried under keeping the railroad running.
So, you know, getting access to context is is sort of the next the next, the next phase. It’s I mean, it just it just goes on.
Yeah. Either they’re the busiest or the least receptive.
Right? It can be around forever. They were the runs the ones that wrote the original ETL script twenty five years ago.
It’s still running and that everybody is like, don’t touch it, don’t look at it, don’t even breathe on it. Yeah. He’s the guy that wrote it and, you know, like, just fingers crossed. Don’t even look twice at it because if it breaks, everything will stop.
But getting to your point about the the relationships and the people, there’s another point to be made about a business case and about kind of getting things in or, you know, your ducks in a row when it comes to being able to prove the value of what you’re doing, certainly something that also needs to go into this. But, ultimately, this is about people and it’s about building your relationships and, you know, obviously, a most critical aspect of the things that we wanna do here.
So I’ve got I’ve discovered.
I have got the relationships. I’ve got access, and now I need to go classify all of this stuff. I need to go tag all of this stuff. Do you see AI itself playing a critical role here?
Oh, absolutely. AI in terms of reading the content, in terms of understanding transcripts, converting audio to text, You can train a model with your company’s data classification policy.
That is a very, I would say, low to medium effort in it, you know, task, highly reliable after a few iterations.
But you could teach an LLM a policy on data classification, whether it’s classifying it by domain, classifying it by sensitivity.
And once your l l l m is trained on your company’s policy, it is it is very reliable. And I’ll speak from personal experience.
Somewhere around eighty, ninety percent accurate. If you do the if you have, you know, again, you have to have the right distribution of data so you have, the appropriate training data, in place, the appropriate volume of it to to train the model. It’s a very it’s it’s one of those low effort, high value, you know, if if there’s such thing as low hanging fruit, data classification, I think, is a very impactful use case for for AI.
But let’s get specific on that. What what would that typically look like? Does that look like a defined kind of corporate ontology?
Is is is that what is that what a classification policy would kind of look like? Because that’s that’s what I start thinking about, right, which is all of my kinda like my key terms, my definition, my my thesaurus, thesauri thesauri, and some idea of of an ontology or am I overthinking it?
No. You’re precisely right. You need that sort of it’s been it’s all of those things.
You have your ontology. You can have a business requirements document that has a grid that describes data.
You have a data catalog. You leverage your data catalog, which has definitions in it that feed context, and then you actually have the actual data sample sample that you’re training with.
And it it’s it’s highly reliable. Like, you you’ve, you know in one case, we had something like seventy thousand data elements.
It was appropriately tagged for the most part, And it was it was tagged with, like, internal data, confidential data. It was like it was skewed to things that were internal confidential versus public.
And so what we ended up doing was training it to be very good at recognizing certain patterns of data, like what is internal, what is confidential, so that when it saw something other, its confidence score would be lowered to reflect, I’m pretty sure this is something else. So it’s like let let me take, like, a, a, like, a parallel sort of metaphor. You could train something to say, this is what an apple is. This is what an apple looks like.
And it’s it’s red and it’s round. There’s a green apple too.
This one’s Fuji and this one’s Gala, and you can sort of teach it that. You can teach it then, like, what a banana is. A banana is to get the yellow. They’re they come in different different types. And then the moment that it sees an orange, you could be like, well, this is doesn’t really fit the criteria of an apple or banana. I’m not really sure what this is. Or if you have, like, those apple pieces, like, a species of apple pears.
This looks like an apple. And then so but doesn’t quite fit the size or things. You know, it’s confidence score. So that’s a, you know, another sort of key thing is when you do data classifications, the, the confidence score in terms of your training material is something that you pay very close attention to. So in our case, we were really confident that it was gonna be good with internal confidential data, less so about the sort of public slash sensitive, like, where, you know, the the classifications that we had. But it, a very, very reliable use of of and and in those in using those things, ontology, data definitions, and an appropriate set of training data.
So in these cases where the LLM or these this small model that has been custom trained on your classification policies, in the cases where it doesn’t know, how do we inject a human in the loop at scale or do that?
Oh, no. Well, you know, I would say this. We are still in the infancy of AI. Yeah. Even even in the, like, the, like, most reliable uses of AI, I think you’ve you’re gonna have a human in the loop. I think your priority is gonna be driven by whatever your confidential, confidence score is.
And I think that without, you know, without any doubt, there is a human in the loop that is accepting or rejecting every output from a whether it’s a small language model or a large angle model is accepting. So I might look at my output for my data classifications, and I’ll say, listen. Anything that’s over ninety, I’m not gonna even look at it. So I’m gonna go No. Just not gonna look at it. I’ll park it for now. Anything that’s below seventy or seventy five, I’m gonna go look at and physically approve, reject, and provide, like, a rationale for.
And that human loop just improves, right, the model. Like, okay. I accept something. Here’s my rationale for accepting.
Here’s my rejection. Here’s my rationale.
So I think we are we are still I know there’s a lot of talk about AI agents and agentic AI and, you know, these sort of self governing, self operating models.
I think that we you know, maybe this is just my perspective. We’re still a good twelve, twenty four, thirty six months, you know, truly being, independent.
There’s there’s a there’s a few things that that stick out to me.
One is there’s a lot of people who are really concerned about AI taking all the data management jobs away. Right? And AI taking all the data stewardship jobs and all of the data modeling. And based on what you just said, if if eighty to ninety percent of our data is unstructured and we need we necessarily need to start governing it, because everything we’ve just described is governance process.
It’s governance. Right? If we need to start governing that data at scale, in the short term, I can see more of a need for data stewardship than there ever has been in the past with a twist. However and to me, it comes back to kind of that business and domain knowledge.
Right? It comes back to that, and it also comes back to a combination of the data and analytics function with other functions that have completely existed outside the realm of data and analytics, Largely the people who are doing what is known as knowledge management. Right? The people who may be managing your search infrastructure in from the perspective of all of the stuff you’re doing from a customer facing perspective.
Right? So and and content management. Content management for your website, content management for for customer FAQs, you name it. But knowledge management, content management, the people who have been living and breathing in the world of ontologies for a while, often to support search, corporate search, whether that’s internally facing or externally facing.
Seems to me like those people and the traditional data stewards, data managers, data classifiers, we need to find a way to bring these groups together. What do you what do you think?
So I might say something provocative here. Please. And and I may get a lot of hate mail from from your from your listeners.
I do think here’s here’s my personal I do think AI will move us into a period or, into a phase where we will need less people doing data work. Longer term for sure. Yeah. Yeah.
For sure. Without a doubt. I think that, AI is coming to a point like, I have watched its evolution. I’m gonna say I’ve been close I’ve been deep into it, I’ll say, for about two and a half, three years.
And I would say every day, like really day and there are people who’ve been into it much longer. Just the evolution of its capabilities and features and its operations in the last two and a half, three years is x is exponential. Yeah. I think that we will have less data stewards as such and less data professionals.
I think there will be a decreased demand for those roles, And I think that there will be some new roles that are created. You know, people talk about prompt engineers, all this kind of stuff. I think that there’ll be a new I don’t know if if you’ve heard if if this is a new term or something that people are already using, the concept of AI stewards.
And there will be people who will rather than be data stewards, there’ll be these AI stewards that watch AI and AI agents doing, doing work.
I’m working, at a place now, and, and I say this because that’s precisely what I’m doing now is designing an AI agent meant to replace the mundane nature of what some what data professionals do today. And what we’ve what we’ve shown is an org chart that shows that we are gonna be scaling and increasing the amount of governance and data management that we do without increasing headcount.
We’ve what we’re doing is showing that we are bringing online AI agents in the coming quarters. An agent to do profiling, an agent to, look at rules translation, an agent to create rules, and then there’ll be an like, you know, sort of like a a parent agent that manages these agents.
And so the org chart shows a leader, their directs, and that these directs are now effectively becoming AI agents or sorry, AI stewards, and they are watching AI agents do the work. And so, you know, I almost hate to say it, but I can see it. I I you know, I’m training models today to do the work that an entry level person would do and work that I the work that I was doing, you know, when I was up and coming is now becoming obsolete. Wouldn’t even wouldn’t even do it. So I think that I think it is inevitable. And I think that, you know, I think that the the skill set that people are gonna have to do a skill they’re gonna upskill themselves to to remain to remain, relevant, I think.
More general generalists.
Correct? Right? Instead of, like, these data specialists, it sounds like there’s certainly a need here to be a little more maybe t shaped where you could go an inch deep and three miles wide, but really understand the business processes, how the business is consuming, digesting, and even creating data, maybe be able to go deep, you know, from the perspective of AI and how it works and its capabilities?
What do you what do you say to that?
Is there is there a role for generalists here?
I I think so. So here’s I don’t know if you, follow basketball or play you know, like, you know, there was an age where in basketball you had defined positions. You had a center. That center was, like, seven feet tall.
And he was big and you couldn’t move him. He would go down to the post, and that’s where he would live. You have a point guard. Yeah.
There you go. Exactly. You have a point guard, and that point guard was six one, six two athletic with his function was to go dribble a ball and make a play happen.
I, you know, I think that I’ve seen this notion of, like, positionless basketball come where you now you have people that can play multiple positions that are long and athletic and can dribble the ball, can equally shoot the ball. And I think that post play is becoming more obsolete because it favors a three point shot because it the three point shot is just worth more. So I think that there are certain positions that are becoming less relevant because of the value that they might add. And so I think, you know, when I think about data professionals, like, you had this I came up at a time where you had data quality SMEs. You had guy doing metadata management. You had a person writing policy.
And I think I think if I was designing a data organization, I’d be building an organization that had that was a team of, for lack of a better word, like generalist or jack of all trades.
And I would fill in the gaps with either really good models, whether they’re small language models, or people that might that might have some nuanced experience that, that’s necessary for a person to do. So I’m of the of the of the opinion of the future.
It’s better to be a generalist and know know all about, hey. This is what a data scientist needs. Yep. Here’s what their good data quality looks for them, and here’s what data quality looks like. Here’s what our framework is. That’s that’s that’s the way I would build a a team in, for the future, I think.
So let’s briefly touch go back to the idea of increasing automation of some of these data management roles, whether that’s stewardship, whether that’s analysis, whether that’s modeling, even engineering.
Is that true for all do you see that being true for all use cases? Because the story that I’ve been telling and I’m I’m open to being wrong. Trust me. I’m I’m wrong a lot.
The story that I’ve been telling is that there will always be a rule, some role for human oversight here because let’s take a bank. You’ve worked in banking. You may be may even be working for a bank right now, you might say.
But the the story that I’ve been telling is that explainability is critical for our successes as data leaders. Right? If you put something in front of your CEO and your CEO asks, why am I seeing what I’m seeing?
I’m not entirely sure I would wanna be the CDO that said, you’re seeing it because that’s what the LLM told me to do.
Right?
Versus you’re seeing it because a human being decided that that was the right thing even if even if the AI is consistently more accurate than the human. Right? There is still an accountability aspect, still an explainability aspect of being able to say a human was in that loop and made a decision based on these criteria versus an AI was in the loop and made a decision based on these criteria. Do you see do am I am I just being too optimistic about the long term proposal of humans being involved here?
Yeah. I think that there I think that there will always be humans involved. I think that in the next twelve, twenty four, thirty six, sixty months, that in that involvement will decrease at some steady rate. Yeah.
But I think they don’t they don’t I think I can’t imagine it ever going down to zero as as such. I think there’s always gonna be some level of human involvement. And and here’s here’s a here’s why. I think the biz what’s not clear to me is what happens when the underlying data changes?
Like, what happens when the business changes? Yeah. It requires it requires a context change. It requires a code change.
I’m not in the position to really know. It’s not clear to me if an AI model can recognize that, oh, I have a fundamental shift in business, and I have to teach myself a whole new context. I I’m just not sure we’re there yet. I think we’ll be in a tremendous amount of trouble if AI does, AI does get there. I think that, you know, because the data changes, the because the business changes, because the environment changes, the you’ll lean you’ll need people to tweak to tweak it. It’s no. The parallel will be, you know, like, you know you know, with, automation came to, manufacturing.
Right?
You won’t you know, a lot of jobs were eliminated, you know, but there was you know, things change, styles change, materials change, Technology does improve. So I think you’re gonna have humans in a loop for for some time, at least for the for, I would say, foreseeable future.
Yeah. You know, I remember a time way back in the day when people were saying, you’ll never get a bank to put their data in the cloud.
Right? You’ll never it’ll never happen. Banks will never put their data in the cloud, and here we are.
Banks have put their data in the cloud.
You know? Yeah.
So so my my the example here is and I’m sorry. I’m being a little metaphoric.
Is that I can certainly see some people saying, oh, well, there will always be a need for humans in the loop. Right? There will always be that need.
But I’m not entirely sure if that’s always true, given history. But I do know that for the time being for the time being and I don’t know what the I don’t know what I don’t know what this time span is, whether it’s a year, two years, three years.
Even though full self driving cars are way safer than humans.
Right? The the top five drivers of car accidents are all human factors. Right? Whether that’s distracted driving, whether that’s drinking, whether it’s whatever, it doesn’t matter.
The top five drivers of car accidents can all be mitigated through this technology, and self driving cars are being shown consistently to be more safe. However, when a self driving car makes a bad decision and a human is impacted, there is a one could argue a disproportionate disproportionately negative response to that that could end up shutting down that entire technology. And I see that being the case now with governance, with quality, with these LLMs making decisions about our analytics, where I could easily see c level people saying, okay. This one report was wrong. That bad apple is gonna spoil this entire bunch.
Forget about it. We’re not doing this. What what what what do you mean?
I I couldn’t agree with you more. You know, I’m working on an AI agent.
I know, as we roll it out in its early phases, we’re going through, like, an alpha and a beta phase before we get to a, version one.
The expectation setting is the biggest part of my job, which is this you know, people have two you know, I’ll tell you my experience when it comes to AI. I have two responses when I talk to leaders.
The response is either pure skepticism oh, AI. You know, like, to your point. Like, people are yelling at the clouds. AI is is a passing fad.
And, you know, I can almost feel their eye rolls on the phone when I talk to them about what AI can do. Right. Then there’s the other sort of extreme where people other leaders think that, oh, it’ll just do it. It’ll just, you know yeah, it’ll, it’ll do whatever you tell to do, and there’s nothing I have to do. I just have to, like, plug it in.
And, you know, that’s, you know, sometimes worse than the skepticism because you could turn skeptics. You know what I mean? Bringing down people’s expectations is, sometimes a harder thing to do. But I think expectation setting is one of the hardest parts in working in data analysis. Like, how do you sound optimistic and really positive about something without sounding like a crazy profit.
Right?
Right. Right. Or or a doomsday that all the jobs are gonna be eliminated. Right.
Yeah. But I would say yeah. I would say this. You know, I would say that, I agree also here with I I feel the pressure where if the if our work if our agent even produces something less than, like, ninety percent accuracy, it would be deemed as a failure.
Like, without a doubt, I think I I I feel as a leader, I feel that pressure, which is, okay, this wasn’t a hundred percent accurate. It was ninety percent accurate. I feel like if it’s something like seventy, it would be like, hey. We are wasting our time.
It’s the the response we might get when it’s meant to be incremental. So, hey. Let’s do something these two quarters. Let’s target an an increase in accuracy in the next two.
But it’s very much the same way.
Like, if a if a self driving car crashes, it’s like, oh, every car is now, like, unsafe and unreliable.
But I think it’s, I think there’s a a similar parallel to AI.
Well and that touches on something, you know, very near and dear to my heart that I touch on in my book.
Sorry.
I’m I’m selling the data here at Playbook, which is this idea of mindset.
And I think that we really, as data and analytics practitioners, or as just business leaders, it doesn’t even need to be CDOs, just business leaders, we need to really rethink how we frame some of these issues. Right? And the idea that if it’s not ninety percent, it’s junk.
Well, that’s just not the way the world works. We live in a world of nuance. We live in a world of context. What is good to marketing could be bad to finance. We’ve always recognized some of this tension. We’ve always recognized that quality exists on a spectrum.
And that for ninety percent ninety percent for a marketing use case will be unbelievable.
Right? Like, fantastic, like, unbelievable.
Maybe not for an audit or compliance use case, but this idea that two things can be true at the same time, right, and that quality exists on a spectrum, it doesn’t need to be this binary and or, that that to me is just so important. Just thinking differently and nondeterministically about how we manage these processes and how we manage AI seems to be, like, a a huge challenge, but also a huge opportunity for us.
What do you think?
You know, I like the way you said it, and I’m gonna copy you from now on when I talk Okay.
Which is which is, you know, use the way you describe this this notion of quality exists on a spectrum is really profound. I always talk about it as, like, degrees of of risk. Okay? Is your data quality risk high or low?
And you have to live with a certain amount. You you define what risk you can live with. And with but I like the way you said it a little bit better, which is you have to there’s a spectrum of of quality, and that spectrum, even at sixty percent, could be good to your point versus maybe in finance or credit. Like, that’s that’s terrible.
But I’m I’m in the, I’m in the the the agreement with you, which is and and especially kind of tying this back to unstructured data that the spectrum of quality is so so wide in unstructured data.
Yeah.
You know, like, we you know, when someone asked the question is, hey.
What does good data quality look like for unstructured data? And so I would answer I typically answer by, you know, sort of the same thing that you hear in structured data, which is, is it fit for purpose?
So now is this unstructured data, whether it’s an audio file or a text file or email, is it good for, like, a purpose around the use case I would do? Is it time relevant?
If if this thing is ten years old, you know, it’s it’s not if there isn’t the appropriate tagging or classification, I mean, is it relevant or not? You can start you know, you can apply the principles sort of thematically of structured data to unstructured data. But the biggest nuance is that the definition of quality exists on the spectrum.
And you don’t have to be in that ninety, ninety five, or hundred percent. Like, you know, we’re typically in finance and and risk. Ninety nine percent accuracy is like a threshold. You have a half percent or one percent threshold that we typically allow banks, especially when it comes to to reg reports.
That’s just not not feasible Right. In in unstructured data. But I’m gonna use that.
It’s yours. It’s it’s yours.
Feel free to use it. You know, we could we could be talking about this for hours and hours, and maybe we probably should. But, alas, our our time is, is is coming to an end. A few additional, resources for folks, who may be listening to this and who are still listening to this. We talked about the importance of ontologies, as as a means to kind of classify data, as a means to inform a data classification policy.
Check out the episode of CDO Matters with a woman named Jessica Talisman. Jessica is crazy smart, has lived and breathed in the world of ontology. She’s created this framework called the ontology pipeline.
I would welcome you to check out the dish the episode of the podcast with Jessica. I’d also invite you to check out a book written by Bill Inman. Yes. That same Bill Inman, the the the godfather of the data warehouse called Turning Text Into Gold.
Bill actually goes into detail about how you could actually chunk out and maybe even vectorize large blocks of text to help you classify it, to help you structure it at scale.
Really interesting read. It’s a quick read. It’s a slim book, but but it will help you start thinking about how these processes actually work. How you can start classifying the data at scale.
Jide, what where can people get in touch with you? How how would somebody access clearly this deep knowledge that you have around AI and unstructured data?
What’s a good way to contact you?
I only have one social media presence, and that’s LinkedIn. I’m sure I think that ages me a little bit. You can find me on LinkedIn, reach out.
All of my contact information is there. Thanks so much, Malcolm.
Fan fantastic. If you’re still with us, I would be absolutely thrilled if you subscribed, if you liked.
If you did all of the social things, thumbs up, all of that. Yes. We’re showing our age. Yeah.
I’m I’m I’m LinkedIn is my my primary vehicle as well, Janet. So, thank you for tuning into this episode of the CDO Matters podcast. I hope you got value from this. That is my mission is to help prolong the tenure of CDOs.
And anybody that wants to be a CDO, check out previous episodes. We’ve got nearly a hundred now.
And so if you found this useful, I am sure that you’ll find previous episodes equally as useful. Also, one last plug for the book. If you want to challenge the status quo in in data and analytics, if you need to figure out what you need to do differently and how you need to, more importantly, think differently about data, you should check out my book. Thanks, Junaid, again. Thanks for everybody for listening today. We will see you on another episode of CDO Matters sometime very soon. Bye for now.
ABOUT THE SHOW
