Business Strategy
Data Management

The CDO Matters Podcast Episode 83.5 – SPECIAL

The Evolution of Data for Quick-Serve Restaurants with Christopher Dwight

X

Episode Overview:

Quick-service brands face growing pressure to modernize data management, from ensuring menu and recipe consistency to scaling franchise operations. To dig deeper into these challenges, Profisee’s CDO Matters podcast released a special edition episode featuring Christopher Dwight, Profisee’s VP of Strategic Initiatives!

Episode Links & Resources:

Good morning, good afternoon, good evening, good whatever time it is, wherever you are in this amazing planet. I’m Malcolm. I’m your host for the CDO Matters podcast, and I’m also the CDO of Profisee. We make MDM, master data management software.

And if you’ve got questions about MDM, hit me up on LinkedIn. I will happily chat with you. Today, I’m joined by Ron Green. Ron is the chief technology officer of KungFu dot ai.

I just wanna keep saying KungFu dot ai over and over again. I love that. I love the name, KungFu dot ai. You are I would describe you as an AI services company.

 

Is that fair description, Ron?

 

Yeah. That’s exactly right. We are, an AI strategy and engineering firm, and we help our our clients with pretty much everything under the sun related to, figuring out your AI strategy and and building custom AI solutions.

 

So not only is Ron the CTO, but you are a cofounder. Yes?

 

And you and you started this in twenty seventeen. So this is before, you know, the OpenAI Firestorm that you you saw the writing on the wall and started pretty early.

 

Yeah. Well, not really early. Early early.

 

Yeah. Yeah. I actually did a master’s in artificial intelligence back in the late nineties. So, you know, it’s it’s like one of those, you know, quote, overnight successes that only took twenty five years.

 

And so I’m I’m glad we’re gonna talk today.

 

We’re gonna talk today about AI adoption. We’re gonna talk about what Ron is hearing in the market, what companies are working on, what they’re not working. We’re gonna talk about the perception of data as an enabler or maybe a hindrance to some of those AI aspirations. We’re we’re we’re gonna talk about, all things AI, the state of AI through the lens of somebody that’s out there helping customers build this stuff. Right? Not through being in the data engineering, which which so many of the people and listeners of this podcast are, but from the perspective of of folks that are helping others and get enabled with AI.

 

Ron and I actually have a little bit of a history. I’ve met Ron, I wanna say probably my guess, Ron, would be twenty eighteen when you or or maybe even before.

 

I was working for the startup called Quick Arrow, which was Austin Ventures Yep.

 

Startup. And I I wanna say we had a connection through AV. I can’t I can’t remember what our connection was, but I’m pretty sure we had coffee in a Starbucks at Anderson Mill in one eighty three in in Northwest Austin.

 

That’s, that totally rings a bell.

 

Yeah. I I’m pretty sure we did. But then I know for sure that we had at least one meeting, maybe even more when I was at Dun and Bradstreet up off of Palmer, and you were starting Kung Fu dot ai talking about, okay, how could we potentially work together? So I know that we had at least one meeting. This is this is a way of saying, friends, if you’re listening to this and you’re a younger professional, don’t ever burn a bridge.

 

Don’t don’t don’t ever burn your bridges because you never know. Twenty years later, you could be on a podcast with the person you burnt the bridge on.

 

So, you know, it’s such a small world that we live in. It’s such a small world.

 

Anyway, Ron, let let’s talk about let’s let’s first start with the kind of a general flavor. What are some of the projects that you’re involved in that are that are really interesting, that are really challenging? What are some of the things that you were seeing in the market from the demand side around custom AI development? What are some of the trends?

 

Yeah. You know, it has changed quite a bit, I would say, over the last eight years.

 

When we started Kung Fu AI in twenty seventeen, if you were working with us, you are an early adopter. Like, you are a true believer. And so most of the clients that we worked with back then had sort of very specific problems they were trying to solve, and they they were they knew that they were, you know, not addressable with traditional software approaches. So, you know, a lot of computer vision, natural language processing, that type of stuff.

 

The biggest question back then was, you know, was AI real? Like, could it actually solve these types of problems?

 

And the amount of data you needed eight years ago was considerably higher than you need now. So if we were doing a computer vision project, even in twenty eighteen, you weren’t flat footed from a modeling perspective. You could come in, you could use open source weights to kind of jump start the initiative. But on the natural language processing side, there was almost nothing.

 

You were almost just just starting from zero every time. That’s completely changed across the board. So regardless of what you’re doing now, you you you do have that head start. What’s really changed I I I kinda think there have been three phases over the last ten years.

 

There was the early adopter phase and and there was really narrow AI. We would build systems that could solve one problem.

 

You could maybe solve it at a superhuman level, but it was very narrow. Then generative AI explosion took over and there was, I would say, almost like an over correction. Everybody thought generative AI could solve any problem.

 

And they were frequently really blind to the complexities of deploying generative solutions because they, you know like we were talking about before we started recording, you know, these AI systems are probabilistic.

 

They’re they are it is very difficult to prevent, generative systems from hallucinating, things like that.

 

And that, that intense focus has really, subsided. And I think we’re entering a really strong phase where there’s a balance.

 

And where we’re seeing our clients get by far the biggest return on their investments is when you leverage your company’s proprietary data to build an AI capability that either automates some workflow that’s really intensive.

 

It’s a predictive system that got this historical data and you can use it to identify outliers and this can be used, you know, in just countless ways, or you’re enabling some new entire capability within the product suite that you couldn’t do before because you you know, you we just didn’t have those those capacities within artificial intelligence. But the key thing, the key point I’m making here is that it’s it’s having proprietary data where you’ll see the bigger return on investment. I definitely wanna dig into that a little bit more because it’s it’s where we’re encouraging our our clients to look.

 

So before we dive into there, where do you where are you seeing from your client perspective, where are you seeing the demand come from? Is this CDOs who are recognizing, hey. I got the data. I just don’t have the resource to do it, or is it business function heads? Like, is this the head of marketing wanting to do a propensity model? Where where are the demand for these solutions coming from from your client base?

 

You know, it’s that’s a really great question. It’s changed. I would say in the beginning, it was very, very, you know, focused. It would might be somebody in product or might be somebody in IT.

 

Last year, it was crazy. We would have people call us up and say, hey, can I have some AI? And we’d say, what do you need? And they would say, I don’t know, but I need some AI and I need it fast, because, you know, the board is gonna kill me. Now it is almost invariably c suite driven, very often CEO driven, meaning there’s there is a push to embrace AI, and the specific nature of that is sort of TBD, but we’ve gotta get on board.

 

And what’s really exciting about that for somebody like me who’s been doing this for so damn long is I would have conversations, Malcolm, like, three years ago, people four years ago, they’d say, what’s the difference between AI and crypto? Like, aren’t they kind of they’re both scams?

 

Yeah.

 

And so, you know, we’re really in a place now where we don’t have to convince our customers. And so we deal with everybody, the CMOs, CIOs, CTOs, products, owners, CEOs. It it’s really across the board. We have people reaching out to us now.

 

Separate podcast maybe about the convergence of AI and blockchain. I I there’s I think there could actually be something there from a blockchain perspective. Any anyways, separate podcast because I fell down the the blockchain rabbit hole many years ago and have not managed my way to get out.

 

But but interesting. So when you say the CEO, does this then the CEO is driving and saying get this done. And then is the CEO saying, hey. Go talk to my CIO. Go talk to my CDO. But just just make this happen. So that’s so that’s interesting.

 

Are are these folks and I don’t know. This may be too probing a question. Are are you seeing a lot of people, like, okay. Heck, yeah.

 

Let’s go. Or is there reluctance? Do they feel like there’s a been a kind of a gun put to their head from a from an AI perspective? Are you talking to CIOs or CDOs that are saying, I don’t know why my CEO is pushing me on this.

 

I already got this figured out. We we turned on Copilot last week. Right? Is it is there is there any tension there that that you encountered?

 

That’s another great question. I would say a few years ago, yes. There was a sense of like, hey, we have JGPT. We’re set.

 

What else would we need? What’s happening is, you know, it’s a really competitive market, you know, in in the world. And so as companies, have sort of maybe sat on the sideline to wait and see what happens and then their competitors release AI driven capabilities, nothing will light a fire quite like that. And so we are now seeing, I would say, more than half our engagements are essentially a combination of strategy and engineering where it’s it’s companies who don’t know how to get started.

 

They don’t know how to maybe find the talent. They don’t know how to vet the opportunities. They don’t know necessarily how difficult it would be to execute and and how to truly estimate the ROI around them.

 

And so they’re looking for help on the strategy side, on the road mapping side, and on the execution side. And what we see a lot is that they’ll come to the table with really good ideas, but there is maybe a a misunderstanding on the data side.

 

Data has always been important. It has never been more important. The, you know, the dominant approach within artificial intelligence right now is this technique called supervised learning where you you build models and you train them on a bunch of example data.

 

And the beautiful part about this is that these models can learn to generalize, and they they can become really good. You you you almost can’t overload them with data. They’ll soak up all the data you give them. But the challenge is you have to have that data. So an example we’ll see a lot is there will be maybe some very manual process that humans do, and maybe it is a sort of a decision process where there are many, many touch points and you wanna automate that and we’ll engage with clients and a lot of time they don’t have the data. You know, the humans have made the decision. They’ve they’ve taken in the input.

 

They’ve they’ve built up heuristics and intuitions over many, many years, and they are making decisions. And then if we come in to automate something like that, and and almost without exception, it’s human augmentation. It’s not replacing the human entirely. It’s automating the more mundane aspects, but there’s they haven’t captured the data. So that means it’s basically locked in the human’s heads.

 

The good news is if this is a critical path initiative, it’s better to find out now that you don’t have the data so you can start collecting it, then find out three years from now. So often, you know, bad news is is good news in the long run.

 

And then the other really big aspect on the proprietary data side is quality as we were talking about. You know, just having a you know, garbage in, garbage out is probably never more true than for the supervised learning AI systems. You need high quality data. It doesn’t mean that it has to be perfect, but it doesn’t mean it has to be structured. That’s one of the beautiful things that you can just kind of dump a fire hose at it, but there are some minimum thresholds you need from, a data distribution perspective for it to really deliver meaningful results.

 

Okay. So three things that we touched on around the data. One was you had said earlier about the importance of having kind of proprietary datasets in order to do something really unique.

 

Wanna touch on wanna touch on that. The second thing you said is companies maybe not having the data. So they’ve decided from a strategy perspective or just a kind of a fit from a use case perspective, or maybe it’s a customer demand perspective. They decided on I want a model x y z.

 

Right? Like this decision process, maybe it’s buying a pair of jeans or building a custom home or who who knows. Right? But that for whatever reason, that company doesn’t have the the the the data.

 

Like, there’s there’s nothing to there’s nothing to train on because in your words, they’ve kind of relied on just gut, aka heuristics. I love data scientists, love love, you know, fancy words for saying for saying intuition.

 

Yeah.

 

Heuristics heuristics a form of intuition, but, you know, arguably an important evolutionary process that, you know, heuristics over time, but Right.

 

Less valuable in the data science world. The third thing that you touched on was the idea of data quality.

 

Let’s let’s address those in in in reverse order. Let’s come back to to to data quality. Now you describe the problems of data quality from the perspective of supervised learning, which which is not generative AI. Right?

 

These these machine learning tactics are not generative AI. Is it correct, Ron, to say that you could kind of divide the world into two? Now now keep me honest here. Yes.

 

Where one of these worlds is the kind of traditional machine learning supervised reinforcement learning, you know, world, and then there’s the Gen AI world. Is that is that a heuristically true perspective?

 

Not not really. I probably would divide it along those lines. Okay. Because even even very large language models, for example, like ChatGPT and Claude, even though you wouldn’t strictly say the training methodology was supervised learning, it would be semi supervised with a sort of a reinforcement learning with human feedback or something post pre training regimen.

 

And and semi supervised, what that really means is, you know, people often talk about large language models were trained on the entire Internet. Right? Yeah. Well, there’s no way people can label, you know, all the input.

 

Well, actually, the pre pre training task is pretty simple.

 

You you just take corpus of documents and there are there are, training open source training examples out there that have billions of tokens, trillions of tokens in fact, and you just simply put them in and you mask out future tokens. So I’ll show you the first token or word, you have to guess the second one. I’ll show you the first two, you have to guess the third one. You do that for the entire corpus.

 

And you can do intermit you can do, like, intermediate masking and other things like that. But the important thing is even though it’s it’s not pure supervised learning in the traditional sense, it’s still mostly supervised even if it’s called semi supervised. And so the the real difference between generative AI and non generative AI is generative AI is designed to be able to generate new samples that mimic its data distribution, the data distribution it was trained on. So it can generate essentially from this sort of high dimensional manifold it learns.

 

And in non generative AI, they’re either sort of classifiers and they can be continuous or discrete, but they’re predictive in in some way very often.

 

I guess for me, I I I I talked about this potential split because as the data geek that I am, I look at things through the lens of structured data versus unstructured data. And structured data from the from the view of the CDO, we know reasonably well because we’ve been doing classic analytics reports, dashboards, rows and columns. We’ve been doing that reasonably well for a while, and we figured out processes and how to govern it. We figured out processes and how to assess quality. We figured out how to even even do things like data validation, right, and saying, is this correct? Is Ron Green actually really Ron Green? We’re reasonably good at that.

 

The unstructured stuff, that’s hard.

 

Right? Right.

 

Data quality, at least from the perspective of data validation verification, is something assertively true. Is it correct? When it comes to unstructured data, that’s really hard. And I think that’s that’s what some of the data professionals I know for for I know are really struggling with that because that world is so subjective, is so context driven, is so nondeterministic.

 

Right. That’s why I drew the distinction between Gen AI and and other forms of AI in the Gen AI seems to really like all of that text.

 

And the other stuff that you’re talking about can actually handle and is happy with rows and columns. Is that a is that a more correct assessment?

 

I hate to I hate to say it Malcolm, but no. No.

 

The, non generative AI algorithms do really well on non structured data as well. So for example, images or just plain text files or, you you know, rasterize content. I mean, just pretty much anything under the sun. The challenge the challenge we always had before was it was kinda multifold. One was you’re you’re dealing with discrete data types, and you need to have a way to represent them in a space that is continuous.

 

Okay? And if if there was anything anybody took away from this this podcast that I would be delighted about, there’s a concept called embedding, and it’s essentially where you can learn a a continuous representation of discrete inputs. So I can take a photo and I can map it to some high dimensional space and it’s a point in that space, or I could take a word and represent it on some complex multi dimensional manifold. That is that is common to both of these, both, dealing with structured data and unstructured data and generative and non generative AI solutions.

 

And to me, that’s actually kinda one of the slept on breakthroughs of the twenty first century. This idea that we can take discrete domains and manipulate them in these high dimensional continuous spaces. And the beautiful part, one of the reasons that we’re seeing amazing sort of multimodal models now, like single models that understand audio and video Text and and can, generate text and can generate images, etcetera, is because we have a unified architecture to represent these inputs in this this combined embedding space. And to me, that’s one of the, like, again, most critical breakthroughs of the twenty first century.

 

So so let’s get a little more specific, and let’s assume that I’m ignorant in this space, and I appear to be showing myself that I am.

 

Let’s talk a little bit more about this idea of an embedding. Let’s let’s start from the simple use case of I have a file of data related to customers.

 

Right? And I know how many goods I have got this kind of this time series data or transactional data that shows customer bought x at this date in time and customer bought y at this date in time. Right? And I’ve got and I’ve got customer data sitting in relational data. This idea of an embedding, let’s also say that I’ve got a big blob of text coming out of a CRM system that is just a salesperson is describing or a customer service agent is describing a discussion that they had with that same customer.

 

Right? Maybe it’s you.

 

The embedded form the of that text is are you are you saying that I I that now these systems are bay basically able to say, okay. I can take customer out, and I could vectorize that. I could embed that somehow in this array of numbers, and I can relate that customer interaction with this transaction in a relational table, with this transactional relational table, with a video of this person because it happened to be tagged with that person’s name for whatever reason. Then I can relate all those things together.

 

That’s that’s a perfect example. That’s a perfect example because essentially what you’re doing is the embedding space that the model learns, you get to decide what it represents. And so for with language models, when the tokens come in, the words come in, the they’ve already been embedded. So they’ve already learned some embedding and it’s there’s there’s a concept that you can hold in your head, everybody listening that that I think is just beautiful, which is these are highly, highly multidimensional spaces.

 

Well, that’s just a fancy way of saying for every word, it’s an array of floating point numbers that that could be thousands of floating point numbers. But conceptually, that’s just a point in some multidimensional space. And and if we imagine embedding words in a three-dimensional space, then they could be, you know, words here in our world that we could, you know, we could see in three dimensions. And the beautiful part is this, the traditional approach is to have this the semantic meaning of these, words be represented in that embedding space.

 

And so if you took a word, if I if I gave you a word like the word cat and I located it in this three-dimensional space, if I drew a a sphere around that word, all of the words closest to it will semantically be similar to it. And so I would see, like, cat would be closer to lion than cat would be to bear or something like that. And that’s the beauty of this. And so sticking with your customer data example, you could embed customers into a customer embedding space where they are grouped by buying behavior or buying propensity or style or knowledge we know about them.

 

Maybe they’re where they live or, you know, it’s just an infinite number of things.

 

And it’s that trick, which is how you can do things like recommendation engines or you can do fraud analysis because you’ll have these outliers that just stick out or you can use it to do, buying trend analysis and like you mentioned, time series and things like that, but it’s all predicated on being able to take this discreet data, which again might be a photo, might be a paragraph describing an interaction and mapping that into some embedding space that the AI model can operate on.

 

Well, sounds a bit like a knowledge graph to me.

 

You know, in some ways in some way, it is it is. And in fact, graph neural networks are becoming quite popular. And and and the difference between sort of a Graph Neural Network and a traditional approach is you are it’s called inductive bias where you’re essentially describing how closely related items are within that space, using sort of graph techniques.

 

And maybe maybe one of the best examples of that is with, like, protein fold prediction, where they know that there are relationships that are predicated on spatial proximity that needed to be that need to be understood, and you can represent that with graph neural networks.

 

Alright. Very cool. So I wanna skip the second question about the data not being there because we’re we’re maybe we could end on that because that’s gonna get into these, like, pithy comments related to, synthetic data that my brain may or may not be ready to consume. But let’s go back to the, you know, the importance of this proprietary data and arguably monetizing or operationalizing this proprietary data for use in AI.

 

Everything you just described, it sounds like that’s what would necessarily need to go into a solution to make sure that maybe your foundational model, be it off the shelf or not, the the the foundation model that you’re using could be, for lack of a better word, grounded. Is that the right way to describe? Yeah. You can control or optimize the behavior based on this unique dataset that you’ve got.

 

That’s right.

 

Is that prime this is this is the work that you guys are doing day in, day out. Yes?

 

That that’s exactly right. And and you would not necessarily need to start, even with the foundational model. I’ll give an example, I think.

 

Will will maybe clarify. One of the projects we did last year worked with a publicly traded company that does loan factoring. That’s essentially, essentially buying this is within the trucking space. Truckers will deliver a load, maybe they drive from Los Angeles to New York, they deliver a load, they won’t get paid for maybe net thirty.

 

So they they can sell that sell that invoice, get paid now and then the company takes, you know, like maybe one percent. There’s a lot of fraud in this space. This company is doing, I think it’s two point six billion in loans per year and the average loan amount is six hundred dollars So an enormous amount of volume, just an enormous amount of volume. They have hundreds and hundreds of people validating every day the paperwork, and they had built up a proprietary record set of of transactions going back twenty years, and they were interested, hey, is there a way where we could automate some of this decisioning?

 

And it took about a year, but we built a system that and I I just love this. It it took the loan decisioning process down from twenty four hours to nine seconds.

 

They moved from only operating during business hours to around the clock twenty four seven. So, you know, if a trucker needed to get, you know, an invoice bought on Christmas Eve, that’s fine. They’re open.

 

Fraud levels dropped, chargeback levels dropped. And the beautiful thing is we designed the system purposely so that it was human augmentative.

 

All of those hundred people are still there because they’ve they were able to essentially change their business stance and and increase the volume that they can deal with. The humans now handle about forty percent of the cases that the AI says there’s something there’s something wrong here. Either the paperwork’s incorrect or there’s something it’s suspecting might be fraud and the humans can get, you know, they can get on the phone and they can call people and actually reach out. And the beautiful part about this is this system was built on all of that proprietary data that none of their competitors have and it’s wrapped with a policy shell. So the model, as you mentioned earlier, it’s probabilistic. It’s making probabilistic decisions about the likelihood of repayment. And then based upon the model’s confidence, it goes through a series of sort of human crafted policies that ultimately decide if that loan will be approved automatically.

 

And here’s the beautiful part about all this, This this new capability instantly resulted in a quarter billion dollar market cap bump when they announced it last October because it’s gonna transform the Regardless of whatever the regardless of whatever the actual business returns were, you’re talking about a market cap return just because they announced it to the market.

 

That’s right. They because they were able to announce that they are going to twenty four seven and customer leading behavior. They were already the best in the business at twenty four hour turnaround.

 

Now it’s nine second and incidentally fraud and chargebacks are down. It’s just one of those win win win wins. And so we we are always preaching to our clients, don’t go build something that is based upon open data that any of your competitors could duplicate, or don’t go put a bunch of time and money in building something that doesn’t give you market competitive differentiation.

 

The golden standard, and this is just one example that I mentioned is take the legacy proprietary data that you’ve built up and, you know, to your point about unstructured data, a lot of times they don’t even realize that all that unstructured data is just gold, just setting there waiting to be leveraged. And if you can build, new capabilities or predictive systems or whatever it may be based upon your proprietary data, the return of that investment will just be astounding.

 

Alright.

 

So, so much for there being no business case for investments in AI. I mean, if I had a dollar for every time that I’ve heard that over the last year and a half, I’m I’m like, I don’t I don’t understand it. Like, even if we’re just talking about GenAI stuff, I went to Gartner Summit earlier this year and there would that that I would say that that was perhaps one of the prevailing wins that I was hearing was this idea that productivity in the desktop. Nah. Not really. But when we start to change business processes, that’s when the money’s really gonna flow.

 

I don’t know an engineering organization that isn’t twenty to thirty percent more productive from using co pods. I don’t I don’t get this. I don’t I don’t understand this.

 

You know? I don’t understand I don’t understand it either.

 

You know, I do a lot of public speaking and I’ll ask people, you know, raise your hand if you, you know, if you’re using AI daily. It’s like, you know, ninety five percent, and I’ll say, raise your hand, you keep your hand up, or raise your hand if you if you just don’t use AI at all. And there’s always just a handful of people, and they’re like, I’m not impressed. I don’t get it.

 

It is transforming the way we work. We’re an AI company. We’re leveraging AI coding assistant tools like crazy. We use it for data analysis, the deep research capabilities. I mean, I to me, it’s analogous, people, you know, in, like, nineteen ninety nine saying, the Internet, I don’t you know, that’s not really gonna affect my business.

 

Who’s who’s gonna buy clothing online?

 

Exactly. Exactly.

 

Right? Like, I remember that. Like Okay. They may, yeah, they may buy a book. Sure. But they’ll never buy clothing online. They’ll never buy, like, durable goods online.

 

I literally had somebody tell me one time. I remember it was in the early two thousands, and they said something like, I would never give my credit card out on the Internet. You know, they’re just mind shifts. It’s technology.

 

I think, you know, I think a lot of people and probably a lot of people listen to this podcast, we like technology. We we kind of embrace change. There’s a large segment of the population out there that is fine. They would, you know, they would wish that their, you know, their user interface is never shifted, never changed.

 

Just keep it keep it as I know how to use it now. The the change is not something they’re they’re happy about.

 

Well, I would argue if if you are one of those folks, if you are a more of a late adopter, that’s okay, and and that’s fine. But your business needs you to be an early adopter, needs you to start taking more of these risks. And if you don’t, maybe your CEO is just gonna make you do it.

 

I think that’s true. So I think that no. I think that’s exactly true.

 

Yeah. Or or you’ll get reorged under the CFO or you’ll get reorganized under the CTO.

 

What’s that on saying? It’s like AI is not gonna take your I mean, gonna take your job as somebody using AI is gonna take your job. And I think there’s a lot of truth into that saying.

 

So let’s let let’s get back to some some of these these these engagements that you’re seeing. Right? One of the prevailing wins, you know, above the fact that I’m hearing, you know, people yell at clouds related to the productivity levels and they’re not really real, which I which I just cannot believe.

 

Another one that a thing that we hear often in the world of data is, oh, well, there’s so many of these POCs, and most of these POCs are just flailing and are not working, and that’s a real problem. Yeah. I’m not convinced it’s a problem. I I think if you look at a data science as an inherently r and d function, a thirty percent success rate, you could argue, is reasonably pretty good.

 

What what do you think? Are you seeing do you see failed POCs as a problem? Are are you seeing a lot of that with your clients? What what’s some of the temperature out there in regards to this?

 

You know, I that’s it’s another great question. We really pride ourselves on getting things into production.

 

No joke. When we started the company, I was like, I’m not doing AI POCs. I am not interested in science projects, and we staffed accordingly. The stuff we build works and it goes into production. That said, we will often have to go through a POC phase. And the reason is we don’t know necessarily beforehand if the data is is from a quality quantity perspective is sufficient.

 

And, you know, I think a lot of companies, you know, if you understand and I have a couple of thoughts here. Like, if you understand that AI systems are more complicated to build, they’re probabilistic, that there will be challenges that are very organic in nature. Like, for example, you know, famously, you know, you can have a computer program and you miss you you have a one letter typo and the whole thing breaks. Well, AI doesn’t work like that.

 

You can take out you could, you know, take out twenty percent of the neurons and it will degrade in sort of a way like biological systems degrade. Right? So it has a lot of redundancy, but, unfortunately, it also has some of the negative sides of biological systems where it’s it’s more probabilistic. So if you come in and you’ve got a mindset, we’re gonna do some work, these POCs may not get into production, but we’re gonna learn and we’re gonna understand where our opportunities are, great.

 

What we often see is people come to us saying, we tried to build something and it didn’t work and we don’t know why we got stuck. And the number one thing we see is this, they’ll do some type of prototype, some POC. They’ll get to, like, seventy five percent accuracy pretty quickly, and they’re like, oh, we’re seventy five percent of the way there.

 

No. It is it is just incredibly difficult to ink out those additional percentages and performance.

 

It’s very much like semi, conductor chip design. You know, they will they will have they’ll drive it to, like, ninety eight, ninety nine percent efficiency from a yield perspective, but to try to get it to a hundred would might double the investment. Right?

 

That’s not linear. It’s logarithmic. Yeah. Exactly.

 

Are are on a log scale.

 

Yeah.

 

And so that’s the number one cause behind these POC failures is, you know, you can get to sixty, seventy percent like that, and then you get stuck. And the the the sad truth is these are complicated systems. They’re black boxes in a way.

 

We’re professionals. Even we get stuck sometimes building them, and we’ll go into hyper hyper parameter, tuning phases, and and and sometimes you’re just like, you know, why won’t the model learn? It’s just stuck at plateaued, and then you realize, well, we need to maybe change our loss function or or maybe change the way we’re initializing the weights or do cold restarts and things like that. And there are it it’s almost sometimes you feel like you’re dealing with the biological system. So again, I think it’s totally normal that a lot of these POCs don’t work. And if you have the mindset that you’re going to learn from them, great. But most companies think, hey, we’re gonna go do AI and our first try is gonna be a success.

 

So these are inherently iterative enterprises.

 

Really are. Really, really are.

 

Well, and so the this would suggest that the traditional or maybe not.

 

That the kind of more traditional approaches from a software development perspective may not be a great fit here.

 

I think that’s true. That’s that’s really true. We so we our methodology, which I’ll share with the world really quickly is start with EDA, exploration data analysis. Do not start building models.

 

Go look at the data. Go understand the signal. Go figure out. You may have you may have thousands of columns of data, let’s say, and let’s say you’re trying to build some type of, like, recommendation prediction system.

 

Most of the signal might be coming from three or four or five of those columns. You might not need the bulk of that data. Only once you understand your data, then should you start prototyping, and it’s very iterative. It is, you know, even agile methodologies will sometimes fall down because it will be like, well, we’re gonna achieve goal x in this next sprint. Well, well, guess what?

 

These things are black boxes. Sometimes you will get stuck, and you have no idea what to do, and you’re just iterating. I mean, my god, look, Meta.

 

Meta just released LAMA four a couple months ago, and by all accounts, they had a bad training run, and they wasted billions of dollars on this model that just didn’t come out correctly because it is so complicated.

 

Alright.

 

I love the advice of understand what you’ve got. Understand but I would argue, obviously, that what you described earlier is the kind of the strategy part of all of this. Understanding the problem you’re trying to solve, understanding the use case, understanding the desired outcome of of the model is the place to start, then kind of discover and probe your data to see what’s needed and what’s not needed. This is advice that I’m always giving on this podcast, which is don’t just start from greenfield.

 

Hey. We’ve got to go a lot of data. Let’s go figure it out. Understand the problem you’re trying to solve first and work your way backwards from there.

 

That’s the universal truth that we see, I think, with with everything. Okay. It’s a good starting point. You’d you’d mentioned earlier about the data not being there, and I I made this pithy comment about synthetic data.

 

But I’ve heard that being tossed around a lot in that, you know, okay. If we don’t have the data, we can just create a lot of this synthetic data, and this will solve a lot of the world’s problems. It’ll solve problems related to PII. It’ll solve problems related to data availability. Is that where does synthetic data fit in all of this, and and are are people kind of over romanticizing its capabilities?

 

Oh, that’s that is a tough question because I think it’s really context dependent. We’ve done projects where we’ve used synthetic data and it’s it’s really most most frequently needed in situations where you have a really skewed skewed, data distribution. So for example, you are trying to fraud’s a very common example. You’re trying to predict fraud.

 

Well, guess what? It’s a good thing. Fraud’s kinda rare. It’s like cancer. It’s kinda rare, which means almost by definition, you have a data distribution that is not gonna lend itself to training a fraud detection model because the bulk the vast majority of the data you have is the non fraud case.

 

Right? And that’s the that’s the whole reason you’re building the fraud detector in the first place is that it’s hard to detect, it’s rare, etcetera.

 

So synthetic data generation can happen, but if there’s a little bit of a chicken and an egg problem there, for you to be able to synthesize new data that correctly sort of mirrors the real world but missing data distribution, you kinda have to have enough data in the first place to bootstrap yourself. So I do think synthetic data and other training techniques, like that really have a a a big role to play and a meaningful role to play, but it is it I it’s it’s not a magic bullet. You can’t go into a situation we were talking earlier about the experts. Hey.

 

We have all these experts. I would like to be able to have a system replicate their predictive capability. And I say, okay. Well, where’s the training data?

 

And you say, well, it’s in their brain. Synthetic data generation is not gonna help you there.

 

Awesome. Well, that’s that’s good news. I’m glad to hear that. You can’t just create data out of thin air.

 

If you don’t have any data at all, you can’t just synthesize it, you know. Magic wand y. It still it still needs to be there and in a meaningful way. Alright.

 

Last question.

 

A big topic for data people like myself when it comes to AI is this idea of making your data AI ready. You’ve talked a lot about the importance of data, having the data. You’ve talked a bit about data quality.

 

Maybe that’s the answer that you’re about to give here. But to you, somebody that’s out there building these so a data scientist, somebody’s building these solutions day in, day out, every day. What does AI ready data mean to you, and how do you recommend the CDOs do a better job at making it?

 

Tough one.

 

I don’t think I don’t think there’s any blanket answer I can give that will apply to every situation, but there are general sort of best practices.

 

One is having your data siloed and distributed in ways that make accessing it difficult is a frequent bottleneck. That is a challenge.

 

Having data that has semantic changes over time, meaning we’ll see this a lot. Well, this column used to mean this, but but but prior to twenty twelve, it means this other thing. So unifying your dictionary, your data dictionaries for lack of better term, and making sure you have, like, consistent semantics, really, really important.

 

Dealing with bias and understanding that and realizing, like, one of the one of the top misconsiderations is that you’ll build the you’ll build these AI systems, and they will understand what’s important or they will they won’t be biased, and they’ll they’ll have this sort of omniscient capability to disregard unimportant data. No. They will suck up and memorize anything you give them. So if you have a dataset and it has biases in it of any type, that model will replicate it. And, so for example, let’s stick with, you know, the alone example we were talking about earlier.

 

You build a you build a system and it’s and it’s it’s got biases in its loan behaviors, but you act like it does and you’re like, oh, this is an omniscient model. It’s always making the right decision. You’re you’re fooling yourself.

 

So there are basically, the answer is consolidated, consistent data is great, but I would not I would highly not recommend people go do a bunch of work on their data in the abstract sort of in a vacuum to get ready for AI.

 

Make it make it initiative driven, and you’ll you’ll you’ll you’ll probably have a lot less wasted time and a lot less heartache.

 

Oh, that is pure gold right there.

 

Right? Because I I think that’s one of the struggles that data people have, which is this nebulous platitudinous nature of the statement of get AI ready. That sounds like I need to clean the entire house, and I’m not even entirely sure where to start.

 

Right.

 

And what if my guests are staying in the bedroom and not in the guest room? Right? Like, I don’t think I need to clean the entire house, and I don’t think you do either. This and I said last question. I was lying.

 

But something that that was you just talked about from a bias perspective. And one of the things that I kinda struggle with is not necessarily just relevant to bias, but this idea of data quality as a reflection of the way that the world is or the the way that we want the world to be. And is is it both in your world? Is it necessarily both or is it just strictly on the on the behavior that you want of the model itself? Right? Is that is the only way that this is another way of saying is the only way that we can really assess the data quality through the buff performance of the model, or can we assess the data quality in advance of the having the model defined just in some sort of abstract deterministic terms of, okay, it’s good or it’s bad?

 

Yeah. You can tap that here. Yes. You can absolutely, analyze the data in isolation prior to doing any type of modeling. That’s entirely possible.

 

One of the weird things though that I just can’t help but but comment on is your models are always gonna be biased on some. In the just in the same way that humans are, you just have to decide what that bias is. Right? And that’s one of the challenges I think we’re all gonna experience with this, the introduction of artificial general intelligence is these are not gonna be omniscient, omnipotent systems.

 

They’re gonna be incredibly intelligent, maybe even superhuman intelligent, but they’re gonna come to the table with biases in one form or another because that’s the nature of the universe. And so, again, understand your domain, understand your data, you can do bias analysis in isolation, but it’s a business decision ultimately. What do you want and that will drive how you manipulate the data and and clean and debias or whatever it may be. And it will just and it will help you on the modeling side because you will have to make decisions on the modeling side that are, you know, tangible everyday decisions about confidence levels and things like that that only humans that understand this the the the nitty gritty details will be informed to make the model can’t do it themselves.

 

Not to mention just an overall kind of risk assessment.

 

Like, what are we ready to accept and what are we not ready to accept?

 

And that’s that’s going to vary by use case. And I’m not even entirely sure from a governance perspective, You know, we know how to assess quality today. We know how to wrangle. We know how to do ETL. We know how to help with some of the semantic consistency issues that you’re talking about. But the overall risk assessment from a business perspective, and are we ready to do this, Is it ninety percent or is it ninety five percent?

 

These are some questions I’m not entirely sure that we’ve done a lot of asking historically because we really haven’t had to, personally. I agree.

 

Yeah. I totally agree.

 

Alright. Fascinating conversation. We could keep going for hours. Ron, thank you so much for coming on the podcast, sharing all of your wisdom, helping educate me about some of the fallacies that I had, long embraced around the differences between forms of AI. Really, thank you so much for your time.

 

Malcolm, thank you so much. This was a blast.

 

Hey. What’s a good way to get in touch with you, or or to to learn more about Kung Fu dot AI?

 

What what’s what’s what’s a good way to Yeah.

 

You can hit our website, Kung Fu dot AI. You can email me at Ron, r o n, at kung fu dot ai. And if you want, you know, sort of a deeper technical dive, we have a podcast called Hidden Layers. Check that out. We we cover, sort of, the latest breaking news in AI every month.

 

Awesome. Sounds like a podcast that I need to subscribe to. Alright, Ron. Thank you again so much.

 

And for our listeners, thank you for subscribing. Thank you for checking us out week after week after week. Hey. I’ve gotta say, if you haven’t already got one, pick up a copy of my book, The Data Hero Playbook.

 

It talks about the mindsets that we need to embrace in order to embrace this AI enabled future state. It’s all about mindset. Ron, thanks again, and thanks for our listeners. We will see you on another episode of CDO Matters sometime very soon.

 

Bye for now.

ABOUT THE SHOW

How can today’s Chief Data Officers help their organizations become more data-driven? Join former Gartner analyst Malcolm Hawker as he interviews thought leaders on all things data management – ranging from data fabrics to blockchain and more — and learns why they matter to today’s CDOs. If you want to dig deep into the CDO Matters that are top-of-mind for today’s modern data leaders, this show is for you.
Malcom Hawker - Gartner analyst and co-author of the most recent MQ.

Malcolm Hawker

Malcolm Hawker is an experienced thought leader in data management and governance and has consulted on thousands of software implementations in his years as a Gartner analyst, architect at Dun & Bradstreet and more. Now as an evangelist for helping companies become truly data-driven, he’s here to help CDOs understand how data can be a competitive advantage.
Facebook
Twitter
LinkedIn

LET'S DO THIS!

Complete the form below to request your spot at Profisee’s happy hour and dinner at Il Mulino in the Swan Hotel on Tuesday, March 21 at 6:30pm.

REGISTER BELOW

MDM vs. MDS graphic