CDO MATTERS WITH MALCOLM HAWKER

CDO Matters Ep. 43 | Practical AI and Data Science

February 8, 2024

Episode Overview:

In this 43rd episode of the CDO Matters Podcast, Malcolm interviews Santona Tuli, a former Nuclear Physicist turned data scientist.  Santona shares some provocative and actionable insights to data leaders looking to take a more practical approach to implementing AI and data science into their organizations. 

If you’re looking to show value from AI, quickly, Santona shares her recommendations for everything that data leaders should be considering in their search for the optimal way to embrace these transformative new technologies.  

Episode Links & Resources:

Automatically generated

Alright. We’re here. Good morning, good evening, good afternoon, whatever time it is, wherever you are, wherever you are in the world. I’m Malcolm Hawker, the host of the CDO Matters podcast.

I’m thrilled that you’ve decided to spend some time with us today.

I’m also thrilled to be joined by Santona Tuli, who is, with Upsolver. We’re gonna hear a little bit more about our current role.

Shantara, you are a data scientist by trade. Yes?

Yes.

Wonderful.

Well, we had a we had a prep call, I don’t know, about a week ago, before we actually got a chance to meet each other this past weekend at Day to Day Texas, which was an awesome day. I think everybody pretty much enjoyed it. I certainly did. But but one of the coolest things about about Santona’s history is that you’re a nuclear physicist. You’ve actually worked at CERN, like the Large Hadron Collider in Switzerland. How how in the world does a data person end up smashing photons in in Switzerland?

I think it’s the other way around.

A physicist there was first a physicist who was smashing, protons and, ions together and then went into industry for doing data. But it was really, that that physics was my introduction to working with data. So I don’t really think of the two as being separate.

I mean, so as a physicist, as you’re getting trained as a physicist, so it’s, you know, through high school and undergrad and all of that, you learn a lot of data skills. So you learn skills around how to make statistically significant statements, and, you know, how to how to make plots and and analyze data.

So I had that, but then it was really during the time that I worked with this massive particle collision data that I got a true true appreciation for, how sometimes, how hard the data work can be, how many different aspects there there are. So part of the work was just figuring out at the hardware software level how to collect data, the the interesting data from these massive particle collisions where data is being created at petabytes per second.

And obviously, we can neither process nor store data, at that rate. So it was, it’s I mean, that’s that’s one of the bigger bigger challenges is how do I even trigger on the interesting events. And then from there all the way through to, you know, how do I maximize the signal to noise ratio, how do I come up with the correct models that describe the signal in the background, noises, and then eventually to extracting those signals through, generalized linear models. So all of that, and, of course, you know, you’re writing these, programs and, I mean, we we use c plus plus, in the particle physics community, and it’s it’s a lot. Like, the collaboration, it’s it’s a, you know, global, collaboration. All of these experiments are so your coworkers are in, you know, literally in every part of the world, to managing that, all the collaboration software.

Like, there’s so much that’s similar to data work in industry, especially at, you know, the bigger companies with the massive data that I really don’t think of that experience as being, you know, like, separated from, what I did after that when I, switched over to industry.

Well, but so you’re measuring things that are not visible with the human eye. Correct?

Correct. I think you’re talking at a subatomic level here. I mean, we’re not we’re not seeing these these things.

Mhmm.

So you’re building I would have to assume that a part of this is you’re building models to measure things that cannot be seen, but the models are it’s based on expected behavior of how these things are actually acting. Yes?

I mean, that’s that’s asking me.

Yeah. Part of it is my billing model is based on expect expected behavior. Part of it is billing models based on observed data. So there’s both those angles come into play.

And, we are measuring the aftermath of the collision, which still is, you know, the particles that are being created. We can’t see that’s, you know, very small.

But, the effect that the particles leave behind in these detectors that we design is really the data that we’re analyzing. So we’re using that that, like, aftermath data to reconstruct what might have happened in the actual collision.

Okay. So for you data professionals out there who who who, are challenged with data observability, I I I would try this on for size. This this is observability, kinda next level stuff when we’re talking about observability. But, also, what I’m hearing you say is, I I don’t know.

I I I in my mind, you know, one of the big challenges for for for data leaders, data practitioners, and this was one of the the the big themes that we discussed in in Austin was this whole idea of measuring value. Right? And it and it’s a conversation you and I were both involved in in during our town hall. It it it it got a little lively.

Therefore, for a while, I went on a little bit of a rant, because that’s kinda what I do.

But what you just described is, like, things that cannot even be seen, subatomic level, petabytes per second, like, holy cow.

Right? We’re measuring that. We and and and we’re we’re measuring what you said. You know, we’re we’re reducing signal to noise ratio. We’re coming up with reasonable probabilities of things.

Not to say that measuring value is nuclear physics.

It sounds an awful lot simpler than nuclear physics, though. Right? I mean, I I I don’t know. I may maybe I went on another rant here. But if we could do one, I I have to think that we could do a before and after on the value of data, or am I just simplifying things?

Like, before before here’s here’s where we are today. Here’s our business KPIs. Here’s where we are today. Here’s how efficient our organization runs.

Here’s how quickly we send invoices or how fast we recognize revenue or the length of time it takes us to make something before. Then we do something to the data. We improve our quality. We build an integration.

Maybe we deploy a mesh. Who who knows? Doesn’t matter. But that’s the after.

Gotta think that we could do a before and after.

Yes?

Well, the the tricky thing about measuring the impact of anything is that how do you isolate that impact? So you’re talking about a whole entire business. Right? There are so many different factors that goes into, each of all of those KPIs.

I I would be very skeptical of saying over this, you know, two month period or even two week period, this these changes that we’re seeing in our business KPIs are solely due to the changes that we made in in our data quality enforcement or data observability. So it it’s just I think it’s too there’s just too many factors that go into it.

Okay. So you’re you’re you’re saying a direct one to one causal relationship would be impossible.

Okay. I get that, and I agree.

But we do live in a world of probabilities, or where maybe it’s not a one to one. Maybe you’re just one of many things that could be potentially influencing.

But that could still be meaningful, could it not? If you if you determine, okay, there’s these eight variables that we have through our model. These are the eight things that actually we think could be impacting business performance. Maybe it’s day of the week. Maybe it’s whether it’s sunny outside. I I I don’t know.

But data could be one of those. Could we get to that?

Yeah. I definitely agree with that. The the other issue that I that I see though is, you’re we’re always trying to improve the business. So in in to this do this experiment, right, this theoretical experiment that we’re talking about, would I would I hold off on running new marketing initiatives?

Because I just want to measure the impact of, you know, something in my data pipeline. I I think, I mean, we definitely don’t have that level of trust or collaboration across organizations, I feel to to actually, you know, carve out that time to make that especially at start ups, you know, you, I work at a start up. You know? It’s I think it we could say.

I I I mean, I agree with you that in in principle, you could do it before and after, some changes that you’ve made, and that’s would at least provide a, a an uncertainty band. It’s how I would call it. Although in this case, like, that’s the result we’re going for for the possible effect, that this one variable could have had.

Alright. Well, I’m not I’m not arguing trying to be right. I’m just I’m well, maybe I am. But I I I I do I do believe that I mean, either data plays a role or it doesn’t. Right? And and if it doesn’t, I don’t have a job.

So all those being equal, I I I want to find even if it’s just five percent. Right? Like, we’re we’re even if we said something like, well, we’re eighty percent certain that there there’s a five percent influence in better data. Heck. I’ll I’ll I’ll I’ll take it. Right? But I’m not gonna be the guy to tell a nuclear physicist that she’s wrong.

So, I I think I think we can iterate on this.

After after this, I think we should really sit down and try to hash out how to how to better measure the impact of a data initiative on the business.

I I think we are agreed that there’s no GUT in in measuring data value. The grand unified theory of of of all things. Right? We’re not gonna we’re not gonna figure out how to how to circle the square, but at least come up with some reasonable estimations that would that would help. Alright.

Let’s It’s funny. G u t is, because I think a lot of the times, it is our gut feelings that sort of, that is thick.

Well, so but but let’s pick on that a little bit. Right? Like, people in the data world like like to say, oh, well, you’re using your gut and you’re not using data. But where does the where does the gut come from?

Yeah. The the the gut comes from experiences.

Absolutely. Yeah. So think No.

I’m I’m a big Yeah.

Yeah. Go ahead.

I’m a big proponent of heuristics.

I I don’t think like, even from a science perspective, I don’t think we would be where we are today if we didn’t have heuristics and we didn’t go out and, you know, test those theories out and stuff. So, no, I’ve I’ve always said, numbers are great, collecting the data when possible. I mean, one thing is that it’s it’s hard to collect the relevant data to do whatever measurement you’re trying to do. It’s not it’s not always trivial. Sometimes it can be quite impossible. So you have to rely on heuristics at some level.

And, I think the more you sort of embrace that and, you know, work work with the data and heuristics together, the gut feelings, or or instincts and lived experiences together, the better your, you know, holistic understanding is going to be.

And the, I was gonna say something else, but I forgot.

Well, another another so so, I mean, yes. All else being equal, we should be using data to make decisions and not just gut.

But I, you know, I wonder. And I’m I’m not trying to be contrarian here. I’m I’m, like, I’m trying to put myself in the shoes of an average leader who has to make a decision, like, pretty quick. Right?

And and and she or he would would be given some data. Maybe maybe there’s a dashboard. Maybe there isn’t a dashboard. Maybe there’s nothing, but you gotta make some you gotta make a decision now.

Maybe you go talk to two or three people, and maybe you didn’t actually defer to some known published dashboard, but you did your best in the amount of time that you had to do it.

I I I think you could argue that that’s a data driven decision.

But I think the real problem and this is this is an interesting academic exercise. I think the real problem is is that we’re not at we’re neither tracking the out we’re not tracking the decision, and we’re certainly not tracking the outcome of the decision.

Right? Like, in your in in when when you were smashing, I said photons, but but you said it’s not photons. Those are that’s light. Right? What we’re smashing?

Protons, and then also just ions. Yeah.

Okay. When when you’re smashing them, you know the before state and you know the after state. Right? And and the decision in that case is is the smash.

Right? And and you’re able to measure all of these states, which gives you confidence of of of your data. But in our case, we’re not measuring the impacts of the of we’re we’re often completely and totally unaware of the decision itself, like a decision has been made.

And then we will see the net impacts of some sort of outcome, but we’re not really kind of modeling that. We’re not following the life of the decision. So So there’s this kind of growing field of decision science that says that you can do all these things. What what do you what do you think about that? Is there is there is there some validity to are you familiar with the decision science? I mean, do you think there’s validity there?

Yeah. Yeah. I am familiar. So okay. First thing, as a physicist, I have to say, we’re not we don’t know the after state of the collision.

We are trying to probe the after state of the collision.

Okay. Okay. Okay. There’s signals, though, that’s the you you mentioned there’s, like, signals. There’s footprints or something that we’re getting. Okay.

Yeah. The yeah. We take the footprints in order to analyze, but that’s where the interesting, unknowns are. That’s what we’re probing is what happens when you collide these two particles at this this energy.

So decisions so I I think that there’s a gap between, and I think you’ll agree with this, between data science and decision science today. And I think it’s gonna take some time to close that gap. But even when you said, like, a leader is give, you know, receives data to make decision. Right? I mean, I know that’s the reality that, you know, you can’t as a leader, you probably don’t have the time to go work with the data, but it’s not enough to be given data because what does that mean? Right? You’re if you’re given a dashboard or, like, a report, which is all you probably have time for, you probably don’t need get to spend more than five minutes, you know, reading up on this before you have to make this call.

It’s is it is it really enough? Like, because there’s so much bias that is introduced by the person that’s actually doing the analysis, that’s gathering the data, that’s, you know, giving you the insights. And you don’t you’re not privy to any of that.

And, I mean, bias I don’t mean, like, they’re ruining the data or ruining the results.

Work with it.

Exactly. Just working with the data. You introduce your, you know, your own biases in there. And, have you documented your assumptions? Have you documented the, you know, limitations of whatever analysis you ran or you ran? What’s out of scope? What’s so all of that, you’re not going to be able to communicate.

At least, you know, we humans generally are not able to communicate these when we just provide, a stakeholder or an executive with a a number or a set of numbers that they have to make go make this decision. So, it’s I mean, it’s flawed at the core. It’s it’s hard. It’s hard at the core, this idea of being able to make good decisions, using data in a in a data driven fashion.

And which is why I think, yes, partly, the heuristics really have to come in. You have to go with what you’ve seen and what mistakes you’ve made in the past.

And then the other part is, you know, I I think a good leader will have the curiosity to sort of, you know, not just look at a report that someone’s produced, but, you know, to if enough information is in there to probe and understand and maybe sit with this person and say, okay. You say you did this analysis and this is the results, but can you tell me more? Where you know, what what are the parameters? What did you have control over and what you didn’t do?

And really try to get the whole picture before actually relying on that data to make decisions. And, one last thing is, this is something I was lamenting, I think, when our on on our previous call. We don’t do error bars in in our line of work in industry. Right?

That’s that’s a big miss. It would be so, like, we would make much better decisions if we did error bars, uncertain uncertainty ranges on the results that we were, determining and sharing because, you know, cutoffs are cutoffs, cut off numbers, thresholds.

It’s just, you know, your point five you know, point one under or point one over makes all the difference. So what is actually my confidence level on this number that I’m quoting?

That’s what you mean by an error bar. Mean meaning meaning, I’m seventy percent confident that that this will drive twenty percent returns, like, plus plus minus some some sort of some some sort of, you know, error rate.

Well, that’s do you I wonder how people would react to that because that seems logical to me. Right? But may maybe I’m just a little bit too scientific. Maybe I’m just too a little bit too analytical, but but that seems logical to me because we live in a world of probabilities.

Nothing is nothing is one hundred nothing is one hundred percent. Right? Every we live in a world of probability. So this is one area where I think that a lot of data leaders kind of maybe struggle a little bit in that I think, historically, we’ve been really deterministic about everything.

Right? Like, we’ve been really binary about everything. Either it’s good data quality or it’s bad data quality. Either we’re data driven or we’re not very data driven. I don’t know if that’s an indictment of just society as a whole or if it’s the way a lot of us think. I don’t I don’t know, but that’s not really how reality works. Reality is is all probabilities, and and and that’s kinda how AI works as well.

I I Yeah. So I don’t know. I’m I’m I’m thinking and talking at the same time, but I think there’s an opportunity for a lot of data leaders to start thinking differently about their business. And one of them would be, okay. Perfect doesn’t exist.

And maybe we should just maybe sixty percent is good enough.

What do you think?

But I need to know but I need to know whether it’s what what my is is it sixty percent?

Is it twenty percent? Like, what’s good enough, and what sort of uncertainty is does the number that you’re providing me has? But, But, yeah, I mean, it is definitely very challenging and, like, one thing that doesn’t help is the idea of move fast and break things, which is so much the ethos, especially at start ups. And, I mean, it’s it’s always gonna be a trade off.

So, I mean, I certainly get it. I’m not saying throw away that mindset and, you know, slow slow down all of your, work. But, there has to be a voice of reason too, I think. So this is why I think, as a startup grows in the, like, thirty to fifty percent, person range and, I mean, of course, it’s just a, you you know, it’s a generalization.

The growth can have different metrics.

At some point, you need to have, okay, the r and d team or the engineering team that’s sort of sprinting and then, you know, churning out and putting out these features. You sort of need to have, on that in that product org, you need another, you know you know, the other other side of it, the other angel that sits on the other shoulder, right, to say, you know, are we let’s measure. Let’s measure if we’re doing the right thing. Let’s measure if we’re moving too fast. Let’s, stop and think about whether that feature that we’re so excited about and we’re definitely skipped even necessary. Let’s do the, market research to figure it out. And, I mean, I I think that’s so good product orgs do have that.

And I think data folks, data teams are in a position to be that voice of reason on the, other side of let’s just run and produce produce produce.

Well, it’s all about balance. Right? Like, it’s it’s it’s not just, you know, break things, but it’s not just analyze forever. It’s it’s it’s it’s some happy medium, and that’s gonna vary that’s gonna vary based on your mileage. It’s gonna base vary based on your company. To your point, startups, yeah, let’s go break some stuff.

Big giant companies, Fortune one hundred companies, maybe it’s a little bit too much of the analysis, right, and not enough of the action to try to find some sort of balance there. That’s that’s what I’ve been urging data leaders to do when it comes to AI because I’ve seen what I’ve seen is for a lot of them, not all of them, but but for a lot of them, the AI is just kind of unknown. Right? It’s kinda scary. It’s unknown. I’m not even really sure how it works. Seems like a little bit of of a sorcery to me, and I need to slow down.

Right? Like, that that seems to be the reaction of of a lot of folks, and it’s a natural reaction. I get it. Right? What would you what would you say to that to that data leader that is like, okay. May maybe I need to slow down and and maybe they’re not reacting fast enough, or maybe they’re not reacting as fast as their board would want them to react. What do you say to that data leader?

Yeah. It’s an interesting challenge. So it’s it’s really hard to separate the hype from the reality or or the hype from the actual advancements that that we’ve, seen in the last year ish. LLMs are fantastic. And, I mean, today, we’re having we have multimodal models too, that are, you know, generative.

And the promise and the potential is huge. And I think that’s what brings on the hype. Right? There’s promise, so we must all, you know, jump on this ship. But, so, I mean, you know, I don’t I don’t, it’s it’s it’s a hard decision. I think for every data leader that’s receiving pressure from their superiors, from the c suite, or from the the chief executive officer, to incorporate AI somehow into their mandate.

I definitely resonate with that. Like, the the pressure is real, and it’s it’s a difficult difficult position to be. However, it’s also a time to really be, you know, be contemplative about this. Because I think at the end of the day, the question that we really need to be asking, as, like, data leaders of data teams, is what is the natural fit, between my product and an AI augmentation or an AI version of this.

Like, I can’t really force AI into the product if there isn’t a natural way to weave that story in. Now one common so I I work in tooling, data science tooling, and, one common way to augment the product with AI is auto, auto copilot. Right? Code code generation or code complete, in some sense, or having a little assistance that sort of helps you with your workflow.

And that’s potentially useful, right, with certain tools, at certain scales is certainly useful. But if you’re for example, if you have a low code interface already and, you know, the AI can at best do, like, ten different suggestions because that’s, like, literally the number of, you know, ways you can, configure the the task that you’re trying to do. It’s not so useful. Right?

So it’s like, what is because even before you get to the lift, what is the lift of trying to incorporate AI into my product Is what is it really going to add, and is there a natural story? And if there isn’t, then, you know, we have to push back. We have to push back on on the data leaders and say, hey. This is hype, and I understand you wanted to do it.

And the purchase is probably coming from even, you know, higher up. You know, it’s probably the VCs and whatnot. But, there there isn’t really a natural story to tell here. Now we could we could add something that’s completely you know, it’s absent in the product today, but would fit more into an AI story.

But let’s recognize that that’s like a pivot or at least a bifurcation of the of the product.

So, yeah, I I think for me, the biggest thing is asking, what the fit is and whether that fit adds enough value.

Beyond once you get beyond that and you say, okay. There there is, some benefit to be unlocked here for my specific product. The next challenge, of course, is hiring the right folks to, actually incorporate, and that’s also hard.

And and this is where I say, like, one of the things that I’ve I’ve, spoken about, previously is, like, the role definition, defining different data roles.

There are lots of ML and AI, professionals and folks that really know this space and have the experience.

But roles like data scientists, for example, they’re neither here nor not there. Right? They’re not explaining.

So you you really have to look at someone’s resume and someone’s experience and and what they’re, interested in in order to figure out if they will be able to incorporate an AI, based practice or AI project or whatever it may be into the product.

But, I mean, those professionals are certainly out there. And, if you I mean, the other aspect of today where we are today, with AI is a lot of it is becoming turnkey.

Right? You I mean, for example, these large models, you’re not gonna train and host them. You’re going to probably ping their predict API to get an answer, and that’s an easy integration. Like, it you know, it’s still new, and you won’t find a lot of people who have done exactly that, but it’s a it’s a relatively easy integration.

So, yeah, figure out the scope, figure out what value it it adds. And once you have that, you just make sure to try to hire the that’s challenging too. But try to hire the right people, in order to do that integration.

Well, I would argue, you know, does everybody need to run and hire a, you know, a three, four hundred thousand dollar a year data scientist?

Maybe not.

But what you just described, is like, use a turnkey solution solution. Right? Use something out of the box like like an LLM, whether that’s open source, like a Bard or whether that is an open AI.

Find ways to do that. But what you described, I I think that there’s still some unique work there. When all this started, you know, the craze was prompt engineering, and nobody really kinda knew what that meant. And people were like, oh, wait a minute. Hold on a second. Is this like a really unique stand alone expertise?

But where we are where we find ourselves now, I think a year later, is is that what what I think we’re seeing is it’s not just prompting well, it may be prompt engineering, but it’s really complex prompt engineering where we are doing something to data in order to pass it over to an LLM so that the LLM so we’re no passing maybe a known fact set, and maybe that is in a graph or in a vector or in something else. But, you know, OpenAI, the latest version is, like, three hundred pages it’ll take in in a prompt. So you can pass an awful lot of information in as a known fact set. I I think that that task in and of itself in doing that, what is known as a rag pattern.

Right? Retrieval log minute generation. Like, doing that work, that sounds like a like a like a more of a data engineering skill. What what do you think?

That’s that’s interesting. I had never thought of it that way.

I mean, I How have you thought about it? I just, like, I’ve never thought about whether or not it’s a data engineering skill.

Okay. Okay.

I’m not even a big fan of, I think, necessarily differentiating between data engineering and data science skills. I find more value in differentiating, like, analytics skills and ML skills, whatever your your role is, either of those, fields. But to come back to the question of, like, prompt engineering versus data engineering versus, all of this. So, I mean, we we might not agree on definitions here, but, so I do I think of data engineering as, sort of integrating data, bringing in data from different sources.

Mhmm.

So but what with prompt engineering or knowing how to or let’s, like, in simpler words, knowing how to ask the right question with the relevant context to an, a generative model to get the right answer.

There might be some more engineering heavy aspects of that. Like, how do you build out the, retrieval infrastructure for your, relevant data, in which case it’s, you know, closer to engineering.

But, you know, from the design aspect, like, how do you, like, how do you architect the pattern of providing data and and retrieving, and, like, retrieving information? I feel that’s that’s probably closer to a data scientist’s, job req. So, I mean, I think as with any any sort of, technology, it’s it’s gonna be a mix. Right? It’s gonna go you’re going to need different skills that, come into play here. So, I will I will, just briefly add that I was very skeptical of prompt engineering be becoming a thing. So I was like, well, you know?

Now that we’ve sort of iterated on that a little bit and, sort of, you know, like, as you say, RAGS, retrieval augmented, generation, like, now that we’ve put more thought into what that really means and what that looks like, I think it makes a lot lot more sense.

The so I’ve I’m, you know, I’ve I’ve had the privilege of working in NLP a little bit, and this is not exactly new. Right? Right. Like, this, the the part that’s new is these models are massive, and, like, the scope is is is massive.

But even even the generative aspect isn’t new. The way that we, train models and, you know, sort of, you know, one thing that we used to that I’ve done is like a recency bias. Right? So you’re you’re going you’re going to train your model on, you know, historical data over three, four, five years, however you you have.

But, obviously, the context that’s gonna be most relevant to your model tomorrow is the stuff that is the data that’s come in today. So, so you want to weight that weight that data more. So, like, putting in that recency bias, things like this, like, the architectural patterns that go into how to use NLP to best serve the use case, that’s been there.

And there is a lot of folks that have been doing this and know how to do it really well. And not all of them are gonna call themselves an AI engineer or, you know so it’s, I guess I’m trying to say that the, the actual space or the amount of skilled professionals in the space and professionals that can quickly pick up either, you know, one set of skills, like kinda integrating with the APIs or the other side of it, like, architecting the you know, how to do the retrieval is is going to be like, we’re we’re gonna have plenty of people that are going to be able to pick up all of those things. So that kinda, like honestly, in some ways, we’re going back to this idea of of prompt engineering being, like, how to, like I think of it as, like, knowing how to Google well. Right. Like, in in some right?

So yeah.

But, also, it I mean, I think it may go a little bit beyond that, at least for now. I think in the future, a lot a lot of this will be figured out. But, you know, our our our common friend, Juan Saketa, data dot world, gave a presentation at day to day Austin where he was talking about using knowledge graphs to drive thirty percent improvement in the accuracy of of LLMs.

So I think I think that’s material, right, where you can pass in from information and known facts that in and pass context, right, in into an LLM and improve the quality of the responses. So I think for data leaders, that’s attractive because so many are really freaked out by hallucinations, right, or or or or the creativity or or just blatant inaccuracies.

And given we do live in this deterministic world, I mean, I I do worry that a lot of people are just seeing one or two bad apples spoiling the entire bunch.

Right? Where where they’ve got maybe a negative experience or or they’ve seen, you know, an LLM, you know, hallucinator. Maybe they’ve they’ve put their own name in and asked about themselves and what they see back isn’t isn’t quite factual.

Right?

So I think there’s there’s promise on their horizon to to mitigate that.

But let’s just kinda shift into a little bit more of a of a thought exercise.

A lot of comp a lot of particularly bigger companies are sitting on a lot of data that is just sitting there fallow, not driving value at all. And I think I think kind of buried in a lot of the transactional data and a lot of the metadata is a lot of really valuable stuff about comp how companies perform, what works, what doesn’t work.

And and maybe this is really too high level, and too academic a a question, but how would a data leader go about utilizing AI to start understand or utilizing anything? Maybe it doesn’t have to be AI. I just love your opinion on I do dashboards today. Right?

I I I I do some reporting today. I give the business what it wants. I make sure that the dashboards reflect the KPIs that the business is running on. But I’m you know?

And I tinkered in the past with with Hadoop and big data, and it was a little bit of a train wreck because, you know, we we couldn’t find a use case. But I know in my heart, I know there’s an unknowns out there. There’s a whole bunch of unknowns out there that could transform my business. And how do how do I go about finding them in a way that doesn’t end up with me getting fired because I spent two million dollars on a boondoggle?

Well, that’s that’s a really deep question and an excellent question. And, I mean, I will start by saying I don’t have the answers.

Oh, But I fully agree with you that there is, like okay.

So just breaking things down. Right? We collect data about our business to understand how the business is doing. Like, the interesting part, the data is just the means. It’s not the it’s not the goal.

And we have always like, we always have to make choices around what data we collect because the business doesn’t leave a perfect footprint. And even if it did or even the the best version of the footprint is often inaccessible, so it’s proxies upon proxies upon proxies, which is fine. Models and representations.

So that that’s that’s it’s the best we can do, and that’s something we have to accept. However, this, you know, I really like something you said. Like, there’s so much when you’re sitting on so much data and you’ve done the exercises of trying to figure out which data sources are most important, which ones you combine to get what answers, you’re inevitably leaving out some important information. It’s just how it is. No matter how well you think you’ve see through the data, especially, like, metadata, I think folks are talking more and more about metadata these days, but I think for the longest times time, we haven’t pay been paying as much attention to our metadata.

So, yes, it’s whether it’s hidden in the transaction logs, whether it’s hidden in metadata, whether it’s hidden in, you know, like, deeply nested JSON structures. I I mean, I feel fairly confident in saying that there is a lot of information that you haven’t unlocked because we’ve also fallen into this pattern of, I like to say, like, there’s, like, sixty questions to answer in analytics. I mean, it’s obviously tongue in cheek. It’s obviously a generalization.

But we’ve gotten so used to this pattern of, okay. I need my, HubSpot data. I’m I need my Salesforce data. I need my x y z and, like, you know, really formulaic way of answering those questions.

And as we were saying earlier, things are not deterministic like that. Right? So, I think my answer to your question is it’s it’s hard, but it’s a worthwhile exercise for sure. I do think with different kinds of information architectures, right, going beyond the enterprise data warehouse or, you know, more newer models of, new ways of doing data modeling, I think breaking out of those, a little bit and thinking about information architecture anew, given the tools that we have today that we didn’t used to have.

So specialized, data source like graph, graph databases, or, inverse index, like search databases like, you know, Elasticsearch, specialized data stores for, you know, the images and language and and or vector databases. We have all of these tools at our disposal now, and there are even, companies that are, like, trying to be the layer on top of all of them so you don’t have to go and integrate your vector database and your graph database.

So if we can really leverage that, if we can, dig dig into that and figure out how my information fits in, how I can represent the different aspects of the information that I have in the correct format that’s most relevant, most optimized, I think there’s a lot of promise there, but it’s it’s a lot of work, I would say. And, I’m not sure that one would be able to tomorrow go and say, hey. I need, you know, two years’ worth of resources or something to really Right. Get a get a better representation of my business.

And, I mean, then you go to show the value of that. Right? You try to, project the estimated value, and maybe our, lightweight model is good enough. Right?

How you know, what what is how much effort is worth getting a slightly better model?

Right.

I I there’s maybe a little bit of a paradox buried in here in that, you know, we we start the conversation talking about business value.

And if business value is articulated through kind of known KPIs, right, the way we measure the business, those are arguably trailing indicators.

Right? They’re they’re they’re they’re not lead they’re certainly not leading. Maybe sometimes they can be.

Maybe customer satisfaction. There are a few others. But, like, there’s they’re they’re generally you know, they’re measures of past performance. They’re not indicators of future performance. They’re just they’re they’re trailing indicators.

And finding that unknown that could be transformative requires a an an an undertaking that will that will consume resources, that will consume time, and that will will require a lot of experimentation and a lot of r and d. And it sounds for a lot of companies, that’s a bit like a lottery ticket. Right? But if you get it and if you land it, it could be like, holy cow.

But maybe the happy middle ground is somewhere where you just suggested, like, maybe more of a knowledge graph, where you run a graph and you learn about a relationship that you didn’t know about before and that you could use that as as as a way to maybe focus some of your efforts instead of just trying to come up with any any any correlation out there possible. And and and this is another reason why I actually kind of like the concept of master data because it it helps you focus your efforts. Right? You know customers are important.

Right? Like that you you know you know your supplies are important. You know your materials are important. At least gives you a place to start looking.

So instead of it just being a a total science experiment where you’re just trying to, you know, solve solve for everything all at the same time. So, anyway, this is awesome stuff. What tell us a little bit about Upsolver really quickly. I mean, I I’m I’m a big fan because you bought me some beer at at at, at day to day Texas, but but quickly tell us about what your company does.

Yeah. So, at Absolver, we do believe that a lot of your valuable information is locked up in in your applications, in your product.

So, we want to help folks bring in that data into their data infrastructure, into analytics infrastructure, or whatever the use case may be. So we specialize in ingesting streaming data sources such as message buses, event queues, like Kafka, Kinesis, etcetera.

So if your microservices are talking to each other, that’s generating you know, that that’s a queue that’s generating a log. We help you ingest that.

And then databases. So Postgres, MySQL, all of these different databases that are powering the actual applications, that are serving your end users. We’re also very good at ingesting that data. You can land it in, warehouses like Snowflake. You can land it in lake houses.

We just, we’re actually we have an event, coming up next week, the Chill Data Summit, which is all about learning about Apache Iceberg because we just did a When is it gonna be interrupt because this what what day of the week is chill data?

It is on Tuesday. Tuesday is Thursday.

On Thursday. So this will the chill data will have already have happened, unfortunately. But for next year, maybe twenty twenty five.

Yeah. And, also, you know, the the re the talks are gonna be recorded and Oh, good. Cool. Cool. Cool.

Good stuff.

So, yeah, we we’ve in incorporated, Apache Iceberg in into our product so you can, also write your data into a lake house. Really, however, you’re comfortable with interacting your data once it’s in your systems, we have we’ve left all of that open. I think folks often struggle with just getting that data in because data I mean, folks often don’t even think about those data sources if they’re in the, you know, my CRM world and my SaaS application work. So we help unlock the value of your production data.

I I’ve finally just put two and two together.

Iceberg, chill. I I I I finally put that together. I didn’t it just, like, we had talked about this a couple of times, but it’s just like, oh. Alright. So, my my my last question.

Are are you generally I I is it safe to say you’re generally optimistic about AI and the future of AI? Yes. Okay. Yes.

Yeah. What what do you think about the concept of AGI?

You know, where where where potentially the machines get smarter than us? Do you see that do you see that future? Do you see it as imminent? And if so, what do you think of that future?

I don’t think it’s that imminent.

I think that it’s a natural, I I think I think we’re headed that way.

But, I don’t think that we’re really gonna get there in my lifetime.

Maybe.

But, I there’s a lot of science fiction on this. I don’t really like I don’t really like to think about it if I guess, if I’m being honest, because, like, I love technology. I love advancement, and I love answering questions. Right? This is why I went into science. Like, it’s really important to know about the fundamental truths about the universe, as best we can and and model it.

But you get, like so I’m I’m pro.

But, on on the other side, like, it’s going to be a different world if we get there. I mean, the main thing is the processing power. Right? We are we have our limitations as far as how fast we can do things.

But we’ve been using tools to, augment the way in which we, you know, understand the world. I mean, even a calculator. Right? Instead of doing multiplication in my head, I’ll I’ll use a calculator.

So it’s not the fundamental concept isn’t new, being able to use a machine that is in some way superior to us, to to me, in at least one particular area to help me do something. Now you generalize that to every area and the it’s natural, I think, to feel a little bit threatened.

But at the same time, like, I think I think we have to understand the, it I mean, if bad things happen, it’s going to be malicious intent of human beings that lead to the bad things happening. So I think I think we have a little bit of time to get to do this, but I think it’s very important for us to start thinking about, like, how like, the laws and governance around how to build and how to use AGI, so that, you know, we don’t end up in one of those dystopian features.

That’s a that’s that’s a great way to end. The future is on us. It’s on us to try to figure out all of these rules, these policies, these ethics, these everything. So what a wonderful conversation. Santona, Thank you so much for taking time out of your busy day. I look forward to seeing you, you go to Data Universe, New York?

Probably.

Alright. Cool. Well, I’ll catch I’ll catch you there. To our viewers, please take the time to subscribe to CDO Matters. Take the time to like this if in fact you did.

I look forward to seeing many of you on the next episode of CDO Matters very soon. Thank you, Shatira, again, so much.

Thank you so much. Thank you.

ABOUT THE SHOW

How can today’s Chief Data Officers help their organizations become more data-driven? Join former Gartner analyst Malcolm Hawker as he interviews thought leaders on all things data management – ranging from data fabrics to blockchain and more — and learns why they matter to today’s CDOs. If you want to dig deep into the CDO Matters that are top-of-mind for today’s modern data leaders, this show is for you.

Malcolm Hawker
Malcolm Hawker is an experienced thought leader in data management and governance and has consulted on thousands of software implementations in his years as a Gartner analyst, architect at Dun & Bradstreet and more. Now as an evangelist for helping companies become truly data-driven, he’s here to help CDOs understand how data can be a competitive advantage.

LET'S DO THIS!

Complete the form below to request your spot at Profisee’s happy hour and dinner at Il Mulino in the Swan Hotel on Tuesday, March 21 at 6:30pm.

REGISTER BELOW

MDM vs. MDS graphic
The Profisee website uses cookies to help ensure you have the best experience possible.  Learn more