Andy Dang

SE Radio 534: Andy Dang on AI / ML Observability

Andy Dang, Head of Engineering at WhyLabs discusses observability and data ops for AI/ML applications and how that differs from traditional observability. SE Radio host Akshay Manchale speaks with Andy about running an AI/ML model in production and how observability is an important tool in diagnosing and detecting various failures in the application. They explore concept drift and data drift as indicators in assessing a model’s quality and what corrective actions to take. Andy describes the challenges arising from high dimensionality and data volume, as well as from organizational structures that manage and operate various aspects of the data infrastructure and how observability can detect and solve problems in production. This episode also considers explainability from an observability perspective and how it helps stakeholders — include both builders and consumers of AI/ML applications — understand what they are seeing from AL/ML models.

Show Notes

Related Episodes


Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Akshay Manchale 00:00:17 Andy, welcome to the show. To get started, let’s set some context about how this is different from data observability. So, can you tell what observability means in the AI/ML space and how that is different from just observability in general for a generic software?

Andy Dang 00:00:29 So, when it comes to general software observability, people often think of companies like Data Doc, New Relic, Splunk, and open-source technology like Prometheus, Grafana. So those pieces of the software observability stack are designed around application performance monitoring. And the reason that it’s very different from a data observability and AI observability is that you’re focusing on different ways of measuring problems. A specific example is when you look at time to process a web request. In the context of software monitoring, you probably want to monitor things like average latency, the P99 latency of your request processing pipeline. And it’s very specific to a specific metric.

Andy Dang 00:01:22 Whereas in the space of data and ML observability, you tend to deal with a lot more dimensions. And what I mean by dimensions, just basically you could think of them as metrics, but in a more kind of what we call “high cardinality” in this space. To make it more specific is when you look at a data set like a table in a SQL warehouse or the inputs into a machine learning model, you see a lot more fields. And each of these fields would have their own metrics. Things like, you know, if they’re numerical features you can look at the median value or you can also look at the distribution of these values. So, from the kind of dimensionality angle, there’s a lot more signals in the space of data and ML observability. The other angle is how these metrics change in terms of behavior. In case of software, I would say is a lot more predictable in terms of how metrics would change.

Andy Dang 00:02:26 Typically you can correlate it with like traffic to the website or the size of your container in terms of CPU and memory usage. But in context of data and ML observability, all the data is actually coming from various sources — typically from either a business source like your partner team or it can be something like human-incorporated as in people write their own custom logic to say take the sales number and then try to do some transformation to do some prediction. Or it could be from the real world, like human behavior. And this is actually a big problem for the machine learning world where people’s behavior change in real life due to external factors like weather. Like on hot day, people probably don’t go to the, maybe people will spend more time in the mall, and your camera feed will see a different pattern compared to a colder day, for example.

Andy Dang 00:03:25 So, those external factors are not managed by the people who deploy software or machine learning, and therefore the possibilities of these signals misbehaving is really what I would say not higher but really hard to capture, especially when you deal with the signals across a large number of features or columns. So that’s the biggest challenge that we see when it comes to this and specific example, think about like when you make a purchase on a website and your credit card goes through a front detection model that actually gets fed a lot of signals including your IP address, geo-location, shipping address, all of that gets run through this models, and those signals can change underneath the hood because say COVID changes people behavior and people purchase a lot more toilet paper. And maybe if the model has never seen that before, it probably think there’s some fraudulent activity happening here. So, those are the examples that we’re trying to capture in this space is that not just underlying software-driven behavior but also trying to understand changes in extraneous factors outside of the machine learning or the software deployment space.

Akshay Manchale 00:04:44 That makes sense. In traditional software, I guess failures are predictable in terms of its either in your application or maybe it’s in your infrastructure, and maybe just turning your whole stack off and on again might just fix the problem if you observe that it is failing. In machine learning, the way that you’ve just described it, it seems like there are more failure modes that are external factors than are internal factors. Can you shed a little more light on what sort of failure modes do you see in the AI/ML space, and how does observability bridge that unknowns that you might see in this space to be able to diagnose and debug problems?

Andy Dang 00:05:24 Yeah, so before we started WhyLabs, we did a big survey across various ML practitioners around machine learning model failures. And what we found was that 70% of production failures when it comes to machine learning revolves around data. And more specifically, you can think about the semantics of data. For example, you take a particular feature like in our case, for example, the currency, like the value of a purchase, and upstream system produced this new thinks about like money as a straightforward thing. But in the lab system they are coded in different forms, like some system coded in the form of dollars and the other system code these values as cents and they typically have to be reconciled somehow. And these value can then disagree if you use the dollar value as cents and vice versa. And a lot of those are actually business logic incorporated somewhere by someone as a kind of like business logic owner.

Andy Dang 00:06:31 They’re not necessarily software developers or machine learning engineers. And those things can change upstream without informing the team that actually owns the model that reads these data sources. So that is like internal business logic change. And the other one is that when external traffic changes — I mean, I give an example of fraud detection in the context of COVID. So just data pattern behavior changes based on human factors. And those are really hard to capture in general. Like even when that happens, the best effort is more like the monitoring in this context is more like best effort. So, you don’t just look at the input data — the data that flows to them into the model — but you also want to look at the model behavior itself and monitor for not necessarily failures but shifts because the data check coming in, changing the shift in model behavior can change.

Andy Dang 00:07:29 And depending on whether these models are going to affect the business KPI, typically we don’t talk about like that’s not really, especially in the AI space, it’s really hard to really say your model is failing unless you’re talking about some various obvious metric. But a lot of the time it’s more like your model is slowly drifting away from the expected behavior, and that slow drift over time is fundamentally hard and painful for humans to notice. So, typically over one week it doesn’t look that bad. But if you look at that shift over time, it’s called model drift or basically just drift in general — concept drift and model drift — over time. And you look at business KPI, that’s when you detect failures. And sadly these are not necessarily obvious to monitor until your bottom line business KPIs are impacted. So, you want to put multiple defense here at multiple layers from the model behavior level, but you also probably want to do something with the business KPIs, and that just makes this problem again a little bit more complex because these signals arrive at different times but then you have to crawl it over time for example.

Akshay Manchale 00:08:45 Yeah. Let’s dig into that concept drift a little bit. Can you give an example application, and how do you know that it is drifting from whatever is the ideal output? Because I guess maybe if you’re looking at something like a recommendation system, it’s really hard to define when your model is drifting. So, can you tell us what is methods to detect drift? How do you define drift in different applications or different examples?

Andy Dang 00:09:14 So the way we think about drift, at least when it comes to data and machine learning monitoring as a generic holistic angle — and there’s model-specific techniques by the way — is that we want to scan the for drift signals across as many signal as possible. And again, this is very different form traditional APM where you target specific metric. Here you’re trying to look for drift in many places as possible and hopefully you can do some probabilistic ranking among them to say there’s a high chance of the model experiencing drift because of these signals. And again, a more concrete example I can think of right at the top of my head is, for example in image processing, it’s very common for people to take a training data set that these training data sets were taken from an ideal condition with perfect lighting and on sunny days, for example.

Andy Dang 00:10:11 And then when they deploy that model to production, suddenly the factory changed the lighting and now you have lighting that has a slightly different variation of brightness and contrast creating different artifacts on the image, or the model or the data set that you collected is from the summer and now the model is running in the winter, the lighting condition also changes significantly. And we’ve seen this sort of like drift because of the, basically you’re taking the training data set for a model that doesn’t cover the edge cases and finding out when these edge cases happen, basically pinpointing the possible drift events or possible out what we call outliers to the dataset that would cause the model to drift. If you watch for these outliers, it’s actually, there’s a term that a16z kind of like highlights of long tail problems when you have like a lot of nice data but sometimes there are some bad apples mixed in.

Andy Dang 00:11:13 You really want to the ability to pull out these bad apples and examine separately. And that’s a big challenge in machine learning and data monitoring because these data sets are in the move; they don’t stay in a central warehouse for you to monitor all the time. Especially if you think about deployment to the edge, the data that these edge devices are collecting or fraud-detection models are collecting is massive compared to the data you typically train the model against. So, knowing when your model might fail and finding out these outliers and making sure they are captured in the training data set, that’s how you prevent this kind of concept drift because basically the training data set is showing something, but in reality you’re seeing something else that is not part of that training data set. The model machine learning so far, we don’t have autonomous AI, so all we got is really pattern matching and if the model fails to match that pattern there, then that’s how you know it’s drifting.

Andy Dang 00:12:14 Of course, there are more challenging problem like ranking models. Those tend to be a bit harder to measure. Typically, you want to do it via kind of like a downstream KPI. At least, that’s my experience working with a lot of Amazon ranking system is like you don’t necessarily track the drift directly from the model because it’s really hard to say what is a correct rank for a query result. Like, Google is all relative, like how good it is relative. So, the only way you can really measure the drift is really like customer engagement downstream from the result. And then those are a bit harder problem than say image monitoring in a factory. But they’re actually relevant in the sense that you kind of need to put these guard rails or what do you call, we call them Fitbit or sensors for the data across the decision-making process from the model to any kind of business decisions downstream.

Akshay Manchale 00:13:12 Yeah, I can see how it is hard to detect drift in certain types of problems, and it makes sense to look at business KPIs as a way to detect drift and possibly say that. And I suppose maybe that is stretched out, you don’t really have that information also readily available, perhaps, as compared to data. Let’s dig into that data drift aspect that you mentioned earlier. You know, you are looking at data from various sources, they’re all multidimensional as in you have row as many types, you have columns. How do you model what you see in training versus what you’re seeing in production across these systems? What sort of statistical things can you collect, and how do you detect data drifting?

Andy Dang 00:13:53 Yeah, that’s a very good question. In terms of measuring drift, like how do you really like detect and measure this from the statistical standpoint? And there are many techniques to do this. You can run full analysis on the data set. So, what we at WhyLabs, we build an open source collaborate called Whylogs. And the goal of this library is to really enable this as a standardized way of thinking about detecting drift at scale. And the way we do this is that we collect them for, we approach the problem from the statistic gathering exercise. So, what we do is, again, you have a model with a lot of features, and you use those features probably for training first. So, what you can do is start to profile these training data sets for fingerprints like statistical fingerprints. It’s about building for example the distribution of numerical features as histograms of collecting the cardinality of things both you can do both unique like IDs or cardinality of categorical features.

Andy Dang 00:15:02 So you that you know that hey today you have 50 states but somehow the data only has like 40 states. Tomorrow coming in the data stream, something is off there, for example. So, we want to monitor those very basic signal. And then for more complex data problems you can also do fancy things like for example in case of image you can translate them into more structured information because again you want to be able to make these signals actionable and communicate some level of like it, it has to be operationable, actionable — so you can take actions from that and decide whether the issue has been resolved or not. So, for like image data, a basic technique is to really just run through the various like image processing pipelines and image signals like RGB, like color channels and brightness and distribution of the brightness in the image. That’s a very simple way to really, like, translate a complex object like an image to more human-understandable signals.

Andy Dang 00:16:03 And if you take it to more embedding space, you can then do things like oh monitor the distance between your production embeddings versus the baseline embedding that you trained your data from. Again, it’s all about relatives. So you need to first identify what you’re monitoring or comparing against — the baseline and the target. And you can do this across different stages, right? You can compare your training data to make sure that every day the training data doesn’t drift from each other too much. If there’s a big drift, you probably want somebody to come and take a look because semantically they might change. So, once these signals are collected then you can run various tests like pale divergence. So basically they take distance between these statistical signals, and the distance can be an algorithm like looking at histogram and overlay them and then see the differences in the area.

Andy Dang 00:16:55 And you have these measurements, now what do you do next? You can then get a practitioner to come in and say does the data really drift? Now you do need human in the loop, and this is what I kind of believe strongly when it comes to machine learning is that you do need human in the loop because at the end of the day the model can only capture a partial view of the world because the data set’s always going to be limited, and therefore somebody has to come in and compliment that knowledge by examining the data and making sure that the training data for the next generation of that iteration of that model captures these problems in an adequate manner.

Akshay Manchale 00:17:38 That makes sense. And statistical fingerprinting certainly is a very interesting way to reduce your problem space by effectively trying to match, you know, statistical fingerprints across two different data sets. That’s really cool. You mentioned earlier about guardrails and one example that you said stood out to me where maybe the one with currency where you’re talking about a hundred dollars or maybe a hundred dollars represented in cents, that’s going to be like 10,000. So, if you’re looking at two different data sources and one of those data sources suddenly starts reporting currency in cents and the other one continues to report in dollars, this in some way seems like a drift. The statistical models of what you’re seeing, your average mean, all of those will look different across these two. How do you plug guardrails? How do you get out of your model drifting because your currency is in two different forms right now, and how quickly can you act on that?

Andy Dang 00:18:34 So, depending on where you are putting these guardrails, if you’re putting these gut rails as part of your model training process, what you can do is first of all doing some sort of distance test like KS test, KL divergent tests; that’s very data-oriented, like very common drift detection problem. You detect this currency issue very fast because the distribution of the two things will look similar except that the range values are completely off. Like one would be like a hundred dollars and less of in terms of purchase value, and the other one all the distribution would fall between a hundred dollars, probably between like $1 value versus like a hundred dollar and you get like from value of 100 from the X axis to 10,000. So, the distribution would look completely off and probably we’ll have a very big distance in any kind of meaningful distance test.

Andy Dang 00:19:26 That’s one way. Or if you want to think of more a DevOps where if people are not comfortable with the distance metric, you can even measure the gap between the p90 value for example and these two would have very different p90 value. Those are very common but you do need to capture the distribution in order to calculate this. So if you can do that at training, that’s the best place because then you can know that their training today looks very different from yesterday and maybe it’s the legitimate reason, like your feature store has changed and you’ve updated the training algorithm to reflect this reality, and you make sure that you have transformation in your Python code to make sure that, that might be the case? But again, it’s important to have somebody come in and say yes, this is a legit change and I approve this version.

Andy Dang 00:20:14 Or hopefully, you if that’s an unexpected change, your model never gets to production. But sometimes these things slip and we’ve had issues where we’ve seen cases where the customer, like the people who deploy the model are calling external third party services to get more data for the actual live inference. An example is zip code and this customer is like train the model on five digit zip code, which is great, right? Very standardized, except the third-party provider changed the API upstream and started returning the zip code with the hyphen with the dash and that broke the model because the model doesn’t, has never seen anything like that in the embedding space. It’s doing text processing, but it still doesn’t know what it is and the way they caught it was very straightforward, it was just through, you know, monitoring the distribution of the input.

Andy Dang 00:21:10 And so the ability to really scan through these signals in production fast is important. And you can also do things like live check per data point that only allows you to detect outliers not really drift, or if you accumulate enough data point and depending on how fast you can respond, the typical pattern we see is just daily. Like, you take a day of data and you try to say, oh does it drift from either the training data or the historical baseline because you do need to see enough data in production, you can’t really say oh ten decisions were made and it’s drifted. It’s really hard unless those decisions are really off. So, you do need to accumulate enough signal in production. So that’s a bit of a what it’s, and again this is different from DevOps monitoring where you want like immediately the P99 is right rapid, that’s great.

Andy Dang 00:22:03 But in machine learning you probably want to wait a bit to see the full distribution of production — maybe six hours, maybe a day — and depending how fast you can roll out the next generation. Although we have seen use cases where people want to know outliers immediately and kind of like put fall back into another model that is the safe or the fall back version of that. So, that’s a bit more complex use case, but in general it’s really hard to detect live drift fast because you have to have enough signal and that that’s a catch 22 about drift detection. You don’t want to have too many false positives.

Akshay Manchale 00:22:43 I think you touched upon a really interesting thing, which is that you can’t really respond to these incidents instantly as they happen in a typical DevOps SRE world. What does the alerting look like in the AI/ML observability space? You have observability tools; you’re collecting all of these signals. At what point do you raise alerts? What kind of alerts are raised, and can you dig a little deeper into what are corrective actions that people can take based on these alerts?

Andy Dang 00:23:12 So, this is interestingly enough, has a lot of overlapping with the data ops worlds. So, things like no-ness of your data stream. If that value just certainly has a lot of no, it can really affect the model. But then you also want to monitor on things like drift and uniqueness, or like frequent items what we call like the “categorical features.” If somehow California’s most popular state becomes the least popular state, least frequent item in your data stream, you probably want an alert. So, there are different ways of approaching these, but we typically think of them as data quality, data kind of ingestion. So, a data quality like the quality of the stream itself and then the kind of freshness that kind of like does a stream happen? Does a stream happen at the moment in production because your system might be down? That’s a bit of an overlap of DevOps there.

Andy Dang 00:24:06 But sometimes it’s also like maybe you are seeing really old data if you have like kind of like a feature store that sends this data is from like yesterday, but somehow you see a feature with like data from last week, then you might want to raise an alert so there’s a bit of freshness there. And then drift is a really hard problem and I said before drift is one of the things that you probably want to wait a bit to monitor, like adequate time typically hourly if not daily depending on the model most people deal with daily. But we’ve seen people who want to do weekly as well because if you used day of the week as they feature, you don’t want to do it daily because you get Monday would be very different from Tuesday . So, these are the kind of common issues that we see.

Andy Dang 00:24:53 And then distribution drift is very common as well. Each of these have different requirements around operational aspects. So, if you’re talking about data freshness, people want to know immediately. So, it’s very close to the DevOps angle. You take the extreme of DevOps and you detect it, you alert customer right away. Another one is outliers. I mentioned before that besides like monitoring the holistic data stream, sometimes you really want to capture the outliers in the stream just so that you know that you can examine them in the future and that those outliers you probably want to do an alert against them in more real time manner. But then when it comes to drift then it’s the other extreme world (?). You’d want things to be batched into like batches of daily or sometimes weekly or sometimes even monthly. And so these batches dictates the frequency of the alert, and they tend to have a longer SLA compared to the freshness alert, for example.

Andy Dang 00:25:55 So those are different requirements of the data Ops and MO Ops monitoring system. It’s actually we have a lot of use cases. We see a lot of use cases where people deploy new model architecture and then the latency increases, or sometimes the data upstream changes, especially when it comes to images — like, people sending a larger image and suddenly the model takes a lot longer to respond. And this kind of performance metric kind of overlapping with DevOps but tied to the ML operation aspect, you probably want to flow them through the machine learning monitoring system as well because you want to tie them with like model deployment and data changes because any of those can trigger these DevOps issues. So, that’s another angle, but those changes are critical so you would want them to be as fast as possible. Then the destination for this at the moment, not surprising, is very similar to DevOps. Like they go to Slack, PagerDuty — all the standard kind of DevOps workflows.

Akshay Manchale 00:26:58 Are there different people in the SRE worlds that you see typically in organizations who are focused on the ML side of things, you have a machine learning model that doesn’t necessarily sit independently by itself. It may fit into a larger ecosystem of the product space that some company might offer. So, you might have an SRE team that’s fronting the application but you also have this machine learning model and this complex integration issue where you might see performance issues, so data drift et cetera. So, how do you see organizations typically dealing with that? Do you have different SRE teams? Can you shed some light on how different that is and how these escalations might work in the ML space?

Andy Dang 00:27:40 We actually don’t see the SRE dedicated role to the ML op stack yet. Maybe there’s going to be a transition soon. However, we see a lot of overlapping between the data engineering space, like the data engineering team versus the ML engineering team, right? Those two teams tend to be co-dependent even though they belong to different organizations who are different budget, which is a bit strange, but maybe that’s where we are trying to figure out where we fit in this space — as in “we” as in the ML practitioners here, like trying to really fit because we don’t really belong to the traditional SRE space. So, the people who are the receiver of these alerts and signals typically for under three categories so far that we often see one is data engineer. So, they care about things like no-ness, data quality, things around more like, oh missing values or the distributions are off and then we see like a spike in min and max of very simple, very traditional data ops metrics.

Andy Dang 00:28:48 And then the machine learning folks, they care a lot more about things like uniqueness of the data stream, although that can be taken care of the data ops. So there’s a bit of an overlap there. And then the drift is typically consumed by the ML team specifically because the data team doesn’t really what you call it, it’s the expertise gap. They don’t really necessarily understand and can’t really take action against drift. And a lot of the time these drifts are natural and require retraining of the model. So, we would hook into the retraining pipeline trigger a new version automatically. But that is more, again, in the realm of the ML engineering team. Some companies do vet out the general metrics of things like how many, what’s kind of overall health of the system or historical information about this, what called data drift count in general to the SR team to be part of the kind of global system health dashboard. But that is just more like they don’t really take immediate action, they just like want to track the health over time. And then finally there’s a bit of a business operators where they care about the KPIs, so they want to see the health of the ML system and then the KPIs of the business and tie them together through some dashboarding or through some kind of alerting mechanism.

Akshay Manchale 00:30:12 And I think maybe it’s fair to say that on the data ops site they might be serving different ML models with different drift-detection needs. So how do they respond to something where they might not detect a drift but maybe there is one particular ML model that says yeah this is totally unacceptable distribution of data whereas another model might be okay with it. Does that cause problems for either of the models? So maybe the data engineering team, so who deals with this sort of like disparity and statistical distribution needs of the underlying data?

Andy Dang 00:30:47 So again, it comes back to the question of whether the data stream is drifting, whether it is a natural drift of the data — if there’s a natural drift is like customer behavior changes upstream. There’s nothing that the data stack can change, like the goal of the data stack is flow the that drift data correctly to the right ML teams — or whether is that business owner upstream changing the semantics of an internal business field like the currency code for a currency kind of handling logic. If that’s the case, then it falls back to the data engineering team to really like making sure that things are consistent for each of the consumer either via building a new table or making sure that if this thing change then the downstream teams get alerted accordingly. But again, you are right, that the drift problems, they’re typically not handled by the data engineering team.

Andy Dang 00:31:42 Typically they go to the ML team, and then the ML team will decide whether, what actions to take, what kind of actions, whether it’s about retraining the model or sometimes really like you have to build a new model because hey our model just not working with this data anymore; we need to do something else like restructure the neural network or something like that or go back to the upstream team like re basically in the DevOps world you reassign that ticket to the DevOps team, other the data ops team and say hey the semantic of this field has changed and you own the logic for this so you need to help us address it.

Akshay Manchale 00:32:18 Let’s go back into different phases of like machine learning itself, right? In the ML ops space you have the training, you have what you run in production, and this is a continuous life cycle that evolves and changes based on feedback. So, in the coming from no observability space, how does it assist the training set requirements? Like how is it different during training? How do you capture observability metrics maybe or statistical things? How is it different during training, and how does that shift during production, or is it all just the same thing on both sides

Andy Dang 00:32:50 In terms of the statistics, you would want them to be aligned because you really don’t want to compare an orange to an apple. So, if you’re collecting different statistics, different ways of summarizing things, you can’t really say oh this model is versus the other model or describe it over time. So, that’s one thing to call out in terms of what to collect. However, the big difference in terms of training versus production is the two things. One is the scale of the data. Like, typically training data sets are smaller, most training is done on a single machine. Machines can be really beefy nowadays when GPU you get a lot of, like, horsepower into that. So you can handle quite a sizable data set but still fitting within a single machine. So you can run a lot of statistic locally. Typically you pull them from data warehousing; again, you can like Snowflake, you can run statistical analysis there.

Andy Dang 00:33:48 But then once the data hit production is about different game, you know you do batch processing that happens on multiple nodes, SageMaker batch inference for example or even more complex things like real-time inference, then the data moves to a container that at a much like in terms of both volume and throughput is a different game. You need to be faster. And then if you deploy this model to the edge, what we see is that lot of the time people are pretty much flying blind except they monitor some basic signals. It’s really hard to monitor edge devices, and especially to even say that whether the model is drifting for edge deployment or not. So what we, at least with WhyLabs, what it aims to do is build this statistical kind of standardization so you can build that as a way to standardize metrics across your teams and deployment steps.

Andy Dang 00:34:44 So you can do the same thing in training or deployment. The second thing is that it’s also designed for these challenges is that it’s mergeable. So you can take a data profile from training and then you can take multiple profiles from production and you can combine the production profile and compare it with the training. So, you don’t necessarily have to run it in a single machine or against a global system, but you can run the statistic collection in a distributed manner and then run the final analysis somewhere because you’re not dealing with the raw data. It’s a lot faster there because you’re looking really at the statistical signature. It’s like think about histograms and trying to match them. So that’s the technique that WhyLabs for example tries to incorporate. You can also do other things as well like sampling, which you lose some accuracy but it’s better than doing nothing for example.

Akshay Manchale/strong> 00:35:38 So can you share a little more light on how WhyLabs your solution fits into the product? Is it a side guard that sits on your container that’s serving the model or something that observes and then reports it back into one place? Or how do you integrate these signals from resources?

Andy Dang 00:35:54 So WhyLabs is meant to be really a data telemetry library. It is a Python library, but you can also run it in Java, and it is built on top of a technology called Apache Data Sketches. So, this library was originally built by Yahoo for weblog analysis. So, they built it for the ability to analyze through billions of records per day. So, it’s built into these capabilities that I describe around merging and lightweight statistic collection. What WhyLabs adds on top of that is the ability to do this across the standardized APIs for data science. Think about pandas data frame for example, or spark data frame. So, WhyLabs hooks into these interfaces and enable customers to users to run wide, the sketching across the pipelines of Python and data engineering pipeline like in Spark. And finally, people can opt to store these metric objects in multiple places.

Andy Dang 00:36:54 Things like ML flow is a very common way to integrate because you just, you have an artifact store so you can hook WhyLabs into that very easily or you can also just store it in a blob storage in case you are doing in like Google Cloud or Amazon Cloud. And you can also do it like over the edge because these lock objects are much more lightweight than traditional objects, traditional data. So you can just like, you can transform terabytes of data into hundreds of megabytes. So, that allows you to then collect statistics from the edge easily.

Akshay Manchale 00:37:30 And I guess it should be easier to process when you have like manageable sizes and react to it too. That’s nice. I want to switch gears a little bit into explainability. This is something that people keep talking about ML/AI. Can you first describe what is explainability in AI/ML, and then we’ll get into like observability. How does that help with respect to explainability?

Andy Dang 00:37:52 So explainability, I think this goes back to the problem I stated at the beginning is when you have a lot of features and you’re feeding these features into a model and the model makes a prediction, you want to know which feature contributes to the model decision most. Think about a credit scoring model, whether it’s zip code or is it the person’s education background, or is it their job that is a deciding factor for this. I think this explainability goes back into things like fairness and how we think about responsible AI or in the context of legal framework as well as decision-making from the business perspective as well. Like making sure everyone is served equally. So that’s the context of explainability. What it means really in practice is that you assign the importance of your decision model’s decision to basically a number between zero and one to your specific features.

Andy Dang 00:38:48 Like hey, maybe the person’s zip code is like 0.2 and the person’s current job maybe years of experience is like 0.5 and something like that. Like, and you add them up to one basically across the model and you can do this per prediction. So, every time you make a decision you can say oh this person for this particular example, this is the weight of these decisions. And then for you can also do it at global level to say for this model based on this testing data set and here’s the weight, the global weight, it basically kind of trying to average them out globally. And the reason you’re doing this is that if you want to monitor the for kind of like important features for fairness and biases, then you want to focus on this particular signals alone. Like maybe where you are from affects your life insurance. So the state, for example, you want to monitor the distribution of people across the state values. So those are the examples that I can think at the top of my head in terms of explainability and a bit about fairness as well. It’s like how you slice and dice and compare the model behavior across different populations. And once you have this framework you can talk about how to make sure your model doesn’t drip or one particular population or one particular segment of your data.

Akshay Manchale 00:40:09 Going by what you just said. I think there are different stakeholders it sounds like in the AI/ML space. We’ve talked a lot about how observability might help the data engineers and data scientists run their models and get that into production and run that. There’s also, when you talk about explainability, it seems like the customers also a little bit of a stakeholder in this application. Does observability helps surface some of these reasons why you are seeing something back to the customer or the end user?

Andy Dang 00:40:41 I think observability helps if done right. So, the problem with observability, especially in machine learning, is that it gets overwhelming if you’re not careful. You’re just presenting partitioners with all the data, it becomes unmanageable. Humans are not designed to look at 7,000 features in a front detection model and try to figure out the way to really decipher the weights and importance. So observability, I think a good observability system should be able to surface important signals among the noise and then highlight important things like for example, I could think of like a segment of the data set that has experienced adverse decision by model or the subset of the customer. I experience bad customer experience like delayed shipping because the model is giving them bad predictions. So, those are examples of how observability can, by just observing the model behavior and not necessarily trying to explain everything because explainability is, by the way, is very expensive.

Andy Dang 00:41:45 You can run techniques like shaply and but they are very expensive and slow. So, what you really want to observe is the KPIs with respect to these important segments or automate that process list. Like how can we surface these problematic segments in the data stream in a scalable way, in an automated way, so that the people who are running the model can make a decision about that. Maybe that segment of customer is not as important to the business and it’s an explicit human decision but it has to be a human decision helped by this observability system. So automating that discovery is, in my opinion, one of the most important parts of a good observability system.

Akshay Manchale 00:42:28 You mentioned scale as a general challenge and I think that is interesting when you are looking at models that have really high dimensional data that they’re using for training. Maybe you have single rows that contain hundreds of thousands or millions of features, and they all have their own distributions. How do you see that presentable for a ML engineer in a way that they can actually consume it? It might not be reasonable when you say 10,000 of your features are seeing a drift. So, where do you start when you start seeing that much noise? So, how do you find the signal from this noise when you’re just looking for say drift or statistical anomalies?

Andy Dang 00:43:07 So this is another very problem of like why you can’t use DevOps tools for machine learning operation is because if not careful, you’re dealing with a lot of dimensionality noise. And how do you address this? From our experience dealing with this sort of large-scale problem and not just from the number of features but also the data volume. If you are running — large can mean like large dimensionality but can also mean large data volume by the way. Is that you want to really have a good system that can automate, first of all, you want the system to quickly automate all these analysis around drift data quality across these 7,000 features with minimal configuration as much as possible; you want to express what you’re looking for, and of course you’re going to get a lot of noise because even like 1% of 5,000 feature are drifting, it’s like, is it important?

Andy Dang 00:44:02 It’s hard. So you start with more like DevOps angle, right? Kind of like spread out the signals, analyze it fast and you probably get some things that are alerting, some things are not. And then now you want to do more machine learning specific workflows where you feed into a signals like important features, important segments and then have the system to really narrow down this amount of noise and reduce them into a more human-consumable angle. They are interesting techniques around this space like, even just monitoring just the pure drift count over time. Like if your drift count suddenly increases among many features, there’s a likely big chance that your model is like experience a sudden drift for example. So there are proxy metrics that are designed for specific, not very specific, but that are designed for specific workflows that will enable you to get fast multiple level like drill down. So you want to start with like the overall things like alert count of your system and then you can go further down into specific problems like drift or later quality and then hopefully you can locate like maybe it’s the most important signals, or features based on feature importance in the training step that are experiencing that and you should do something.

Akshay Manchale 00:45:17 Makes sense. I’ve heard this term “AI ops,” which is using AI for operations in order to reduce the signal to noise ratio or improve. And do you see that as a thing in the AI/ML space also where you’re using another model on top of it that runs through your observability, the signals or metrics that you’re capturing, in order to help you decide or cut down the noise and present what’s important?

Andy Dang 00:45:45 I mean, as somebody who works with a lot of AI models, I’m skeptical of that because it is so hard to generalize. It’s not that the problem is that all the signals, they’re very, they come from different domain problems. It’s not just about web application performance. Like the APM space deals with a lot fewer of these, the application of AI just much wider rather than deploying services and maybe being biased here, but I don’t think using AI can solve this. And another thing, the reason why you can’t use another AI to solve this AI monitoring problem is that to explain why something is happening is really hard using an AI model. You really want to use standardized simple statistical tests that have like, you know, there’s a beauty in simplicity, and you really want to use like start with simple statistical signals that can be explained either by, you know, take Gaussian distribution.

Andy Dang 00:46:46 That’s a very common technique in operation — kind of by monitoring you can apply those and get a lot of value and reduce the noise already without having to rely on much more complex layers. And think about the people who’re consuming these signals might be data engineers, ML engineers, they’re not necessarily the expert about the AI ops model that we might build. So, they’re not going to enjoy going through our documentation and try to say, oh the model is alerting because here’s our hyper parameter and you probably can go and force us to retrain it. I don’t think that’s a good customer experience. From my experience working with the machine learning team, people already deal with a lot of probabilistic signals. They want that final model, final layer, to be much more straightforward.

Akshay Manchale 00:47:32 Yeah, that makes sense. Otherwise, you have a problem of trying to figure out what’s happening with that AI model that’s trying to describe your AI model, and I guess that cycle will never, never end without a human somewhere. So, in terms of tools, how is it different compared to the SRE world in AI observability? Because you are monitoring different types of systems. As an example, let’s say you have a database or some storage tier that might have its own SRE team that has its own monitoring visualization tools, et cetera. You might have another one that’s responsible for streaming data and that team might have its own SRE team maybe and their own visualization tools. And then as machine learning practitioners you’re really consuming from several pieces, several layers in the ecosystem. So how does the tooling look like in terms of visualization? Is it different? Do you plug into all of them? What is typically something that you see in the industry now?

Andy Dang 00:48:26 So, I think you’re touching on very interesting topic there where you talk about existing tools and frameworks and processes people have. So, what we see is the trend of hooking into these existing system. So they, people do want signal from AI systems being flown into Datadog or New Relic, but only a subset of them because again, otherwise you deal with dimensionality problems and Datadog and New Relic are not designed for that. They can handle that in terms of scale. But in terms of the monitoring and the configuration experience, they’re not designed using, like, for that particular use case as first classes. So, what we see is that the SRE folks or the operators, they tend to consume the right metrics. Things like, the total number of drift happening in this particular model is X, Y, Z, or maybe somebody has to ticket the data team, something like that.

Andy Dang 00:49:23 But in terms of really like observing and navigating the system of a machine learning kind of data stream or feature store, you really need specialized tools. And this is why we have to build specialized tool because you have to visualize things like data drift for numerical features or data drift or categorical features. And those are not first-class citizens in Datadog. So, it becomes really painful for the operator to go in there and do anything like that. And Datadog doesn’t deal with things like comparing training versus productionization, and the training process might be happening every night. So, I strongly believe in specialized tools for different purposes, and therefore you do need to have special tools to deal with, especially ML Ops, the problem of machine learning. Also, you deal with like model performance metric, things like confusion matrix that gets annoying. Maybe you can do it in Datadog, but yeah it’s probably going to be really a lot of like reinventing the wheel there.

Akshay Manchale 00:50:25 We’re about time? So, do you have any final thoughts about AI/ML observability to share? Otherwise, where do you see the whole field moving towards in the next couple of years?

Andy Dang/strong> 00:50:37 So I think that’s, I mean in terms of machine learning, we should not try to fight blind. I think we should not reinvent the wheel. It should be on top of open source, existing well proven techniques. That’s why we don’t just reinvent something new. We take data sketching and make it adjusted for data science workflow in WhyLogs. And then the goal is to really make it locking kind of standard among all the ML framework. So that’s my hope is that people at least putting some monitoring maybe in a Whylogs, maybe without Whylogs into the ML system as they kind once they make, deploy those model and make critical business or customer-facing decisions. In terms of the field itself, I think it’s definitely interesting to see the move towards more, I wouldn’t call them real-time system, but with live stream-based processing and faster iteration of models and even things like automatic retraining.

Andy Dang 00:51:38 So, those are the trends I’m seeing nowadays with talking to customers that they need to be able to reflect the new state of the world faster, retrain the model faster. However, it’s also — I don’t think we have a good answer for this — but really figure out how to incorporate humans as part of this automated workflow. Either as approvals or kind of like, you know, think about like DevOps deployment. You always have like human in the loop. How do we make humans more effective as part of the AI deployment process? Similar to how DevOps folks have built against the principle of CICD. That’s something I don’t think we have a good kind of philosophy around because you do need standardized monitoring, standardized kind of like guard rails, to do that. And that’s something is not there in the ML Ops space. So maybe we’ll get there.

Andy Dang 00:52:30 And then alongside with things like more real time or yeah, streaming feature stores and streaming deployment systems, then we’ll need those even more because the faster you can deploy, the easier it is to break things. And so, I think there’s going to be some interesting space there. And then finally, I do think there’s going to be a, I don’t know how it would look like, but a convergence or some sort of like a lot of like cross-pollination between the Data Ops and ML Ops space. Cause at the end of the day, data is a source of machine learning, so it’s really hard to talk about one’s quality without talking about the other. And then the ML system feeds back into the data system. So you know, it’s kind of by this circle of like problems getting merged by this infinite loop if you’re not careful.

Akshay Manchale 00:53:18 Yeah, that sounds great. Andy, thank you so much for coming on the show and talking about observability and running AI/ML operate products and production. Thanks.

Andy Dang 00:53:24 Yeah, thank you. I really enjoy the chat.

Akshay Manchale 00:53:26 This is Akshay Manchale, for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod ( — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

More from this show