SE Radio 610: Phillip Carter on Observability for Large Language Models

Phillip Carter, Principal Product Manager at Honeycomb and open source software developer, talks with host Giovanni Asproni about observability for large language models (LLMs). The episode explores similarities and differences for observability with LLMs versus more conventional systems. Key topics include: how observability helps in testing parts of LLMs that aren’t amenable to automated unit or integration testing; using observability to develop and refine the functionality provided by the LLM (observability-driven development); using observability to debug LLMs; and the importance of incremental development and delivery for LLMs and how observability facilitates both. Phillip also offers suggestions on how to get started with implementing observability for LLMs, as well as an overview of some of the technology’s current limitations.

This episode is sponsored by WorkOS.

Show Notes

SE Radio

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Giovanni Asproni 00:00:18 Welcome to Software Engineering Radio. I’m your host Giovanni Asproni and today I will be discussing observability for large language models with Philip Carter. Philip is a product manager and open-source software developer, and he’s been working on developer tools and experiences his entire career building everything from compilers to high-level ID tooling. Now he’s working out how to give developers the best experience possible with observability tooling. Philip is the author of Observability for Large Language Models , published by O’Reilly. Philip, welcome to Software Engineering Radio. Is there anything amiss that you’d like to add?

Phillip Carter 00:00:53 No, I think that about covers it. Thanks for having me.

Giovanni Asproni 00:00:56 Thank you for joining us today. Let’s start with some terminology and context to introduce the subject. So first of all, can you give us a quick refresher on observability in general, not specifically for large language models?

Phillip Carter 00:01:10 Yeah, absolutely. So observability is well, unfortunately in the market it’s kind of a word that every company that sells observability tools sort of has their own definition for, and it can be a little bit confusing. Observability can sort of mean anything that a given company says that it means, but there is actually sort of a real definition and a real set of problems that are being solved for that. I think it’s better to sort of root such a definition within. So the general principle is that when you’re debugging code and it’s easy to reproduce something on your own local machine, that’s great. You just have the code there, you run the application, you have your debugger, maybe you have a fancy debugger in your IDE or something that helps you with that and gives you more information. But that’s sort of it. But what if you can’t do that?

Phillip Carter 00:01:58 Or what if the problem is because there’s some interconnectivity issue between other components of your systems and your own system or what if it is something that you could pull down on your machine but you can’t necessarily debug it and reproduce the problem that that you’re observing because there’s maybe like 10 or 15 factors that are all going into a particular behavior that an end user is experiencing but that you can’t seem to actually reproduce yourself. How do you debug that? How do you actually make progress when you have that thing because you can’t just have that poor behavior exist in production forever in perpetuity because your business is probably just going to go away if that’s the case people are going to move on. So that’s what observability is trying to solve. It’s about being able to determine what is happening, like what is the ground truth of what is going on when your users are using things that are live without needing to like change that system or like debug it in sort of a traditional sense.

Phillip Carter 00:02:51 And so the way that you accomplish that is by gathering signals or telemetry that capture important information at various stages of your application and you have a tool that can then take that data and analyze it and then you can say okay, we are observing sort of let’s say a spike in latency or something like that, but where is that coming from? What are the factors that that go into that? What are the things that are happening on the output that can give us a little bit better signal as to why something is happening? And you’re really sort of answering two fundamental questions. Where is something occurring and to the extent that you can, why is it occurring in that way? And depending on the observability tool that you have and the richness of the data that you have, you may be able to get to a very fine grained detail to like the, this specific user ID in this specific region and this specific availability zone where you’ve deployed into the cloud or something like that is what is the most correlated with the spike in latency.

Phillip Carter 00:03:46 And that allows you to sort of like very narrow down and isolate something that’s going on. There is a more academic definition of observability that comes from control theory, which is that you can understand the state of a system without having to change that system. I find that to be less helpful though because most developers I think care about problems that they observe in the real world, sort of what mentioned and what they can do about those problems. And so that’s what I try to keep a definition of observability rooted in. It’s about asking questions about what’s going on and continually getting answers that help you narrow down behavior that you’re seeing whether that’s an error or a spike in latency or maybe something is actually fine but you’re just curious how things are actually performing and what healthy performance even means for your system.

Phillip Carter 00:04:29 Finding a way to quantify that, that’s sort of what the heart of observability is and what’s important is that it’s not just something that you do sort of on a reactive basis, like you get paged and you need to go do something, but you can also use it as one of your foundations for building your software. Because as we all know there, there’s things like unit testing and integration testing and things like that that help when you’re building software. And I think most software engineers would agree that you want to build modern software with those things. But there’s another component which is what if I want to deploy these changes that are going to impact a part of the system but it may not necessarily be a part of a feature or, we’re not ready to release the feature yet but we want that feature release to be stable and like easy and not a surprise and all of that from like a system behavior standpoint. How do I build with production in mind and use that to influence things before I like flip a feature flag that allows something to be exposed to a user.

Phillip Carter 00:05:24 Again, that’s sort of where observability can sort of fit in there. And so I think part of why this had such sort of a long-winded definition if you will or explanation is because it is a relatively new phenomenon. There have been organizations such as Google and Facebook and all of that who have been practicing these sorts of stuff for quite a while, these practices and building tools around them. But now we’re seeing a broader software industry adoption of this stuff because it’s needed to be able to go in the direction that people want to actually go. And so because of that definitions are sort of shifting and problems are shifting because not everybody has the exact same problems as your Googles or Facebooks or whatnot. And so it’s an exciting place to be in.

Giovanni Asproni 00:06:07 Okay, that’s fine then. Now let’s go to the next bit, LLMs large language model. What is a large language model? I mean everybody nowadays talks about ChatGPT, that seems to be all over the place, but I’m not sure that everybody understands at least, to a high level yeah, what a large language model is. Can you tell us a bit?

Phillip Carter 00:06:27 So a large language model can be thought of in a couple different ways. I’ll say there’s a very easy way to think of them and then there’s a more fundamental way to think of them. So the easy way to think of them is from an end user perspective where you already have something that’s largely good enough for your task. It is a black box that you submit text to and then it has a lot of information compressed inside of it that allows it to analyze that text and then perform an action that you give it like a set of instructions such that it can emit text in a particular format that contains certain information that you’re looking for. And so there can be some interesting things that you can do with that. Different language models are better for like emitting code versus emitting poetry.

Phillip Carter 00:07:13 Some like ChatGPT are super large and they can do both very, very well but there are specialized ones that can often be better for very specific things and there are also ways to feed in data that was not a part of what this model was trained on to sort of ground a result in a particular set of information that you want an output to be in. And it’s basically just this engine that allows you to do these sorts of things and its very general purpose. So if you need for example to emit JSON that you want to insert into another part of your application somewhere, it’s generally applicable whether you are building a healthcare app or a financial services app or if you’re in consumer technology or something like that. It’s broadly applicable which is why it’s so interesting. Now there’s also a bit more of a fundamental definition of this stuff.

Phillip Carter 00:08:06 So the idea is language models, they’re not necessarily new, they’ve been around since at least 2017 arguably earlier than that and they are based on what is called the transformer architecture and a principle or a practice I guess you could say in machine learning called a tension. And so the idea, generally speaking, is that there were a lot of problems in processing text and natural language processing with previous machine learning model architectures. And the problem is that if you give like a sentence that contains several pieces of information inside of that, there may be a part of this sentence that refers to another part of the sentence like backwards or forwards and like the whole thing contains this like strong semantic relevance that as humans we can understand and inform those connections very naturally. But computationally speaking it’s an extremely complex problem and there have been all these variations in trying to figure out how to efficiently do it.

Phillip Carter 00:09:05 And a tension is this principle that allows you to say, well we’re going to effectively hold in memory all of the permutations of like the semantic meaning of a given sentence that we have and we’re going to be able to pluck from that memory that we’ve developed at any given moment as we generate. So as we generate an output, we look at what the input was, we basically hold in memory what all of those things were. Now that’s a gross oversimplification. There are piles and piles of engineering work to do that as efficiently as possible and implement all these shortcuts and all of that. But if you could imagine if you have a program that has no memory limitations, if you have let’s say an N2 memory algorithm that allows you to sort of hold everything in memory as much as you want and refer to anything at any point in time and refer to all the connections to all the different things, then you can in theory output something that is much more useful than previous generations of models. And that’s sort of the principle that underlies large language models and why they work so well.

Giovanni Asproni 00:10:03 Referring to these models now I’d like to definitions of two more terms that we hear all the time. So the first one is fine tuning. I think you hinted at it before when you were explaining what to do with the model. So can you give us what does it mean to fine tune a model?

Phillip Carter 00:10:20 Yes. So it’s important to understand the phases that a language model or a large language model goes through in sort of its productionizing if you will. There is the initial training, sometimes it’s broken up into what’s called pre-training and training, but it’s basically you take your large corpus of text, let’s say a snapshot of the internet and that is data that is fed into create a very large model that that operates on language, hence the name language model or large language model. Then there’s a phase that is often it’s within the field of what’s called alignment, which is basically you have a goal, like you want this thing to be able to be good at certain things or say you want it to minimize harm, you don’t want it to tell you how to create bombs or like that snapshot of the internet might contain some things that are frankly rather horrible and you don’t want that to be a part of the outputs of the system.

Phillip Carter 00:11:12 And so this sort of alignment thing is a form of tuning. It’s not quite fine tuning but it’s a sort of a way to tune it such that the outputs are going to be aligned with what your goals and principles are behind the system that you’re creating. Now, then you get into forms of specialization, which is where fine tuning comes in. And depending on the model architecture it may be something that like once you fine-tuned it in a particular way you can’t like really fine tune it in another way like its sort of optimized for one particular kind of thing. So that’s why if you’re curious looking at all the different kinds of fine tuning that’s going on, there’s so many different models that you could potentially fine tune, but fine tuning is that act of specialization. So it’s been trained, it’s been aligned to a general particular goal but now you have a like a very much more narrow set of problems that you want it to focus on.

Phillip Carter 00:12:03 And what’s critical about fine tuning is it allows you to bring your own data. So if you have a model that is good at outputting text in a JSON format for example, well it may not necessarily know about the specific domain that you want it to actually output within like you care about emitting JSON but it needs to have a particular structure and maybe this field and this subfield need to have a particular association and they have some sort of underlying meaning behind them. Now if you have a corpus of knowledge, of textual knowledge that explains that what you can do is fine tuning allows you to specialize that model so it understands that corpus of knowledge and is almost in a way sort of overfitted on it so that the output is a language model that is very, very good at the understanding the data that you gave it and the tasks that you want it to perform but it loses some of the ability especially from an output standpoint that it may have started from.

Phillip Carter 00:13:02 So you’ve basically overfit it towards a particular use case. And so the reason why this is interesting and potentially a tradeoff is you can in theory get much better outputs than if you were to not fine tune, but that often comes at the expense of if you didn’t quite fine tune it, right? It can be overfit for a very specific kind of thing and then a user might expect like a slightly more general answer, and it may be incapable of producing such an answer. And so anyways, it’s kind of long-winded but I think it’s important to understand that fine tuning fits in in sort of this like pipeline if you will of like different stages of producing a model. And the output itself is a language model. It’s really like the model is different depending on each phase that you’re in. And so that’s largely what fine tuning is and where it fits in.

Giovanni Asproni 00:13:48 And then the final term I’d like to define here we hear a lot is prompt engineering. So what is it about, I mean sometimes looks like, kind of sorcery is, we have to ask, be able to ask the right questions to have the answers you want, but what is a good definition for it?

Phillip Carter 00:14:06 So prompt engineering, I like to think of it by analogy and then with a very specific definition. So through analogy, when you want to get an answer out of a database that uses SQL as its input, you construct a SQL statement, a SQL expression and you run that on the, it knows how to interpret that expression and optimize it and then pull out the data that you need. And maybe if you have different data in a different shape or you’re using different databases, you might have slightly different expressions that you sort of give this database engine depending on which one you’re using. But that’s how you interact with that system. Language models are similar to the database and the prompts which is just English usually, but you can also do it in other languages is sort of like your SQL statement that you’re giving it.

Phillip Carter 00:14:54 And so depending on the model you might have a different prompt that you need because it may interpret things a little differently. And also just like when you’re doing database work, right, it’s not just any SQL that you need to generate, especially if you have a fairly complex task that you want it to do, you need to spend a lot of time really crafting good SQL and like you may get the right answer but maybe really inefficient and so there’s a lot of work involved there and a lot of people who can specialize in that field. It’s the exact same thing with language models where you construct basically a set of instructions and maybe you have some data that you pass in as well through a task called retrieval augmented generation or RAG as it’s often called. But it’s all in service towards getting this black box to emit what you want as effectively and efficiently as possible.

Phillip Carter 00:15:41 And instead of using a language like SQL to generate that stuff, you use English and where it’s a little bit different and I think where that analogy kind of breaks apart is when you try to get a person or let’s say a toddler like a three or 4-year-old to go and do something, you need to be very clear in your instructions. You might need to repeat yourself, you may have thought you were being clear but they did not interpret it in a way that you thought they were going to interpret it and so on, right? That’s sort of what prompt engineering kind of is. If you could also imagine this database that’s really smart at admitting certain things as sort of like a little toddler as well, it may not be very good at following your instructions. So you need to get creative and how you are instructing it to do certain things. That’s sort of the field of prompt engineering and the act of prompt engineering and it can involve a lot of different things to the point where calling it an engineering discipline I think is quite valid. And I’ve come to prefer the term AI engineering instead of prompt engineering because it encompasses a lot of things that happen upstream before you submit a prompt to a language model to get an output. But that’s the way I like to think of it.

Giovanni Asproni 00:16:48 What is observability in the context of large language models and why does it matter?

Phillip Carter 00:16:54 So if you recall when I was talking about observability, you may have a lot of things going on in production that are influencing the behavior of your system in a way that you can’t like debug on your local machine, you can’t reproduce it and so on. This is true for any modern software system with large language models, it’s that exact same principle except the pains are felt much more acutely because now in practice with normal software, yes you may not be able to debug this thing that’s happening right now but you might be able to debug some of it in the traditional sense. Or maybe you actually can reproduce certain things. You may not be able to do it all the time but maybe you can. In large language models that sort of everything is in a sense unreproducible, non-debug gable, non-deterministic in its outputs.

Phillip Carter 00:17:46 And on the input side your users are doing things that are likely very, very different from how they would interact with normal software, right? Can if you consider a UI there’s only so many ways that you can click a button or select a dropdown in a UI. You can account for all of that in your test cases. But if you give someone a text box and you say input whatever you like and we’re going to do our best to give you a reasonable answer from that input, you cannot possibly unit test for all the things your users are going to do. And in fact it’s a big disservice to the system that you’re building to try to understand what your users are going to do before you go live and give them the damn thing and let them bang around on it and see what actually comes out.

Phillip Carter 00:18:27 And so as it turns out this way that these models behave is actually a perfect fit for observability because if observability is about understanding why a system is behaving the way it is without needing to change that system, well if you can’t change the language model, which you usually cannot or if you can, it’s a very expensive and time consuming process, how do you make progress? Because your users expect it to improve over time. It’s what you release first is likely not going to be perfect, it may be better than you thought but it may be worse than you thought. How do you do that? Observability and gathering signals on what are all the factors going into this input, right? What are all the things that are that are meaningful upstream of my call to a large language model that potentially influence that call? And then what are all the things that happen downstream and what do I do with that output?

Phillip Carter 00:19:15 What is that actual output and gathering all those signals. So not just user input and large language model output, but if you made 10 decisions upstream in terms of gathering contextual information that you want to feed into the large language model, what were those decision points? Because if you made a wrong decision that will influence the out like the model might have done the best job that it could, but you’ve fed it bad information, how do that you’re feeding it bad information? You capture the user input. What kind of inputs are people doing? Are there patterns in their input? Are they expecting it to do something even though they gave it vague instructions basically. Is that something you want to solve for or is that something that you want to error out on? If you get the output and the output is what I like to call mostly correct, right?

Phillip Carter 00:19:57 You expect it to follow a particular structure but one piece of it is a little bit wrong. Are there ways that you can correct that and make it seem as though the language model actually did produce the correct output even if it didn’t quite give you the right thing that you were expecting? These are interesting questions that you need to explore and really the only way that you can do that is by practicing good observability and capturing data about everything that happened upstream to your call to a language model and things that happen on the output side of it so you can see what influences that output and then that when you can isolate that with an observability tool and you can say, okay, when I have an input that looks like this and I have these kinds of decisions and then this output pretty reliably is bad in this particular way, cool, this is a very specific bug that I can now go and try to fix. And my act for fixing that is frankly a whole other topic, but now I have something concrete that I can address rather than just throwing stuff at the wall and doing guesswork and hoping that I improve a system. So that’s why observability intersects with systems that use language models so well.

Giovanni Asproni 00:21:03 Are there any similarities of observability for large language models with observability for let’s say more well in quotes, conventional systems?

Phillip Carter 00:21:13 There certainly can be. So I’ll use the database analogy again. So imagine your system makes a call to a database and it gets back to result, and you transform that result in some way and feed it back to the user somehow. Well you may be making decisions upstream of that database call that influence how you call the database, and the net result is like a bad result for the user. Even though like your database query was not wrong, it was just the data that you parameterized into it or something like that or the decision that you made to call it this way instead of this way. That’s the thing that’s wrong. And now you can go and fix that and it may have manifested in a way that made it look like the database was at fault but something else was at fault.

Phillip Carter 00:21:58 Another way that this this can manifest is in latency. So language models like frankly other things have a latency component associated with them and people don’t like it when stuff is slow. So you might think, oh well the language model, we know that that has high latency, it’s being really slow, opening eyes being really slow and then you go and look at it and it’s actually not that slow and you’re like huh, well this took five seconds but only two seconds was a generation where the heck are those other three seconds coming from? Now swap out the language model for any other component where there’s potential for high latency and you may think that that component is responsible but it’s not. It’s like, oh upstream we made five network calls when we thought we were only making one. Oops. Well that’s great. We were able to fix the problem, it was actually us.

Phillip Carter 00:22:44 I’ve run into this several times. At Honeycomb, we have one of our customers who uses language models extensively in their applications. They had this exact workflow where their users were reporting that things were slow and they were complaining to OpenAI about it. And OpenAI was telling them we’re like, we’re serving you fast requests. I don’t know what’s going on, but it’s your fault. And so they instrumented with open telemetry and tracing of their systems and they found that they were making tons of network calls before they ever called the machine learning model. And they’re like, well wait a minute, what the heck? And so they fixed that and all of a sudden, their user experience was way better.

Giovanni Asproni 00:23:19 Now about the challenges that observability for large language models helps to address. So I think you mentioned before the fact that with these models is — you know, unit testing, for example, or any kind of testing — has some strong limitations with what we can do. You cannot test a textbox where you can put random questions — tests cannot respond to those, so you cannot have a good set of tests for that — and so there is that, but what other kinds of challenges observability helps address?

Phillip Carter 00:23:49 Two important ones come to mind. So the first is one of latency. So I kind of mentioned that before but large language models have high latency and there’s a lot of work being done to improve that right now. But if you want to use them today, you’re going to have to introduce latency on the order of seconds into your system. And if your users are used to getting everything on the order of milliseconds, well that could potentially be a problem. Now I would argue that if it’s clear that something is an AI, the phrase with large language models, usually most people associate it with AI. A lot of users now sort of are expecting, okay, this might take a little while to get an answer, but still if they’re sitting around tapping their feet waiting for this thing to finish, that’s not a good experience for someone.

Phillip Carter 00:24:36 And the right latency for your system is going to depend on what their users are actually trying to do and what they’re expecting and all of that. But what that means is kind of to that point about you can be making a mistake unrelated to the language model that gives the impression of a higher latency that makes those problems more severe because now that you have created a step change in your latency on the order of seconds and you have other stuff layered on top of that, your users might be like, wow, this AI feature sucks because it’s really slow. I don’t know if I like it very much. Getting a handle on that is very difficult. Now in addition to that, the way that a model is spoken to, right, the prompt that you feed it and the amount of output that it has to generate to be able to get a complete answer greatly influences the latency as well.

Phillip Carter 00:25:24 So for example, there is a prompting technique called chain of thought prompting. Now chain of thought prompting, you can go look it up but the idea is that it forces the model to so-called like think step by step for every output that it produces. And so that’s great because it can, it can increase the accuracy of outputs and make it more reliable. But that comes at the cost of a lot more latency because it does a lot more computational work to do that. Similarly like, imagine you’re solving a math problem, think step by step instead of intuitively it’s going to take you longer to get a final result. That’s exactly how these things work. And so you may perhaps want to AB test because you’re trying to improve reliability. Okay, what if we do chain of thought prompting? Now our latency went up a whole lot.

Phillip Carter 00:26:08 What like how do you systematically understand that impact? That’s where observability comes in. Also on the output side you need to be creative in terms of how it generates outputs, right? Things like ChatGPT and stuff, they will output a dump of text but that’s usually not appropriate for any, especially any kind of enterprise use case. And so there’s this question of okay, how do we influence our prompting or perhaps our fine tuning such that we can get the most minimal output possible. Because that’s actually where the majority of latency comes in from a language model. Its generation task depending on how it generates and how much it needs to generate can introduce a large amount of latency into your system. So instead of a large language model, you have a large latency model, and nobody likes that. So again, how do you make sense of that?

Phillip Carter 00:26:55 The only way to do that is by gathering real world data. These is what real people are entering in. These are the real decisions that we made based off their interactions and this is the real output that we got. This is how long it took. That’s a problem that needs solving and observability is really the only way to get that. The second piece that this solves, it gets to the to the observability driven development kind of thing. So observability driven development is a practice that is fairly nascent, but the idea is that if you break down the barrier between development and production and you say that okay well this software that I’m writing is not the code on my machine that I then push to something else and then it goes live somehow. But really, I am developing with a live system in mind, then that’s likely going to influence what I work on and make sure that I’m focusing on the right things and improving the right problems.

Phillip Carter 00:27:49 That’s something that large language models really sort of force an issue on because you have this live system that you’re probably pretty motivated to improve and it’s behaving in a way right now that is perhaps not necessarily good. And so how do I make sure that when I’m developing, I know that I’m focusing on things that are going to be impactful for people. That’s where observability comes in. I get those signals, I get sort of what I mentioned, that sort of way that I can isolate a very specific pattern of behavior and say okay, that’s a bug that I can work on. Getting that specificity and getting that clarity that this is what is occurring out in the world is crucial for any kind of development activity that you do because otherwise you’re just going to be improving things at the margins.

Giovanni Asproni 00:28:29 Is this related to, I read your book so it’s related to your book, to the early access program example you give where say with limited user testing, especially large language models, you cannot possibly get all the possible user behaviors because of the fact that it’s a large language model is not a standard application. So this seems like this case of observability driven development is you get to go out with something but then you check what the users do and somehow use that information to refine your system and make it better for the users. Am I understanding that correctly?

Phillip Carter 00:29:04 That’s correct. I think a lot of organizations in fact are used to the idea of an early access program like a closed beta or something like that as a way to reduce risk. And so that could in theory be helpful with large language models if it’s a large enough program with a diverse enough amount of users. But getting that degree of population like enough people with a diverse enough set of like interests and things that they’re trying to accomplish is often so difficult and time consuming that you might as well have just gone live and seen what people are doing and just acted on that right away. And what that, what that means though is that you need to commit to the fact that you are not done just because you’ve released something. And I think a lot of engineers right now are used to the idea that something goes live in production, the feature is launched.

Phillip Carter 00:29:53 Maybe there’s, you sprinkle a little bit of monitoring on that but that might be another team’s concern anyways, I can just move on to the next task. That is absolutely not what’s going on here. The real work actually begins once you are live in production because I would posit that I didn’t write this in the book but I would posit that it’s actually easy to bring something to market when you use large language models because they are so damn powerful for what they can do right now that for you to create even just a marginally better experience for people, you can do that in about a week with a bad UI and then expand that out to a month with an engineering team and you probably have a decent enough UI that that’s going to be acceptable for your users. So you have about a month that you can use to take something to market for. I would wager a large majority of the features that people use large language models for.

Giovanni Asproni 00:30:36 Actually I have a question related to this now that just came to my mind. So basically it seems that we need to replace the attitude of okay, we’ve done the feature, the feature is ready, somebody will test in QA, QA is happy you release it because for this, there is no real QA per se because we can’t really do a lot, I mean we can try a bit, we can play with the model a little bit and say okay seems to be good. But in reality until there are lots of people using it, we have no idea of how it performs.

Phillip Carter 00:31:07 Oh yeah, absolutely. And what you will find is that people are going to find use cases that work that you had no idea were going to work. We observe this a lot with our own feature at Honeycomb with our query assistant feature. That’s our natural language data querying. There are use cases that we did not possibly think of that apparently quite a few people are doing and it works just fine and there’s no way we would’ve figured that out unless we went live.

Giovanni Asproni 00:31:33 If you come across, I donít know, among your customers that had the more kind of let’s say traditional mindset with development QA approach and then going to production, going to this large language model and being maybe confused by not having the QA accepted part before going to production, I don’t know, is something that you experienced.

Phillip Carter 00:31:56 I’ve definitely experienced that. So there’s really two things that I’ve found. So first of all, for most like larger enterprise organizations, there’s usually some degree of excitement at the higher level, like the executive staff level to adopt this technology in a way that’s useful. But then there’s also sort of a pincher motion there. There’s usually some team at the bottom that wants to explore and wants to experiment anyways. And so what usually happens is they have that goal. And on the executive side, I think most technology executives have understood the fact that this software is fundamentally different from other software. And so teams may need to change their practices and they don’t really know how, but they’re willing to say, hey, we have this typical process that we follow, but we’re not going to follow that practice right now. We need to figure out what the right process is for this software.

Phillip Carter 00:32:44 And so we’re going to let a team go and figure that out. That team that goes and figures that out on the other end, I found when I went and did a bunch of user interviews, they find out very, very quickly that their tool set for making software more reliable practically needs to get thrown out the window. Now, not entirely. There are certain things that certainly are better. For example with prompt engineering, source control is very important, it’s very important for software, it’s also very important for prompt engineering, get ops-based workflows, that kind of stuff are actually very good for prompt engineering workflows and especially different kinds of tagging. Like you may have had a prompt that was a month old but like it performs better than the thing that you’ve been working on and how do you sort of systematically keep track of that?

Phillip Carter 00:33:25 So people are finding that out but they’re finding out very, very quickly that they can’t meaningfully unit test, they can’t meaningfully do integration test, they can’t rely on a QA thing, they need to have just a bunch of users come in and just do whatever they feel like with it and capture as much information as they can. And the way that they’re capturing that information may not be ideal. Some are actually realizing that we’ve talked with one organization that was just logging everything and then finding out that sort of what I mentioned, that there’s often these upstream decisions that you make prior to a call that influence the output and they would have to like manually correlate this stuff and eventually they realized, oh this is actually a tracing use case so let’s figure out what’s a good tracing framework where we can capture the same data and almost sort of stumbled their way into a best practice that some teams may know is appropriate. But like so there’s this pains that people are feeling and recognition that they have to do something different. That I think is really important because I don’t think it’s very often that software comes along and forces engineers and entire organizations to realize that their practices have to change to be successful in adopting this tech.

Giovanni Asproni 00:34:28 Yeah, because I can see that a big change in attitude and mindset in how we approach all release to production. What about things like incremental development, incremental releases, is this the incremental bit still valid with larger language models or?

Phillip Carter 00:34:44 I would say incrementality and fast releases are much more important when you have language models than they are when you don’t. In fact, I would say that if you are incapable of creating a release that can go live to all users on a daily basis, now you may not necessarily do this, but you need to be capable of doing that. If you’re incapable of doing that, then maybe language models are not the thing that you should adopt right now. And the reason why I say that is because you will literally get from day to day different patterns in user behavior and shifts in that user behavior and you need to be able to react to that and you can end up being frankly in a more proactive workflow eventually where you can proactively observe, okay, these are the past 24 hours of user interactions. We’re going to now look for any patterns that are different from the patterns that we observed in the past.

Phillip Carter 00:35:34 And we find one and we say, okay, cool, that’s a bug, file it away and keep repeating that. And then basically you get into a workflow where you analyze what’s going on, you figure out what your bugs are for that day, you then go and solve one of them, or maybe it was one from the other day, who cares. And then you deploy that change and now you’re not only checking to see what the new patterns are, you are monitoring for two things. You’re monitoring for, number one, did I solve the pattern and behavior that I wanted to solve for? And two, did my change accidentally regress something that was already working? And that’s I think is something that is kind of an existential problem that engineers need to be able to figure out. And that’s where observability tools like service level objectives really, really come in handy because if you have a way to describe what success means systematically and through data for this feature, you can then capture all of the signals that correlate with non-success with failing to meet that objective.

Phillip Carter 00:36:34 And then you can use that to monitor for regressions on things that were already working in the past. And so creating that flywheel of looking at data, isolating use cases, fixing a use case going in through the next day, ensuring that A, you fixed that use case but B, you did not break something that was already working. That’s something that’s really important because especially in the worlds of language models and prompt engineering, because there’s a lot of variability, there’s a lot of users doing weird things, there’s other parts of the system that are changing. The model itself is non-deterministic. It’s actually very easy to regress something that was previously working without you necessarily knowing it upfront. And so when you get that motion of releasing every day and being very incremental in your changes and proactively monitoring things and knowing what’s going on, that’s how you make progress where you can walk that balance between making something more reliable but not sort of hurting the creativity and the outputs that users expect from the system.

Giovanni Asproni 00:37:30 Okay. And observability and collecting and analyzing data seems to play quite a crucial role to be able to do that, to do these incremental steps, especially with large language models. Also, how do use observability to feed this data back also for product development, maybe product improvement, new features or something. So can you feed that data back also for that purpose? So far, we are talking about replacing the fact that we cannot really test the system or finding out if this performing well in terms of expectations, but what about product development? So maybe new ideas, new need to set users find ways of actually doing stuff with large language models that you didn’t even think of. So how can we use this information to improve the product?

Phillip Carter 00:38:20 So there’s really two ways that I’ve experienced that you can do this with our own large language model features in Honeycomb. So the first is that yes, what you release first is not going to solve everything that your users want. And so yes, you iterate and you iterate, you iterate, you iterate until you sort of reach I guess a steady state if you will, where the thing that you’ve built has some characteristics and it’s probably going to be pretty good at a lot of things, but there will likely be some fundamental limitations that you encounter along the way where somebody’s asking a question that is simply unanswerable with the system that you’ve built. Now in the case of Honeycomb, I’ll ground this in something real with our natural language querying feature. What people typically ask for is sort of like a starting point where they’ll say, oh well, show me the latency for this service.

Phillip Carter 00:39:17 What were these like slow requests or, what were the statements that led to slow database calls? And they often take it from there. Well they’ll manually manipulate the query because the AI feature sort of got them to that initial scaffolding. We do also allow you to modify with natural language. So they will often modify and say, oh now group by this thing or also show me this, or oh I’d like to see a P95 of durations or something like that. But sometimes people will ask a question where they’ll say, oh well why is it slow? Or what are the user IDs that most correlate with the slowness or something like that. And the thing that we built is just fundamentally incapable of answering that question. In fact, that question is very difficult to answer because first, you’re not going to be guaranteed an answer why?

Phillip Carter 00:40:08 And second of all, we do actually as a part of our UI, have a way, there’s this feature called bubble up that will automatically scan all of the dimensions in your data and then pluck out oh well we’re holding this thing constant. Let’s say its error is constant. What are all the dimensions in your data and all the values of those dimensions that correlate the most with that and generate little histograms that sort of show you that, okay, yes, user ID correlates with error a whole lot, but it’s actually these like four user IDs that are the ones that correlate the most and that’s your signal that you should go debug a little bit further. That’s the sort of answer that a lot of people are asking for some signal as to why. And what that implies from an AI system is not just generate a query, they may already have a query, but to sort of identify, based on this query, somebody is looking to hold this dimension in the data constant. And what they want to do is they want to get this thing into bubble up mode and they wanted to execute that bubble up query against this dimension of the data and show those results in a useful way. And that is just a fundamentally different problem than create a query based off of somebody’s inputs even though it’s the same text box that people are in.

Giovanni Asproni 00:41:19 Yeah. This seems to be more about guessing the goal of the user. So it is not about the mean it, the rest is the means to an end here we are talking about understanding the end they have and then work on that give them the answer they’re looking for.

Phillip Carter 00:41:35 Right. That’s true. And so the two approaches that people generally fall under is they try to create an AI feature that’s like ChatGBT, but for their system that can understand intent and knows how to figure out which part of the product to sort of activate based off of intent. All of those initiatives have failed so far largely because it’s so hard to build and people don’t have the expertise for that.

Giovanni Asproni 00:41:57 So to me it looks like that particular feature requires a certain amount of context that can be slightly different from even person to person. So not everybody, different users are looking for something similar. Yeah. But the similarity means also that there is some difference anyway. And so creating a system that is able to do that probably is less obvious than what it seems.

Phillip Carter 00:42:22 Yes, it absolutely is. And so, back to this whole notion of incrementality, right? You do want to deliver some value, like you don’t want to solve every possible use case all upfront, but eventually you’re going to run into these use cases that you’re not solving for and if there’s enough of them, like through observability, you can capture those signals. You can see like what are the things that associate the most with somebody answering that kind of question that’s fundamentally unanswerable and that gives you more information to feed into product development. Now the other way that this thing manifests as well is there’s this period of time when you release a new AI feature where it’s like fancy and new and expectations are like this weird mix of super high and also super low kind of depending on who the user is and you end up surprising your users in both directions. But eventually it becomes the new normal, right?

Phillip Carter 00:43:15 In the case of Honeycomb, we’ve had this natural language querying feature since May of 2023 and it’s just what users start out with querying their data with now, that’s just how they do it. And because of that there’s some limitations, right? Like there are other parts of the product where you can enter in and get a query into your data and this querying feature is not really integrated there. And some people, like for example, our homepage does not have the text box. You have to go into our querying UI to actually get that, even though the homepage does show some queries that you can interact with. We’ve had users say, hey, I want this here, but we don’t actually really know what the right design for that is. Like the homepage was not really built with anything like that in mind ever. And yet there actually is a need there.

Phillip Carter 00:43:59 And so this influences it because I mean this in a way, this isn’t really any different from other product development, right? You release a new feature, it’s new eventually it sort of creates, your product now has a slightly different characteristic about it. You’ve created a need because it’s not sufficient in some ways for some users and they want it to show up somewhere else. And that creates sort of a puzzle of how you figure out how that feature’s going to fit into these other places of your product is the exact same principle with the AI stuff. I would just say the main thing that’s a little bit different is that instead of having very, very direct and often exact needs that people have, that needs that people have or questions that people want answered are going to have a lot more variability in them. And so that can sometimes increase the difficulty of how you choose to integrate it more, more deeply through other parts of your product.

Giovanni Asproni 00:44:46 Okay. And talking more, a bit more about prompt engineering. So as we said, it is at the moment probably is, more of an art than a science right now is because of the models, but how can people use observability to actually improve their prompts?

Phillip Carter 00:45:03 So because observability, it involves capturing all of these signals that feed into an input to that system, one of those inputs is the entire prompt that you send, right? So for example, in a lot of systems, I would say probably most systems at this point that are being built, people dynamically generate a prompt, or they programmatically generate it. So what that means is, okay, for a given user, they may be part of an organization in your application, that organization may have certain data within it or like a schema for something or certain settings or things like that. All those influences how a prompt gets generated because you want to have a prompt that is appropriate for the context in which a user is acting and, one user versus another user, they may have different contexts within your product and so you programmatically generate that thing.

Phillip Carter 00:45:54 So A, there’s steps that are involved in programmatic generation that actually is prompt engineering, even though like it’s not the literal like text itself, like literally just like selecting which sentence gets incorporated in the final product that we send off, that is an act of prompt engineering. And so you need to understand which one was picked for this user. Then the second thing though is when you have the final prompt, your input to a model is literally just one string. It’s a giant string, well not necessarily giant, but it’s a big string that contains the full set of instructions. Maybe there’s data that you’ve parameterized within, maybe there’s a bunch of like special things. You might have examples as a part of this prompt, and you may have parameterized those examples because you may have a way to programmatically generate them based off of somebody’s context.

Phillip Carter 00:46:42 And so that right there is really important because how that got generated is what’s going to influence the end behavior that you get, and your active prompt engineering is generating that thing a little bit better. But also when you have that full text, you now have a way to replay that specific request in your own environment. And so even though the system that you’re working with is non-deterministic, you might get the same result or a similar enough result to the point where you can say, okay, I’m maybe not necessarily reproducing this bug, but I am reproducing bad behavior with this thing consistently. And so how do I make this thing more consistently produce good behavior? Well you have the string itself, so you can literally just edit pieces of that right there in your environment as you’re developing it and you try this thing, okay, let’s see what the output is, I’m going to edit this one and so on.

Phillip Carter 00:47:35 And you get very systematic about that, and you understand what those changes are that you’re doing. If you’re good enough, which is most people in my experience, you will likely get it to improve in some way. And so then you need to say, okay, which parts of this prompt did we change? Did we change the parts that are static? Okay, we should version this thing and we should load that into our system now. Did we improve the parts that are dynamic? Okay, what did we change and why did we change it? Does that mean that we need to change how we select pieces of this prompt programmatically? That’s sort of what observability allows you to do because you capture all of that information, you can now ground whatever your hypotheses are in just sort of like the reality of how things are actually getting built.

Giovanni Asproni 00:48:16 Okay, now I’d like to talk a bit about how to get started with it. For developers that are maybe starting to work with the large language models and they want to maybe implement observability or improve the observability they have in the systems they they’re creating. So my first question is, what are the tools available to developers to implement observability for these large language models?

Phillip Carter 00:48:42 So it kind of depends on where you’re coming from. So frankly, a lot of organizations already have pretty decent instrumentation usually in the form of like structured logs or something like that. And so honestly, a very good first step is to create a structured log of this is the input that I fed the model, this was the user’s input, this was the prompt. Here’s any additional information that I think is really important like as metadata that goes into that request. And then here’s the output, here’s what the model did, here’s the full response to the model, including any other metadata that’s associated with that response. because the way that you call it, we’ll sort of influence that. So like there’s parameters that you pass in and it’ll tell you sort of like what those parameters meant and things like that. Just those two log points, those two structured logs.

Phillip Carter 00:49:28 This is not the most perfect observability, but this will get you a long way there because now you actually have real world inputs and outputs that you can base your decisions on. Now eventually, you are likely to get to the point where there are upstream decisions that influence how you build the prompt and thus how the model behaves. And there may be some downstream decisions that you do to act on the data, right? Like kind of that thing that I mentioned before where it may be mostly correct, it might be a correctable output. And so you may want to manually correct that thing through code somehow. And so now instead of just two log points that you can sort of look at, you now have these set of decisions that are all correlated with effectively a request and that request to the model and then it’s output and some stuff you do with on the backend and some people call multiple language models through a composition framework of some kind.

Phillip Carter 00:50:19 And so you may want that full composition represented as sort of like a tracing through that stuff. And by golly there’s this thing called open telemetry that allows you to create tracing instrumentation and gather metrics and gather those logs as well. And it’s an open standard that’s supported by almost every single observability tool. So you may not necessarily need to start with open telemetry. I think especially if you have good logging, you can use what you have to a point and incrementally get there. But if you do have the time or if you simply don’t have anything that you’re starting with at all, use open telemetry and critically you do two things. You install the automatic instrumentation. And so what that will do is it will track incoming requests and outgoing responses throughout your entire system. So you’ll be able to see, okay, not just the language model request that we made, but the actual full lifecycle of a request from like when a user interacted with thing, everything that it talked to up until the point via HTTP or GRPC or something like that until it got to a response for the end user to look at.

Phillip Carter 00:51:20 That is very, very helpful. But then what you need to do is you need to go into your code, and you use the open telemetry API, which is for the most part pretty simple to work with. And you create what are called spans. A span is in tracing form. It’s just a structured log that contains a duration and causality by default. So basically you can have like a hierarchy of, okay, this function calls this function which calls this function and they’re all meaningfully important as this chain of functionality. So you can have a span and function one span and function two span and function three and functions two and three are like children of number one. So it’s sort of like nests it appropriately. So you can see that nested structure of how things are going. And then you capture all the important metadata with like, this is the decision that we made.

Phillip Carter 00:52:04 If we’re selecting between this bank of sentences that we’re going to incorporate into our prompt, this is the one that was selected and like maybe these are the input parameters that are going into that function that are related to that selection. It’s basically an active structured logging except you’re doing it in the context of traces. And so that gets you really, really, rich detailed information. And what I would say, you can go to open telemetry, just the website right now and install it. Most organizations are able to get something up and running within about 15 minutes and then it becomes a little bit more work with the manual instrumentation because there’s an API to learn. So maybe it takes a whole day, but then you need to sort of make some decisions about what the right information capture is. And so that may also take another day or so depending on how much decision fatigue you end up with and if you’re trying to overthink it or something like that?

Giovanni Asproni 00:52:55 One thing also that I wanted to ask about the information to track that I think we haven’t mentioned so far because you mentioned inputs outputs, but then also reading your book you put a high emphasis on errors as well. So tracking them in this case with open telemetry say so with your observability tool. So why are errors so important? Why do we need to track them?

Phillip Carter 00:53:19 So errors are critically important because in most enterprise use cases for large language models, the goal that they have is they want to output a JSON object. I mean it could be XML or YAML or whatever, but like, we’ll call it JSON for the sake of simplicity. It’s usually some act of a combination of smart search and useful data extraction and putting things together in a way such that it can fit into another part of your system. And hopefully like the idea is that that thing that you’ve extracted and put into a particular structure accomplishes the goal that the user had in mind. That’s I would say is like 90 plus percent of enterprise use cases right now and will likely always be that. So there are ways that things can fail. So first, your program could crash before it ever calls the language model.

Phillip Carter 00:54:15 Well yeah, you should probably fix that. The system could be down. OpenAI has been down in the past, people have incidents. Well if it cannot produce an output period, okay, you should probably know about that. It could be slow, and you could get a timeout. And so even though the system wasn’t down well, it’s effectively down as far as your users are concerned. Again, you should know about that. And the reason why you should know these kinds of failures right now is because some are actionable, and some are not actionable. So if say you get a timeout or the system is down, you get a 500, maybe there’s a retry or maybe there’s a second language model that you call as a backup. Maybe that model is not as good as the first one that you’re calling, but it might be more reliable or something like that.

Phillip Carter 00:54:55 There’s all these little puzzles that you can play there and so you need to understand which one is which and you need to track that in observability so you can understand if there’s any patterns that lead to some of those errors. But then you get to the most interesting one, which is what I call the correctable errors, which is that the system is working, it is outputting JSON, but maybe it didn’t output a full JSON object, right? Maybe for the sake of latency you are limiting the output amount to be a certain amount, but the model needed to output more than like what your limit was. And so it just stopped. Well that is an interesting problem to go and solve because maybe the answer is to increase the limit a little bit or maybe it’s that you have a bug in your prompt where you are causing the model somehow through some means to produce way more output than it should actually be outputting.

Phillip Carter 00:55:49 And so you need to systematically understand when that happens. You then need to also systematically understand when, okay, it did produce an object, but it needed to have like this name of a column and a schema somewhere or something like that. But it gave a name that was like not actually the same name or maybe this object structure had like this nested object inside of it that needs to have a particular substructure and maybe it’s missing one piece of that substructure for some reason. And like you could imagine if you look at the output, oh well if a human were tasked with creating this JSON, like maybe they would’ve missed that thing. And so you need to track when those errors happen because that could be, it’s valid JSON, so it parses, but it’s not actually valid as far as your system is concerned.

Phillip Carter 00:56:35 So what are those validity rules? What are the things that it fails on? How can you act on that? Is that something that you can improve via prompt engineering or if when you’re validating it and like you actually know what the structure should be, you have enough information to like to fill in that gap, can you actually just fill in that gap? And what we observed with Honeycomb in our query assistant feature is that we had none of these like correctable outputs on the beginning or other. We did not try to correct those outputs in any way at first. And so what we noticed is about 65 to 70% of the time it was correct, but then the rest of the time it would error, it would say can’t produce a query. And when we looked at those, it had valid JSON objects coming out, but they were just like slightly wrong.

Phillip Carter 00:57:20 And we then realized in that parsing thing, oh crap, we actually can’t, like if we just remove this thing, this may not be perfect, but it’s actually valid and maybe that’s good enough for the user or we know that it’s missing X, but we know what X is, so we’re just going to insert X because we know that like that needs to be there for this to work and boom, it’s good to go. And we were able to improve the overall like end user reliability of the thing from like a 65 to 70% of the time to like a 90% of the time. Like this is a massive, massive improvement that we were able to do just by fixing these things. Now the remaining now it’s like 6-7% of reliability. That was through like really hardcore prompt engineering work that we had to do. That took a lot more time. But so I think why that’s really important is we were able to fix that 20% plus improvement within about two weeks. And so you can have that degree of improvement within about two weeks if you systematically track your errors and you differentiate between which one is which. And so this is kind of a long-winded answer, but I think it’s really important because the way that you act on errors matters so much in this world.

Giovanni Asproni 00:58:23 Now I think at the end of of our time, so I’ve got maybe some final questions. So the first one is about the current limits of what we can do with observability for large language models. Are there any things that at the moment are not really possible but we wish they were?

Phillip Carter 00:58:44 I’ll say one thing that I really wish that I had that I did not have is a way to meaningfully apply other machine learning practices on this data. So not like AI ops or, something like that, but pattern recognition. So these classes of inputs lead to these classes of outputs that’s effectively like that’s a collection of use cases if you will, that are like thematically similar. And we had to manually parse all that stuff out and like humans are good at pattern recognition, but it would’ve been so nice if, if our tool could recognize that kind of stuff. The second thing is that observability and getting good instrumentation to the point where you have good observability, it’s an iterative process. It’s not something you can just slap on one day and then you’re good to go. It takes time, it takes effort, and you don’t get it right often.

Phillip Carter 00:59:32 You need to constantly improve it and kind of, that’s frankly hard and I wish it was a lot easier and I’m not really sure I know how to make it a lot easier, but like what that means is you may think that you’re observing these user behaviors, but you’re not actually observing everything that you need to be observing to improve something. And so you will be doing a little bit of guesswork and then you have to go back and figure out what to re instrument and improve and all that. And like I wish that like there’s still no best practices around that, but also just from like a tool and API and SDK standpoint, I just wish it were a lot easier to sort of get like a one and done approach or like maybe I do iterate, but I iterate like on a monthly basis instead of on a daily basis until I feel like I have good data.

Giovanni Asproni 01:00:09 Well maybe any of these of what you said, these current limitations being addressed in the next say few years or also there are other things that you see happening in terms of observability engineering for LLMs things that you think will improve new things that we cannot do now. Is there any work in progress?

Phillip Carter 01:00:31 Yes, I would say there definitely is on the instrumentation front right now, it’s not just language models, but there’s like vector databases and frameworks that people use and there’s sort of like a collection of tools and frameworks that are relevant in this space. None of those right now have automatic instrumentation in the same way that like HTTP servers or message queues have automatic instrumentation today. So the act of getting that auto instrumentation via open telemetry is like you kind of have to do it yourself. That’s going to improve over time, I think. But that’s a real need because that sort of first pass at getting good data is harder to come to today than it should be. The second is that your analysis workflows and tools are a little bit different. Some tools, like for example, Honeycomb is actually very well suited to this.

Phillip Carter 01:01:18 And so what I mean by that is when you’re dealing with textual inputs and textual outputs, these values are not meaningfully pre-aggregable, meaning that you can’t like sort of just turn it into a metric like you can other data points and they tend to be high cardinality values. So like there’s likely a lot of unique inputs and a lot of unique outputs and a lot of observability systems today really struggle with high cardinality data because it’s not a fit for their backend. And so if you’re using one of those tools, then this might be a lot harder to actually analyze and it might also be more expensive to analyze than you would hope it’s, and so I hope that like, I mean, high cardinality is a problem to solve, like independent of LLMs, it’s something that you need period, because otherwise you just don’t have the best context for what’s going on in your system. But I think LLMs really forces the issue on this one. And so I hope that this causes most observability tools to handle this shape of data a lot better than they do today.

Giovanni Asproni 01:02:17 Okay, thank you. Now we came to the end. I think we’ve done quite a good job of introducing observability for large language models, but is there anything that you’d like to mention? Anything else that maybe we forgot?

Phillip Carter 01:02:30 I would say that getting started with language models is super fun and it’s super weird and it’s super interesting and you’re going to have to throw a lot of things that out of the window and that’s what makes them so exciting. And I think that like you should look at how your users are doing stuff and some things that they struggle with and just pick one of those and see if you can figure out a way to wrangle a language model to output like something useful. Like it doesn’t have to be perfect, but just kind of something I think you’ll be surprised at how effective you can be at doing that and turn something from like a creative wish to like a real proof of concept that you might be able to productionize. And so I wish there were a lot more better practices around how to do this stuff, but that will likely come I think a lot, especially in 2024. There will be a lot of demand for that. And so I think you should get started right now and like spend an afternoon seeing what you can do and if you can’t get it done, like I don’t know, reach out to me and like maybe I’d be able to help you out.

Giovanni Asproni 01:03:26 . Okay. Thank you, Phillip, for coming to the show. It has been a real pleasure. This is Giovanni Asproni for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio 610: Phillip Carter on Observability for Large Language Models

Show Notes

SE Radio

Links

Transcript

Join the discussion

More from this show

SE Radio 643: Ganesh Datta on Production Readiness

Menu

Recent posts

Search

Search

SE Radio 610: Phillip Carter on Observability for Large Language Models

Show Notes

SE Radio

Links

Transcript

Join the discussion

More from this show

SE Radio 643: Ganesh Datta on Production Readiness

SE Radio 593: Eric Olden on Identity Orchestration

Menu

Recent posts