SE Radio 703: Sahaj Garg on Low Latency AI

In this episode, Sahaj Garg, CTO of wispr.ai, joins SE Radio host Robert Blumen to talk about the challenges of building low-latency AI applications. They discuss latency’s effect on consumer behavior as well as interactive applications. The conversation explores how to measure latency and how scale impacts it. Then Sahaj and Robert shift to themes around AI, including whether “AI” means LLMs or something broader, as they look at latency requirements and challenges around subtypes of AI applications. The final part of the episode explores techniques for managing latency in AI: speed vs accuracy trade-offs; speed vs cost; latency vs cost; choosing the right model; reducing quantization; distillation; and guessing + validating.

Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Robert Blumen 00:00:19 For Software Engineering Radio, this is Robert Blumen. With me today is Sahaj Garg. Sahaj is a co-founder and CTO of wispr.ai. Prior to wispr, he was an AI engineer at Luminous Computing and is a graduate of Stanford University. He has published research on topics in machine learning. At Luminous he was the fifth employee and AI team lead. Sahaj, welcome to Software Engineering Radio.

Sahaj Garg 00:00:50 Thank you for having me here, Robert. It’s a pleasure to be on the show.

Robert Blumen 00:00:53 Would you like to say anything else about your background before we dive into the topic?

Sahaj Garg 00:00:58 No, I think that’s a great summary. Maybe the only thing that I’ll add is right now at wispr we’re working on low latency interactive voice AI applications for users and so that will shed a lot of light on how we think about latency, throughput and optimization when it comes to ML workloads.

Robert Blumen 00:01:16 That’s a great thing for you to be working on because today we will be talking about low latency AI applications. Before we get into latency in AI in particular, let’s talk about generally in applications. When we’re talking about latency, what are we talking about?

Sahaj Garg 00:01:35 So when we talk about latency, what we really care about is what the user experiences the moment they try to do something to the moment they actually experience the response. So, it’s entirely from the side of the person interacting with the system, not so much from the side of like the service or the background or how much we’re able to process all at once.

Robert Blumen 00:01:57 You could imagine a system in which another computer program is interacting with the system. Does your definition rule out calling it latency in those cases?

Sahaj Garg 00:02:08 No. You can think about latency kind of in all sorts of cases where all you’re measuring is the time from something to start to the time that something finishes. And so, it’s basically just a lapsed time of some kind of operation. And a lot of times I just tend to think about it from what the user experiences because that is the type of latency that a lot of software applications care to optimize. Maybe one example that I think about is people joke about the AI pause nowadays when you’re having a voice call with ChatGPT voice where it pauses for three or four seconds. That time span is what I think of as latency in a lot of user reactive applications.

Robert Blumen 00:02:44 We’ll definitely talk about that in a bit. I want to do some more general-purpose talk concepts about latency. We did a podcast on this some years ago that was talking about consumer applications, and the guest presented some data about human behavior and response to latency so that people are highly sensitive to both average latency and tail latency even down into milliseconds. Now this was years ago. Do you have any updated research or your own ideas about what humans experience as a noticeable latency?

Sahaj Garg 00:03:20 Yeah, so it depends a lot on the application. So there are some really prominent ones like Superhuman Mail, which is an email client where what they’ve optimized for is the latency of every interaction and they found that when responding to keystrokes for example, you click a shortcut in the application, how long until the mail app updates based on that the target was under a hundred milliseconds. Anywhere between 50 and a hundred milliseconds would feel good and anything above a hundred started to feel bad very quickly. It’s a little bit different for kind of traditional software and a lot more of the AI applications and what the specific interaction is. So, like in voice cases it tends to be 500 to 600 milliseconds tends to be tolerable in UIs. If you hit more than 30 to 40 milliseconds for response to like a user hitting a button, then that starts to feel laggy very, very fast and people can perceive latency down to like five or 10 milliseconds. Like if my keystrokes, when I input them into an app lag by 10 milliseconds, I tend to use those kinds of notepad editors way less than other ones.

Robert Blumen 00:04:25 Do you have any thoughts on whether people are responding to average latency, long tail latency, worst case latency or some other aggregator?

Sahaj Garg 00:04:37 I think people tend to measure latency based on the 95th to 99th percentile experience that they have. Like whenever at wispr people have mentioned that things feel slow and we look at the distribution of times they get, it’s actually usually one or two outlier cases where it feels slow. And so, people tend to measure a latency based on their worst perceived experience, not based on their average perceived experience.

Robert Blumen 00:05:02 In trying to engineer the obvious cause of average latency is the system has to do some work for you in order to give you a response and that takes time. You can try to make it run faster, but long tail latency may have different causes. And do you have any experience in tracking down where that originates?

Sahaj Garg 00:05:24 So that’s where the hardest challenge in latency come from because when you think about P99 latency, it’s like a hundred different things that can cause the issue and it’s about figuring out how to find those and diagnose those. So, for us one of the ways that we think about and analyze P99 latency is looking at the impact of the network on the user. Networks are super stochastic. And so every once in a while a request just won’t make it where it should get to. And the way that we try to track these things down is by making sure that we have as much observation as possible into the breakdown of the different parts of the latency. So if we measure the time for the work, right, as you said, that’s the time that every request is going to have to take and the time for every single sequence of steps, what we can do is we can figure out, okay, this hop of network latency was the source of the issue and then we can try and figure out how do we get that 99th percentile latency to go down.

Robert Blumen 00:06:25 Do you have any stories of when you looked at this telemetry and you found something maybe you weren’t expecting that was taking up a normal amount of time?

Sahaj Garg 00:06:35 Absolutely. So, this is really common. So, we host models for voice dictation. So basically users will press a button on their computer, they’re going to speak, we send it to a server for transcription where we host all of the models ourselves and then we return a response to the user and we run both a speech model and an LLM as part of that pipeline. That’s to kind of give context on the application where we’ve seen it. The most unexpected one that has come up for us is we found that things were making it to the server fast. Things were getting processed by the server fast, but one day all of a sudden all of our response times went up by about 300 to 400 milliseconds. And what we found in that case was the time between the first model running and the second model running suddenly had a spike.

Sahaj Garg 00:07:20 And the reason that we eventually uncovered is something in our deployment allowed one GPU to be in one region and the other GPU to be in a different region. So, we would go to California for processing our audio, then we would somehow ascend it to the Midwest for processing the LLM and all the post-processing. And this was kind of opaque behind the scenes infrastructure things where we thought everything was getting deployed to the same place and then we looked under the hood, we found that okay, there’s like network responses going on here and then we were able to diagnose it and bring it all back into the same region.

Robert Blumen 00:07:53 What’s interesting about that story, one of the things you wanted to talk about was latency and scale. That story illustrates how as you have a larger system and you either are scaling out for reasons of handling more traffic or other reasons, it also impacts your latency.

Sahaj Garg 00:08:10 Absolutely. Like every time you expand a system, the kinds of places where you can incur latency, where you route traffic, or you get to the right place or your databases have higher load on it start to very quickly crop up. And often what happens is they kind of hit a tipping point at different parts of scale. So, things will look fine for a really long time, then all of a sudden bam, you’ll expose a new source of latency and you’re going to have to spend a bunch of time figuring that one out, optimizing it and building the system for that growing scale.

Robert Blumen 00:08:40 Do you advise or at your company do you institute latency regression testing?

Sahaj Garg 00:08:47 The way that we have done it is primarily looking at analytics data from day to day. And so, we’ll have analytics on latency for every event that users experience with our application and we’ll of course test things before we promote the deployments, but we’ll basically do gradual rollouts of new infrastructure as we roll it out and we will measure impact on latency there. And so, we can see on any given day, on any hour of any day, did the latency suddenly spike for any reason? We can see when we do new rollouts and new deployments, did it bring down the latency as we would expect it to? But we haven’t done any specific kind of regression testing there.

Robert Blumen 00:09:30 Maybe the systems we build now are so complex and impossible to really simulate in staging that you can’t exactly test them before you roll them out, but you can detect when you have a problem quickly and resolve it.

Sahaj Garg 00:09:44 That’s very much the approach that we take and that I think a lot of fast-moving companies in the AI space take right now because so many of these workloads are really stochastic. They’re super random, different things happen on GPUs that are extremely non unpredictable and non-deterministic. And so, making sure that you can immediately respond to any spike, that’s the most important thing for us.

Robert Blumen 00:10:09 So Sahaj, we’ve been talking about latency for a bit converging on the intersection of latency and AI. Let’s talk about what AI is briefly, the way I observe this term, it’s a bit of a moving target where it tends to over index on what happened in the last five minutes, but we’ve had AI for decades now. If we’re going to be talking about AI and latency, what domains of AI are we talking about?

Sahaj Garg 00:10:38 I mean latency becomes important in any of these kinds of applications. So, whether it’s more classical techniques that are classifiers that are trained on certain kinds of data, which are often used in recommendation systems and ads models or whether it’s larger and larger ML workloads like large speech models or large language models, right? Which are models that have billions of parameters. And in those kinds of cases because they’re running so much computation for any given request, that’s the reason why latency has become such a hot topic in this kind of a community where before maybe we’re running 20,000 floating point operations for getting one result back. Now we might be running 10 billion or a hundred billion floating point operations to get one result back. And as soon as you start to have to do that much computation, that’s when optimizing these systems for a different type of latency constraint starts to become relevant.

Robert Blumen 00:11:38 Could you give an example of something using AI and what the latency expectations would be? Something that’s not a large language model.

Sahaj Garg 00:11:48 So for not a large language model tools using AI include basically all the recommendation systems that people use for the Twitter algorithm, the Facebook algorithm for showing you different things. And in those kinds of cases there’s actually usually a cascade of different models that are running. Like first you’ll use a model to retrieve a bunch of candidates of things that you might want to show the user. Then you actually rank those things. And so, each stage of each model has a different latency requirement to like retrieve them. You might be operating on saying, hey, which of these like 10,000 posts are even potentially relevant to the user? And so that model has to be super, super fast to be able to run on all those things. Then the next model might say, hey, of these 10 things, what’s the thing that I show in the top place to the user? And with a lot of these systems, the way people get creative is you can show people the first page of results really fast and then you can have more time before the user scrolls to compute things that are going to show up on the second page. And so that’s where like people really optimize for like first page load latency versus what you see after and can buy more time for running certain computations more efficiently.

Robert Blumen 00:13:09 This reminds me of another question I had about latency where you can define it as the time from when you request until you start getting results is the thinking behind this last point. If you could start showing the reader some results, whether it’s LLM chat or recommendations and we read pretty slow so the reader’s going to start reading and you have time now where you’re ahead of the reader and you can fetch more results, and the reader doesn’t perceive it as lag. Is it something like that?

Sahaj Garg 00:13:40 That’s exactly right. Yeah. And that’s taking advantage of the fact that we’re engaging with the content and so we have plenty more time to run computation in the background while the user is actively engaged.

Robert Blumen 00:13:52 We discussed these findings. I recall a study that was covered in the news about Google if they introduced a hundred milliseconds delay in search that people almost stopped using Google for three months or something. And it wasn’t even a perceptible amount, it was something subconscious where the person feels like I’m having a bad experience. Yet with these LLMs people will sit there for seconds waiting for an answer. Why is this tolerated when people seem to have such a short attention span in other domains?

Sahaj Garg 00:14:28 That’s a great question. I think it comes back to two different things. So, one like the immediate value that I get out of a LLM query is so high in so many cases, like I’m immediately getting the final answer that I need. If you contrast it to Google, I’ve asked a question, I got a hundred milliseconds, now I have a bunch of search results that I now have to go click through to even see if I find the thing that I want to. And so, there’s so many steps involved there to get to the thing that I actually want to. Whereas with ChatGPT, like if it’s going to take two seconds for me to get a magical answer that answers the world’s most complex question that I previously didn’t have a way to get an answer to, I’ll wait. So that’s like one part of it.

Sahaj Garg 00:15:10 And the other part is I think the user expectation that has been set. So once people get used to a certain thing, it becomes very easy for them to continue working with that kind of model. But like I mentioned this earlier with the AI pause on voice mode for example, on ChatGPT where it used to pause for three or four seconds before it starts giving you an answer and they’ve reduced that to 500 milliseconds now. After using the version where it responds like an actual person conversational partner, if it went back to having the long pause, I would probably stop using it very fast. And so, I think there’s a bit of both going on here.

Robert Blumen 00:15:49 Yeah, I think you might be used to something and tolerate it, but you wouldn’t get used to it if you weren’t getting something out of it because you would simply stop using it.

Sahaj Garg 00:15:59 Yeah, and like these systems are so magical and what they can provide a user so quickly. For me, for example, even if it gets half the things right in what it says in an answer, I’m still happy. Like I’ve gotten 50% magical things that I could never have had before and that’s enough for me to come back to a tool like ChatGPT.

Robert Blumen 00:16:20 We’ve done some setup on latency and AI, let’s delve into our core topic here, which is latency in AI applications. If you start for point of comparison, dealing with an old school web application that’s not hitting any kind of AI, what are some or all of the things that have to happen before a user gets a result?

Sahaj Garg 00:16:45 Yeah, so in kind of an old school web application, the user will interact with a UI or passively, they’ll land on a page, then a request will get sent to a backend. The backend will have to potentially authenticate a user query, some databases generate a response and then send the response back to the client. And then the client, their browser or whatever is showing it will actually have to render it. So, you’ll see different kind of chunks of time for sending the request, retrieving the response on the backend, getting it back to the user and then finally showing it to them.

Robert Blumen 00:17:19 I would imagine AI has to do pretty much the same things, but can you say what might be different for an AI application?

Sahaj Garg 00:17:28 Yeah, so the difference for the AI application is this idea of you can’t fetch the full response often all at once. So, in a lot of traditional software, you’re fetching the content that you want to show on a page. And so that’s going to be some kind of query that gets a lot of this information and returns it back to the user. In a lot of AI applications, what people do is what’s called auto regressive generation. So, if you think about ChatGPT, it’s generating one word at a time as you get an output. So, a request gets sent for processing, then you have to basically queue up all the stuff to get the first word of the response. Then you have to feed that back into the model to get the next word. Then you have to feed that back in to get the next word and so on.

Sahaj Garg 00:18:15 So in a lot of more traditional software, there’s a lot less of this sequential acting. There might be three or four steps, maybe you retrieve something from a database, then you figure out, okay, how does that mean that I should update the UI? But they’re not heavy computations and they’re usually discreet steps, right? You’re following a linear code path. Here you’re often just following a loop where you’re doing a loop of decoding something and then eventually giving it back to the user. I’d say the other maybe part that I’d add there is GPU computation. So, in most traditional software, most of the requests are handled by a CPU because you have a server somewhere that can do the database operations and return things. Whereas for a lot of machine learning workloads, you’re going to have to have a GPU that’s actually processing and providing a result.

Robert Blumen 00:19:02 That opens a lot of implications. The first thing I can think of is obviously you have some of the response before you have all of the response. So, you could start sending data back to the user before you have all of it, which is generally not the case in old school web apps.

Sahaj Garg 00:19:20 Exactly.

Robert Blumen 00:19:21 Now we talk a lot also about streaming applications where data is arriving. Does this make AI more like a streaming application?

Sahaj Garg 00:19:31 It does in a lot of ways. Streaming applications like audio chat, video chat and all of these kinds of things have to handle a lot of challenges like this. There’s a couple differences. I think one is a lot of streaming applications and traditional software that stream audio and video, they’re handling really big data requests when you’re streaming and that’s the thing that makes it super intensive. Whereas in a lot of these AI workloads, the streaming itself, it’s not much data. You send back a text response, it’s like there’s not that much there, but the computation on the server is the thing that’s super stochastic, super random and can take a long time and that’s what you’re trying to hide for the user.

Robert Blumen 00:20:13 I observe LLM apps, they seem to send me back roughly paragraph sized chunks. You could tend one word at a time. What’s the thinking? Is it in words or in batches of certain number of milliseconds?

Sahaj Garg 00:20:28 It comes back to actually the user experience. So, the reason how people optimize latency is so that it feels good to the user. That’s why people send back partial responses. As a user, whenever I’ve prototyped something for myself where I stream back one word at a time, it’s completely impossible to read the output because it’s just my screen is changing so fast that I’m actually losing track and not having a good time. And so that’s part of the reason why people stream it in that way. And people have played around actually with a lot of different streaming patterns. Like Gemini used to fade in one line of text at a time and like they’re not coming in faded in that way off the server. That’s primarily a how do you display it, the users or they stay engaged. I do think they moved away from that though.

Robert Blumen 00:21:17 I have queued up some questions more about managing latency, but an obvious question I have here is how much of a response does the user want from an LLM? I’ve noticed using different models that they seem to have a different view on how long the response are, where certain models are consistently giving me bigger response. Now an obvious way to control latency would be to send the user a smaller response. How do you approach that tradeoff?

Sahaj Garg 00:21:48 Are you curious how we approach it at wispr or how the industry approaches it for different things?

Robert Blumen 00:21:54 Well actually if it’s different, yeah, both.

Sahaj Garg 00:21:56 So for us we are a voice dictation application. You speak to your computer and it will enter the text of what you said, clean it up for you so that you can actually use it for email, slack messages and so on for that kind of application. People don’t want it to go off the rails and start like expanding on your behalf and deciding what you want to say on your behalf. That’s actually very much like I want the user to experience that they get to choose what shows up in the system. And so, we’re really careful to make sure that the responses that we produce, because we run an LLM to post-process, things are very similar size to the input. It’s actually a very strict requirement that we have. Now on other applications where it’s more of a question answering style thing or even for us when we start going voice to action and we’re going to generate longer and longer responses, that’s when this comes back to user preference.

Sahaj Garg 00:22:50 So the most important thing is like actually all of these different applications and services will adapt to what people want and what people seem to like. And I think the reason why they all produce different responses because they’ve all been designed with a different goal in mind. For example, ChatGPT is largely like a consumer application built with a lot of former Meta employees. And so, they optimize probably for engagement and time spent in the application. You know I’m speculating there, but I think that’s part of the reason why it tends to produce longer responses. Claude and Anthropics models have been optimized for coding and tasks where you’re generating for a very precise task, which is why in my experience they’ve been more concise. But I’ve noticed in my own use, if I hit expand for more details on one answer, then subsequent answers all start to get longer and longer, longer from that same service. And so that’s about adapting to, hey, what does this user want in this moment? How can I make sure I give them what they as an individual are looking for?

Robert Blumen 00:23:53 Your response about wispr makes it reminds me of a time I had a job condensing interviews from this podcast into written, there were called transcripts, but I was approaching as joy trying to preserve information and I found that people when they speak are highly verbose about two and a half times more words than written English. Does your application do any condensation where you’re trying to preserve the meaning and eliminate redundancy, which I also found the redundancy is much more obvious when you read it.

Sahaj Garg 00:24:28 Yeah, for sure. What we found is there’s two types of things that people want. So, one is when people are speaking short messages, suppose I’m speaking in a slack or text message there actually a lot of my filler words, things like– like, though, and stuff like that are super important for me to preserve my tone. Because I want to be super soft, super casual in those kinds of places. But where this kind of condensing and rewriting really matters is like people will speak to their email for a minute, like nobody can write a precisely spoken email for a whole minute. That’s just a really hard task to be able to do. And so now we do basically some light editing. We’ll break it into paragraphs, we’ll structure it, we’ll make it look a little bit better for you. We’re about to launch a mode where it does some kind of heavier rewriting for you where you should still be able to scan the output so you know that we haven’t deleted any meaning from what you said, but the person receiving it can read it really easily. And different people will want different things there. Some people are super precise in how they speak; some people just want to speak how they would speak to a friend and have it so that the system understands them and cleans up all of the extra redundancy and makes it really, really intelligible.

Robert Blumen 00:25:39 We may have covered this to some extent, but I think you would have some new things to say about what are specific challenges in engineering for low latency around AI that we haven’t already covered.

Sahaj Garg 00:25:54 So I think the biggest one is this trade-off between latency and throughput. So, let’s put it this way. So, when people submit requests to ChatGPT, right? They’re asking a question, the server can choose one of two things, it can process my request immediately or it can wait for my request and your request to show up and then send them to the GPU at the same time. And there’s actually a really big trade off here. So, if you process my request immediately, I’ll get my response quickly. But if I wait for yours, then often because I’m running on a GPU, I can run those requests in parallel and make really good use of my hardware. And so, I could serve twice as many requests on the same amount of compute capacity. And so there becomes this big tradeoff between, hey, how do we optimize a latency and how do we optimize serving as many users as we can with the same number of GPUs? And if you’ve ever heard about the GPU shortage or if you’ve ever seen Nvidia stock go up, you know that people are really clinging to how much juice can they get out of every GPU. How can you get free users to be able to use more requests? And so, this trade-off is like very specific to a lot of AI workloads and depends a lot on the application that you’re building.

Robert Blumen 00:27:20 Are there any other trade-offs possibly involving accuracy?

Sahaj Garg 00:27:26 For sure. So, in general, bigger models will give you better answers. Smaller models will give worse answers. It’s like a rough rule of thumb. And bigger models are slower. Smaller models are fast. You can get really tiny models that run on, for example, a MacBooks, apple silicon chip for example, open-source models like Llama, 8B, you can quantize it and put it on your device, and it will produce super-fast responses. But they won’t be very good. And so, the challenge is figuring out for any given application for any given request, how do I route that request to a model with the right level of complexity? So, if you’re building, for example, an end user facing application, some things that a user might do might require a really complicated model, like maybe you’re building a medical health app and for example processing certain data requires the latest and greatest GPT 5.1 models and that’s going to be really slow.

Sahaj Garg 00:28:26 But then other questions or other things that a user might be doing where you’re just trying to figure out basic things like is this person happy, sad, or what’s their tone right now that might be able to get by with a much smaller model. And so, for both users and developers, this becomes a really important consideration what size model you’re deploying, and do you use any of these kinds of new techniques to take the same models and make them faster? And I know you’re interested in some of these topics with respect to distillation, quantization using certain compute efficient architectures, ones that work faster for training versus faster at inference time. There’s like all sorts of knobs that you can tune that will allow you to serve different quality outputs to users in different ways.

Robert Blumen 00:29:14 Yeah, and we can talk about some of those things. Let me ask, so AI, it’s much more expensive to run, at least the larger AI compared to conventional web applications. I remember seeing some data that AI search compared to a Google index search it, I don’t remember, 10 times more expensive multiples more

Sahaj Garg 00:29:39 Easily.

Robert Blumen 00:29:40 All these trades you’re talking about are really coming down to resource tradeoffs, which in the end you could give the customer whatever they want if they are willing to cover your costs. How do you approach making these trade- offs in a real application?

Sahaj Garg 00:29:55 Well one, I would say almost if we had infinite GPUs, but OpenAI and a lot of the biggest language model providers just don’t even have enough GPUs to service the requests. Like if they did, they might actually be able to serve people more large model requests. Now to your kind of other question of like, hey, suppose it is under the practical resource constraint trade off space, which is you know the vast majority of cases where you’re not running up against literally somebody does not have enough GPUs in a data center right now to handle the request that a user has. It’s very case by case. So for example, for us, we’re building voice dictation where the user expense a nearly instantaneous response, right? It’s an interactive voice application, but if the user has to go back and fix up their transcript or clean up a bunch of stuff afterwards or edit a mistake in the speech recognition, that experience is no longer better than typing.

Sahaj Garg 00:30:51 And so we have basically evaluated models at different sizes where we say we have a latency budget, our LLM for 50 words or a hundred words can only take 250 milliseconds to process. And then we’re going to try models of various classes in various families to see what is the most accurate one that we’re able to train that gets the accuracy we want at the latency budget that we want. And we found that going too small or quantizing models will make it worse. We found that going bigger and bigger at some points might not even help or might exceed our latency threshold. So that’s how we’ve done it for our application. A lot of other places that are doing things that are more like agent based where the model is deciding what tool to use, you can even have a model make the decision about how to route the request. So, you could have a request get sent somewhere, some medium or big model might decide, hey, is this a easy, hard medium question and send it off to the right thing after that.

Robert Blumen 00:31:54 You just now mentioned latency budgets, which is something I have on my list to talk to you about. Can you explain what that is?

Sahaj Garg 00:32:03 Yeah, so a latency budget is basically us saying this is the maximum allowable time that a user can experience for a request, and we can set the budget either for the average request or for the extreme like 99th percentile worst request. And so, we do this as a kind of product side, design the constraint around the engineering and then try all possible solutions that we can to meet that constraint.

Robert Blumen 00:32:32 Is this constraint enforced on each and every request as it goes through the system or is it more of a goal that you’re trying to meet on an average or P99 basis?

Sahaj Garg 00:32:44 We do it as a soft target for something that we want to meet on an average or a P99 basis. So, we’ll have a different target for our average latency, a different target for our P99 latency. And then when we evaluate new things we want to build, we see hey is this going to even be practical to meet that P50, P99 latency target? Because in practice every request is so different to these AI workloads that it’s hard to guarantee that every single thing will complete under a very fixed threshold.

Robert Blumen 00:33:14 We’ve talked a little while ago about all the things that have to happen for the user to get a completed request. When you look at the budget, are you using this as a planning tool or you say, okay, DNS takes two milliseconds and database takes three, or how do you allocate the budget among the different steps?

Sahaj Garg 00:33:35 The way that we’ve always approached it is we’ve kind of started with the overall target, then we figure out, hey, what’s the breakdown right now? How much is each step taking right now? And how much could each of those be optimized? So, at the end of the day, all I care about is the end user’s experience, right? And then based on how much time each chunk of it’s taking, we can say okay, this chunk can be optimized further so we can make more time for this other chunk, right? Maybe if we make our network operations a hundred milliseconds faster, now we get a hundred milliseconds back to run a bigger LLM, that would be nice, wouldn’t it? And so that’s how we start. We always have the constraint on the end user experience and then when we’re doing engineering, we’re trying to figure out, okay, right now based on all the chunks of the pipeline, we only have this much time for this set of operations and then we make it happen.

Robert Blumen 00:34:26 Can you tell a story about a time where you said we need to get this particular step to come in a hundred millisecond slower and what you had to do to make that happen?

Sahaj Garg 00:34:38 So a huge one for us was with respect to deploying our LLMs, right? So, our LLMs and other LLMs for things like coding agents where they’re editing the output of what somebody has said, require essentially taking in all the words and outputting something that looks very similar but edits the output. And so that whole step was taking a really long time initially, but there’s kind of one category of techniques that people use across a lot of ML workloads called speculative decoding where essentially what you’re doing is you’re going to say, hey, I’m going to guess a bunch of things about what the output should be and just verify my guess more or less. So for example, if you’re in like a coding tool and you’re editing a line or you’re going to edit several lines, you can actually guess that like, hey this line’s going to say the same and just re-output the same line that was already there so that that you only have to output the lines that are going to get changed.

Sahaj Garg 00:35:40 And similarly for us, if you’re dictating and we’re going to remove a bunch of filler words, you can guess that hey if the sentence is looking fine, it’s probably going to be the same words up to a certain point before we have to start changing things. And so, this is one of those kinds of techniques where once we roll that out, we’re actually able to bring down the latency for LLMs by over half in a lot of our cases, sometimes even more. And that allowed us actually budget for running even more models as part of our workflow, right? By doing that we were able to say, okay, how can we make the results even better by running a third model and a fourth model to do things that are even more useful for the user.

Robert Blumen 00:36:22 I’ve worked more in enterprise software and I can’t say that companies don’t care at all about latency, but the approach is generally more building features that create some value within certain workflows and then you’ll maybe look at latency and if you have certain features that are way out of line, you try to figure out what’s going on and bring them back to the maybe the average. But from this conversation it sounds like there are problem domains where latency is a key part of the product and you just wouldn’t have a product that would be successful without latency. So, it can’t be an afterthought is first, is that an accurate description of either your business domain or some related domains?

Sahaj Garg 00:37:05 A hundred percent. I think anything that’s a user interactive sticky habit that’s being used directly by individuals and not bought by a company has this kind of attribute to it. So that’s like superhuman mail, email clients like web browsers, notepads editing applications, things like Slack or messaging applications, anything that’s individual facing, all those things have extremely tight latency requirements.

Robert Blumen 00:37:32 You could say if you’re the user of enterprise software, you’re an employee of a company which has bought this software for to use as part of your job. And so, you’re not going to say, hey, I’m going to switch to this other enterprise software. So, you just have to put up with it. But may it also be true that companies don’t want their employees to get bored, distracted, attention wander and they want their employees to be productive. So, do you have any thoughts on how important latency is even when you have a somewhat captive user base?

Sahaj Garg 00:38:03 The best consumer products often can also be some of the best enterprise products. Like we all still use things like iPhones or Android phones for our work applications and they’re also individual products. And so, companies that do this incredibly well can capture a lot of the market on both sides. I think the real trade off comes from how many features can you ship quickly if you’re also trying to optimize a latency and performance of every feature. And so, in a lot of enterprise cases there’s actual value unlock by having a complex new workflow feature that integrates in five different data sources and lets you do something new with it, right? And maybe only 5% of the users at the company are going to use that specific feature. And so, in those kinds of cases, the reason why a lot of enterprise software focuses less on this by default is you can’t do that and also ship all the features as quickly. That’s the real reason for the trade-off.

Robert Blumen 00:39:05 Yeah, that makes sense. Now we’ve covered some of the techniques for lowering low latency engineering in AI. You talked about choosing the right model, you went by pretty quickly reducing quantization, distillation, pruning. So, let’s maybe dive into those. We could start with quant reducing quantization.

Sahaj Garg 00:39:27 Yeah, so one of the big techniques that people use to actually make workloads really fast is instead of running all of your operations in 32 bit floating point arithmetic, which is pretty slow, you could run it in 16 bit floating point arithmetic or even better you could run it in eight bit integer arithmetic. Integer arithmetic is way, way faster to run on compute than floating point arithmetic because you don’t have to kind of carry over any exponent or do any complicated computations. The problem comes with the fact that a lot of these models were trained using floating point arithmetic. So now when you want to port them over to using integer arithmetic to serve them really fast, the accuracy starts to go down because they’re doing something slightly different. And so people do a lot of work basically to train models or fine tune models in a way that’s essentially quantization aware, that’s aware of the fact that in the future all the operations are going to be redone as integer operations that then lets you deploy these kinds of quantized models on edge devices or on servers as well and still get really good results.

Robert Blumen 00:40:33 Can you give an example of a problem domain where that works?

Sahaj Garg 00:40:38 I actually suspect almost all of the large language models that are deployed right now are operating on quantized arithmetic at inference time. Because like you can get four times compared to 16-bit integer, but significantly more than four times compared to 16-bit floating points. And so, so much of the inference for these models is actually optimized for running an integer inference. And like you’ll see for example, plenty of posts where Elon talks about how the fact that these models are trained in a floating point itself is an issue and he’s trying to push for a lot of the XAI work to train in integer arithmetic as well because then things can be even faster.

Robert Blumen 00:41:18 Do you know if some of the biometric IDs that run on our phone fit that category?

Sahaj Garg 00:41:26 I actually don’t know too much about how those are implemented or if they’re using some of these really big workloads. Quantization I’ve mostly seen, people use for models that exceed, you know, a hundred million parameters where you’re now trying to bring it back to the device to run it. But I don’t think the biometric ID models are actually really big models. I think they’re actually tuned in a different way for precision and recall

Robert Blumen 00:41:52 Distillation. That’s the next thing on my list. How can that be used to impact latency?

Sahaj Garg 00:41:58 So distillation is a really huge technique, and it’s used extremely widely. And so, the idea here is small models are less smart than big models. Small models have less capacity to learn and understand things. However, if you’re telling a small model to do something specific, it can learn how to do that very specific thing. And one of the ways that somebody can teach a small model to learn how to do that very specific thing is to ask a large model to teach it essentially. And the way that you get a large model to teach it is maybe you have a million examples of the tasks that you want to do and you take those million examples, you have a large model to try and perform the task on all a million examples and you get the answers that it produces and you use that as the data to say, hey, let me now train the small model to mimic what the big model would’ve done.

Sahaj Garg 00:42:49 And now all of a sudden, this small model might not have been able to do the task to begin with because it doesn’t have the capacity to learn everything on its own. But now that it’s seen lots and lots of examples of how to do that, it can do that one thing really well often just as well as the big model and probably not much else anymore at that point. And this technique is super powerful. It’s used kind of across the board. So the way that I just described it was how it’s used in large language models where people might distill a very large language model into a small language model, but it’s also used across the board for example, for computer vision models or speech models or things of that sort where you can get, or even recommender systems, you can get really strong properties out of small models by training a much larger model first offline

Robert Blumen 00:43:36 Given a large language model. What’s a good example of a use case where you want a small language model that’s good at one particular thing?

Sahaj Garg 00:43:45 Basically anything domain specific is a good example of this. Let’s take your example that you mentioned where you would often have recordings of interviews that you’d done where you were trying to get them to be shorter. And suppose you’re trying to do that at scale, right? For lots and lots of things. If you sent every single one of those requests to GPT 5.1, it might be really slow, might take forever, might cost you a lot of money. And as a company, if you’re doing that or as an individual, if you’re doing that, if you could turn that into a task that could be accomplished by a small model, it solves for a lot of the resource constraint trade-offs that you mentioned and that we were talking about earlier, right? Because now all you have to do is first either online or offline, you can generate all those data, then you train the small model once and now every subsequent thing that you do is super-fast, super cheap and accomplishes the same result.

Sahaj Garg 00:44:40 And so like all sorts of domain specific things and domain specific applications, legal AI will have a lot of these kinds of things. A lot of medical AI tools will do this because it lets you serve a lot more requests at low latency, at low cost still with quality. The other one is also for on-device inference. So like if you want to bring something to work on your phone or on your laptop and you want to have a really high powered AI workload on that, sometimes the only way to get that to work on these really, really, you know, small computers that exist on a phone is to actually use this kind of a technique.

Robert Blumen 00:45:18 I would love to have ChatGPT on my laptop, but what’s an example of a limited scope application that’s very useful to be able to run offline?

Sahaj Garg 00:45:28 A good example of this, and I don’t know if Grammarly does this, but like grammar checkers for example, or autocorrect built into your keyboard or things like this where if you have a good language model, it should be able to accomplish this task really well and it just has to run fast and on my device because I can’t send every single one of those requests to a server. So that’s one example. A lot of vision models, this is really important. So vision models will often want to run on my phone or on my device if for example, I’m looking through a camera app and I want to tag the objects that I’m looking at, or if Tesla is driving a car, the amount of compute that can actually run on the device in the car is going to be lower than what they can train when they’re training their models and their servers. And so, any of these kinds of things where you don’t want to be reliant on a network for being able to make a request and then you don’t have to run servers as a developer yourself, you can have everything run on the user’s device. And then for privacy sensitive applications, it’s kind of the final core one where it’s, hey, no data, we’ll leave the device, it runs right there. And that’s super useful.

Robert Blumen 00:46:39 AI or LLMs in particular are different in the amount of randomness they introduce into the response. What are some techniques for managing that?

Sahaj Garg 00:46:49 This is one of the hardest parts about serving a lot of these applications and one of the ways in which that occurs is the kind of classic problem of hallucinations in AI. So maybe you ask a question and then all of a sudden, the model goes off in a crazy loop and then starts to even do bad things like repeat the same sentence over and over and over again in an answer. And then the request just never stops, never terminates and keeps consuming resources, never giving the user what they want. There’s a bunch of different pathologies like this that can show up when serving AI applications and there’s two or three important things to do to handle this. So, the first is like safeguards. If you know how long an answer should be, that at least lets you cut off how bad of an output you can generate.

Sahaj Garg 00:47:37 So that’s one thing that’s helpful. The second thing is catching it as the model is producing it. One thing that can be done is like if the model starts hallucinating going off the rail, start answering a different question from what it was asked, that can be caught partway because the output’s being partially produced and heuristic algorithms can be used to catch that, other LLMs could be used to catch that. All these kinds of things allow for handling those kinds of problems. And so, you know, this becomes really important because it’s like even with normal questions that a user might ask, sometimes these systems just go off the rails. And as a developer it’s important that you’re the one responsible for making sure the system doesn’t go off the rails when a user is asking a normal question, let alone for the kind of handful of users who will try to break your system and give it hard things that will cause those kinds of problems to occur.

Robert Blumen 00:48:32 Now in the last few minutes, I want to talk about the issues of scale. How does latency degrade or become increasingly challenging at scale?

Sahaj Garg 00:48:42 So scale has a tremendous amount of impact on latency, and I think a lot of it has to do with how GPU requests tend to actually get processed. The more and more loads are on the GPU. So, GPU work by parallelizing lots of operations, they run lots of computations in parallel. But if two people, for example, submit a request at the same time and mine has a 200-word question with a 10-word answer and yours has a 10-word question with a 200-word answer and the GPU tries to process them simultaneously, it now all of a sudden has to do two slightly different things for each of us. And so, the way in which it can optimize the parallel computation changes, right? And so, there’s a lot of things that occur as systems and requests start to scale where more and more different types of requests get batched and on the flip side you have more opportunity to optimize it, right? I can try as somebody working on infrastructure to optimize it. So similar kinds of requests go to the same server, right? And so that gives me opportunities for optimizing latency in a way that wasn’t possible when there wasn’t scale. And so, scale both helps and hurts when it comes to latency. The place where it’ll probably hurt the most is if you ever run out of resources and then now, you’re dealing with higher and higher load on the same set of machines and doing your best to make sure that you can actually service everybody.

Robert Blumen 00:50:10 And how much of these behaviors that are more, unique to GPUs, do you know going into it and how much have you discovered with observability and digging into things that were not behaving as you expected?

Sahaj Garg 00:50:27 I’d probably say 50/50. Like there’s a bunch of these kinds of problems that we know going in where we haven’t solved them yet and we know that at the next batch of scale it’s going to be important to solve them. And there’s a bunch of things that come up that are extremely hard to predict because sometimes one flag that you set for an Nvidia CUDA kernel is the kind of thing that changes the performance of a system by 25% under a specific load pattern that we’re experiencing. Like we literally had that happen at one point where one CUDA kernel flag improved performance by 25% due to a specific load pattern. And that’s the kind of thing where there’s no way we could have ever guessed that going into building the system, but by observing it, by seeing when it spiked and then by looking at profiling of these operations we’re taking this amount of time, we’re able to figure out that that’s the thing that we needed to optimize in that situation.

Robert Blumen 00:51:26 Are there any other lessons learned or best practices around managing latency that you’d like to convey?

Sahaj Garg 00:51:33 I think the most important one is that for anything latency sensitive, everything that’s not critical should get removed from the critical path. And you know, when I phrase it that way, it sounds obvious because it’s called the critical path, but it’s really easy to have a database query here or a blocking operation there that is occurring in the primary path for a latency sensitive application. But three milliseconds here, five milliseconds there, 10 milliseconds another place and that’s under good conditions, you know, could become a hundred milliseconds here or there really fast. And it’s actually often the innocuous pieces of code that people write that can vary easily just cause overhead that doesn’t need to be there, right? That’s the easy kind of overhead to avoid when being disciplined about programming. A lot of the other stuff is like you’re going to have to whack away at it. It’s going to be a hard problem. You’re going to have to dive into logs, observe things, and understand how like kernels work and things of that sort. And that’s the hard work that always needs to happen, but the best practice of like make sure only the absolute bare minimum needs to happen in any blocking operation on the critical path. Like that’s the kind of golden rule for me.

Robert Blumen 00:52:54 It sounds very much like an idea that I ran across some years ago under the name of reactive applications where the idea was anything that you don’t need the response in line, put it in an async and just let it happen on its own. And if you need to know what it finished using event.

Sahaj Garg 00:53:15 Exactly.

Robert Blumen 00:53:16 Okay. Well Sahaj, thank you so much for this conversation. Before we wrap up, would you like to direct listeners to anywhere on the internet where they can find either you or your company?

Sahaj Garg 00:53:28 Yeah, so I would just recommend going to the wispr flow website, so that’s where you can see some of our research and our blogs about how we think about optimizing and building voice AI models for improving things. And you can also download and play around with the application, which lets you speak instead of type anywhere on your computer. So that’s, you know, the best set of resources. And then I have a small lightweight personal blog that you can find online as well.

Robert Blumen 00:53:54 So Sahaj, thank you for speaking to Software Engineering Radio.

Sahaj Garg 00:53:58 Thank you for having me, Robert. It was a pleasure being here.

Robert Blumen 00:54:00 This has been Robert Blumen for Software Engineering Radio. Thank you for listening.

[End of Audio]

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 711: Scott Hanselman on AI-Assisted Development Tools

SE Radio 710: Marc Brooker on Spec-Driven AI Dev

SE Radio 709: Bryan Cantrill on the Data Center Control Plane

Menu

Recent posts

Search

Search

SE Radio 703: Sahaj Garg on Low Latency AI

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 711: Scott Hanselman on AI-Assisted Development Tools

SE Radio 710: Marc Brooker on Spec-Driven AI Dev

SE Radio 709: Bryan Cantrill on the Data Center Control Plane

Menu

Recent posts