SE Radio 661: Sunil Mallya on Small Language Models

Sunil Mallya, co-founder and CTO of Flip AI, discusses small language models with host Brijesh Ammanath. They begin by considering the technical distinctions between SLMs and large language models.

LLMs excel in generating complex outputs across various natural language processing tasks, leveraging extensive training datasets on with massive GPU clusters. However, this capability comes with high computational costs and concerns about efficiency, particularly in applications that are specific to a given enterprise. To address this, many enterprises are turning to SLMs, fine-tuned on domain-specific datasets. The lower computational requirements and memory usage make SLMs suitable for real-time applications. By focusing on specific domains, SLMs can achieve greater accuracy and relevance aligned with specialized terminologies.

The selection of SLMs depends on specific application requirements. Additional influencing factors include the availability of training data, implementation complexity, and adaptability to changing information, allowing organizations to align their choices with operational needs and constraints.

This episode is sponsored by Codegate.

Show Notes

Related Episodes

Other References

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host Brijesh Ammanath. Today I will be discussing small language models with Sunil Mallya. Sunil is the co-founder and CTO of Flip AI. Prior to this, Sunil was the head of AWS NLP service, comprehend and helped start AWS pet. He’s the co-creator of AWS deep appraiser. He has over 25 patents filed in the area of machine learning, reinforcement learning, and LP and distributed systems. Sunil, welcome to Software Engineering Radio.

Sunil Mallya 00:00:49 Thank you Brijesh. So happy to be here and talk about this topic that’s near and dear to me.

Brijesh Ammanath 00:00:55 We have covered language models in some of our prior episodes, notably Episode 648, 611, 610, and 582. Let’s start off Sunil, maybe by explaining what small language models are and how they differ from large language models or LLMS.

Sunil Mallya 00:01:13 Yeah, this is a very interesting question because, the term itself is sort of time bound because what is large today can mean something else tomorrow as the underlying hardware get better and bigger. So if I go back in time, it’s around 2020. That’s when the LLM term starts to sort of emerge with the advent of people building like billion parameter models and quickly after OpenAI releases GTP-3, which is like 175 billion parameter model that sort of becomes like this gold standard of what a true LLM means, but the number keeps changing. So I’d like to define SLMs in a more slightly different way. Not in terms of number of parameters, but in terms of like practical terms. So what that means is something that you can run with resources that are easily accessible. You’re not like constrained by GPU, availability or you need the biggest GPU, the best GPU. I think to distill all of this, I’d say as of today, early 2025, a 10 billion parameter model that’s operating with like say a max of like 10K context length, which means that you can give it like an input of around 10K words maximum, but where the inference latency is around one second. So it’s pretty fast overall. Like so I would define SLMs in that context, which is a lot more practical.

Brijesh Ammanath 00:02:33 Makes sense. And I believe as the models become more memory intensive, the definition itself will change. I believe when I was reading up GPT-4 actually has about 1.76 trillion parameters.

Sunil Mallya 00:02:46 Yeah. That actually some of these closed source models are really hard when people talk about numbers. Because what can happen is people nowadays use like a mixture of expert architecture model. What that means is they’ll sort of put together like a really large model that has specialized parts to it. Again, I’m trying to explain in very easy language here. What that means is when you run inference through these models, not all the parameters are activated. So you don’t necessarily need 1.7 trillion parameters worth of compute to actually run the models. So you end up using some percentage of that. That actually makes it a little interesting when we say like, oh, how big the model is. But like you want to actually talk about like number of active parameters because that really defines the underlying hardware and resources you need. So if we go back again something like GPT-3, when we, when I say one 75 billion parameters, all the one 75 billion parameters are involved in giving you that final answer.

Brijesh Ammanath 00:03:49 Right. So if I understood that correctly, only a subset of the parameters would be used for the inference in any particular use case.

Sunil Mallya 00:03:57 In mixture of expert model in that architecture. And that’s a very popular for the last maybe a year and a half, has been a popular sort of way for people to build and train because training these really, really large models is extremely hard. But training like mixture of experts, which are sort of collection of smaller models, relatively smaller models are much easier. And then you put them together, so to speak. That’s a emerging trend even today. Very popular and a very pragmatic way of actually going forward in training and then running inference.

Brijesh Ammanath 00:04:34 Okay. And what differentiates an SLM from an expert model? Or are they the same?

Sunil Mallya 00:04:39 Yeah, I’d say how we’ve ended up training LMS have been general purpose models. Because these models are trained on internet corpus and whatever data you could get hand. So by the nature of like when you look at internet, internet is all the sort of topics of the world that you can think about and that defines the characteristics of the model. So hence you would characterize them as general-purpose Large Language Models. Expert models are when model has a certain expertise or like you don’t care about, let’s say you’re building a coding model, which is an expert coding model. You don’t necessarily care about it knowing anything about Napoleon or anything to do with history because that’s irrelevant to the conversation or the topic of choice. So expert models are something that are focused on one or two areas and go really deep. And SLMs are the term being just Smaller Language Model from a size and practicality perspective. But typically when you think about what people end up doing is you are saying that, hey, I don’t care about history, so I only need this little part of the model, or I just need the model to be expert in only one thing. So I let me train a smaller model. We just focused on just one topic and then it becomes an expert. So they’re interchangeable in some respect but needn’t be.

Brijesh Ammanath 00:06:00 Right. I just want to deep dive into the differences and attributes between SLMs and LLMs. Before we go into the details, I’d like you to define what a parameter is in the context of a language model.

Sunil Mallya 00:06:12 So let’s talk about, this actually comes from, if we go background, the whole concept of neural nets and early days, we call them neural nets. They’re modeled on the biological brain and how I guess the animal nervous system and brain functions. So this fundamental unit is a neuron, and neuron actually has a cell, has some sort of memory, some sort of specialization. The neuron connects to many other neurons to form your entire brain and certain responses based on stimuli like certain other sets of neurons sort of activate and give you sort of the final response. That’s sort of what is modeled. So a parameter like you can sort of think about it as equivalent to like a neuron or a compute unit. And then these parameters come together to sort of synthesize the final response for you. Again, I’m giving a very high-level answer here that what translates to, from a practical point of view.

Sunil Mallya 00:07:08 Like when, when I say 10 billion parameters or model, that roughly translates into X number of gigabytes and there’s a, I would say there’s an approximate formula and, it depends on the precision that you want to use to represent your data. So if you take about like a 32-bit representation floating bit, that’s about four bytes of data. So you multiply 10 into four, that is 40 gigs of memory that you need to store these parameters in order to make them functional. And of course you can go half precision. And then you’re suddenly looking at only 20 gigs of memory to serve that 10 billion parameter model.

Brijesh Ammanath 00:07:48 It’s a very good example comparing it to neurons. It brings to life what parameters are and why it’s important in the context of language models.

Sunil Mallya 00:07:56 Yeah, it’s actually the origin itself, like how people actually thought about this in the fifties and how they modeled and how this finally evolved. So rather than it being an example, I would say people went and modeled real life neurons to finally come up with the terminology and how the design of these things, and to this day, people sort of compare everything to rationalizing reasoning, understanding, et cetera, very human like concepts into how these LLMs behave.

Brijesh Ammanath 00:08:26 Right. How does the computational footprint of an SLM compare to that with an LLM?

Sunil Mallya 00:08:33 Yeah, so computational footprint is directly proportional to the size. So size is the number one driver of the footprint, sort of like, I would say maybe like 90%. The rest of the 10% will be something like how long is your input sequence? And these models typically have a certain like maximum range back in the day. I would say like a thousand tokens or roughly tokens. A definition of, let me sort of go a little segue into how these models work. Because I think that may be relevant as we dive in. So these language models, right there is essentially a prediction system. The output of the language model for you when you go to a chat GPT or anywhere else, like it’s giving you beautiful blogs and sentences and so on. But the model doesn’t necessarily say understand sentences as a whole.

Sunil Mallya 00:09:23 It understands parts of it. It is made up of words and technically sub words, sub words are what we call as tokens. And the idea here is the model predicts a probability distribution on these sub word tokens that allows it to say, hey, the next word should be now with 99% probability should be this. And then you take the collection of the last N words you predicted, and then you predict the next word, N + one word, and so on. So it’s auto aggressive in nature. So this is how these language models work. So the token length as in how many words if you are predicting over a hundred words versus 10,000 words is a material difference because now you have to take, when you’re predicting the 10,000th word, you have to take all the 9999 words that you have previously as context into that model.

Sunil Mallya 00:10:16 So that has a sort of a non-linear scaling effect on how you end up predicting your final output. So that, along with the fundamental, as I said, the model size has an effect, not as much as the model footprint itself, but I mean they sort of go hand in hand because like the larger the model, the slower it’s going to be on the next token and next token and so on. So they add up. But fundamentally, when you look at the bottleneck, it is the size of the model that defines the compute footprint that you need.

Brijesh Ammanath 00:10:47 Right. So to bring it to life, that would mean an SLM would have a smaller computation footprint, or that’s not necessarily the case?

Sunil Mallya 00:10:55 No, yeah, by definition it would, we’re defining LMS as a certain parameter threshold almost always will have a smaller footprint in terms of compute. And just to give you a comparison, it’s probably if we compare the 10 billion parameter model that I talked about versus something like a one 75 billion parameter we are talking about two orders of magnitude, difference in terms of actual speed. Because everything is not, again, things are not linear actually.

Brijesh Ammanath 00:11:26 Can you provide a comparison of the training data sizes typically used for SLMs compared to LLMs.

Sunil Mallya 00:11:32 Practically speaking, let me define different training strategies for SLM. So what we call as training from scratch wherein, your essentially your model parameters. I mean, think about model parameters as this giant matrix and this matrix everything starts with zero because you haven’t learned anything or you’re starting with these zero states and then you give them certain amount of data and then you start training. So there is that let’s call it zero weight training. That’s one approach of training small language models. The other approach is you can take like a big model and then you can actually go through different techniques like pruning where you take certain parameters out or you can distill it, which I can dive in later, or you can quantize it, which means that I can go from a precision of 32 bits to eight bits or four bits.

Sunil Mallya 00:12:27 So I can take this, a hundred billion parameter model, which would be 400 gigs and, if I chop it by four technically it becomes a 25 billion parameter model because that’s the amount of compute I would need. So there are different strategies in creating these small language models. Now to the question of training data larger the model, the hungrier it is, and the more data you need to feed, the smaller the model, you can get away with smaller amounts of data as well. But it doesn’t mean that the actual end result is going to be the same in terms of accuracy and so on. And what we find practically is given a sort of a fixed amount of data, the larger the model, it’s likely to do better. And the more data you feed into any kind of model, the more likely it is to do better as well.

Sunil Mallya 00:13:19 So the models are actually very hungry for data and good data and you get to train, but I’ll talk about the next step, which is rather than using the SLMs or training the SLMs from scratch, fine tuning these LLMs, what that means is instead of the zero weights that I talked about earlier, we actually use a base model, . Like a model that has already trained on a certain number of training data, but then the idea is steering the model to a very specific task. Now this task can be building a financial analyst or an actual in the case of, healthcare, like you can build like healthcare models in case of Flip AI, we built models to understand observability data. So you can fine tune and build these models. So now to give you some real examples.

Sunil Mallya 00:14:13 Like let’s take some of the most popular open-source models where Llama-3 is the most popular open-source model out there and that’s trained on 14 trillion tokens of data. Like it’s seen so much data already, but by no means it is an expert in healthcare or in observability and so on. What we can do is train on top of these models using the data that we have curated. And if you look at Meditron, which is, healthcare model, they train on roughly 50 billion tokens of data. Bloomberg trained a financial analyst model and that was again in the hundreds of billions of tokens. And we have trained our models with like a hundred billion tokens of data. Now that’s sort of the contrast. Like we’re talking about two orders of magnitude less data than what LLMs would need. Only reason this is possible is by using those base models, but the specialization part, you don’t require as much data as the generalization number of tokens.

Brijesh Ammanath 00:15:20 Alright, got it. And how do you ensure that SLMs maintain fairness and avoid domain specific biases? Because SMS are by nature very specialized for a specific domain?

Sunil Mallya 00:15:31 Yeah, very good question. Actually, it’s a double-edged sword because on the one hand, when you talk about expert models, you do want them biased on the topic. When I talk about credit in the context of finance, it means certain thing and credit can mean something else in a different context. So you just sort of wanted bias towards your domain. In any ways. So that’s how I think about bias in terms of functional capability. But let’s talk about bias in terms of anything that is acting. Like in terms of like now if the same model is being used to pass a loan or determine who needs a loan or not, that’s a different kind of bias. Like that is more inherent of a decision-making bias. And that comes with data discipline.

Sunil Mallya 00:16:20 What you need to do is, you need to train the model or ensure the model has data on all the pragmatic things that you’re likely to see in the real world. What that means is if the model is being trained to make decisions on offering loans, we need to make sure that underrepresented people in society are being trained, trained in the model. So, the model, like if the model is only seen a certain demographic while training is going to say no to people who have not represented in that training data. So that curation of training data and evaluation data, I like to say this is the evaluation data. Your test data is, is far more important. Like that needs to be extremely thorough and a reflection of what is out there in the real world. So that whatever number you get is close to the number that happens when you deploy. There are so many blogs, so many people I talk to everybody’s concern as, hey, my test data says 90% accurate. When I deploy, I only see like 60-70% accuracy because, people didn’t spend the right amount of time in curating the right training data and more importantly, the right evaluation data to make sure the biases are taken care of or reflected that you would encounter in the real world. So to me it boils down to good data practices and good evaluation practices.

Brijesh Ammanath 00:17:50 For the benefit of our listeners, can you explain the difference between curation data and evaluation data?

Sunil Mallya 00:17:56 Yeah, yeah. So when I say training data, this is the model. These are the examples that the model sees throughout its training process. So the evaluation or test data is what we call as a held-out data set. As in this data is never shown to the model for training. So it doesn’t know that this data exists. It is only shown during inference and by inference, inference is a process where the model doesn’t memorize anything. It is a static process. Everything is frozen, the model is frozen at that time, it does not learn with that example, it just sees the data, gives you an output and done. It doesn’t complete the feedback loop of if that was correct or wrong.

Brijesh Ammanath 00:18:36 Got it. So to ensure that we don’t have unwanted biases, it’s important to ensure that we have curation data and evaluation data which are fit for purpose.

Sunil Mallya 00:18:47 Yeah. So again, curation, I call it a training data. Like curation would be the process. So your training data is what the examples that the model will see, and the test data is what the model will never see during the training process. And just to add more color here, we act good organizations follow the complete blind process of training or annotating data. What that means is you would give the same example to many people, and they don’t know what they’re labeling, and you may repeat labeling of the data, et cetera. So you create this process where you are creating this training data, a diverse set of training data that is being labeled by multiple people. And you can also ensure that the people who are labeling this data are not from a single demographic. You’re taking a slice of real-life demographics into account. So you’re getting like diversity all around. So you’re ensuring that the biases don’t creep through in your process. So I would say 95% of mitigating bias is to do with how you curate your training data and evaluation data.

Brijesh Ammanath 00:20:00 Got it. What about hallucinations in s SLMs compared to LLMs?

Sunil Mallya 00:20:05 Yeah. So LLMs by nature, as I said, they are general purpose in in nature. So they know as much about Napoleon as much as like other topics like how to write a good program in Python. Like so it’s this extreme thing and that comes with burden. So now let’s go back to this whole inference process that I talked about. Like the model is predicting this one token at a time. And now imagine for some reason, let’s say somebody decided to name their variable Napoleon. And Napoleon predicted the variable as Napoleon and suddenly the model with the context of Napoleon things like, oh, this must be a history. And it goes off and writes about, we asked you to develop a program, but it has written something about Napoleon. What are opposites in terms of output? And that’s what hallucination, that’s where it comes from, which is it is actually an unsure, the model is unsure as to okay, which path it needs to go down to synthesize the output for the question you’ve asked.

Sunil Mallya 00:21:12 And by nature with s SLMs, there’s less things for it to think about so that the space that it needs to like think from is reduced. The second is because it is trained on a lot of coding data and so on, even when say Napoleon may come in as a decoded token, unlikely that the model is going to veer into a history topic because majority of the time the model is spent learning is only on coding. So it’s going to assume that’s a variable and decode. So yeah, that’s kind of the advantage of SLM because it’s an expert, it doesn’t know anything else. So it’s going to focus on just that topic or its expertise rather than think. So typically an order of magnitude difference in hallucination rates when you think about a good well-trained SLM versus an LLM.

Brijesh Ammanath 00:22:05 Right. Okay. Do you have any real-world example of any challenging problem which has been solved more efficiently with SLMs rather than LLMs?

Sunil Mallya 00:22:15 Interesting question and I’ll give you; it’s going to be a long answer. So I think we’ll go through a bunch of examples. I would say traditionally speaking if you had the same amount of data and you want to use an SLM versus an LLM, look, LLM is more likely to win just because of the power. The more parameters give you more flexibility, more creativity and so on, that’s going to win. But the reason why you train an SLM is for more controllability deployment, cost accuracy, that kind of reasons and happy to dive into that as well. So traditionally speaking, that has been the norm that is starting to change a bit. If I look at examples of something like in healthcare, a couple examples like Meditron these are open-source healthcare models that they have trained. And when I look at, if I recall the numbers, they had their version one, which was like a couple of years ago, even like their 70 billion model was outperforming a 540 billion model by Google.

Sunil Mallya 00:23:19 The Google had trained like these models called Palm, which were healthcare specific. So Mediron. And they recently retrained the models on Llama-3, 8 billion and that actually beats their own model, which is 70 billion from the previous year. So if you sort of compare in a timeline of these five 40 billion parameter models from Google, which is like a general-purpose sort of healthcare model versus a more specific healthcare SLM by Meditron and then an SLM version-2, by them it’s like a 10X improvement that has happened in the last two and a half years. So I would say, and if I recall even their hallucination rates are a lot less compared to what Google had. That’s one example. Another example I would say is again, in the healthcare space, it’s a radiology oncology report model. I think it’s called RAD-GPT or RAD Oncology GPT.

Sunil Mallya 00:24:18 And that was the output I remember was something like the Llama models would be at equivalent of 1% accuracy and these models were at 60-70% accuracy. That dramatic a jump that relates to training data and happy to dive in a little more. So now you see that difference. Like of like large models. And that’s because when you think about the general-purpose models, they have never seen like radiology, oncology, that kind of reports or data like that doesn’t exist on the internet. And now you have a model that is trained on these data that is very constrained to an organization and you start to see this amazing, almost crazy 1% versus 60% accuracy result and improvements. So I would say there are these examples where the data sets are very constrained to the environment that you operate in that gives the SLMs advantage and then something that is practical. So that’s something that is like open in the world. So hopefully I’m happy to double click. I know I’ve talked a lot here.

Brijesh Ammanath 00:25:24 No good examples. That’s a really big difference from one person to 60 to 70% improvement in terms of identifying or inference.

Sunil Mallya 00:25:33 Yeah actually I have something more to add there. That’s this is like hot off the press just a couple of hours ago. There’s a model series called DeepSeek R1 that just launched and DeepSeek, it’s actually a, if I forget, maybe somewhere around like 600 billion parameter model, but it’s a of expert model. So activation parameters that I earlier talked about, that’s only about 32 or 35 billion parameters. So almost like 20x reduction in size when you practically talk in terms of the amount of compute and that model is outperforming the latest of open AI, 0103 series models and Claude from Anthropic and so on. And it’s insane. Like when you think about, again, we don’t know the size of, say Claude 3.5 or GPT-40, they don’t publish it. We do know those are probably in the hundreds of billions of parameters.

Sunil Mallya 00:26:35 But for a model that is effectively 35 billion parameters of activated size to actually be better than these models are just insane. And I think it deals, again, it deals with like how they train, et cetera and so on. But I think it comes back to the question of the mixture of expert model. When you take a bunch of small models and put them together, they’re likely to, as we see these numbers, they’re likely to perform better than like a huge model that has this one sort of giant computational footprint end to end. I do think this is a sign of more things to come where SLMs or collection of s SLMs are going to be way better than a single 1 trillion parameter or a 10 trillion parameter model. That’s where I would bet.

Brijesh Ammanath 00:27:22 Interesting times. I’d like to move to the next topic, which is around enterprise adoption. If you can tell us about a time when you gave specific advice to an enterprise deciding between SLMs and LLMs, and what was the process, what questions did you asked them and how did you help them decide?

Sunil Mallya 00:27:39 Yeah, I’d say enterprise is a very interesting case and my definition enterprise has data that nobody’s ever seen. It’s not the data that is very unique to them. So I say like, enterprises have a last mile problem, and this last mile problem manifests in two ways. One is the data manifestation, which is the lack of the model is never probably seen the data that you have in your enterprise. It better not,right? Like because you have security guardrails in terms of like data and so on. The second is making this model practical and deployed in your environment. So tackling the first part of it, which is data. Because the model has never seen your data. You need to fine tune the data on your own enterprise data corpus. So getting clean data. Like that’s my first advice is getting clean data.

Sunil Mallya 00:28:31 So sort of advice them on how to produce this good data. And then the second is evaluation data. How do you, to my earlier examples. Like I have people who say like, hey, I had 90% accuracy on my test set, but when I deploy, I only see 60% or 70% accuracy because your test set wasn’t a representative of what you get in the real world. And then you need to think about how to deploy the model because there’s a cost associated with it. So when you’re sort of thinking through SLMs, you’re always, there’s a trade-off that they’re always trying to do, which is accuracy versus cost. And then that becomes sort of like your main optimization point. Like you don’t want something that is cheap and does no work or you don’t want something that is good, but it’s too expensive for you to justify bringing it in. So finding that sweet spot is what I think is like extremely important for enterprises to do. I would say these are my general advice on how to sort of think through deploying in the enterprise, deploying SLMs in the enterprise.

Brijesh Ammanath 00:29:41 And do you have any stories around the challenges faced by enterprises when they adopted SLMs? How did they overcome it?

Sunil Mallya 00:29:48 Yeah, I think as we look through many of these open-source models that companies try to bring in-house because the model has never seen the data, things keep changing. There are two reasons. One is you didn’t train well, or you didn’t evaluate well, so you didn’t get a hold of the model. The second is the underlying sort of data and what you get and how people use your product keeps changing over time. So there’s a drift in terms of you’re not able to capture all the use cases at a given static time point. And then as time goes along, you have people using your product or technology in a different way and you need to keep evolving. So again, comes back to how you curate your data. How can you train well and then irate on the model. So you need to bring in observability into your model, which means that when the models are failing, you’re capturing that when a user is not happy about a certain output, you’re capturing that the why somebody who’s not happy, you’re capturing those aspects.

Sunil Mallya 00:30:56 So bringing all of this in and then iterating on the model. There’s also one thing which we haven’t talked about, especially in the enterprise, like we’ve talked a lot about fine tuning. The other approach is called a Retrieval Augmented Generation or RAG, which is more commonly used. So what happens is when you bring a model in, it doesn’t have, it’s never seen your data. And what you can do is certain terminologies or technologies or something jargons or something specific that you have, let’s say in your company Wiki page or some sort of text spec that you’ve written, you can actually give the model a utility to say, hey, when somebody asks a question on this, retrieve this information from this Wikipedia or this indexed, storage that you can bring in as more context because you’re never seen, you don’t understand that data and you can use that as context to predict what the user asked for. So you’re augmenting the existing base model. So typically people like approach in two different ways as they deploy. So either you fine tune, which I talked about earlier, or you can use retrieval augmented generation to get better results. And it’s a pretty interesting, there’s a people that debate RAG is better than fine tuning or fine tuning is better than RAG. That’s a topic we can dive in if you’re interested.

Brijesh Ammanath 00:32:22 Maybe for another day we’ll stick to the enterprise theme and digging a bit deeper into the challenges. So what are the common challenges enterprises face? Not only in bringing the models in, but also training them, but also from a deployment perspective.

Sunil Mallya 00:32:36 Yeah, let me talk about deployment first and it’s underrated. Like people focus on the training part. People don’t think about the pragmatic aspect. So one is how do you determine the right footprint of the resources that you need. Like the right kind of GPUs, because your model can probably fit on multiple GPUs, but there’s a cost performance tradeoff. If you take the huge GPU and you’re underutilizing it, it’s not actually practical. Like you’re not going to get budget for that. So you have this sort of becomes like these three axes rather than two axes. So the X axis, you can think about the cost Y axis, you can think about performance or latency and the Z axis, you can think about accuracy. So you’re now trying to optimize in these three axes to find this sweet spot that, oh well I have budget approved for X number of dollars and I need a minimum of this accuracy.

Sunil Mallya 00:33:37 What is the trade-off I can make in terms of, well, if somebody gets the answer in 200 milliseconds versus a hundred milliseconds, that’s acceptable. So you start to like figure out this trade off that you can have to select the best sort of optimal setting that you can go deploy on. Now that requires you to have expertise in multiple things. It means that you need to know the model deployment frameworks or the underlying tools like TensorFlow, PyTorch. So those things are specialized skills. You need to know how to pick the right GPUs and create this trade off or these trade-offs that I talked about. And then you need to think about people are experts in DevOps when it’s said an organization, me experts in DevOps when it comes to CPU and traditional workloads, GPU workloads are different. Like now you need to train people on how to monitor GPUs, how to understand how to the observative part comes in. So all of that needs to be sort of packaged and tackled for you to deploy well on the enterprise. I know if you want to double click on anything on the deployment side,

Brijesh Ammanath 00:34:48 Maybe just quickly if you can touch on what are the key differences between deploying or the trade-offs between deploying on-prem and on the cloud?

Sunil Mallya 00:34:58 Yeah, I don’t know. Do you mean in the cloud? Do you mean an API based service or

Brijesh Ammanath 00:35:03 Yes.

Sunil Mallya 00:35:04 Yeah, I mean API based services, there is no difference in you using a payments API versus an ML API. Like it’s as long as you can make a rest call, you can actually use them, which makes them extremely simple. But if you’re deploying on-prem, what I would say is I’ll make it more generic. If deploying in your VPC, then that comes with all the importance that I talked about. With the addition of compliance and data governance. So because you want to deploy it in the right sort of framework. Another example like Flip AI actually we support our deployments in two modes, which is you can deploy as a SaaS, or you can actually deploy on-prem. And this on-prem version are, it’s completely air-gapped. We actually, we have scripts, whether it’s like Cloud Native scripts or Terraforms and Helm charts and so on.

Sunil Mallya 00:35:59 So we make it easy for our customers to go deploy this basically with one click because everything is automated in terms of bringing up the infrastructure and so on. But in order to enable that, we have done those benchmarks, those cost accuracy, performance sort of trade-offs, all of that. We have packaged it, we’ve written a little bit about that in our blogs, and this is what an enterprise adopting any SLM would need to do themselves as well. But that comes with fair bit of investment because it is not commoditized yet in terms of deploying LLMs in-house or SLMs as well.

Brijesh Ammanath 00:36:38 Yeah. But if you pick on that Flip AI example, what drives a customer to pick up either the SaaS model or the on-prem model? What are they looking for or what they gain? Yeah. When they go for on-prem or for the SaaS one?

Sunil Mallya 00:36:50 Yeah we work with highly regulated industries where the customer data needs to be not processed by any third party and that cannot leave their security boundaries. So it’s primarily driven by compliance and data governance. There’s another thing which is again, applies to Flip AI, but also applies to a lot of enterprise adoption, which I didn’t talk about is robustness. So when you rely on robustness and SLAs and SLOs, when you rely on like a third party API, even Open AI or Cloud or Anthropic or any of those, they don’t give you SLAs. You don’t tell you like, hey, my request is going to finish in X number of seconds. They don’t give you availability, guarantees and so on. So as an enterprise, think about an enterprise who’s building a five nines availability or even higher nines of availability. Now they have no control over nobody’s promising them. Like we are using a SaaS service, nobody’s promising them X number of whether it is accuracy or even the nines of availability that they need. But bringing in-house and deploying with best practices and redundancy and all of this, you can guarantee certain level of availability as far as these models come. And then the robustness part. These models tend to hallucinate less. Like if you’re using an API based service, which is a more general-purpose model, you cannot have these kind of hallucination rates because your performance is going to degrade.

Brijesh Ammanath 00:38:20 Hallucination wouldn’t be a factor for on-prem and SaaS, right? That would be the same.

Sunil Mallya 00:38:25 Well, it can be because in terms of general-purpose models, but if the same model is available for SaaS or on-prem, yes, then there’s equivalency there. The other is in-house expertise. If a customer doesn’t have in-house expertise of managing or they don’t want to take out that burden, then they end up going SaaS versus going on-prem. The other factor, which is a general factor I would say is availability or other this is more of a, I take that back, I was going to talk about LLMs versus SLMs, but if the same model being SaaS or on-prem, it basically comes down to compliance, data governance, the robustness aspect and being in-house expertise and the availability guarantees that you can give. It typically comes down to these factors.

Brijesh Ammanath 00:39:13 Got it. Compliance, availability, in-house expertise. You touched on a few key skills that are required for deployment. So you touched on model deployment framework, you touched on the knowledge about GPU and also about how you observe the workload on GPU. What are the other skills that, or knowledge areas that engineers should focus on to effectively build and deploy SLMs?

Sunil Mallya 00:39:40 I think those factors I talked about should cover most of them. And I would suggest if somebody wants to try and get their hands, try deploying a model locally on your laptop. There are even these, you can with the latest hardware and stuff, like you can easily deploy a billion-parameter model on your laptop. So I would kick tires taking these models. Well you don’t need a 1 billion parameter. You can even go with a hundred million parameter model to sort of like have an idea of what it takes. So you’ll get some expertise in diving into these frameworks. Like deployment frameworks and model frameworks. And then you’ll sort of get an idea about as you run benchmarks on say different kinds of hardware, you’ll get a little bit of idea on those trade-offs that I talked about. Ultimately what you’re trying to build is this access that I talked about like accuracy, performance, and cost. So that is a more pragmatic take I would do is start on your laptop or a small instance, you can get on the cloud, kick the tires and then that really builds that experience because with sort of DevOps and other sort of technologies, I feel like the more you read, the more you get confused and you can sort of condense that knowledge learning by actually just doing it.

Brijesh Ammanath 00:41:00 Agreed. I want to talk about, move onto the next theme, which is around architectural and technical differences or distinctions of SLMs. But I think we have covered quite a few of those already, which is around training data, around the tradeoffs of model size and accuracy, but maybe a few bits. So what are the main security vulnerabilities in SLMS and how can they be mitigated?

Sunil Mallya 00:41:25 I think practically speaking security vulnerabilities are not specific to SLMs or LLMs. They’re not, one has better over the other. I don’t think that that’s the right framework to think about. I think security vulnerabilities exist in any sort of language models. They manifest in slightly different way. What I mean by that is you are either trying to retrieve data that the model has seen. So you are tricking the model to give some data in the hope that it has seen some PII data or something of interest. It’s not going to tell you. So you’re trying to exfiltrate that data out. Or the other is behavior modification. Like you are, you’re sort of injecting, it’s sort of equal to SQL injection. Like where you’re trying to get the database to do something by injecting something that is malicious the same way you would do that in the prompt and trick the model to do something different and give you the data. So those are the typical security vulnerabilities I would say that people tend to exploit, but they’re not exclusive to an SLM or an LLM, it happens in both.

Brijesh Ammanath 00:42:34 Right. And what are the key architectural differences between SLMs and LLMs and is there any fundamental design philosophy which is different?

Sunil Mallya 00:42:42 Not really the same technique that you use to train a 10 billion parameter model can be done for a hundred billion or a 1 trillion. Architecturally, they are different on neither are the training techniques. I would say. Well, people do employ different techniques. It doesn’t mean that the techniques are not going to work on LLMs as well. Like, so it’s just a size equation. But what is interesting is how these SLMs get created. They can be trained from scratch or fine-tuned, but you can take an LLM and make them an SLM and that’s a very interesting topic. So couple of most common things that people do is quantization and distillation. Quantization is where you take a large model and you convert the model parameters and this can be done like statically, it doesn’t even need a whole process. What you’re basically doing is chopping off the bits.

Sunil Mallya 00:43:36 You take a 32-bit precision, and you make it a 16-bit precision or you can make an eight bit precision and you’re done. Like you’re basically changing the precision of those floats in your model weights, and you’re done. Now distillation is actually a very interesting, and there are different kinds of technique. Distillation at a high level is where you take a large model, and you take the outputs of those large models and use that to train a small model. So what that means is its sort of a teacher-student relationship, the teacher model that knows a lot and can produce high quality data, which a small model just can’t because it has creativity limitations and because the number of parameters is fewer. So you take this large model, you generate a lot of output from that, and you use that to then train your small language model, which then can see equivalent performances.

Sunil Mallya 00:44:32 And there are a lot of examples of this. So if we look at the, what I talked about the Meditron, even like these models called this Open bio, even multilingual models, like what I’ve seen, there was this Taiwanese Mandarin model, again, like they used like large models, took a lot of data, and then trained, and the model was doing better than like GPT-4 and Claude et cetera. All because it was trained through distillation and so on. That’s a really practical approach and a lot of fine-tuning happens through distillation, which is generate the data. And then there can be a more complex version of distillation where you are training both models in tandem, so to speak, and you are taking the signals that the larger model learns and giving that to the smaller model to adapt. So they’re very complex ways of training and distillation as well.

Brijesh Ammanath 00:45:25 Okay. So distillation is the teacher student model brings it to life. You can intuitively understand that. Whereas quantization is taking a large model and chopping off bits. I’m struggling to understand that. How does that make it specific to a domain or is this not related to a domain?

Sunil Mallya 00:45:41 No, it doesn’t. It does not. It just makes it smaller for you to deploy and manage. So it’s more of a cost performance, trade-off, cost-performance-accuracy, trade-off. It doesn’t make you like an expert model by any means.

Brijesh Ammanath 00:45:56 So it’s still a general-purpose model.

Sunil Mallya 00:45:57 Correct. But what we see and there’s a lot of trend is, let’s say I train a model with X amount of data, a 10 billion parameter model versus a hundred billion parameter model and then quantize it. There’s a lot of examples were taking a hundred billion parameter model and reducing it, quantizing it to the size of your 10 billion parameter model was this training one you could get better results. So it’s the same purpose, same data, except you trained a larger model and then you quantize it. So there are people who have done that with a lot of success.

Brijesh Ammanath 00:46:27 Right. You also briefly mentioned about model pruning and when we discussed about the differentiation between SLM and LLM attributes, can you expand on what pruning is and how does that work?

Sunil Mallya 00:46:39 Yeah, so when I talk about 10, so one thing we have to understand fundamentally is when I say 10 billion parameters, it doesn’t mean that 10 billion parameters are all storing good amount of data. They’re all needed equally to produce the result. And this is actually analogous to the human brain. Like it is predicted that the human brain only uses 13% of its entire capacity. Like the other 87% is just there. So, same way these models are sparse in nature. By sparse, I mean the best way to understand is, remember when I talked about these matrixes having zero weights? And as you train a model, like these numbers change. Like these numbers change and let’s say they increment, you’ve learned something that that parameter is non-zero. So when you look at a trained model, it doesn’t mean that all the models have gone, all the parameters have gone from zero to something meaningful.

Sunil Mallya 00:47:32 Like there are still a lot of parameters that are close to zero. So those don’t necessarily add anything meaningful to your ultimate output. So you can start to prune those models. Again, I’m, I’m trying to explain practically that is more nuance to this, but effectively that’s what’s happening. You are just removing those parts of the model that have not been activated or don’t contribute to activations as you run inference. So now suddenly a 10 billion parameter model can be pruned to like a three billion parameter model by doing that. That’s the general idea of pruning. But I would say pruning has become extremely less common as a strategy these days. Rather mixture of experts, as I talked initially in the podcast, that’s a more pragmatic way in which the model itself is sort of creating these specialized parts. Like in your training process you have a huge model, let’s say a 10 billion parameter model, but you’re creating these experts, and the experts are actually defining these paths that are history expert, math expert, coding expert, and so on. Like so these effectively sort of utilizing the space better while you train. So that’s more of a state in which we are moving. Not to say you cannot prune mixture of expert model and so on, but it’s less common that people do that. And a factor of that is how much efficient and faster GPUs and the underlying frameworks have become that you don’t necessarily need to bother with pruning.

Brijesh Ammanath 00:49:04 Alright, we have covered a lot of ground over here. We have covered the basics in terms of what are SLMs, we have looked at the SLM attributes compared to LLMs. We have looked at enterprise adoption and also looked at architecture technical distinctions and the training differences between SLMs and LLMs. As we wrap up, just a final couple of questions Sunil, what emerging research area are you most excited about for advancing SLMs?

Sunil Mallya 00:49:30 Love this question. I’ll talk about a few things that people have worked on and something that exciting that is emerging as well. Speed is actually a very important thing. Like when you think about vast number of applications that exist on the internet or people use speed is key. Like just because somebody something is AI powered, you’re not going to say like, oh you can give me the response in 60 minutes or 60 seconds. Like people still want things fast. So people have spent a lot of time on inference and making inference faster. So a big emerging research area is how to scale things at inference. There’s a technique that people have sort of developed. It’s called speculative decoding. Now this is very similar to people who understand like compilers and so on. how you have a speculative branching where youíre trying to guess where the code is going to jump next and so on.

Sunil Mallya 00:50:24 Same way in inference, while predicting the current token, you are also trying to get the next token in a speculative manner. So you’re basically in a single pass, you’re producing multiple tokens. Like which means now you can take like half the amount of time or 25% of the time it would take to produce the entire inference. But again, it’s speculative. Which means the accuracy takes a bit of hit, but you are getting faster inference. So that is a very, very exciting area. The others I would say like a lot of work has been done on device, how to deploy these SLMs on your laptop, on your RaspberryPi. That’s an extremely exciting area. Privacy, preserving way of deploying these LLMs. That’s a pretty active area and exciting for me, I’ll keep the most exciting. Is a couple of things I would say, which has started in the last maybe six months since the One series of models that open AI release, which are where the model actually is thinking based on its own outputs.

Sunil Mallya 00:51:29 Now, the best way to explain this is, how you probably worked out math problems in school where you have a rough sheet on the right-hand side, you are doing the nitty gritty details and then you’re bringing that into, substituting into your equations and so on. So you have this scratch pad of a lot of thoughts and a lot of rough work that you’re using to bring into your answer. The same way that is happening is these models are generating all these intermediate outputs and ideas and things that it can use to generate the final output. And that is super exciting because you’re starting to see high accuracy in a lot of complex tasks. But on the flip side, it is something that used to take us like five seconds for inference, starting to take five minutes.

Sunil Mallya 00:52:20 Or 15 minutes and so on, because you’re starting to generate a lot of these intermediate outputs or tokens that the model has to use. Now this whole entire paradigm is called inference time scaling. Now the larger the model you can imagine, the more time it takes to generate these tokens, a more compute footprint and so on. The smaller the model, you can do it faster and which is why I was talking about all these faster inference, et cetera, those start to come into picture because now you can generate these tokens in a faster manner, and you can start to use them to get higher accuracy at the end. So inference time scaling is an extremely exciting area. There are a lot of open-source models now that have come out that are able to support this. Second is, which is again like fresh off the press, there has been a lot of speculation on using reinforcement learning to train the models from scratch.

Sunil Mallya 00:53:19 So typically speaking in a training process, reinforcement learning has been used. So just to explain the training process, we do what is called a pre-training, where the model learns on self-supervised data and then we can talk about instruction tuning where the model is given certain instructions or human curated data. They train that. And then there’s reinforcement learning where the model is given reinforcement learning signals to, well I prefer the output in a certain way. Or you give signals to the model, and you train using that. But reinforcement learning was never used to train a model from scratch. People speculated it and so on. But with this DeepSeek R1 model, they’ve used reinforcement learning to train from scratch. That’s a whole new, that opens a whole new possibility of on how you would train. This is completely new. I’m yet to read the entire paper just as I said, it released a couple hours ago and I skimmed through it and it’s been always speculated, but they have put into research paper, and they’ve produced the results. So to me this is going to open a whole new way of how people train these models. And reinforcement learning is good at finding hacks on its own. So I wouldn’t be surprised where it is going to reduce the model size and have a material impact on these SLMs being even better. I’m extremely excited with these things.

Brijesh Ammanath 00:54:53 Exciting space. So you have spoken about speculative decoding on device deployment, inference time scaling and using reinforcement learning to train from scratch. Quite a few emerging areas. Before we close, was there anything we missed that you’d like to mention?

Sunil Mallya 00:55:09 Yeah, maybe I can bring through a practical example like that I’ve been working on for three years and putting all the things that I’ve talked about together. So at Flip AI we really an enterprise first company and we wanted the model to be practical in all these tradeoffs that I mentioned and deploy on-prem or SaaS, whatever option for our customers wanted to choose, we wanted to give the customer the flexibility and all the data governance aspect. And as we trained these models, right, we didn’t have any of the LLMs that had capability of doing anything in the observability data space. And this observability data is sort of very tuned to what a company has. You don’t necessarily have this data out in the wild. So what we did is to train these models. We use, like many of the techniques that I talked through the start this podcast, first we do pre-training.

Sunil Mallya 00:56:00 So we collect a lot of data from the internet in terms of like say Stack overflow logs that are available, et cetera. And then we put them to a rigorous data cleaning pipeline because you need high-quality data. So we spend a lot of time there to get high-quality data, but there’s only so much data that’s available. Like, so we curate, data that are human labeled. And we also do synthetic data generation similar to that distillation process that I talked about earlier. And then finally, what I like to say is the model trains and gets really good but doesn’t have practical knowledge. And to gain practical knowledge, what we do is we have created this gym, I call it this chaos gym. Maybe we have an internal code name, called “Otrashi,” and if you’re a South Indian native speaker of any of those languages [[Konkani and Kannada]] you’ll appreciate, which basically means chaos.

Sunil Mallya 00:56:55 And the idea is this chaos framework goes in, breaks, all these things, and the Flip model predicts the output and then we use reinforcement learning to align the model better on, hey, you made a mistake here or, hey, that’s good, you predicted it correctly, and then it goes and improves the model. So all these techniques, there’s no one answer that gives you performance out of your SLMs. You must use a mix of these techniques to bring all of this together. So whoever is building enterprise grade SLMs, I would advise them to think in similar manner. We’ve got a paper as well that is out. You can check it on our website that walks us through all of these techniques that we’ve used and so on. Overall, I would say I remain bullish on the SLMs because those are practical in how enterprises can bring and give utility to their end customers and LLMs don’t necessarily give them that flexibility all the time, and especially in a regulated environment, LLMs are just not an option.

Brijesh Ammanath 00:58:01 I’ll make sure we link to that paper in our show notes. Thank you, Sunil, for coming on the show. It’s been a real pleasure. This is Brijesh Ammanath for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio 661: Sunil Mallya on Small Language Models

Show Notes

Related Episodes

Other References

Transcript

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts

Search

Search

SE Radio 661: Sunil Mallya on Small Language Models

Show Notes

Related Episodes

Other References

Transcript

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts