SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

In this episode of Software Engineering Radio, Abhinav Kimothi sits down with host Priyanka Raghavan to explore retrieval-augmented generation (RAG), drawing insights from Abhinav’s book, A Simple Guide to Retrieval-Augmented Generation.

The conversation begins with an introduction to key concepts, including large language models (LLMs), context windows, RAG, hallucinations, and real-world use cases. They then delve into the essential components and design considerations for building a RAG-enabled system, covering topics such as retrievers, prompt augmentation, indexing pipelines, retrieval strategies, and the generation process.

The discussion also touches on critical aspects like data chunking and the distinctions between open-source and pre-trained models. The episode concludes with a forward-looking perspective on the future of RAG and its evolving role in the industry.

Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Related Episodes

Other References

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Priyanka Raghavan 00:00:18 Hi everyone, I’m Priyanka Raghaven for Software Engineering Radio and I’m in conversation with Abhinav Kimothi on Retrieval Augmented Generation or RAG. Abhinav is the co-founder and VP at Yanet, an AI powered platform for content creation and he’s also the author of the book,† A Simple Guide to Retrieval Augmented Generation . He has more than 15 years of experience in building AI and ML solutions, and if you’ll see today Large Language Models are being used in numerous ways in various industries for automating tasks, using natural languages input. In this regard, RAG is something that is talked about to enhance performance of LLMs. So for this episode, we’ll be using the book from Abhinav to discuss RAG. Welcome to the show Abhinav.

Abhinav Kimothi 00:01:05 Hey, thank you so much Priyanka. It’s great to be here.

Priyanka Raghavan 00:01:09 Is there anything else in your bio that I missed out that you would like listeners to know about?

Abhinav Kimothi 00:01:13 Oh no, this is absolutely fine.

Priyanka Raghavan 00:01:16 Okay, great. So let’s jump right in. The first thing, when I gave the introduction, I talked about LLMs being used in a lot of industries, but the first section of the podcast, we could just go over some of these terms and so I’ll ask you to define a few of those things for us. So what is a Large Language Model?

Abhinav Kimothi 00:01:34 That’s a great question. That’s a great place to start the conversation also. Yeah, so Large Language Model’s very important in a way, LLM is the technology that assured in this new era of artificial intelligence and everybody’s talking about it. I’m sure by now everybody’s familiar with ChatGPT and the likes. So these applications, which everybody’s using for conversations, text generation, etc., the core technology that they are based on is a Large Language Model, an LLM as we call it.

Abhinav Kimothi 00:02:06 Technically LLMs are deep learning models. They have been trained on massive volumes of text and they’re based on a neural network architecture called the transformers architecture. And they’re so deep that they have billions and in some cases trillions of parameters and hence they’re called large models. What it does is that it gives them unprecedented ability to process text, understand text and generate text. So that’s sort of the technical definition of an LLM. But in layman terms, LLMs are sequence models, or we can say that they’re algorithms that look at a sequence of words and are trying to predict what the next word should be. And how they do it is based on a probability distribution that they have inferred from the data that they have been trained on. So think about it, you can predict the next word and then the word after that and the word after that.

Abhinav Kimothi 00:03:05 So that’s how they’re generating coherent text, which we also call natural language and health. They are generating natural language.

Priyanka Raghavan 00:03:15 That’s great. Another term that’s always used is prompt engineering. So we’ve always, a lot of us who go on ChatGPT or other kind of agents, you just type in normally, but then you see that there’s a lot of literature out there which says if you are good at prompt engineering, you can get better results. So what is prompt engineering?

Abhinav Kimothi 00:03:33 Yeah, that’s a good question. So LLMs differ from traditional algorithms in the sense that when you’re interacting with an LLM, you’re interacting not in code or not in numbers, but in natural language text. So this input that you’re giving to the LLM in form of natural language or natural text is called a prompt. So think of prompt as an instruction or a piece of input that you’re giving to this model.

Abhinav Kimothi 00:03:58 In fact, if you go back to early 2023, everybody was saying, hey, English is the new programming language because these AI models, you can just chat with them in English. And it may seem a bit banal if you look at it from a high level that hey, how can English now become a programming language? But it turns out the way you are structuring your instructions even in English language, has a significant effect of on the kind of output that this LLM will produce. I mean English may be the language, but the principles of logic reasoning they stay the same. So how you craft your instruction that becomes very important. And this ability or the process of crafting the right instruction even in English language is what we call prompt engineering.

Priyanka Raghavan 00:04:49 Great. And then obviously the other question I have to ask you is also there’s a lot of talk about this term called context window. What is that?

Abhinav Kimothi 00:04:56 As I said, LLMs are sequence models. They’ll look at a sequence of text and then they will generate some text after that. Now this sequence of text cannot be infinite and the reason why it can’t be infinite is because of how the algorithm is structured. So there is a limit to how much text can the model look at in terms of the instructions that you’re giving it and then how much text can it generate after that. So this constraint on the number of, well it’s technically called tokens, but we’ll use words. So number of words that the model can process in one go is called the context window of that model. And we started with very less context windows, but now they are models that have context window of two lacks and three lacks. So, can process two lack words at a time. So that’s what the context window term means.

Priyanka Raghavan 00:05:49 Okay. I think now would be a good time to also talk about what is hallucination and why does it happen in LLMs. And when I was reading your book, the first chapter, you give a very nice example if there are listeners on the show. We have a listenership from all over the world, but I had a very nice example on your book on what is hallucination and why it happens, and I was wondering if you could use that. It is with respect to trivia on Cricket, which is a game we play in the subcontinent, but maybe you could explain what is hallucination using that?

Abhinav Kimothi 00:06:23 Yeah, yeah. Thank you for bringing that up and appreciating that example. Let me first give the context of what hallucinations are. So hallucination means that whatever output the LLM is producing, it is actually incorrect and it has been observed that in a lot of cases when you ask an LLM a question, it’ll very confidently give you a reply.

Abhinav Kimothi 00:06:46 And if the reply consists of a factual information as a user, you will believe that factual information to be accurate, but it is not guaranteed and in some cases it might just be fabricated information and that is what we call hallucinations. Which is this characteristic of an LLM to sometimes respond confidently with inaccurate information. And like the example of the Cricket World Cup that you were mentioning is, so ChatGPT 3.5, or GPT 3.5 model was trained up till sometime in 2022. So that’s when the training of that model happened, which means that, all the information that was given to this model while training was only up till that point. So if I ask that model a question about the cricket World Cup that happened in 2023, it sometimes gave me incorrect response. It said India won the World Cup when in fact Australia had won it and it gave it very confidently, it gave the score saying India defeated England by so many runs, etc. which is absolutely not true, which is false information, which is an example of what hallucinations are and why do hallucinations happen.

Abhinav Kimothi 00:08:02 That is also a very important aspect to understand about LLMs. At the outset, I’d like to mention that LLMs are not trained to be factually accurate. As I said, they are just looking at the probability distribution, in very simplistic terms, they are looking at the probability distribution of words and then trying to predict what the next word in the sequence is going to be. So nowhere in this construct are we programming the LLM to also do a factual verification of the claims that it is making. So inherently that’s not how they have been trained, but the user expectation is that they should be factually accurate and that’s the reason why they’re criticized for these hallucinations. So if you ask an LLM a question about something that is not public information, some data that they might not be trained on, some confidential information about your organization or you as an individual, the LLM has not been trained on that data.

Abhinav Kimothi 00:09:03 So there is no way that it can know that particular snippet of information. So it’ll not be able to answer that. But what it does is it generates actually inaccurate answer. Similarly, these models take a lot of data and time to train. So it’s not that they’re real time, they’re updating in real time. So there is a knowledge cutoff date also with the LLM. But despite all of that, despite these characteristics of training an LLM, even if they have the data, they might still generate responses that are not even true to the training data because of the nature of training. They’re not trained to replicate information, they’re just trying to predict the next word. So these are the reasons why hallucinations happen and there has been a lot of criticism of LLMs and initially they were also dismissed saying, oh, this is not something that we can apply in real world.

Priyanka Raghavan 00:10:00 Wow, that’s interesting. I never expected that even when the data is available that it could also be factually incorrect. Okay, that’s interesting note. So, and this would be a perfect time to actually get into what is RAG. So can you explain that to us as what is RAG and why is there a need for RAG?

Abhinav Kimothi 00:10:20 Right. Let’s start with the need for RAG. We’ve talked about hallucinations. The responses may be suboptimal is in, they might not have the information or they might have incorrect information. In both cases the LLMs are not usable in a practical scenario, but it turns out that if you are able to provide some information in the prompt, the LLMS adhere to that information very well. So if I’m able to, again taking the Cricket example, say hey, who won the Cricket World Cup? And within that prompt I also paste the Wikipedia page of 2023 Cricket World Cup. The LLM will be able to process all that information and find out from that information that I’ve pasted in the prompt that Australia was the winner and hence it’ll be able to correctly give me the response so that maybe, a very naive example like pasting this information in the prompt and getting the result. But that is sort of the fundamental concept of RAG. The fundamental idea behind RAG is if the LLM is provided with the information in the prompt, it’ll be able to respond with a much higher accuracy. So what are the different steps that this is done in? If I were to kind of visualize a workflow, suppose you’re asking a question to the LLM now instead of sending this question directly to the LLM, if this question can search through a database or a knowledge base where information is stored and fetch the relevant documents, these documents can be word documents, JSON files, any text documents, even the internet, and fetch the right information from this knowledge base or database.

Abhinav Kimothi 00:12:12 Then along with this user question, send this information to the LLM. The LLM will then be able to generate a factually correct response. So these three steps of fetching and retrieving the correct information, augmenting this information with the user’s question and then sending it to the LLM for generation is what encompasses retrieval augmented generation in three steps.

Priyanka Raghavan 00:12:43 I think we’ll probably deep dive into this in the next section of the podcast, but before that, what I wanted to ask you was, would you be able to give us some examples in industries which are using RAG

Abhinav Kimothi 00:12:52 Almost everywhere that you are using LLM, an LLM where there is a requirement to be factually accurate. RAG is being employed in some shape and form something that you might be using in your daily life if you are using the search functionality on ChatGPT or if you’re uploading a document to ChatGPT and sort of conversing with that document.

Abhinav Kimothi 00:13:15 That’s an example of a RAG system. Similarly, today, if you go and ask for something on Google, you search something on Google, on the top of your page, you will get a summary, sort of a textual summary of the result, which is sort of an experimental feature that Google has launched. That is a prime example of RAG. It is looking at all the search results and then passing that search, those search results to the LLM and generating a summary out of that. So that’s an example of RAG. Apart from that, a lot of Chat bots today are based on that because if a customer is asking for some support, then the system can look at support documents and respond with the right item. Similarly, with virtual assistance like Siri have started using a lot of retrieval in their workflow. It’s being used for content generation, question answering system for enterprise knowledge management.

Abhinav Kimothi 00:14:09 If you have a lot of information on your SharePoint or in some collaborative workspace, then a RAG system can be built on this collaborative workspace so that users don’t have to search through and look for the right information, they can just ask a question and get that knowledge snippets. So it’s been used in healthcare, in finance, in legal, almost in all the industries, a very interesting use cases. Watson AI was using this for commentary during the US open tennis tournament because you can generate commentary, you have live scores coming in. So that is one thing that you can pass to the LLM. You have information about the player, about the match, what is happening in other matches, all of that. So there’s information you pass to the LLM and it’ll generate a coherent commentary, which then from text to speech models can also be converted into speech.

Abhinav Kimothi 00:15:01 So that’s where RAG systems are being used today.

Priyanka Raghavan 01:15:04 Great. So then I think that’s a perfect segue for me to also ask you one last question before we move to the RAG enabled design, which I want to talk about. The question I wanted to ask you is like is there a way humans can get involved to make the RAG perform better?

Abhinav Kimothi 00:15:19 That’s a great question. I feel the state of the technology as it stands today, there is a need of a lot of human intervention to build a good RAG system. Firstly, the RAG system is as good as your data. So the curation of data sources, like which data sources to look at, whether it’s your file systems, whether open internet access is allowed, which websites should be allowed over there, if is the data in the right as the garbage in the data, has it been processed correctly?

Abhinav Kimothi 00:15:49 All of that is one aspect in which human intervention becomes very important today. The other is in a degree of verification of the outputs. So RAG systems exist, but you can’t expect them to be a hundred percent foolproof. So until you have achieved that level of confidence that hey, your responses are fairly accurate, there is a certain degree of manual evaluation that is required of your RAG system. And then at every component of RAG, whether your queries are getting aligned with the system, you need a certain degree of evaluation. There is this whole idea of which is not specific to RAG, but reinforcement learning based on human feedback, which goes by the acronym RLHF. That is another important aspect that human intervention is required in RAG systems.

Priyanka Raghavan 00:16:47 Okay, great. So the humans can be used in both to find out how the data is going into the system as well as like verifying the output and also the RAG enabled design as well. You need the humans to actually create the thing.

Abhinav Kimothi 00:17:00 Oh, absolutely. It can’t be done by AI yet. You need human beings to build the system of course.

Priyanka Raghavan 00:17:05 Okay. So now I’d like to ask you about what the key components required to build a RAG? You talked about the retrieval part, the augmentation part and the generation part. Yeah, so maybe you could just paint a picture for us on that.

Abhinav Kimothi 00:17:17 Right. So like you said, these three components, like you need a component to retrieve the right information, which is done by a set of retrievers where is an innovative term, but it’s done by retrievers. Then once the documents are retrieved or information is retrieved, then there is a component of augmentation where you are putting the information in the right format. And we talked about prompt engineering. So there is a lot of aspect of prompt engineering in this augmentation step.

Abhinav Kimothi 00:17:44 And then finally it’s the generation component, which is the LLM. So you’re sending this information to the LLM that becomes your generation component and these three in combination form the generation pipeline. So this is how the user interacts with the system real time, this is that workflow. But if you think sort of one level deeper into this, there is this entire knowledge base that the retriever is going and searching through. So creation of this knowledge base also becomes an important component. So this knowledge base is a key component of your RAG system and creation of this knowledge base is done through another pipeline known as the indexing pipeline, which is sort of connecting to the source data systems and processing that information and storing it in a specialized database format called vector databases. This is largely an offline process, a non-real-time process. You curate this knowledge base.

Abhinav Kimothi 00:18:43 So that’s another component. These are the core components of this RAG system. But what is also important is evaluation, right? Is your system performing well or you put in all this effort created the system and is it still hallucinating? So you need to evaluate whether your responses are correct. So evaluation becomes that another component in your system. Apart from that security privacy, these are aspects that become even more important when it comes to LLMs because as we are entering this age of artificial intelligence, and more and more processes will start getting automated and reliant on AI systems and AI agents. Data privacy becomes a very important aspect. Your guard railing against attacks, malicious attacks, this becomes a very important context. And then to manage everything interacting with the user, there needs to be an orchestration layer, which is sort of playing the role of that conductor amongst all these different components.

Abhinav Kimothi 00:19:48 So these are the core components of our system, but there are other systems, other layers that can be part of the system, sort of experimentation and data training and other models. So those are more like software architecture layers that you can also build around this RAG system.

Priyanka Raghavan 00:20:07 One of the big things about the RAG system is of course the data. So tell us a little bit about the data, like you have multiple sources, does data have to be in a specific format and how are they ingested?

Abhinav Kimothi 00:20:21 Right. You need to first define what your RAG system is going to talk about, what your use case is. And based on the use case the first step is the curation of data sources, right? Which source systems should it connect to? Is it just a few PDF files? Is it your entire object store or your file sharing system? Is it the open internet? Is it like a third-party database? So first step is curation of these data sources, what all should be a part of your RAG system. And RAG works best or even like when we are using LLMs, the key use case of LLMs is unstructured data. For structured data you already have everything solved almost, right? Like in traditional data science you have solved for structured data. So works best for unstructured data. So unstructured data goes beyond just text is images and videos and audios and other files. But let me just for simplicity’s sake talk about text. So the first step would be when you are ingesting this data to store it in your knowledge base, you need to also do a lot of pre-processing saying okay, is all the information useful? Are we unnecessarily extracting information? Like for example, if you have a PDF file, what sections of the PDF file are you extracting?

Abhinav Kimothi 00:21:40 Or an HTML is a better example, like are you extracting the entire STML code or just the snippets of information that you really need. So another step that becomes really important is called chunking, chunking of the data. And what chunking means is that you might have documents that run into hundreds and thousands of pages, but for effective use in a RAG system, you need to sort of isolate information, or you need to break this information down into smaller pieces of text. And there are very many reasons why you need to do that. First is the context window that we talked about. You can’t fit a million words in the context window. The second is that search happens better if you have smaller pieces of text, right? Like you can more effectively search on a smaller piece of text than an entire document. So chunking becomes very important.

Abhinav Kimothi 00:22:34 Now all of this is text, but computers work on numerical data, right? They work on numbers. So this text has to be converted into a numerical format. And traditionally there have been very many ways of doing that. Text processing is being done since ages. But one particular data format that has gained prominence in the NLP domain is embeddings. It’s called embeddings. And embeddings are simply, it’s converting text into numbers, but embeddings are not just numbers, they’re storing text in a vector form. So it’s a series of numbers, it’s an area of numbers and why it becomes important, there are reasons for that is because it becomes very easy to calculate similarity between text when you’re using vectors and therefore embeddings become an important data format. So all your text needs to be first chunked and these chunks then need to be converted into embeddings and so that you don’t have to do it every time you are asking a question.

Abhinav Kimothi 00:23:41 You also need to store these embeddings. And these embeddings are then stored in specialized databases that have become popular now, which are called vector databases, which are sort of databases that are efficient in storing embeddings or vector form of data. So this entire flow of data from source system into your vector database forms the indexing pipeline. Okay. And this becomes a very crucial component of your RAG system because if this is not optimized and this is not performing well then, your RAG system cannot be, your generation pipeline cannot be expected to do well.

Priyanka Raghavan 01:24:18 Very interesting. So I wanted to ask you, I was just thinking about it was not my original list of questions. When you talk about this chunking, what happens is if the chunking, like suppose you, you’ve got a big sentence like Priyanka is intelligent and Priyanka is gets into one chunk and intelligent goes into another chunk. I don’t know, do you have like this distortion of the sentence because of chunking is?

Abhinav Kimothi 00:24:40 Yeah, I mean that’s a great question because it can happen. So there are different chunking strategies to deal with it, but I’ll talk about the simplest one that helps prevent this, helps maintain that context is that between two chunks you also maintain some degree of overlap. So it’s like if I say Priyanka is a good person and my chunk size is two words for example, so Priyanka is a good person, but if I maintain an overlap, so it’ll become Priyanka is a good person. So that ìaî is in both the chunks. So if I expand this idea then first of all I will chunk only at the end of sentence. So I don’t, I don’t break a sentence completely and then I can have overlapping sentences in adjacent chunk so that I don’t miss the context.

Priyanka Raghavan 00:25:36 Got it. So when you search, you’ll be like searching on both the places where like to your nearest neighbors, whatever would that be?

Abhinav Kimothi 00:25:45 Yeah. So even if I retrieve one chunk, the last sentences of the previous chunk will come. And the first few sentences of the next chunk will come. Even if I’m retrieving a single chunk.

Priyanka Raghavan 00:25:55 Okay, that’s interesting. So I think some of us who have been say software engineers for like quite some time, I think we’ve had a very similar concept also in terms of we’ve had this, like I used to work in the oil and gas industry. So we used to do these kinds of triangulations when we actually in graphics programming where you actually end up rendering a chunk of the earth’s surface, for example. So like there might be different types of rocks and so like this where one rock differs from another, like that will be shown in triangulation just as an example. And so what happens is that when you actually do the indexing for that data, when you’re actually rendering something on the screen, you actually have the previous surface as well as the next surface as well. So I was just seeing that just clicked.

Abhinav Kimothi 00:26:39 Something very similar very similar happens in chunking also. So you are maintaining context, right? You’re not losing information that was there in the previous part. You’re maintaining this overlap. So that context is sort of, it holds together.

Priyanka Raghavan 00:26:52 Okay, that’s very interesting to know. I wanted to ask you also in terms of, since you’re dealing with a lot of text, I’m assuming that performance is also a big issue. So do you have like caching? Is that something that’s also a big part of the RAG enabled design?

Abhinav Kimothi 00:27:07 Yeah. Caching is very important. What kind of vector database you are using becomes very important. What kind of, so when you are searching and retrieving information, what kind of retrieval methodology or retrieval algorithm you are using becomes very important and more so in case when we are dealing with LLMs, because every time you are going to the LLM, you’re incurring a cost. Because every time it is computing you’re using your resources. So chunk size also plays an important role. Like if I’m giving large chunks to the LLM, you are incurring more costs. So number of chunks you have to optimize. So there are several things that play a part to improve the performance of the system. So there’s a lot of experimentation that needs to be done vis-a-vis the user expectations costs. So you need, so users want answer immediately. So your system cannot have latency, but LLMs inherently introduce a latency to the system and if you are adding a layer of retrieval before going to LLM, that again increases the latency of the system. So you have to optimize all of this. So caching, as you said, has become an important part in all generative AI application. And it’s not just caching like regular caching, it’s something called semantic caching where you’re not just caching queries and searching for the exact queries, you are also going to the cache if the query is somewhat similar to the cached query. So if the semantic meaning of the two queries is the same, you go to the cache instead of going through the entire workflow.

Priyanka Raghavan 00:28:48 Actually. So we’ve looked at two different parts of like the data sources chunking and we talked about, caching. So let me now talk a little bit about the retrieval part. How do you do the retrieving? Is the indexing pipeline helping you with the retrieving?

Abhinav Kimothi 00:28:59 Right. Retrieval is the core component of RAG system. Like without retrieval there is no RAG. So how that happens, let’s talk about how you search things, right? Like the simplest form of searching text is your Boolean search. Like if I press Control F on my word processor and I type a word, the exact matches will get highlighted, right? But there is lack of context in that. So that’s the simplest form of searching. So think of it like if I am asking a query who won the 2023 Cricket World Cup and that exact phrase is present in a document, I can do a Control F search for that, fetch that and pass that to the LLM, right? Like that will be the simplest form of search. But practically that does not work because the question that the user is asking will not be present in any document. So what do we have to do now? We have to do like sort of a semantic search.

Abhinav Kimothi 00:29:58 We have to grasp the meaning of the question and then try to find out, okay, which documents might have the similar answer or which chunks might have the similar answer. Now that is done, the most popular way of doing that is through something called cosine similarity. Now how is that done is I talk about embeddings, right? Like your data, your text is converted into a vector. So vector is a series of numbers that can be plotted in an end dimensional space. Like if I look at a graph paper, a two-dimensional sort of X axis and Y axis, a vector will be X,Y. So my query also needs to be converted into a vector form. So the query goes to an embedding algorithm and is converted into a vector form. Now this query is then plotted on the same vector space in which all the chunks are also there.

Abhinav Kimothi 00:30:58 And now you are trying to calculate which chunk, the vector of which chunk is closest to this query. And that can be done through, that’s a distance calculation like in vector algebra or in coordinate geometry. That can be done through L1, L2, L3 distance calculations. But what is the most popular way of doing that today in RAG systems is through something called cosine similarity. So what you’re trying to do is between these two vectors, your query vectors and the document vectors, you are trying to calculate the cosine of the angle between them, angle from the origin. Like if I draw a line from the origin to the vector, what is the angle between? So if it’s zero means, if it’s exactly similar, cause zero will be one, right? If it’s perpendicular, orthogonal to your query, which means that there is absolutely no similarity cosine will be zero.

Abhinav Kimothi 00:31:53 And if it’s like exactly opposite, it’ll be minus one something, like that, right? So then this is the way how identify which documents or which chunks are similar to my query vector, similar to my question. So then I can retrieve one chunk, or I can retrieve top five chunks or top two chunks. I can also have a cutoff that, hey, if the cosine similarity is less than 0.7, then just say that I could not find anything that is similar and then I retrieve these chunks and then I can send it to the LLM for further processing. So this is how retrieval happens and there are different algorithms, but this embedding-based cosine similarity is one of the more popular ones, mostly used everywhere today in RAG systems.

Priyanka Raghavan 00:32:41 Okay. This is really good. And I think the question I had on how similarities calculated is answered now because you talked about using this cosine for actually doing the similarity. Now that we’ve talked about the retrieval, I want to dive a bit more into the augmentation part and here we talk briefly about prompt engineering when we did the introduction, but what are the different types of prompts that can be given to get better results? Can you maybe talk us through that? Because there’s a lot of literature in your book also where you talk about different types of prompt engineering.

Abhinav Kimothi 00:33:15 Yeah, so let me mention a few prompt engineering techniques because that’s what the augmentation step more commonly is about. It’s about prompt engineering, though there is also component of fine tuning, which, but that becomes really complex. So let’s just think of augmentation as putting the user query and the retrieve chunks or retrieve documents together. So simple way of doing that is, hey, this is the question answer only based on these chunks, and I paste that in the prompt, send that to the LLM and LLM response. So that’s the simplest way of doing it. Now sometimes let’s think about it, what happens if that answer to the question is not there in the chunks? The LLM might still hallucinate. So another way of dealing with that very intuitive way of dealing with that is saying, hey, if you can’t find the answer, just say, I don’t know, with the simple instruction, the LLM is able to process it and if it does not find the answer, then it’ll sort of generate that result. Now, if I want the answer to be in a certain format saying, what is the sentiment of this particular piece of chunk? And I don’t want positive, negative, I won’t say for example, angry, jealous, something like this, right? And if I have specific categorizations in my mind, let’s say I want to categorize sentiments into A, B and C, but the LLM does not know what A, B and C are, I can give examples in the prompt itself.

Abhinav Kimothi 00:34:45 So what I can say is identify the sentiment in this retrieved chunk and here are a few examples of what sentiments look like. So I paste a paragraph and then say sentiment is A, I paste another paragraph and I say sentiment is B. Turns out that language models are excellent at adhering to these examples. This is something that is called few short promptings, few short means that I am giving a few examples within the prompt so that the LLM responds in the similar manner as my examples. So that’s another way of sort of prompt augmentation. Now there are other techniques, something that has become very popular in reasoning models today, which is called chain of thought. It basically provides the LLM with the way it should reason through the context and provide an answer. Like for example, if I were to ask who the best team of the ODI World Cup and then I also give it a set of instructions saying hey, this is how you should reason step by step, that is prompting the LLM to sort of think like not generate answer at once but think about what the answer should be. That is something called a chain of thought reasoning. And there are several others, but these are the ones that are mostly popular and used in RAG system.

Priyanka Raghavan 00:36:06 Yeah, in fact I’ve been, doing this for course just to understand, get better prompt engineering. And one of the things I learned was also like we I working as an example of a data pipeline, you’re trying to use LLMs to produce SQL query for a database. And I found that exactly what you’re saying like if you had given like some example queries on how it should be given, this is the database, this is like the data model, these are the particular examples. Like if I ask you what is the product with the highest review rating and I give it an example of what the SQL query is, then I feel that the answers are much better than if I were to just ask the question like, can you please produce an SQL query for what is the highest rating of a product? So I think it’s quite fascinating to see this, the few shots prompting, which you talked about, but also the chain of thought reasoning. It also helps with debugging, right? To see how it’s working.

Abhinav Kimothi 00:36:55 Yeah, absolutely. And there’s several others that you can experiment with and see if it works for your use case. But prompt engineering is also not an exact science. It’s based on how well the LLM is responding in your particular use case.

Priyanka Raghavan 00:37:12 Okay, great. So the next thing which I want to talk about, which is also in your book, which is Chapter four, we talk about generation, how the responses are generated based on augmented prompts. And here you talk about the concept of the models which are used in the LLM s. So can you tell us what are these foundational models?

Abhinav Kimothi 00:37:29 Right, so as we said LLMS, they are models that are trained on massive amounts of data, billions of parameters, in some cases, trillions of parameters. They are not easy to train. So we know that OpenAI has trained their models, which is the GPT series of models. Meta has trained their own models, which are the LAMA series. Then there is Gemini, there is Mistral, these large models which have been trained on data. These are the foundation models, these are sort of the base models. These are called pre-trained models. Now, if you were to go to ChatGPT and see how the interaction happens, LLMS as we said are text prediction models. They are trying to predict the next words in a sequence, but that’s not how ChatGPT works, right? It’s not like you’re giving it an incomplete sentence and it is completing that sentence. It’s actually responding to the instruction that you have given to it. Now, how does that happen? Because technically LLMs are just next word prediction models.

Abhinav Kimothi 00:38:35 So how that is done is through something called fine tuning, which is instruction fine tuning. So how that happens is that you have a data set in which you have instructions or prompts and examples of what the responses should be. And then there is a supervised learning process that happens so that your foundation model now starts generating responses in this, in the format of the example data that you have provided. So these are fine-tuned models. So, what you can also do is if you have a very specific use case, for example complex things like medicine or law where the terminology is very specific is that you can take a foundation model and fine tune it for your specific use case. So this is a choice that you can make. Do you want to take a foundation model for your RAG system?

Abhinav Kimothi 00:39:31 Do you want to fine tune it with your own data? So that’s one way in which you can look at the generation component and the models. The other ways to look at also is whether you want a large model or a small model, whether you want to use a proprietary model, which is like OpenAI has not made their model public, so nobody knows what are the parameters of those models, but they provide it to you through an API. So, but the model is then managed by OpenAI. So that’s like a proprietary model, but there are also open-source models where everything is given to you, and you can host it on your system. So that’s like an open-source model that you can host it on your system or there are other providers that provide you with APIs for those open-source modelers. So that’s also a choice that you need to make. Do you want to go with a proprietary model or do you want to take an open source model and use it the way you want to use it. So that’s sort of the decision making that you have to do in the generation component.

Priyanka Raghavan 00:40:33 How do you decide whether you want to go for open source versus a proprietary model? Is it a similar decision like as software developers we also go between, sometimes you have these open-source libraries versus something that you can actually buy a product. Like you can use a bunch of open-source libraries and build a product yourself or just go and buy something and then use that to do your flow. How is that? Is it a very similar way that you would think as the decision making between a pre-trained model versus an open source?

Abhinav Kimothi 00:41:00 Yeah. I would think of it in a similar fashion. Whether you want to have that control of owning the entire thing, hosting that entire thing, or you want to outsource it to the provider, right? Like that is one way of looking at it, which is very similar to how you would make the decision for any software product that you’re developing. But there is another important aspect which is around data privacy. So if you are using a proprietary model that the prompt along with that prompt whatever you’re sending goes to their servers, right? They are going to do the inferencing and send the response back to you. But if you are not comfortable with that and you want everything to be in your environment, then there is no other option but for you to host that model yourself. And that is only possible for open-source models. Another way is that if you really want to have the control over fine tuning the model, because what happens in proprietary models is you just give them the data and they will do everything else, right? Like you give them the data that this is the data that needs to be, the model needs to be fine-tuned on and then open AI providers will do that for you. But if you really want to sort of customize even the fine-tuning process of the model, then you need to do it in-house. So that’s where open-source models become important. So those are the two caveats that I will put apart from all the regular software application development decision making that you do.

Priyanka Raghavan 00:42:31 I think that’s a brilliant answer. I mean I’ve understood it because it’s the privacy angle as well as the fine-tuning angle is a very good rule of thumb I think for people who want to decide on using Ether. Now that we’ve talked a little bit just dipped into like the RAG components, I wanted to ask you about how do you do monitoring of a RAG system that you would do in a normal system that you have, you have a lot of, anything goes wrong, you need to have the monitoring to the logging to find out. How does that happen with the RAG system? Is it pretty much the same thing that you would do as for normal software systems?

Abhinav Kimothi 00:43:01 Yeah, so all the components of monitoring that you would imagine in a regular software system, all of that hold true for a RAG system also. But there are also some additional components that we should be monitoring and that also takes me to the evaluation of the RAG system. So how do you evaluate a RAG system whether it is performing well and then how you do you monitor whether it continues to perform well or not? And when we talk about evaluation of RAG systems, let’s think of it in terms of three components. One is, component one is the user’s query, the question that is being asked. Component two is the answer that the system is generating. And component three is the documents or the chunks that the system is retrieving. Now let’s look at the interaction of these three components. Let’s look at the user query and the retrieved documents. So the question that I might ask is, are the documents that are being retrieved aligned to the query that the user is asking? So I will need to evaluate that and there are several metrics there. So my RAG system should actually be retrieving information that is as per the question that is being asked. If it is not, then I have to improve that. The second sort of dimension is the interaction between the retrieve documents and the answer that the system is generating.

Abhinav Kimothi 00:44:27 So when I pass these retrieve documents or retrieve chunks to the LLM, does it really generate the answers based on those documents or is it generating answers from elsewhere? That’s another dimension that needs to be evaluated. This is called the faithfulness of the system. Whether the generated answer is rooted in the documents that are being retrieved. And then the final component to evaluate is between the question and the answer, like is the answer really answering the question that was being asked? So is there relevance between the answer and the question that was being asked? So these are the three components of RAG evaluation and there are several metrics in each of these three dimensions and they need to be monitored, going forward. But also think about this, what happens if the nature of queries change? So I need to monitor if the queries that are now coming to the system, are the same or similar to the queries that the system was built on or built for.

Abhinav Kimothi 00:45:36 So that’s another thing that we need to monitor. Similarly, if I’m updating my knowledge base, right? So are the documents in the knowledge base similar to how it was initially created or do I need to go revisit that? So sort of as the time progresses, is there a shift in the query, is there a shift in the documents so that those are some additional components of observability and monitoring as we go into production. I think that was the part, which is I think Chapter five of your book, which I also found very interesting because you also talked a little bit about benchmarking there to see how the pipelines work better to see how the models perform, which was great. Unfortunately we are close to the end of the session, so I have to ask you a few more questions to sort of round off this and we’ll probably have to bring you back for more on the book.

Priyanka Raghavan 00:46:30 You talked a little bit about security in the introduction and I wanted to ask you, in terms of security, what needs to be done for a RAG system? What should you be thinking about when you are building it up?

Abhinav Kimothi 00:46;42 Oh yeah, that’s an important thing that we should discuss. And first of all, I’ll be very happy to come on again and talk more yeah about RAG. But when we talk about security and, the regular security, data security, software security, those things still hold for RAG systems also. But when it comes to LLMs, there is another component of prompt injections. What has been observed is that malicious actors can prompt the system in a way that the system starts behaving in an abnormal manner. The model itself starts behaving in an abnormal manner that we can think about it as a lot of different things that can be done, answering things that you’re not supposed to answer, revealing confidential data, start generating responses that are not safe for work, things like that.

Abhinav Kimothi 00:47:35 So the RAG system also needs to be protected against prompt injections. So one way in which prompt injections can be done is direct prompting. Like, in ChatGPT I can directly do some kind of prompting that will change the behavior of the system. In RAG it becomes more important because these prompt injections can be there in the data itself, the database that I’m looking for. So that’s like an indirect sort of injection. Now how to defend against them, there’s several ways of doing that. First is you build guardrails around what your system can and cannot do when the input is coming, when an input prompt is coming, you sort of don’t pass that directly to the LLM for generation, but you do a sanitization there, you do some checks there. Similarly for the data, you need to do that. So guard railing is one aspect. Then, there’s also processing of sometimes, there are some special characters that are added to the problems or the data which might makes the LLM behave in an undesired manner. So all this removal of, unwanted characters, unwanted spaces, that also becomes an important part. So that’s another layer of security that I would put in. But mostly all the things that you would put in a data system, a system that uses a lot of data, all that become very important in RAG systems also. And this defense against prompt injections is another aspect of security that should be cognizant of.

Priyanka Raghavan 00:49:09 I think the OASP organization has come up with this OASP Top 10 for LLMs. So they talk a lot bit about how do you mitigate against these attacks like prompt injection, like you said, input validation, data poisoning, how to mitigate against that. So that’s something I’ll add to the show notes so people can look at that. The last question I want to ask you is about the future of RAG. So it’s like two questions on that. One is, what do you think are the challenges that you see in RAG today and how will it improve? And when you talk about that, can also talk a little bit about what is Agentic RAG or A-G-E-N-T-I-C and RAG. So tell us about that.

Abhinav Kimothi 00:49:44 There are several challenges with RAG systems today. There are several kind of queries that that vanilla RAG systems are not able to solve. There is something called multi hop reasoning in which, you are not just retrieving a document and answer, you will find the answer there, but you have to go through several iterations of retrieval and generation. For example, if I were to ask the celebrities that endorse brand A, how many of them also endorse brand B? Now it’s unlikely that this information will be present in one document. So what the system will have to do is first of all infer that this will not be present in one document and then sort of establish the connections between documents to be able to answer a question like this. This is sort of a multi hop reasoning. So you first hop onto one document, find out information from there, go to another document and get the answer from there. This is sort of very effectively being done by another variant of RAG called Knowledge Graph Enhanced RAGs. So knowledge graphs are these storage patterns in which, you establish relationships between entities and so when it comes to answering related questions or questions that are related and not just present in one place, itís an area of deep exploration. So Knowledge Graph Enhanced RAG is one of the directions which RAG is moving.

Abhinav Kimothi 00:51:18 Another direction that RAG is moving in is taking in multimodal capabilities. So not just being able to process text, but also being able to process images. That’s where we are right now in processing images. But this will continue to expand to audio, video and other formats of unstructured data. So multimodal RAG becomes very important. And then like you said, agentic AI is sort of the buzzword and also the direction in which is a natural progression for all AI systems to move towards or LLM based systems to move towards and RAG is also going in that direction. But these are not competing things, these are complementary things. So what does agentic AI mean? In very simple words, and this is gross oversimplification of things, but if my LLM is given the capability of making decisions autonomously by providing it memory in some way and access to a lot of different tools like external APIs to take actions, that becomes an autonomous agent.

Abhinav Kimothi 00:52:29 So my LLM can reason, can plan, knows what has happened in the past and then can take an action through the use of some tools that is an AI agent very simplistically put. Now think about it in terms of RAG. So what can be done? So agents can be used at every step, right? For processing of data, whether my data has useful information or not, what kind of chunking needs to be done? I can store my information in different, not in just one knowledge base, but I can have several knowledge bases and depending on the question, I can pick and choose an agent can pick and choose which storage component should I fetch from. Then when it comes to retrieval, how many times should we retrieve? Do I need to retrieve more? Are there any additional things that I need to look at?

Abhinav Kimothi 00:53:23 All these decisions can be made by an agent. So at every step of my RAG workflow, what I was doing in a simplistic manner can be further enhanced by putting in an agent there, putting in an LLM agent. But then think about it again, it’ll increase the latency, it’ll increase the cost, that all has to be balanced. So that’s sort of the direction that RAG and all AI will take. Apart from that, there is also sort of something in popular discourse is that with the advent of LLMs that have long context windows, is RAG going to die and sort of funny discourse that goes on happening. So today there is limitation in which, how much information can I put in the prompt for that? I need this whole retrieval. What if there comes a time in which the entire database can be put into the prompt? There is no need for this retrieval component. So that one thing is that cost really increases, right? And so does latency when I’m processing so much information. But also in terms of accuracy, what we’ve observed is that as things stand of today, RAG system will perform sort of similar or better than, long context LLMs. But that’s also something to watch out for. Like how does this space evolve? Will the retrieval component be required? Will it go away? In what cases will it be needed? All that questions for us to wait and watch.

Priyanka Raghavan 00:54:46 This is great. I think it’s been very fascinating discussion and I learned a lot and I’m sure it’s the same with the listeners. So thank you for coming on the show, Abhinav.

Abhinav Kimothi 00:55:03 Oh my pleasure. It was a great conversation and thank you for having me.

Priyanka Raghavan 00:55:10 Great. This is Priyanka Raghaven for Software Engineering Radio. Thanks for listening.

[End of Audio]

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

Show Notes

Related Episodes

Other References

Transcript

Join the discussion

More from this show

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

Menu

Recent posts

Search

Search

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

Show Notes

Related Episodes

Other References

Transcript

Join the discussion

More from this show

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

Menu

Recent posts