Kacper Łukawski, a Senior Developer Advocate at Qdrant, speaks with host Gregory M. Kapfhammer about the Qdrant vector database and similarity search engine. After introducing vector databases and the foundational concepts undergirding similarity search, they dive deep into the Rust-based implementation of Qdrant. Along with comparing and contrasting different vector databases, they also explore the best practices for the performance evaluation of systems like Qdrant. Kacper and Gregory also discuss topics such as the steps for using Python to build an AI-powered application that uses Qdrant.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
Related Episodes
- SE Radio 676: Samuel Colvin on the Pydantic Ecosystem
- SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation
- SE Radio 666: Eran Yahav on the Tabnine AI Coding Assistant
- SE Radio 493: Ram Sriharsha on Vectors in Machine Learning
- SE Radio 490: Tim McNamara on Rust 2021 Edition
Other References
- Kacper Łukawski
- Qdrant
- Home – Qdrant
- Cloud Quickstart – Qdrant
- Vector Search Basics – Qdrant
- Advanced Retrieval – Qdrant
- Using the Database – Qdrant
Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Gregory Kapfhammer 00:00:18 Welcome to Software Engineering Radio. I’m your host Gregory Kapfhammer. Today’s guest is Kacper Lukawski. He’s a senior developer advocate at Qdrant. Qdrant is an open-source vector database and vector search similarity search engine. Kacper, welcome to the show.
Kacper Lukawski 00:00:35 Hello Greg. Thanks for the invitation.
Gregory Kapfhammer 00:00:37 Hey, I’m really glad today that we get a chance to talk about Qdrant, it’s a vector database and we’re going to learn more about how it helps us to solve a number of key problems. So are you ready to dive in?
Kacper Lukawski 00:00:48 Definitely.
Gregory Kapfhammer 00:00:49 Okay. So we’re going to start with an introduction to vector databases and we’re going to cover a couple high level concepts and then later dive into some additional details. So let’s start with the simple question of what is a vector database? Can you tell us more?
Kacper Lukawski 00:01:03 Yes, of course. First of all, I think vector search engine is a more appropriate term. A search is the main functionality that this kind of tools provide. Nevertheless, it’s a service that can efficiently store and handle high dimensional vectors for the proposals of similarity search and similarity of these vectors is defined by the closeness of the vectors in that space. So vector databases are built to make that process efficient.
Gregory Kapfhammer 00:01:29 Okay, so a vector database helps us to achieve vector search or vector similarity search. Is that the right way to think about it? Exactly. Okay. Now one of the things you mentioned was the word vector and then you said high dimensional. Can you briefly explain what high dimensional data is?
Kacper Lukawski 00:01:46 Yes. In case of vector embeddings we describe them as high dimensional because they usually have at least a few hundreds of dimensions. Typically no more than eight or 9,000 dimensions. And it’s definitely not like high dimensional data if you are the seasoned data expert, but it’s relatively high because it’s hard to imagine, hard to interpret for a regular human. So this is the range that we are usually operating in.
Gregory Kapfhammer 00:02:11 Okay, that’s helpful. Now you mentioned the term embedding a moment ago. Can you talk briefly about the concept of a vector embedding?
Kacper Lukawski 00:02:19 Sure. So vector embeddings are just numerical representations of the input data and the main idea is that they keep the semantic meaning of the input data that was used to generate them. And if we have two different vectors which are similar in some way, then we assume that the objects that are used to generate them are also similar in their nature. And vector embeddings actually enabled semantic search that can understand not only the presence of particular keywords but also user intents and more importantly they enabled search on unstructured data that was impossible to be processed in the past.
Gregory Kapfhammer 00:02:59 So let me see if I’m understanding the workflow correctly. Is the idea that I take something like source code or images or documents and then I convert those two embeddings and then I store those in the vector database? Am I thinking about this the right way?
Kacper Lukawski 00:03:14 Yes, that’s the correct way. And the main idea is that this vector embeddings are generated by neural networks which were trained solely for that purpose. So that’s also why we quite often describe vector search as neural search because it requires some sort of neural networks to encode the data into this numerical representations.
Gregory Kapfhammer 00:03:34 Some of our listeners may not have previously used a vector database or done some type of vector similarity search. Can you tell us a little bit more about how you know when your project actually needs a vector database?
Kacper Lukawski 00:03:47 There are no strict criteria here of course, but generally if you build any kind of search mechanism and whenever you want to add this semantic search capabilities into it, then you should have a look at vector databases because they just make the deployment and the maintenance of this kind of projects easier. Also, when you want to implement search over some data modality that can be processed with traditional search means such as images or audio, then you definitely need to use semantic search because that’s probably the only approach to search on unstructured data like this. And of course if you have just a few examples of documents that never change, then vector databases might be just an additional overhead in your project. So then maybe implementing a semantic search directly into your application and embedding those documents directly into the source code makes sense. But in general, if you have data that changes frequently, you should be using vector database to implement semantic search. Especially nowadays vector databases come along well with large language models because in both cases we expect natural language like interactions and we are not necessarily looking only at the presence of the keywords. So if you build a system that exposes, conversational like interface, then vector databases might be really important to achieve that quickly.
Gregory Kapfhammer 00:05:15 So you mentioned the idea of keyword search engine and we’ve already talked about the concept of a similarity search engine. How are those two types of search engines similar to and different from each other?
Kacper Lukawski 00:05:26 So historically search was tied solely to textual data. We didn’t have any other means that would allow us to search over images or any different data modality. And since we are only focusing on text, we developed some specific methods that were dividing that text into meaningful pieces, not necessarily specific words but we are also converting them into their root forms through stemming or some different ‘lemmitization’ techniques. And then we are just building inverted indexes that were supporting the lexical search, which was based on the present on some specific keywords. And imagine you had a very specific use case in which two different words could describe the same object, the same phenomena. Then you would need to manually maintain a list of synonyms. So this process will convert all the different synonyms into the same form. So that means a lot of effort, maybe even building a whole team of people focusing on search and improving search relevance and semantic search is slightly different because it based on neural networks and this neural networks are trained to understand the meaning of the words and whole sentences.
Kacper Lukawski 00:06:39 And that means you don’t necessarily need to use the same terminology as the people who created the documents you are searching over. But you can also express yourself however you want, assuming the model was trained properly for that particular language and still get significantly better results even though you can’t really speak the same language as the domain experts who created the whole database. So that’s the main difference. And also historically we were using tools such as Elasticsearch or open search or anything based on Lucene actually to support lexical search. Right now vector databases are just another different search paradigm.
Gregory Kapfhammer 00:07:20 Thanks. That response was really helpful. So I want to turn our attention now to Qdrant and then briefly discuss some of the types of applications people can build with Qdrant. So at the start of the show you said that Qdrant was a vector similarity search engine. Can you tell us a few of the key features that Qdrant provides and what developers can actually build with Qdrant?
Kacper Lukawski 00:07:40 Of course. So Qdrant provides a very efficient and lightweight search engine that can handle different types of vectors, dense, sparse and multi-vector representations. And we also support all the existing optimization techniques which are relevant for that space. So just name a few. We support different kinds of quantization such as products, color and binary quantization that helps to reduce the amount of memory required to run semantic search at scale. We can also store the data on disk if you prefer to reduce the cost of running semantic search and you are okay with higher latency or GPU based indexing if you really care about the time spent on building some help of data structures that we use to make that search efficient. And one important feature or functionality of Qdrant is that it allows to keep multiple vectors per each point along with some metadata that can be also used for filtering, which is kind of important because a typical use case requires you not only to search based on the semantics of the data, imagine you are looking for the best restaurants nearby you, you definitely don’t want to see restaurants from the other part of the globe.
Kacper Lukawski 00:08:54 Definitely want to restrict your search to a specific area so you don’t need to travel to have your dinner. And that’s exactly where our metadata filtering is important and it’s implemented in a slightly unique way compared to the other vector databases. So I would say those are the main features of Qdrant. And when it comes to different applications, what Qdrant implements is actually an approximation of nearest neighbor search. KNN, K Nearest Neighbors is a quite well known algorithm for those who have any kind of experience in machine learning, that’s probably the most basic ML algorithm that exists and it’s known for its versatility. However, it’s really hard to scale it up just because at inference time, KNN requires us to compare the distance to all the vectors we have within the system. So Qdrant as well, all the other vector databases, just approximate nearest neighbor search.
Kacper Lukawski 00:09:52 So it’s can be implemented in sub junior time but that also means that we can solve variety of problems that pure KNN could also solve. Obviously semantic search. So if you have an existing application and just want to enhance it with additional semantic search capabilities, that’s something you could definitely implement with Qdrant. However, vector search enables way more than just pure search because since we have the similarity measure, we can also perform a very simple classification pipeline using the same approach. Because if we just select the top 10 closest documents or top end closest documents in general by running a simple voting procedure which can just select the most common class among all these closest neighbors and assign the class to a new observation just because the majority of observations in its neighborhood belong to it. And the similarity measure is also interesting on its own just because you can use it to detect anomalies, let’s say you know the distribution of the queries you typically get into your system, then you can also detect that a particular query is just way below the expected range of similarity. And then maybe add a human in the loop component just to react to that particular observation because that may indicate that somebody is just trying to hack your system for example. And last but not least, recommendation engines. If you have positive and negative examples such as movies that somebody liked or disliked, you can also use multiple vectors and serve recommendations based on this multiple objects that that particular person has interacted with in the past.
Gregory Kapfhammer 00:11:42 Thanks for that explanation. It was really helpful and I appreciate it. In particular, you commenting on the idea of a recommendation engine, I wanted to think about recommendation maybe from a perspective that would be accessible for our listeners. So for example, if we’re thinking about software testing and I have a test case that fails, how could I use semantic search to find the other test cases that are like the failing test case that might perhaps also fail so that I don’t have to run the whole regression test suite? Can you walk us through that type of example?
Kacper Lukawski 00:12:14 Yeah, I can definitely try to describe the approach however, I can’t promise that it’s going to work but definitely there are some embedding models that can work with source code different languages at the same time as well. And I can imagine that somebody could just encode all the test cases from them suits just to capture the meaning of a particular test. And then if you have a, let’s say a Qdrant instance and the collection with all these representations of all the test cases you have, then you can try to find for the nearest neighbors of the failing test case that you just encountered and then try to run them first to evaluate whether they’re also failing. So that may one of the approaches to that. And since you would be using an mimeic model that was trained solely to support code search, then I would expect it to work properly just because the nature of the code is slightly different from natural language like processing. Because here it’s not only about the convention of how we name our variables, methods and classes, but it’s more about the structure and the syntax of the code itself. So this kind of model should capture more nuances of the data and hopefully recognize this problematic test cases early on.
Gregory Kapfhammer 00:13:32 Okay, that makes sense. So in this case I have to find the source code of the test cases and then I have to produce an embedding of the source code of the test cases, store that inside a Qdrant and then use it to help me to find the K nearest neighbors associated with that test case. Am I thinking about that the right way?
Kacper Lukawski 00:13:50 Exactly. That’s the approach I would suggest to test. Unfortunately that’s an interesting use case but I haven’t tried it on my own yet.
Gregory Kapfhammer 00:13:57 Okay. I wanted to talk briefly about some of the other use cases that you mentioned a moment ago. So you might want to do semantic search for like documentation, maybe markdown files or PDFs. What do you have to do to put the markdown file or the PDF in the right format before you embed it? Can you talk a little bit further about how to do semantic search for various types of documents?
Kacper Lukawski 00:14:21 Of course. So modern finds are actually the easiest case because here we have just text with some additional format apply on top of that. And the main challenge here is that we can’t easily just put a whole document into an embedding model and expect it to encode the whole meaning of that document within a few hundreds of dimensions. That will be like a perfect compression mechanism if you can just put whole book inside of such a short vector. So definitely what we need to do is to chunk that into meaningful pieces. And the way we chunk really depends on the data we have as we are speaking about markdown files. You typically have some headers and paragraphs at least maybe some list tables, et cetera. So if you want to chunk your markdown files properly to bring all the context possible, then you probably need to take all the headers way down to the particular paragraph you are encoding just to keep the traction of all the headers that have appeared so far.
Kacper Lukawski 00:15:22 So then you are building more trust that the embedding will capture all the information that it has to capture in order to keep the meaning of that particular piece. However, it’s pretty tricky. Like there is no single method that you can use for chunking. The naive way of just using a fixed window length doesn’t usually work because chunking itself, like imagine you are reading a book and if you just start from a random paragraph it’s really hard to say what was the meaning of that paragraph in the context of a whole book or just a chapter. So chunking usually requires some additional means and some knowledge about the data that we have in order to be done properly. And once you chunked the document, I assume you can tell what’s the best way of how to do that. If you know the docs you are working with, then you need to pass all these chunks through the embedding model of your choice.
Kacper Lukawski 00:16:17 And there are plenty of open-source embedding models available. I really recommend having a look at sentence transformers, which is a Python library that exposes multiple open-source models, some of them even multilingual. So you can work with multiple languages at the same time or if you prefer SaaS then open AI or Cohere are providing this kind of models too. And once you have the embeddings, you sent this embeddings along with the metadata, which is usually the input data that was used to generate this particular vector to Qdrant. So that’s the typical approach and once you have this ingestion pipeline in place, you can start searching over it.
Gregory Kapfhammer 00:16:56 Okay, that makes a lot of sense. I know a moment ago you mentioned the idea of using sentence transformers and in my experience sentence transformers is something that I get from hugging face and download to my computer. Am I remembering that correctly?
Kacper Lukawski 00:17:11 Yes, it’s a typical approach at least when you are experimenting, of course you can use hugging face directly because they have this inference endpoints. So I can imagine like in some cases you can’t really run these models on your own infrastructure or just your own laptop because it won’t be that effective. And in that case you can easily use for example, hugging face inference endpoints to run them on their infrastructure. Or more recently we have introduced this kind of feature Cloud inference into Qdrant Cloud. So you can also just send the raw data and encode it service site.
Gregory Kapfhammer 00:17:48 Aha. Now in a moment I want to compare and contrast Qdrant to other types of databases, but before I do that, can you briefly comment on how Qdrant and the type of system that you build with it is similar to and different from retrieval augmented generation?
Kacper Lukawski 00:18:03 Qdrant might be a part of retrieval augmented generation pipelines. So retrieval augmented generation is all about bringing a relevant context into the prompt that we sent to the LLMs. Obviously LLMs have some disadvantages because they were trained on some specific data sets and even though at the first glance it may look like they know everything, they would definitely not know anything about the internal processes of your organization or maybe some personal data of yours. Well they definitely shouldn’t know that. So the whole idea behind retrieval augmented generation is to use the retrieval component, which might be semantic search for example, to find some relevant information and to automatically add it to the prompt that you send to the LLM. So let’s say you start with a user’s question that was sent directly to your system and instead of using that prompt, that query directly and sending that to the LLM, you use it as if it was a query to your retrieval system.
Kacper Lukawski 00:19:05 And that’s why Semantic search makes a lot of sense because we have this natural conversations with LLMs and then Qdrant in that scenario would just find some relevant documents, parts of the documents that it finds to be important to answer that particular question. Then retrieve augmented generation would just build another prompt including your original question and this documents retrieved from the database and ask the model to answer based solely on this document. So perfectly it should reduce hallucinations and also make sure that the model can rely on its language capabilities not on the internal state or knowledge that it has.
Gregory Kapfhammer 00:19:47 Okay. So if I’m understanding you correctly, the idea is you can use Qdrant in order to find a document and similar documents that are important to you and then you can put that into the context window of the LLM which will then help the LLM do a better job at whatever task you’ve given it. Did I explain it in the right way?
Kacper Lukawski 00:20:07 Exactly. That’s the process of retrieval augmented generation and that also helps the LLM to rely on its summarization capabilities or information extraction capabilities, not using it as if it was a search engine on its own.
Gregory Kapfhammer 00:20:23 Okay, thank you. That was fantastic. Now in a moment I want to begin our conversation about how Qdrant was implemented and then we’re going to spend some time talking about how you actually benchmark the performance of Qdrant. But before we do that, our listeners may be aware of the fact that we’re talking about databases and they may be familiar with other types of databases like a relational database or a NoSQL database or a document database. Could you overview the landscape of different types of databases and then tell us a little bit about how Qdrant fits into that landscape?
Kacper Lukawski 00:20:56 Of course. So I’ve already mentioned that but I think like adopting the term of a vector database was a mistake of that industry. Because when you think about databases, you think about atomicity, consistency, isolation and durability. And we tend to describe ourselves as a vector search engine because we prioritize scalability, search speed and availability over these four database principles. So that also requires different architectural decisions to be made. And these decisions can be easily reproduced in any relation or NoSQL database. So I would say that we should rather compare vector databases to Elasticsearch or Open search or this kind of tools because that’s actually what you are trying to replace.
Gregory Kapfhammer 00:21:45 Okay, that makes sense. If listeners are interested in learning more about other coverage about databases, they can check out Episodes 605, 484 and 199 of Software Engineering Radio. So now what I want to do is dive into the implementation details of Qdrant and how you benchmarked it. Are you ready to go?
Kacper Lukawski 00:22:04 Yeah.
Gregory Kapfhammer 00:22:05 All right. So one of the things that I noticed about Qdrant is that you’ve actually implemented it using the Rust programming language. Can you tell us a little bit about why you and your team chose Rust and what were some of the performance benefits that are associated with using Rust?
Kacper Lukawski 00:22:20 Of course. So definitely the biggest factor behind choosing Rust is its safety. And we can achieve almost similar performance like C or C++ sometimes even better. While keeping this language safety and this strong type system that Rust provides is very helpful in preventing us from making some mistakes in a highly concurrent system. So reading or writing some value from multiple threats concurrently because that’s ultimately what you can expect from a search engine. And Rust has high quality building blocks that make building distributed systems work and probably it’ll be impossible for us to achieve the same quality with the same sized team if we decided to use C or C++. And in case of building search engines or databases, this low-level languages such as C or Rust are just the best choices. And another fun side effect of that is also that it’s very easy for us to refactor the code.
Kacper Lukawski 00:23:20 Like if we change an interface or a data type, the Rust compiler will just point us to all the places that needs to be adjusted. So it prevents some errors at runtime and like we can catch them during the build time. And we can also trust more our external contributors. We are an open-source company so definitely there are some external contributors just because of the features of the language itself and we couldn’t achieve that in languages such as Python for example. That would be way more tricky because there is not such a mechanism there. And last but not least, languages such as Java, Go or C# have garbage collector and that means there are some uncontrollable latency spikes which are just unacceptable in high performance search engines.
Gregory Kapfhammer 00:24:11 Okay, so what you’re saying is that first of all there’s the issue of memory safety and then type safety there’s a performance benefit to using a low-level language and then moreover you needed to pick a language that didn’t use garbage collection.
Kacper Lukawski 00:24:24 Yes, we believe that that’s the way to go.
Gregory Kapfhammer 00:24:26 Okay. Now one of the things that’s really impressive about Qdrant is that you have a whole website about how you do performance benchmarking and I know when you’re doing vector similarity search it’s really important to have the ability to do the K nearest neighbors as fast as is possible. So what I’d like to do now if it’s okay with you, is read out some of the key benchmarking principles that Qdrant has set forth and then I’m going to ask you to explain them and expand on them. Does that sound cool?
Kacper Lukawski 00:24:54 Yeah, of course. I was actually involved in creating these benchmarks at the very beginning so I’m happy to discuss them in details.
Gregory Kapfhammer 00:25:01 Alright, that sounds awesome. So the first thing I was going to say is that you wrote, we do comparative benchmarks which means we focus on relative numbers rather than absolute numbers. What does that mean?
Kacper Lukawski 00:25:12 So for a typical user who has no experience with vector search, it’s really hard to say whether let’s say a hundred milliseconds is a good latency for a query but everyone should easily understand that a particular system is just twice as fast as another one. So that’s why we focus on relative comparison to the other systems that exist on the market.
Gregory Kapfhammer 00:25:34 Cool, that makes sense. Let’s do the next one. You say we use affordable hardware so that you can reproduce the results easily. Tell us more about that.
Kacper Lukawski 00:25:42 Yeah, in our case it’s a Hetzner machine and we decided to use the same machine for all the benchmarks. So we just run them in a queue just because we realized that if we just take instances that look the same, they seem to have the same parameters, the same hardware actually. We are also experiencing some different results from running the same benchmarks that might be caused by different hard drives or maybe different type of memory, different provider and definitely wanted to calculate the quality and performance of all the vector databases, not the quality of the hardware that we are getting. And we believe that running vector search shouldn’t be expensive so we don’t really want to spin up the biggest Cloud instances that exist but we were focusing on a typical use case from our users so they would typically run it on a separate VPS or just a regular instance from one of these providers.
Gregory Kapfhammer 00:26:41 Okay. And what you said actually connects to the next idea. So let me read it and then perhaps you can expand further. You said we run benchmarks on the same exact machines to avoid any possible hardware bias. Can you explain what hardware bias is in slightly greater detail?
Kacper Lukawski 00:26:57 Yeah, so that’s definitely related to the previous one as well. But we don’t want to include like the impact of the particular hardware and measure the latency that could have been caused by let’s say the hard drive that you have. Obviously, vector databases store some data on disk and it wouldn’t be fair to include that into the comparison and that could have happened if we just decided to use multiple instances at the same time. So that’s why we have the same exact machine for all the tests that we run sequentially and then we can compare the results in a proper way I would say.
Gregory Kapfhammer 00:27:33 Okay. And we’ll link listeners of our show in the show notes to details that are related to the benchmarking setup that you’ve used. You’ve already mentioned several performance evaluation metrics that you use in this benchmarking framework, but what I’d like to do is to list them off and then ask you to go into some additional details. So for example the documentation for Qdrant references, throughput, latency, memory usage, CPU usage and indexing time. So those first four, if you could go over those at a high level of detail and then in particular dive into indexing time, that would be greatly appreciated
Kacper Lukawski 00:28:08 Of course. So depending on a specific use case you have or maybe some budget constraints, you might prefer to optimize for a particular metric from those four. But we measure all of them and report them in our benchmarks just so you can have understanding of what you can expect in a very specific setup. For example, low latency might be important if your users expect immediate response, and we measure an average latency P95 and P99 so we can see like what the majority of users can expect from the system and how fast it is going to be. Similarly, if you expect to have multiple concurrent users then throughput might be the metric that you’ll be taking care most. So definitely we can’t really say what’s the perfect setup in all these cases. That’s why we report all of them. And when it comes to memory usage and CPU usage, since we run all the benchmarks on the same exact machine, there are some specific parameters of it that we don’t modify and in certain cases we see that a particular system, a particular engine can’t just work with this limit.
Kacper Lukawski 00:29:19 So definitely it’s just, it needs more memory to support the same use case, the same data set. Because let’s say your million vectors just do not fit a particular instance. And when it comes to the indexing time, I think it’s an important topic that we haven’t discussed yet, but all the vector databases on the market use some sort of helper data structures to make this approximate nearest neighbor search efficient and this indexing time is needed in order to build this data structures. It might be also crucial to know how much time it is going to take, especially if your data is changing frequently in that cases indexing time might be just the most important metric for your particular system.
Gregory Kapfhammer 00:30:05 That was a helpful response. Thanks. What I want to talk about is whether or not you’re using the benchmarking framework to compare one version of Qdrant to another version of Qdrant or alternatively are you using it to compare Qdrant to some other type of tool or technology? Can you expand on that a little bit further?
Kacper Lukawski 00:30:23 Of course. So the benchmarks that you can see on our website compare different vector databases under the same test cases. So we use the same data sets and the same machine to just see what’s the performance according to all these metrics of Qdrant versus the other tools on the market. However, internally we also use the same benchmarks to compare different versions of Qdrant just to see what’s the improvement of a particular feature on search and also, we use it to test different configurations of the same version of Qdrant. So that serves multiple purposes.
Gregory Kapfhammer 00:30:57 Okay, that makes sense. Now I’m wondering if you could give a few concrete numerical performance results. So what I’m looking for here is some type of headline result that helps us to understand the performance say of one version of Qdrant to another or Qdrant compared to some other vector similarity search engine. Can you give us a few of those concrete numerical results?
Kacper Lukawski 00:31:18 Yeah, so maybe let me just start with the results of one of the tests that we did in the benchmarks. So used the most popular embeddings that exist from Open AI. We took 1 million vectors created from some real-life dataset and Qdrant was able to index that dataset within 24-25 minutes. We are not the fastest when it comes to the indexing time I have to admit, but that was just somewhere in the middle. And for that dataset, if you decide to use Qdrant you can expect the latency of a single search operation to be as slow as three to four milliseconds in average. And there shouldn’t be a problem to run like 1,200 queries per second with that particular configuration while the search precision should be still around 0.99.
Gregory Kapfhammer 00:32:07 Aha. So you mentioned the idea of the precision of the search as well. Can you briefly talk more about what precision means in the context of vector similarity search?
Kacper Lukawski 00:32:16 Of course I think we’ve mentioned that topic already but since vector databases approximate the newest neighbor search, you can expect them to always produce the same results as pure KNN would produce for the same query. So search precision is an important factor here and it measures like how many times we return the results the brute force KNN would produce for the same query. So it’s quite easy to build a system that will be very fast but inaccurate. So the whole point of comparing the search engines is that we compare them at the very specific precision threshold. So we only compare the quality of a particular system assuming that the minimum search precision is like 097, 099. So this is a key factor here because like depending on the use case you may prefer to just reduce your requirements in terms of search precision. Like in many cases you don’t need to always get the top results because you prefer like better latency but in many cases in very specific industries you need to be as close to one as possible. So that’s why it makes a lot of sense to just calculate that with search precision threshold in mind.
Gregory Kapfhammer 00:33:30 So what you’re saying is there’s a tradeoff here between throughput and latency on one hand and then on the other hand the accuracy associated with vector similarity search. Did I catch the trade off the right way?
Kacper Lukawski 00:33:42 Yes. Exactly.
Gregory Kapfhammer 00:33:43 Okay, good. Now we mentioned indexing a moment ago and I wanted to talk briefly more about indexing and also again return to this idea of similarity. So if I want to know similarity between two source code segments or two documents, my understanding is that I have to have some kind of distance metric. So I’m familiar with distance metrics like cosine similarity or Euclidean distance. What does Qdrant use to actually calculate these types of similarities?
Kacper Lukawski 00:34:09 So that might be configured for your collection are actually for your vector. Because in Qdrants collection you can have multiple vectors per point and each of these named vectors can have different similarity measure. We support four different similarity measures here, it’s dot product assign similarity, you clearly distance and Manhattan distance. I would say like 90% of the cases people who use cosine similarity, it’s just easy to interpret because it like the outputs of the cosine similarity comes from a very specific range from negative one to positive one. So it’s easy to interpret whether your points are really close to each other and even like use that measure directly to indicate the similarity of two objects in the UI of application. For Euclidean distance, which is virtually unlimited, it’s hard to tell like if it’s a good result or not. Like close to zero is fine but how to interpret 20 is that okay or maybe it’s really far from each other.
Kacper Lukawski 00:35:07 So. Cosine design similarity has that benefit of being easy to interpret even for non-technical people. However it also depends on the model you choose. Assuming you have this model that was trained to support programming languages and source code, then you’d probably need to check the model card on hangman phase or just verify that with the model provider. Because the model was probably trained to optimize for a very specific metric and that probably was either Euclidean distance or cosine similarity. So that’s how you choose the proper metric. It’s just a property of the model you use.
Gregory Kapfhammer 00:35:44 So if I’m understanding you correctly, I have to be careful when I’m creating the embeddings to make sure that I’m using a certain distance metric and then later when I’m running the querying I have to make sure I’m using the same distance metric.
Kacper Lukawski 00:35:58 Not exactly actually when you create your your embeddings it doesn’t matter Like you’ll be just running them through the model and you will receive this numerical representations of your data. But when you create the collection in Qdrant you need to specify the metric that has to be used to compare these vectors and you can’t modify that metric later on. Because that’s also important to know that we use that metric to build this helper data structures which are just used internally to speed up your search operations. So also when you search you don’t specify a particular metric, you just use the one that was configured on your collection.
Gregory Kapfhammer 00:36:34 Thanks for that clarification, I appreciate it. You mentioned before the acid properties that are associated with databases. I’m wondering if you could briefly comment on whether or not Qdrant provides things like isolation or durability or is that not a focus of the system that you’ve built?
Kacper Lukawski 00:36:50 It’s definitely not a focus of the system. Like Qdrants shouldn’t be used as a regular database. Like there is no atomicity if you just send an operation to Qdrant, like if you ingest your data you can expect like eventual consistency of it but it’s not guaranteed at any level so we don’t really focus on all these properties of regular databases. So I wouldn’t really say that any of them has a particular property of Qdrant or vector databases in general.
Gregory Kapfhammer 00:37:19 Okay, that makes sense. And in fact since you just said the phrase vector databases in general, I think it might be appropriate for us to at least briefly compare Qdrant to some of the other vector databases that our listeners might be familiar with. So for example they might have heard of PG Vector or Pinecone or maybe they’re familiar with the fact that SQLite has a way to do vector extensions. Can you pick at least one of those and explain how Qdrant is similar to and different from the system that you picked?
Kacper Lukawski 00:37:48 Of course I think PG Vector is the best example to choose here just because that’s the most common question that I’m getting. The main concern that people have when we discuss vector databases is that if you just add a new system into your existing stack and you already have a relational database such as Postgres, then you need to keep those two systems in sync somehow. So the main benefit of using PG Vector in that case is that you don’t really need to copy your data anywhere else. There is just a single system that keeps everything in one place. So that’s quite often a concern of people that I’m speaking to. However, PG Vector is just an extension of Postgres and Postgres is a relational database that like takes care about all these properties we just discussed and since it’s just an extension it doesn’t modify the core of the system and it just acts as additional functionality of your relational database which is okay if you just deal with like thousands of examples then you shouldn’t even notice any difference.
Kacper Lukawski 00:38:48 However, when we speak about higher scale systems like dealing with millions or even billions of vectors, the vector searched easily becomes a bottleneck of your relational database. Imagine you have a a system that has like a million of documents in one of the tables. That’s not that big amount of of information of the data. If we speak about modern systems, like there are so many transactional systems that can handle this kind of load and it’s not a big deal for Postgres, that’s for sure. However, if you decide to add this vector search capabilities and if you decide to use open AI embeddings for example, then this million vectors will transform to six gigabytes of memory and vectors are typically stored in memory for the search to be just efficient. That means the vector search capabilities easily becomes the like the most important process inside your relational database even though it was supposed to handle like typical SQL queries.
Kacper Lukawski 00:39:48 Like you’ll be selecting points based on their IDs or maybe some other typical filtering criteria but you are just generating an additional load on an existing system. And from my experience that’s rarely works that well if you really reach a certain scale. Moreover, there are also some other issues that you may encounter just because PG vector is an extension, that means that if you want to search using vector search and at the same time perform filtering, coming back to the previous example, you want to filter items coming from a particular city, let’s say New York, then it doesn’t work that well on the semantics search level and that has to be expressed as traditional workload in SQL. However, in case of PG vector you would be either using pre or post filtering, meaning that you either filter all the roles in your database that fulfills that criteria and then perform semantic search on top of them, which may end up as almost linear scan at some point if you, let’s say 90% of your rows just match this criteria.
Kacper Lukawski 00:40:52 And on the other hand if you use post filtering you are then running semantic search on all the rows you have and then you are filtering all this, all this results. But that can also mean that you can end up with no results at all because the set of points that you selected with semantic search just doesn’t include any of the points from that particular city. And in case of Qdrant we have, quite a unique approach to that because semantic search and metadata filtering are performed in a single pass because they’re just incorporated into this helper data structures. So that’s a huge difference. But also historically if we speak about search, anyone treating search seriously would probably set up a separate system for that such as Elasticsearch or open search just because search requires different means than relational databases and they are built to support different use cases. The same applies to vector databases. I totally get the point of just using a single system when we are just experimenting and PG vector and escalate vector extensions are actually okay if you’re just doing this kind of experiment. But in a real production systems having a separate system for search makes a lot of sense for this reasons.
Gregory Kapfhammer 00:42:09 Okay, thanks for that response. It was really thought provoking. I wanted to pick up on three different words that you said. So first of all you mentioned the idea of hitting a bottleneck and then one of the bottlenecks that I heard you mention was related to the gigabytes of memory use. Then the other limitation or bottleneck was related to the fact that you would have to do a linear scan of some of the data. Just briefly, are there other types of bottlenecks that a developer would bump into that would convince them hey I really need to use some type of vector similarity search engine?
Kacper Lukawski 00:42:40 Yeah of course. So I’m glad you mentioned escalate vector extension because actually this is something interesting like many people use SQLite for their site projects and also some mature projects you still use SQLite even though that was supposed to be like an embedded database rather for local usage. And this vector extension to SQLite is actually not an approximation of vector search but it’s actually a brute force KNN that just compares your query embedding to all the document embeddings you have. Even if you have a look at their benchmarks, you can expect the latency to be as high as nine seconds if you have just a million documents in your database. So that’s okay if you just deal with hundreds or thousands ropes. But like on higher scale you can expect it to be to be like the bottleneck of the whole system. And also using this naive approach, the brute force scan means that you don’t really build any data structures for that. You just store the vectors on disc and then just sequentially load them from there. That also means that you can see that like the memory usage is not that high but the latency will be, will be a complete disaster. So this kind of problems may occur when you really choose something which is not built on purpose to support vector search.
Gregory Kapfhammer 00:43:59 Okay, that makes a lot of sense. What I’d like to do now is to transition our conversation to a new topic and I want to briefly discuss how someone would actually get started using Qdrant both from the perspective of running a Qdrant instance or accessing one of those instances and then also using one of the client libraries. So to get us started, when I’m using Qdrant, do I run it on my laptop or do I access a Cloud version of it or do I deploy my own version in the Cloud? Can you walk our listeners through some of the practical aspects of deploying Qdrant?
Kacper Lukawski 00:44:32 Of course. So there are different ways of how you can use Qdrant. We are an open-source engine so you can definitely run it on your laptop and that’s actually what I typically do when I just experiment with Qdrant. It’s as easy as just pulling our docker container, running your on your machine and functional wise you are getting the same functionalities as you would get in the managed Cloud. Like that’s totally, we are using the same containers in our Cloud. The main benefit of using our managed Cloud is that you get a really nice UI, you can spin up your clusters through the API that we have. So when you start to scale a product this is great because you don’t really need to worry much about your infrastructure. You can totally focus on your product and let us focus on making your Qdrant experience as seamless as possible.
Kacper Lukawski 00:45:21 And there is also a third option except for on-premise local usage or managed Cloud. We also have a hybrid Cloud offering. So hybrid Cloud allows you to run your Qdrant instances on your premises as long as you can provide us a Kubernetes cluster. So it’s also great idea to use it that way if you already have all your systems running in on your own infrastructure that might be even in Cloud and you just want to bring Qdrant as an additional component into an existing stack. We also provide the Helm chart if you would like to run it on your Kubernetes cluster, I mean the open-source version. So there are different ways of how to use it but ultimately all of them will bring you the same Qdrant experience because functionality is almost identical for all the possible modes of running it.
Gregory Kapfhammer 00:46:11 Okay, thank you for that. So let’s call the thing that we’ve just deployed the Qdrant server. Is that an okay phrase to use for now?
Kacper Lukawski 00:46:18 That’s what we use to describe.
Gregory Kapfhammer 00:46:20 Okay so now that I have the Qdrant server running which could be in a docker container on my laptop or hybrid or Cloud, I guess I need to run some type of Qdrant client which is going to allow me to like extract the data from my documents maybe using chunking like you mentioned a moment ago. And then I actually have to put it into the Qdrant vector similarity search engine that’s running in my server. So I know that there are Python, GO and Rust libraries that are helping people to build the clients. Can you talk a little bit more about how developers would use these client libraries to interact with the Qdrant server?
Kacper Lukawski 00:46:57 Of course. So all of these clients are actually some interfaces built on top of HTTP and GRPC protocols because that’s what Qdrant exposes in the first place. However, the most popular client is our Python SDK and it comes with some additional benefits because you can interact with both protocols using the same interfaces and that’s fine if you let’s say have some restrictions and let’s say at this point you can’t use the GRPC protocol because it’s just not allowed in network you’re operating on, then you can still use Qdrant in the HTTP mode and eventually switch to GRPC because it’s just a bit more efficient once this is all solved. So those libraries are actually so thin wrappers around our HTTP and GRPC APIs that just makes things a bit easier because we take care of the batching when you insert your data with our clients you can expect them to just send them in batches and that’s a good practice to do that re-tries might be handled automatically but overall you’ll be calling the methods which are named similarly to the HDP endpoints for example. So that’s what you typically do and that depends on the language that you choose because some of them may have synchronous and asynchronous versions of the methods. So that really depends on the platform but eventually you can also use HTTP or GRPC protocols directly depending on the platform you are working with.
Gregory Kapfhammer 00:48:21 Okay, so what you’re saying is that whether I’m using Python, GO or Rust, I have two protocol choices in terms of how I interact with the Qdrant server. Just very quickly when I was using the Python client SDK myself, essentially what I did was I created a virtual environment and then used UV in order to install the Qdrant client as a dependency. What I also did next was actually use something like sentence transformers to create my embeddings but just to make sure I’m clear, Qdrant doesn’t technically care whether I use sentence transformers or open AI or any other way to create my embeddings. Did I get that correct?
Kacper Lukawski 00:49:02 Yes, that’s totally right actually we do not assume that you’ll be using a particular model to encode your data. Many of our users at some point decide to fine tune their own models so they reflect their domain a a little bit better. So like we can’t be really supporting a very particular set of models. Qdrant is model agnostic so no matter how you create your vectors, as long as you can provide them as a list of loads it’s fine to to use them with Qdrant.
Gregory Kapfhammer 00:49:30 Okay thanks that was awesome. So I can pick Python, GO or Rust, I can pick from a very wide variety of embedding libraries and then based on what you said just a moment ago, I can even do some fine tuning of the embedding library for the specific type of data that I care about like markdown files or a source code or PDFs or other things of that nature.
Kacper Lukawski 00:49:50 Yes, many of our users just start with some existing models and then at some point they decide to fine tune something for their own proposals. So yes you can do it.
Gregory Kapfhammer 00:50:00 Alright, that’s awesome. What I want to do now since you mentioned the idea of fine tuning, I’d like to talk a little bit about some of your experiences that you and other members of the Qdrant team have had when it comes to things like building or testing or doing performance evaluation for Qdrant. So I wanted to start by asking you to share a story maybe of like a challenging bug or performance issue that you faced when you were developing Qdrant and then could you tell our listeners a little bit more about how you and the team solved that issue?
Kacper Lukawski 00:50:29 Of course this is actually described in our website. We have a pretty nice article describing that. For those of you who have some Rust experience, you probably have heard about RocksDB which is embeddable key-value store and it has been a key cornerstone for us to persist multiple data on disk for a pretty long time. It has one major problem though it requires periodic compaction of data. That also means we need to do some sort of housekeeping to structure the data and to drop some old data for example. So whenever this compaction job runs it can block everything else and cause some latency spikes similarly to the garbage collector and we had no control over it. So that’s also why we decided to implement something different. A custom key value store which we called grid store, it’s also like an open-source site project of Qdrant, you can find it on our GitHub repositories.
Kacper Lukawski 00:51:27 And although like RocksDB is a fantastic general-purpose product there latency spikes were unacceptable so we had to do that and it has similar functionality to RocksDB but it just a specialized for our specific use case. So that actually improved the the latency perceived by the users significantly. You can also find some benchmarks website that that proves that. So it like definitely that’s something that we are really proud of and we also kept the backwards compatibility so even though you were using the version that was still using RocksDB, you can upgrade to the latest one and still expect it to work. So that was definitely a challenge that we successfully solved in the recent months.
Gregory Kapfhammer 00:52:12 Thanks for sharing that thought provoking example. That’s fascinating. We’ll make sure to link in the show notes the blog post that you mentioned a moment ago so others can learn about this challenging story and this successful outcome. As we draw our episode to a conclusion, I’m wondering if you could comment briefly on the ways in which building and testing and evaluating the performance of a vector similarity search engine is different from other types of software systems with which you have experience.
Kacper Lukawski 00:52:40 Yeah definitely. I think that might be also interesting for those who would like to join Qdrant core team. So I know that the core team works like they always try to keep the development momentum so they implement features in small steps from inside out so they can merge them into the main branch quickly without having many diverging versions. It also makes the reviews and collaboration easier. So in case of vector databases, I think that testing is key but we try not to overdo it. It’s not only to prove that if a particular feature works now, but we also try to prove that it won’t break in the future when we decide to change something. We also try to handle all the common cases in end-to-end tests and also try to keep the test code minimal so it’s not like 10 times bigger than the code for the feature itself.
Kacper Lukawski 00:53:36 And in our case, benchmarking is really hard because you can’t really benchmark, like an individual feature on some artificial data sets. But we really need to think about real world use cases. The good thing is that we have lots of users already so we can also, build our test cases based on that and our benchmarks too. So we also like started to recommend doing some custom benchmarks once the requirements are clear because there are so many diverse way of how people can use vector search and that’s something that we, that we’ve learned and building distributed systems is really hard. That’s where we struggled a lot and yeah, I think we are just getting better in building them even though our core team is really not that big.
Gregory Kapfhammer 00:54:21 Thanks for that response. Now we’ve talked a lot in this episode about different key concepts. So we talked about vector embeddings and similarity search and we’ve gone through many of the details about both how you use Qdrant but then also about how you actually have gone to build Qdrant or to do the benchmarking associated with Qdrant. At the very end of our discussion now, I’m wondering if you could comment briefly on what you see as like the future of vector databases and their overall role in what we might call the AI or Machine Learning Landscape.
Kacper Lukawski 00:54:52 Well definitely vector databases are not that. Even though I’ve seen like multiple posts on LinkedIn about like the end of of the whole industry. The main problem of AI or LLMs, I know that we use these two terms are terms as synonyms, but even the latest LLM can suffer from knowledge cutoff because they were trained and on some specific data sets and definitely don’t know that most recent news and none of them could have been trained on your own data. So definitely some sort of retrieval is needed and vector databases will definitely serve that functionality for them because semantic search is just so well suited for natural language like search or multimodal rack because that’s also something that started to be implemented recently. And Vector search is not only important in terms of rack or LLMs, but nearest neighbor search is just such a versatile method that can solve multiple problems that I still feel that it’s too early to say like what would be the typical use cases in the upcoming two or three years. But I feel like many of us will start implementing something more than just which we’ve augmented generation and we will definitely see the applications of Vector search for example, as some sort of guide rates before we input any data into an LLM because we can perform an detection with nearest neighbor search easily. And I’m looking forward to seeing some new use cases for that and pretty sure that’s definitely going to happen.
Gregory Kapfhammer 00:56:25 Okay, that makes sense. And I can say from my own experience associated with using Qdrant, it is flexible and it can handle a wide variety of different documents. So I do think it’s an area where there’s still a lot of growth. And the point that you made previously was a good one in regards to the fact that you often want your own standalone system for similarity search so that you can let the database that’s relational do what it’s good at and then have another system that can do what it’s good at. So with all of those points in mind and the thoughtful insights that you’ve shared so far, are there any additional topics that we didn’t cover that you think we should briefly discuss?
Kacper Lukawski 00:57:01 I think we should mention the importance of evaluation. That’s something that people tend to ignore when they build retrieval augmented generation or vector search in general. However, retrieval or search is not a new topic. We’ve been discussing proper ways to to do retrieval and evaluated for ages. And even though you may choose the best performing embedding model from the public leader boards, or choose the best LLM, your system may just struggle with your specific data because none of these models was trained on something that would resemble it. So unless you are experimenting and I don’t know, doing like a site project over a weekend, it’s always a good idea to start your semantic search journey with building well curated run through dataset that will serve as a quality judge so you can see whether your retrieval is really doing a great job.
Gregory Kapfhammer 00:57:55 Thanks for that comment about evaluation. That makes a lot of sense. As we draw our episode to a conclusion, I’m wondering if you have a call to action for the listeners of Software Engineering Radio who want to learn more about Qdrant or get up and running and actually start to use it.
Kacper Lukawski 00:58:10 Definitely. I invite you to our regular webinars that we organize every single month, and please just check out the Qdrant Cloud offering, especially the Cloud inference, which actually makes things a bit easier because you truly need to support and host your own embedding model. But you can send your data directly, either text or images, and expect the server to create the vectors without you worrying about hosting a model, especially if you have no experience in that.
Gregory Kapfhammer 00:58:40 Thanks, Kacper. Hey, it’s really been fun to have this conversation on Software Engineering Radio. I really appreciate you being here and devoting all this time to tell us about the Qdrant database.
Kacper Lukawski 00:58:50 Thank you, Greg. That was a great pleasure to be here with you today.
Gregory Kapfhammer 00:58:54 All right, and if you’re a listener of Software Engineering Radio who wants to learn more about Vector similarity search engines, I would encourage you to check the show notes for additional references and details. And now this is Gregory Kapfhammer signing off for Software Engineering Radio. Goodbye.
Kacper Lukawski 00:59:09 Goodbye.
[End of Audio]


