SE Radio 493: Ram Sriharsha on Vectors in Machine Learning

Ram Sriharsha of Pinecone discusses the role of vectors in machine learning, a technique that lies at the heart of many of the machine learning applications we use every day. Host Philip Winston spoke with Sriharsha about the basics of vectors, vector embeddings, feature engineering versus deep learning, hyperparameters, vector search, k-Nearest Neighbor search, alternative distance metrics, and Pinecone’s vector database as a service.

Show Notes

SE Radio

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Philip Winston 00:00:21 This is Philip Winston with Software Engineering Radio. I’m here with Ram Sriharsha. Ram is Vice President of Engineering and Research and Development at Pinecone. Previously, he was Vice President of Engineering and Head of Machine Learning at SPLUNK. And before that, he worked at Databricks and Yahoo. He’s on the project management committee for Apache Spark. Ram has a PhD in physics from the University of Maryland. Is there anything else in your background you’d like to highlight?

Ram Sriharsha 00:00:51 No, I think that covers all of it. It’s great to be here and thanks for inviting me.

Philip Winston 00:00:56 Great. Before we get started, I wanted to list two previous episodes we’ve done on machine learning and there are others in the episode history. The first one is Episode 391, Jeremy Howard on Deep Learning, and Episode 286, Katie Malone, Introduction to Machine Learning. So, in this episode, we’re going to talk a lot about vectors. In the context of machine learning, what are vectors?

Ram Sriharsha 00:01:22 Big question. People are, I think, very familiar with the concept of vectors. You might have heard it in other contexts. I would say this appears a lot in physics and engineering. Often, they’re familiar with quantities that have a magnitude and maybe assign these are what we call scalers. A good example is temperature, right? Temperatures are scalar. Similarly, there are objects or quantities that have both a magnitude and the direction. You can think of velocity, displacement, acceleration. These are all good examples of vectors. One good way to visualize a vector would be fair in an end dimensional space. And if you take the origin at some point in that space, the line connecting them has a direction to that point. And that line you can think of as physically representing a vector. These are concepts that are very familiar to you from non-machine learning settings? They mean the same thing in machine learning.

Philip Winston 00:02:07 Okay. For physics I’ve done simulations maybe, or computer graphics and 2D and 3D. So in 3D, a velocity vector would have three components, X, Y, and Z. In machine learning we’re talking about higher dimensional vectors most of the time?

Ram Sriharsha 00:02:23 Yes, that is correct. So, the only big difference is the vectors could be 1024 dimensions, even bigger.

Philip Winston 00:02:30 Okay. Even people who aren’t too involved in machine learning have heard of Google’s machine learning platform, TensorFlow. What is a tensor and does it relate to vectors?

Ram Sriharsha 00:02:40 That’s a great question. So again, tensor is, it’s a concept that maybe familiar to people who have been in fields like mechanics or fluid dynamics and things like that. I’ve heard about these sort of things. A good example is stress on a body, right on an object. If you think about a small ball or a small kind of cube, you can imagine stretching it and compacting it in three different directions with three independent directions. But you can also imagine sharing it to kind of applying a force at an angle to it. In a sense to actually measure stress on a, on a three-dimensional object, you can’t just have three numbers. You need nine numbers, right? Basically, three independent directions, times three. We call this a rank two tensor. So basically, it is a stress energy of a body is that rank two tensor and higher dimensional examples exist of this. So basically, a tensor is nothing but a generalization of a vector. Just like a vector is a generalization of scaler. You can think of a rank zero tensor as a scaler, a rank one tensor as a vector and an example of rank two tensor is stress energy, and so on.

Philip Winston 00:03:44 Great. So, since we’re emphasizing the software engineering aspect, I’d like to just talk briefly about specific data types. So, in regular programming, I’m usually choosing between floats and doubles for my floating point values. And those today, I think are generally 32 bit or 64 bit. I’ve heard in machine learning, they sometimes use 16 bit or even 8 bit floating point numbers. Why does machine learning want lower precision values? And how would I choose what precision to use?

Ram Sriharsha 00:04:15 We generally tend to use floats a lot in machine learning, but as you mentioned, 8 bit, 16 bit are also quite common. The two biggest reasons to use these are for compactness, these 8 bit and 16 bit, the presentation is produced just more compact models. The other reason is sometimes they’re actually more amenable to taking better advantage of the CPU. So, they can actually lead to faster influence and so on. The main trade-off here is between compactness and speed versus accuracy. So, you’re losing a little bit of floating point position using 8 bits and 16 bits and so on. So, then it’s a question of, are you accurate enough for the speed and for the compactness and the storage produce storage requirements?

Philip Winston 00:04:56 Okay, great. That gives us a little idea about vectors. Now let’s talk about vector embedding. What does vector embedding mean in the context of machine learning?

Ram Sriharsha 00:05:05 An embedding does nothing but a transformation, not a mapping from one vector space to the other. So, a good example could be, if you just take an image and look at the pixels of the image, the pixels of the image could be say a 28 by 28 dimension, right? So, it could be a grayscale image of 28 by 28 dimensions and I’m thinking of something MNIST, that gives you a 784 dimensional kind of representation of an image. Now this may not be the best representation of the image. This is just capturing the pixelated version of the image. What you may want to do is to pass it through some kind of a machine learning model that learns a lower dimensional representation that may be better suited for judging semantics similarity between images. And embedding is nothing but that. Basically, an embedding is a mapping that takes some raw representation of your unstructured data, like images or texts and so on and produces a smaller, more compact representation that is a better fit for your downstream tasks.

Philip Winston 00:05:59 Okay, great. I just wanted to flag one acronym or abbreviation. You said MNIST, what is it?

Ram Sriharsha 00:06:05 MNIST is a very popular image dataset. So, it’s a dataset of grayscale images of handwritten digits, and it’s very popular as a benchmark dataset.

Philip Winston 00:06:14 Okay. And I’m not sure if this falls directly from discussing embedding, but what is feature engineering and do the values in my vector correspond to features or is it not that simple?

Ram Sriharsha 00:06:27 Yes. Feature engineering is basically, again, any process that takes a raw representation of few data and produce us a more compact representation that’s better suited for say classification, regression or any of your downstream tasks. And yes, you’re right. That you can imagine taking the specific weights of your vector or the components of the vector and treating them as features. Today a lot of feature engineering comes from deep learning models and so on where there is not a direct tie in between the coordinate of a vector on like the actual value of a coordinate to any semantically meaningful feature. But yes youíre right.

Philip Winston 00:07:03 Yeah. I was going to ask about deep learning next, when I’ve looked at a table of context for a course on machine learning, I was surprised there’s quite a few techniques before they even introduced neural nets, let alone deep learning. But certainly, it seems very popular today. How has the rise of deep learning affected the type of vector embeddings you see or how they’re created? Like what has been the impact of deep learning?

Ram Sriharsha 00:07:30 Yeah. So, feature engineering has existed for a long, long time before deep learning. Mostly people have been doing handcrafted features. They are pretty time-consuming to develop and they require a lot of domain expertise. And in areas like an LP, for example, feature engineering used to mean BFID of sort of transformations and so on, which basically generates spots vectors. What deep learning I think has done, which is pretty remarkable, is it has made it extremely easy for anyone to take a pre-frame model and generate good enough embeddings. So, you can take these models that have been trained for other tasks and use them to generate features that are useful for training for your tasks. The commercial viability of this and the availability of high quality pre-trained models as completely unlocked. And you kind of change the scheme. So that is actually why you’re seeing a lot of feature engineering being done by deep learning models today because of the availability of the technology and the ability of these models to produce very high quality embeddings that can transfer to your own tasks.

Philip Winston 00:08:33 I was going to ask about this term pre-trained later on, but let’s talk about a little now. So, the idea is a third party or an expert has done neural net training, done deep learning training, and you use the network they created, but then maybe you add on your own special spin on that and that that’s made it easier to people get started.

Ram Sriharsha 00:08:54 Yes, that is a 100% correct. Often these are big companies like Google, Facebook, Open AI, and so on, who have created these very large pre-trained models that have been trained on huge corpuses. And they cost a lot of money quite frankly, to print. But these models are actually, once trained, they’re available for everyone. Even though these models have been trained on these large public Data sets, they capture semantically interesting representations that can be useful for your own tasks. So, your task may not be anything to do with Wikipedia datasets, but these models are so good that you can actually use them as a starting point to fine tune for your own tasks. And again, that is kind of been a game changer.

Philip Winston 00:09:15 That’s great. It sounds like sort of an evolution of open source in the terms of these are open models, I guess, that you’re able to use?

Ram Sriharsha 00:09:41 Yes.

Philip Winston 00:09:43 So going back to vector embedding in the details, you mentioned long vectors, I think 10 24 elements. What determines the length? Is that something that I determine as if I’m creating the model or is that, does the training actually tell me how long the vector needs to be?

Ram Sriharsha 00:10:01 It a great question. So, it is what is called a hyper parameter. So, it is one of the parameters that’s an input to your process itself for creating these models, as opposed to it being a learned parameter. This is basically roughly translates to the number of layers that you may want to use in your model. Again, every task is different. So, you may want to have different dimensional embeddings for different tasks, but often what happens is people take one of these pre-trained models and they may use them as a starting point. In which case you get the dimensionality of that model or that embedding as a starting point. And that is usually like 784 1024 dimensions as well.

Philip Winston 00:10:37 Okay. You mentioned sparse versus dense, I think it was four for vector embedding. What properties in general does a good vector embedding have and how am I able to judge that? If I’m trying to come up with one?

Ram Sriharsha 00:10:50 A good vector representation has to be both compact as well as it has to capture the semantic similarity. So basically, the main reason to create these vector embeddings is that nearness in the embedding space tells you something about nearness in your images are in the things that you cared about. So, they have to have semantically rich content, but they also have to be compact, right? So that you can store them easily. You can analyze them easily and you can credit them effectively.

Philip Winston 00:11:18 That makes sense. So, we’ve talked about vectors and vector embedding. Before we go into vector search and vector databases, I wanted to step way back and give some applications that people might be familiar with that are probably using vector search or vector databases under the hood. Just something very concrete that we can think about.

Ram Sriharsha 00:11:38 Perhaps a classic example would be Netflix’s recommendations, sort of algorithms or in general, any sort of retail product recommendation, the general idea, being that if you have users and you know, their user preference vectors, and if you have say movie ratings and you know, that movie rating vectors, then essentially semantic similarity can be used. A vector sets can be used, to tell you for any new user, what sort of movies to recommend, essentially, by looking at similarity between user and movies. This sort of idea extends to any sort of product retentive mandation. On the other end of the spectrum, you can also do things like detecting near duplicates and images, right? So, if you have an image and if you have a corpus of images, you may want to find out, is this image a near duplicate of something that’s already new corpus? Now it is easy to find duplicate exact duplicate, but you might just can be near duplicates in many ways, right? They can be slightly grayed out. They can have a small rotation in the image and things like that. It’s extremely hard for you to physically quantify what makes an image near duplicate. But if you pass the MNIST through a embedding model, then you can look at the near-miss between the two embeddings. And that can be used to tell you whether the two images are near duplicates. So, these are a couple of examples.

Philip Winston 00:12:52 Okay, great. That makes a lot of sense. It sounds like this is the type of thing we see all the time online, these days with services and websites. So yeah, let’s move into vector search. So, we have our vector embedding. We’ve created vectors, I guess, for millions of items or many, many items. Are we always searching for similarity or what are we searching for exactly?

Ram Sriharsha 00:13:16 That is both similarity as the less anomalies or dissimilarities. For example, you can look at similar items. You can also look at what is not similar to anything that’s in your corpus. And that’s usually what we call anomaly. But you can also use this for classification sort of tasks, which is you not only retrieve things that are similar to each other, but you can also look at what you’ve labeled, the things that have been similar to each other as, and then use the majority label to predict something. This can be used for classification, regression, can be used for similarity. It can be useful anomaly detection, interestingly enough, the set of use cases that vector search powers, is going by the date and whatever you would have traditionally thought of as kind of search can be thought of as a strict subset of what requisition can do.

Philip Winston 00:13:59 You mentioned classification is clustering. The same thing I saw that reference on your website, I think.

Ram Sriharsha 00:14:05 Yeah. Yeah. Clustering is like an unsupervised technique. Classification means you have labels here, labeled it for you and you want to give it a new point detect whether it has a certain label. Interesting , you’re just looking at things that are close to each other. It’s an unsupervised technique and it’s very common either as a people processing technique or just to identify patterns in your data.

Philip Winston 00:14:25 Okay. That’s interesting. So, there are many types of databases, sequel, key value store, graph, databases, document databases. How is vector search similar to some of these existing database types and how is it different?

Ram Sriharsha 00:14:40 Database in the category of say Penn cities databases, or spatial databases and so on where you have all the traditional problems of a database, including data management, including freshness, including transactional updates and all the things that he cared about the database providing, you’re dealing with a new type of data, right? In the case of time series data, it’s a new type of data, traditional relational databases aren’t optimized for dealing with time series data. Similarly, geospatial, traditional databases are not quite optimized for dealing with geospatial data. What makes vector search even more interesting? I think from my perspective, it’s not just a new type of data or a new representation of data. It does that vector search is fundamentally computationally intensive and in a way that not even geospatial at cities or any of the other databases have to deal with, it’s just a orders of magnitude more computationally intensive, especially because you’re dealing with this high dimensionality problem and the sort of algorithms that have to do indexing to be computational intensive. It stretches the boundary of what databases can do. So, it’s not easy for you to take an existing database and kind of provide this capability on top of it. But other than that, there’s a lot of commonality between traditional databases and what youíd get.

Philip Winston 00:15:52 Okay. And a typical database, we construct one or more indices to make searching tractable where an index is an additional data structure that’s constructed from the input data and it’s created and maintained by the database internally. Does vector search require an index? And is it a different type of index?

Ram Sriharsha 00:16:12 Yes, vectors requires a specialized index. It’s called a K-Nearest Neighbor search index is the terminology we use in that sense. It is very similar to the sort of B3, indexes, or other indices that you would have seen in databases. The place where it is different is that there’s this basically still cutting- edge research, where it is a challenging problem to come up with indexing techniques for nearest neighbor search that are up-to-date with your data right there as up to date, as you did this while at the same time being extremely efficient to query. So that is really the challenge that trying solve. But other than that, you can model your logically think of it between disease or any other index.

Philip Winston 00:16:52 Okay. And just to guide us here, what range of scales do you see developers creating vector search for what is typical or what would be on the high end of scale for, either the vectors are searching over or the number of queries or any dimensions?

Ram Sriharsha 00:17:08 Yeah, there’s a few dimensions here. One is we see people creating millions of vectors. We also see people creating billions of vectors, billions of vectors being searched over. So, in terms of just volume, there is a pretty big variance in what we see. There is also other challenges here. For example, we see people who have to update their vectors quite often. These are use cases, very vectors be changed very often. We also see people who may be only create a million vectors, but they have to query it with like a 100 queries per second, a thousand queries per second and so on. We also see people who are perfectly fine with a 100 milliseconds or 500 milliseconds, or even at one second latency is doing queries, but they just want to be able to store inquiries, in 15 billion or 50 billion vectors. So, the seven use cases we see is quite varied across multiple dimensions, whether it is freshness of your index, whether it is query through port or whether it is ingestion through ports. So, you have to optimize across these dimensions.

Philip Winston 00:18:00 So just thinking about the implementation of that, are you detecting or inferring like what the usage pattern is and then steering that to a different implementation or do you try to have an implementation that can kind of roll with whatever is being sent?

Ram Sriharsha 00:18:17 So at a high level, we kind of tend to separate hybrid indexes from in-memory indexes. So basically, we have two types of indexes, one that is optimized for storing as much as possible, but as the other one is optimized for low-latency queries. But that said, when we build the next algorithm, we tend to build ones that are first of all, highly accurate and how approval accuracy guarantees, but at the same time, they are actually optimized for freshness. So, because a lot of our customers actually prefer freshness. We want to build things that are quick to update and quick to kind of react to changing data. So yes, at a high level, we kind of tend to have in memory indices that are more optimized towards query latency, whereas hybrid indexes have optimized towards storing massive amounts of data. But we are very careful in designing parts of our system so that they can be reused across these two, as much as possible.

Philip Winston 00:19:09 Yeah, that makes sense. So, you mentioned K nearest neighbor search. Can you explain a little bit more about how that works and is the K the number of dimensions in the vector or is it something else?

Ram Sriharsha 00:19:22 K-nearest neigbor search is one of the earliest machine learning algorithms and the algorithm works like this, which is it basically memorizes the training dataset. So basically, supposed to give it a training dataset, it’s simply stores it and memorizes it. So, at the time when you give it a query, it looks for the K nearest points to that query vector. And it uses the majority label of that query vector to say that is the label of this very point. The K there comes from the number of near vectors that it looks at. So, K=1 means it look at the nearest neighbor and that’s it. K=3 means it looks at three nearest neighbors and so on. So that K does not have anything to do with the dimensionality of the vector space itself, but more how many near neighbors we query.

Philip Winston 00:20:02 Okay. That makes sense. Yeah. That’s good to disambiguate that. When thinking of the distance between two vectors, the most natural distance function is probably Euclidean, which is the straight line distance between two points. But in machine learning, there are a number of different distance functions or metrics. Can you explain what some of these are and why, or how you would know to use one of these alternate functions?

Ram Sriharsha 00:20:26 Yeah. Different distance measures are useful for different sort of tasks. A good example is like Manhattan distance. Thing the name comes actually from the way the suite system or the grid system is laid out in Manhattan. Everything is rectangles there. So, if you want it to travel from one point or the other, you actually following one of the not so east-west grids. So, the distance between two points in a Manhattan grid is no longer the Euclidean distance of the Crow flying distance. It is really the distance that you’ve traveled along the north-south edge and the east-west. So, in that case, it makes sense to use Manhattan distance, which is basically the sum of the sensors along the axis. Similarly, we tend to see cosine distance being used quite often, especially around X similarity metrics measures, and the main difference there is that cosine does not care about the angle between two vectors. It doesn’t care about whether one vector this really large or really small, but in Euclidean distance, it would matter whether the vectors are small or large, right? That will dramatically change the distance between the two vectors. Even if you double this length of a vector, as long as the angle doesn’t change with the other vector, the cosine doesnít change. So, depending on the properties of what you care about in terms of similarity, you may tend to choose cosine distances in some places versus Euclid.

Philip Winston 00:21:38 So can you give an example of maybe where one of these distance functions was inappropriate or would be inappropriate versus one, which would be sort of a more natural fit?

Ram Sriharsha 00:21:48 Yeah. X similarity is a good example. So, one of the reasons why people use cosine distance there is imagine you just had two words. These two words were the only two words in your corpus. And you can think of them as being laid out in two dimensions. One of the words being on one of the axis, the other one being on the other axis. When you have a document, you may have a document that has both of these words, and these words may appear 10 times in one document, each. The other document may have the same two words appear once each. Now in the case of cosine distance, the cosine distance between the two documents would be one, because the fact that he scaled everything by a factor of 10 or a 100 doesn’t change the distance between the two documents. Which means the two documents are kind of going to be very closely related to each other. In the case of Euclidean distance, that may not be true, right? Because the distance now is much larger simply because one of the documents had 100 occurrences of each of these words. So that’s an example where you don’t want to use a Euclidean distance because you cared about just the fact that the two words co-occur the fact that they co-occur twice or thrice doesn’t change that.

Philip Winston 00:22:52 Okay. So, I can see how it sort of bubbles up to the semantics or the meaning of what you’re actually implementing. One more distance function, I saw, I’m just curious if you know, offhand where this would be used was the Chebyshev?

Ram Sriharsha 00:23:06 Yes. We very rarely see it to be honest. I think it’s also called the hell infinity distance or something. So basically Chebyshev distance, if you look at a distance in two dimensions, you could, in the sense would be looking at the shortest distance between the two points. Chebyshev distance looks at the longest distance, the other travel along an axis to get to that point. I think this comes across more in, in areas where, for example, imagine that you had a crane and you had to take some packages from one point to another, in two dimensions. But your crane could only move in two dimensions, right? It can only move not south but east-west. And it moves at the same speed in both directions, which means if you want to measure how long it takes to get one point to the other, you will use the Chebyshev distance. Because all it matters is what is the maximum distance in a particular access youíre to travel. Because in that same time, you’ll be travelling the other axis anyway. So that’s an example of that. So, it’s probably in late warehouse logistics and things like that. You would end up using something like this, but honestly, we don’t see that too often, that particular metric encountered too often.

Philip Winston 00:24:07 Okay. That’s interesting. I can imagine some distance metrics are more popular or more common than other ones. So, you mentioned vector searches, computationally intensive. Are these distance functions part of that difficulty? Is that a consideration? Are there more expensive or cheaper distance functions or is that not really where the cost is coming from?

Ram Sriharsha 00:24:30 There is a lot of the costs that they have to pay before we even have to worry about distance functions. The biggest challenge with nearest neighbor search is exact nearest neighbor search is just computationally feasible, right? So exactly this nearest neighbor search or the simple algorithm would simply be, if I give you a credit point, scan the entire database, you know, exactly the distance would be each of the credit points there, the other point in your database. And just started and figured out what the top data that’s like, thatís like the naÔve algorithm that simply doesn’t scale. It doesn’t scale because the corpus is huge. It also doesn’t scale because the number of dimensions is very high. The other challenge that happens, even if you say don’t want the exact search and you’re okay with approximation. Approximation itself has challenges in high dimensions. So, there’s something called curse of dimensionality because of which once you go into higher dimensions, even designing approximate, nearest name algorithm becomes very challenging, right? And any sort of exact algorithm kind of isn’t any better than the neighbor algorithm. So, these are the bigger challenges we are to solve. But when designing such algorithms, sometimes some of them may not have good approximation guarantees for a genetic metric. For example, they may only have really good approximation guarantees for percentage, right? These are the things we have to worry about, but usually the problems that even, even earlier than that.

Philip Winston 00:25:45 Yeah, I was going to bring up approximate, nearest neighbors, since it sounds like that’s, what’s actually used with any appreciable scale. What determines just how approximate the results are? Is that a question of how long you spend evaluating the results? Or is it a question of the algorithm? If I have a knob between very approximate and less approximate, how would that work?

Ram Sriharsha 00:26:08 When people use the word approximate it’s a little bit hand-wavy? Meaning there are algorithms out there that are called approximate, nearest numbers, search algorithms, but the extent of approximation is unknown. A good example is things like HNSW. These are really great algorithms by the way, but there are no guaranteed bones and approximation. So, it’s not like you can tune the knob, and the more you can tune that knob, the better the approximation gets. It’s not like that. That is not the case for algorithms that we developed at Pinecon. In fact, the algorithms we develop have like a tuneable knob. So, you can actually if you wanted tune a knob, to get higher accuracy, if you want it. The other thing we do, and also when, when machine learning scientists talk about approximate, nearest neighbor search, they usually mean something very precise. Which is, they mean that this algorithm is going to give you a point that is within 1+ Epsilon, or a small, small fudge factor away from a point within the distance, you get it. You can make this very precise. And those are the sort of algorithms that we developed. So, when machine learning literature talks about approximate, nearest neighbor algorithms, they have, they have a very strong guarantee, which is there is a trade-off between how close to connect exactly what, versus how much memory and space and time that algorithm takes. But in practice, a lot of algorithms you see out there don’t have this property. Very few of them do.

Philip Winston 00:27:30 So the vector search portion of let’s say our machine learning application is not necessarily, or maybe it’s not evaluating a neural network at that point. Is that true? Or is it possible to have a distance function that itself is learned or itself is a neural network?

Ram Sriharsha 00:27:50 By the time we get vectors in a sense that training has already been done. So, the model has been trained and people are sending the unstructured data to the models to get an embedding, and we store those embeddings and query them. As part of training, you can also do what’s called metric learning, which is to actually learn the distance function itself. People can do that. Now, there is no way they can do that on top of Pinecone today, because for us, we get the embeddings, but it’s certainly something that people can do. There is a lot of literature on how to do this.

Philip Winston 00:28:21 That’s certainly a trend we see in software engineering. Is that something that was complex and manual is becoming learned by deep learning or a similar technique. So, we’ve talked about vector search. We’ve kind of veered into vector databases also, but now I want to focus on vector databases and Pinecones. So, can you explain what was the motivation in creating the service? What did you see in the industry that the service was trying to meet?

Ram Sriharsha 00:28:50 Big companies like Netflix, Google and so on, have been developing this sort of technology over the last few years. A lot of what we do is bringing what has only been available to a few big companies so far into the mass market and making it a SAS product that anyone can use. So, I would say that’s what he focused on today. We now know how to do this, and now it’s just a matter of scaling it and making that available to everyone. At the same time, embeddings have become very popular and it’s become very easy for people to get millions and billions of embeddings and kind of work with them, which has unlocked a lot of new use cases for them. The main challenge for many of these customers is cost and scale. So, if you don’t make things a hundred X less costly, it simply does not lock certain use cases, right? So big companies can afford to spend millions and millions of dollars on some of these technologies, but most companies cannot. What we are trying to do is to bring the cost down to a point where people can benefit from these emerging use cases that they wouldn’t otherwise. And since nothing existed that could do this, we started the company.

Philip Winston 00:29:55 Can I ask what Pinecone stands for or where the name Pinecone came from?

Ram Sriharsha 00:30:00 I think they had to change the name because there was like some clash between the original name that we had picked and we’ve kind of brainstormed, potentially four names and Pinecone was the name that stood out. But as far as I can tell, there is no cool reason for choosing Pinecone.

Philip Winston 00:30:16 Okay. We talked a little bit about maybe the interface between what Pinecone does and what the developer or customer is doing. Can you talk a little bit more about that? What am I sending to you? Is the customer hitting your API, or how is that the end user hitting your API? How does that work?

Ram Sriharsha 00:30:33 They provide multiple types of APS, HTTP, Python, and so on. So, all you do is use our APS. You can spin up as many indices as you want. You can size them, however you want. And you get these APS through which you can submit vectors as metadata to us. You can do data management. You can also do queries. Again, some of this will evolve over time, but today our focus we’re kind of laser focused on building the world’s best record database, which means our focus is on people who already have embeddings. Once you have those embeddings, you kind of send them to us. You can use Pinecone to query it. Over time, we may also kind of start helping people who don’t have embeddings, but who have a lot of unstructured data. But today we expect people to have embeddings.

Philip Winston 00:31:18 Yeah, that makes sense. So, you kind of alluded to this, but what are some or operational challenges with running vector search at scale? What problems did you have to solve that maybe every developer does not want to do?

Ram Sriharsha 00:31:33 All of the common problems of elastically scaling such as service exists. So, for example, since embeddings essentially are coming from unstructured data and unstructured data is growing exponentially for customers. Right off the bat, we have to build something that has the director elastically scale, because people are just throwing more and more data over time. So those cloud elasticity problems already exist for us. But on top of that, what’s really challenging is in building the technology that makes efficient indexing possible. So, a lot of the time we spend is on how do you ingest data, like large volumes of data fast, and how do you adapt your indices so that they can kind of get up to date fast. At the same time, how do you take advantage of modern hardware? Whether it is assess these, whether it is specialized CPU instructions available on these cloud machines and so on, to drive down the cost for quantities and cost for storing all this data. So, most of our time is spent in optimizing for costs, scale and freshness. And these are all hard problems just because of the nature of it, the search being a computationally intensive problem.

Philip Winston 00:32:39 Right, one aspect of scaling that doesn’t get talked about much is scaling down. I know you give out API keys for people to try the service. I imagine a lot of developers have uploaded a test data. Maybe they only do a few test queries. Were there any engineering decisions that were specific to hosting this long tail of smaller databases?

Ram Sriharsha 00:32:59 Yes. So, we have a multi-tenant kind of service that people use. If you have a very small database, in fact, our freakier, if you have one milli vector and so on, you can just end up a very tiny part. But then the small party can kind of do a lot of queries. One of the things we are doing also is things like offloading data to block storage. So, you can spin down your instances and you won’t lose data. We are very mindful of those sorts of use cases, which may be smaller and you may not be querying all the time, but when you need to query, you need to quickly bring up an index and kind of do your work.

Philip Winston 00:33:31 So you kind of mentioned this, but I’m interested for a typical machine learning application, how often the model gets updated. I think you kind of discussed this, but can you give an example of an application where you might perform millions of searches with the same model? The model is calculated just once or an application when the model needs to be updated frequently. And how does Pinecone support both of these extremes?

Ram Sriharsha 00:33:56 Yeah, they’re actually quite a few of these cases where you train a model once and you kind of create embeddings. And those embeddings are very stable. Often when we think about documents and searching documents just once, if these documents don’t change quite often, you can treat these embeddings as being generated once. And then you are more often searching than updating the vectors. On the other hand, if you look at your schedules like advertising, product recommendations and things like that, user preference changed quite rapidly, right? Especially in advertising what our user did, that five minutes back, or even a few minutes back impacts their decision now much more than what they did days or weeks past. So, because of that user preferences tend to change quite rapidly and models that use user preferences to do semantic search or any sort of product recommendation style search, need to deal with updated embeddings and a much shorter time scale. And we are talking about timescales of less than a minute. So, we see both of these use cases. So, kind of more common use case for updating vectors quite rapidly is advertising, retail recommendation, profit foundations, and swap. Less common as in the x-search areas.

Philip Winston 00:35:08 Yeah. I can see that. I saw a lot of examples on your site with natural language processing and a few with images. How do the requirements differ in these two domains or in other domains that you could classify that you see?

Ram Sriharsha 00:35:24 Some differences could be as simple as dimensionality, right? So, we tend to see higher dimensional embeddings for images, for example, we do tend to see up to 2048 dimensions. Now higher dimensions, obviously it’s means it’s computationally more challenging. In fact, we usually see lower dimensions even up to 120 dimensions and so on, we do see that. Other than that, there aren’t a whole, again, the size of the corpuses may be very different. So, you might search. Sometimes we don’t see that many images. In text search you could see if you think about transcripts of sales calls or transcripts of conversations and so on, they can get extremely big. You do tend to see some diversity in terms of just the number of vectors, the dimensionality of the vectors and so on.

Philip Winston 00:36:05 Okay, letís start wrapping up related to, to what you’re just discussing. What are the most challenging parts of a developer interested in setting up a vector search application? What do you consider is, is still hard today?

Ram Sriharsha 00:36:19 I think they’re working on a lot of the hard parts. I would say that with Pinecone, I would say I’m very comfortable and happy with saying that we made it easy for developers to get started. We made it extremely easy for you to spin up a Pinecone instance and just your data analyze it and query it. And people are able to build successful applications with nearly zero effort. And we strive for that at the beginning. We wanted to make it as simple as possible. Humanly possible, very, very, uh, spending a lot of effort is, like I said, in making this as cost effective as possible. I feel like there is still a lot we can do here to make this technology far more cost-effective than it is today. And also scale it far more than you can do today. Like for example, we are focused very heavily, but I’m very happy with what you’ve done so far in terms of being able to reduce the barrier to entry of the sort of technology to a developer. So, developers are able to today in just data and get productive and be able to build applications extremely quickly. And these continue to double down on that.

Philip Winston 00:37:20 Okay, you’re operating a specific niche within the larger machine learning universe. Where do you see things evolving in general, that’s going to impact you or machine learning techniques, or just any trends that you see on the horizon that you’re watching?

Ram Sriharsha 00:37:37 The biggest trend, I think which we also want to take advantage of is while we are working on a very specific niche, which is the Electra database, there is this overall trend, that data is becoming more and more unstructured and structured data is exploding compared to what traditionally used to be relational data. So more and more of what people are going to be doing with unstructured data is going to be using machine learning and it’s going to be using technologies like, what are weíre building. How do we take what we built today and close the gap with unstructured data, which is today, we need people that are embedding study. We need people to have certain amount of sophistication with machine learning, to be able to use Pinecone and so on. But there is vastly more unstructured data than that are embeddings. And how do you get people who have a lot of such data? And they have really interesting business use cases, but they don’t have teams of data centers and a lot of expertise and machine learning. How do you help them? This is something that we love to solve because the amount of data we are talking about is exploding and the use cases exploded.

Philip Winston 00:38:34 Okay. I think that’s a great place to end. Can you tell the listeners where to find out more information about Pinecone and then also about you specifically? And I can put both into the show notes for people to look up.

Ram Sriharsha 00:38:41 Our website, Pinecone.io, has a lot of information. We also have a lot of example notebooks and they are free tiered. So, anyone can sign up and kind of play around with Pinecone and give us feedback. We’d love to hear from you. If anything here resonated with any of you, please feel free to reach out.

Philip Winston 00:38:56 I’ll say that I did look at the learn section of your website, and I was pretty impressed with the depth of some of the articles and I found it very interesting. Thanks for your time today, Ram. This is Philip Winston for Software Engineering Radio.

Ram Sriharsha 00:39:11 Thank you.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 493: Ram Sriharsha on Vectors in Machine Learning

Show Notes

Related Links

SE Radio

Transcript

Join the discussion

More from this show

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

Menu

Recent posts

Search

Search

SE Radio 493: Ram Sriharsha on Vectors in Machine Learning

Show Notes

Related Links

SE Radio

Transcript

Join the discussion

More from this show

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

Menu

Recent posts