Jure Leskovec, Professor of Computer Science at Stanford University and Chief Scientist at Kumo.ai, speaks with host Sriram Panyam about relational and graph language models and their transformative impact on enterprise decision-making and predictive modeling.
Jure begins by establishing the critical importance of predictive modeling across industries – from fraud detection in financial institutions to customer churn prediction, lifetime value estimation, product recommendations, and healthcare risk assessment. He notes that while AI has made remarkable advances in natural language understanding and computer vision, predictive modeling over enterprise operational data stored in relational databases has been largely left behind, still relying on 30-year-old machine learning approaches that are expensive, time-consuming, and require manual feature engineering.
His proposed solution to the fundamental problem with current approaches is relational deep learning and relational transformers. The discussion explores how this approach differs from traditional graph neural networks (GNNs), which Jure pioneered and deployed successfully at Pinterest. Jure concludes with practical guidance for software engineers and data scientists interested in exploring this technology.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
Related Episodes
Resources
Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Sri Panyam 00:00:18 Welcome to Software Engineering Radio. This is Sri Panyam, your host. Today we have Jure Leskovec. He’s a professor of Computer Science at Stanford University and a Chief Scientist at Kumo.ai. He pioneered Graph Neural Networks and co-authored PyTorch Geometric the most widely used GNN framework at Pinterest. He served as the Chief Scientist for six years deploying graph learning systems that contributed to 30 to 50% improvement in core metrics and helped the company go public. His CS 224W course on Mission Learning with Graphs has over 1 million views on YouTube. He’s now building relational foundational models at Kumo.ai. Welcome to the show Jure.
Jure Leskovec 00:01:01 Great to be here.
Sri Panyam 00:01:02 We want to kind of establish some of the current patterns that lead up to what we’re going to talk about today. And I want to talk about predictive modeling, right? Just as an overview, what is it? How it powers fraud prediction? Where are organizations with it today?
Jure Leskovec 00:01:16 Yeah, so basically predictive modeling has been around for a very long time and making accurate predictions is basically the best way to make decisions, right? So how do we make predictions is that we look at a bunch of operational historical data, behavioral data that every organization is storing, and then try to identify patterns in the past that predict that future. And if you say, what are the use cases for this, they’re very diverse, right? It can be about fraud detection in transactions in financial institutions, it can be about recommendations, product recommendations, in all kinds of settings. It might be about predicting the risk of churn of a customer, predicting what is the next best action to take, what is the lifetime value of a customer? All the way to the healthcare where you could say, what’s the risk of readmission if I discharge this patient? What is the risk to life to this patient in the next 24 hours, right? These are all these kinds of decision making, risk assessment type questions that we humans are kind of inherently very biased and are very bad at estimating those probabilities. So that’s why over time we build models that are calibrated that can estimate this type of risk probabilities, these kinds of forecasts, you know, how likely my machine is going to fail. The time series forecasting, predictive maintenance, these kinds of everywhere across industry.
Sri Panyam 00:02:42 And today, where are enterprises in terms of their maturity level in optics and deploying these?
Jure Leskovec 00:02:48 Yeah, that’s a great point. And what I would claim is this, predictive modeling has been kind of left in the past, right? We have amazing advancements in AI. And when we talk about advancements in AI, we talk about advancements in, let’s say natural language, understanding, reasoning and things like that. And we have Claude and ChatGPT and things like that, right? And then we also have computer vision, Nvidia. We understand this type of, I would say two data modalities, which is natural language and images video. But if you think about these predictive models and what kind of data they learn over, it’s all about operational data that is usually stored in tables, it’s stored in databases, it’s stored in data warehouses. We have a product catalog, which is a big table, every row is a product. And then the columns described what are the properties of the product.
Jure Leskovec 00:03:37 We have my customer catalog where for every customer I have their information, who they are, where they are, and things like that. And then I also have, let’s say, another table that’s like a transactions table that says, this customer purchases this particular product for this price at this time, right? So, I have a third table that kind of points to the customer ID and points to the product ID, right? That’s how this data is organized. And I would argue that today’s AI models cannot really reason or well over this type of data. So, what are we left with, right? Like what I mean by this is you cannot put a table of transactions into ChatGPT and ask it, hey, how likely is this to be a fraud? Or I know, throw in some information about a customer and say what the customer is going to buy next?
Jure Leskovec 00:04:21 You’ll get something that is kind of common sense, okay? But it’s very low performance. So what people are left with today is machine learning, right? Which is the old technology of manually building these models from data, training them and then deploying them. Building these models, you need to hire a bunch of data scientists. They need to clean the data. We can go through the process, but the point is, it’s very expensive. You need about two full-time employees per model that you’re building. And it takes about, I don’t know, half a year to build a model and put it in production. So, we can go through how complicated this is.
Sri Panyam 00:05:01 Yeah, I mean, I’m glad you mentioned that because what you said about putting all this data in relation databases, it’s a pretty rigorous or time-consuming task of normalization that somebody has to go through. And it sounds like even when they do that, you only know about the model, what you know, and you don’t know what you don’t know, right? So, what are some things that get lost and you wouldn’t even know about it when you do this kind of normalization and flattening, what gets lost?
Jure Leskovec 00:05:25 That’s a great point, right? So usually we put this data, let’s say this simple example of customers products and, and their purchase orders, transactions, right? Usually, we would put this into a database or into a data warehouse. And then we run SQL queries over that and SQL queries, they’re good at telling me what happened last week, what happened last month. I can ask all kinds of historical questions, but it’s less about, and the prediction is not about understanding the history, it’s about understanding the future. So, what do I have to do, let’s say as a data scientist is I cannot just train a model over these three tables and say, just learn over these tables and tell me whether a customer will churn. What’s the risk of that? Right? What I have to do is I have to manually, for example, take the customer, take their transaction history, and then somehow I need to aggregate that transaction history into what is called a feature.
Jure Leskovec 00:06:22 I need to basically do the table flattening. I need to do feature engineering. I need to summarize your past purchases into some count, into some numbers. So, I can say, how much did you purchase last week? How much did you purchase last month? Then I can say, how much did you purchase on Mondays? How about Tuesdays? How about Wednesdays? How about mornings? How about afternoons? And then it’s maybe what’s the sum of purchase prices? What’s your medium purchase price? What’s the most expensive product you bought? What’s the cheapest one? And all these variables, we can use two hours here for me to just go through this. They could be predictive of your likelihood of churning, right? And now I only connect the user, the customer with their transaction history. How about the products, right? Now, if I go to the products, I’m like, what types of products?
Jure Leskovec 00:07:12 What categories, how well are those products rated? Do they have a high star rating? Do they have a low star rating? Are those products popular? And so on, right? So, it’s like this infinite possibility of signals that I as human or as my coding agent could engineer into this big training table. So, I could then train the model. And basically, training the model maybe is easy, but where it gets complicated is when this model is running, let’s say in production, when it’s making these predictions, all these signals need to be updated, right? If I make one more transaction, all my counts of transactions, all my aggregations of transactions, all they need to be updated because everything changed because I added a new transaction, right? So, you start seeing how is this getting complicated? Because then on the fly, the signals need to be updated with every single transaction, with every single update to the database.
Jure Leskovec 00:08:11 All these downstream calculations need to be spun off because everything changes because there’s a new piece of information and there are other problems with this flattening where as you hinted, essentially you are losing information, right? There is more information in, let’s say in the raw sequence than there is in the count over the last seven days. Count over last seven days is some arbitrary construct that we, humans are like, okay, probably if I count how much you purchased last week, that will likely tell me something useful about you purchasing in the future as well. But it’s a guess. So, I have to first engineer the feature, put it in the model, retrain the model, and then see, oh, did it help or not? Yes, it helped a bit. Okay, fine, right? And then if you, for example, start thinking broader, right? If you think about it, let’s say fraud detection.
Jure Leskovec 00:09:02 Fraud detection is an adversarial game. So whatever model you are using to predict fraud, the other side figures that out and tries to game you because it games you, your model accuracy starts decreasing. And then you are like, okay, what signal am I missing to detect these new types of fraud? And you put it in, they discover it, and you are behind again, and you’re again scratching the head, what another signal do I need to do to do this? Right? So, it’s super painful, super manual, super expensive. And studies show that most of these models never see production because it’s so old school, it’s like 30 years old technology. We’ve been doing this the same way for 30 years. Maybe we changed the architecture that we train on this data, but the architecture is not the problem. The problem is this manual process of flattening the tables, engineering the features, training these models, putting these models in production by needing to update all these features in real time as events are happening. And then you have problems with what is called information leakage or this time correctness because you know, maybe I have made a mistake, and I also count the transaction you’ll make tomorrow. And then of course from data about tomorrow, I can predict tomorrow very accurately. But you don’t catch that and then your models don’t perform the way you expected and so on. So, it’s a, I would say data science and this model building with machine learning is very, very hard. Very, very hard to get very on.
Sri Panyam 00:10:31 So in a way you can take a relation database, you can manually create the views and materialized views and triggers and all that, right? To make it look like a graph. But really the devils in the details because doing that is complicated and a lot of work, right? It sounds like your thesis is, if you can do all that work to take something that’s normalized and turn that into a graph after all this burning of energy, why not start from a graph?
Jure Leskovec 00:10:56 Exactly. So, my thesis is the following, right? The first thing is that tabular data, this relational data is the key to enterprise decision making,
Jure Leskovec 00:11:06 Right? We collected it, it’s the most valuable data because we want to make accurate decisions at all levels, right? And even in agentic world, agents need to make decisions, and those decisions have to be driven in data and historical patterns. They’re not common-sense type things most often, right? So that’s the first thing is this is amazingly valuable. Second thing that I want to say is that today’s AI does not understand this type of data web. You cannot text define a database, put it as a prompt in ChatGPT and expect it to work. People have tried that. They burn themselves badly, right? So, it’s kind of a missing data modality in today’s AI. So how do you then solve the problem? That was the key, let’s say question or insight that I asked myself and my students here at Stanford and said we have to do something in this area or be left behind. So we developed this approach, we call it a relational deep learning, that basically says we can take a database, a set of tables, we can represent it as a graph of linkages, and now we can develop special architectures that generalize this attention mechanism that is prevalent in text transformers, where we are attending over the tokens, the words from the past to now generalize this, to attend over the tables in a database.
Jure Leskovec 00:12:24 And if we can have now a neural network that attends over the raw tables in a database and just, it basically figures out how to pull all this data together into an accurate prediction. There are two benefits. First benefit is that you get an amazing productivity game because you don’t have to do this flattening of data into a single table, and you can build these models faster. The second improvement is that your models get superhuman accuracy. And the reason for that is that, as I mentioned earlier, when you take a sequence of transactions and aggregate it into a transaction count over the last seven days, you have thrown away a lot of information. So, attention mechanism that attends actually over a sequence of transactions has much more fidelity, has much more finesse to it to actually learn how to combine this information, how to attend over these transactions.
Jure Leskovec 00:13:17 Is it seven days? Is it mornings? Is it evenings? Let the attention mechanism figure that out to combine the data in an optimal way that is most predictive downstream. And of course, it’s not only because it’s a graph, but also multiple tables. It’s not just, oh, it’s a sequence modeling. No, it’s more because it’s recursive, right? A sequence of transactions connects to a sequence of products. Those products connect to other transactions that connect to other users. So now you are attending through this graph essentially to understand and to reason over the data in a database. The exciting thing is that you go from a person to transaction to the product to another transaction to another person, all of a sudden you can be like, okay, what are the characteristics of people that buy the same products as I buy? There is useful information there, right? And so on and so forth. That’s the thesis.
Sri Panyam 00:14:13 That’s interesting because if you, again, going back to classical, I guess transformers you’re attending over a sequence of tokens in text, right? What does a token look like in this model? I mean, what is conceptually we’ve been thinking about tokens in text styles, you know, three or four letters, three or four characters. There’s some kind of arbitrary boundary, right? What becomes your unit of attending over in this model? Like what is a token here?
Jure Leskovec 00:14:34 Great question. A token here you have two options. One option is to think of every cell in a database as token. So, every pro column combination, right? An age of a particular user, uh, location of a particular user, gender of a particular user, and so on, right? Those could be the tokens, or you can think of entire row as a token. So, a user is a token, a transaction is a token, a product is a token that then gets attended over. And in reality, that’s maybe the way to see it. In reality, it’s a bit more complicated. But probably the best way to think is that now we are attending over individual cells and as well as how they point to each other. The key differentiator, is that we’re not only attending over individual cells, but also over the relationships between the cells, right?
Jure Leskovec 00:15:19 That one row points to another row. So, I know that this particular user points to this particular transaction and that transactions point to a particular product that has a given set of entities. So, the attention mechanism is different. It’s not sequence based, but it’s like graph based. So, we need to understand these relationships and pointers. Understanding this relational structure is super crucial, especially when the data is noisy or in cold start type regimes where you don’t have enough information. Then through this relational context, the model is able to home in and make much more accurate predictions than traditional approaches.
Sri Panyam 00:16:01 Interesting. I mean, the mark keeps getting blown away, going back to our bridge transformer, right? Simple tokens, simple embeddings. So, what would then be your embedding model or your token model? Because you have a lot more volatility and variation in what a token is, right? In a simple text like the cat sat on the mat, right? You have a much narrower limited token space, right? But here, almost, it sounds like almost every combination, every entity, every entity relationship. You potentially have an unbounded token space then, don’t you?
Jure Leskovec 00:16:29 Great point. You have an unbounded token space. You don’t say, oh, you know, I have this fixed vocabulary of tokens as you do in LLMs, here you have in this respect an infinite number of tokens, right? Or token combinations, because cells can take arbitrary values, there can be text in there, there can be images in there. So, in this respect, you don’t have this kind of explicit tokens, but what are you attending over are the values of the cells. In this respect you are right. The token space is infinite because I’m attending over the values of individual cells.
Sri Panyam 00:17:03 I mean, technically even in English, you have a fairly last token space. We just approximate it down, I’m guessing,
Jure Leskovec 00:17:08 For example, right? Or at the end we can say, okay, we have characters and there’s letters of the alphabet and that’s it, right?
Jure Leskovec 00:17:14 Everything is made out of that.
Sri Panyam 00:17:16 So does then the attention mechanism itself have a different meaning because again, going back to creation, transformers, you have your N square attention for every token versus area token. Very, very conceptually and thumbed down, right? What then would be the complexity or space complexity of this attention mechanism or this graph kind of tokens.
Jure Leskovec 00:17:35 Yeah, it’s different, right? Because you don’t think of this as, oh, I have now this sequence of tokens and I do everything to everything type attention. What you are doing is now that your attention mechanism is much more structured, maybe inside a given row of the table, you are attending everything to everything. So, you have, let’s say, a row-based attention. Then you have also a column-based attention where a given column is attending to other columns, to other rows, other values of the same column, to get a sense of what is the distribution of values, right? So that’s also quadratic, but it’s kind of quadratic in some bands of the number of rows. The other one is kind of quadratic in the, maybe the number of columns per table, not the total data. And then you also have the attention across tables that again, are highly structured, is not everything to everything, but is just across the links that exist among the entities. So, you basically have three different types of attention. That is, I would say, highly structured. So, you can process large amounts of tokens because it’s not that any token in a database can attend to any other token as in text, but you’re attending inside a row, inside a column, and across the tables.
Sri Panyam 00:18:50 Is this a design feature or a design tradeoff?
Jure Leskovec 00:18:53 Good question. I would say it’s a design feature because it allows you to scale, allows you to be computationally bounded, and it’s a good way to respect how the data is organized and how the data is linked.
Sri Panyam 00:19:10 Interesting. So, as I put a mental model to this, you may have hundreds of tables with thousands or millions of transactions and dependencies across these tables. You wouldn’t root force your way by taking everything, but you would build out these aggregations, these relationships in autograph again through the model or through automatically, and then see which relationships or which connections have some kind of weighting, some kind of higher relevance, I guess, right? So, these aggregations are being built, these relationships being formed, these connections being actively computed. What is doing that? What is doing it? How does it happen? I’m assuming that’s not a manual process. Right?
Jure Leskovec 00:19:47 Great point. In our approach, this is not a manual process. Basically, we would start from a set of tables and a set of, let’s say, primary foreign key type relationships or relationships between tables where we know that one column in one table points to another column in the other table. And from there on the process of training is essentially automatic because as I said, basically these three types of attention mechanism that we have developed. One inside the row of a single table, the other one across the rows. So basically, inside a single column of a table. And then the third type of attention is over the primary foreign correlations from pointing from one table to another table. And we basically now have; we stack together multiple layers of this type of attention and then train. Now this architecture, we call it a relational transformer with some downstream loss.
Sri Panyam 00:20:42 Hmm, interesting. So just to summarize that, if you were to take your tables, your rows and your foreign keys, your table scheme must become your node types or node parameters, I guess type parameters. Your rows become your actual nodes and your foreign keys become edges, right? Between these rows and a table. So, in a way, do you actually have parameterized nodes or are all nodes just nodes?
Jure Leskovec 00:21:02 Great point. I think you summarized it very nicely. Nodes are parameterized, right? Because you can think of all the column information, all the properties of a product, all the properties of a user to be attributes or data attached to this node. So, what is beautiful in this case is now that the attention mechanism is both learning from the properties as well from the relationships, and that’s where the power of these methods comes in.
Sri Panyam 00:21:29 Nice. I want to slightly switch to something that was there before. I mean a precursor, I guess something in history, graph neural networks, right? I mean they are somehow tied to this because there’s graphs, there’s neural networks. What is the significance of those in this journey, in this transformation, in this evolution?
Jure Leskovec 00:21:44 That’s a great point. I would say graph neural networks were a very important first step in this area where we are talking about maybe smaller models with predefined aggregations, rather than attending over the, let’s say, individual transactions, we would be like just having some transformation and then pulling aggregator like a summation or an average kind of more old school type neural network architecture. And those models were good, accurate, but of course the new generation of models that are attention-based kind of attention is all you need that scale better with parameter size becomes important. And graph neural networks have this issue of what is called over smoothing or over squashing where, because if you think a bit about it from this kind of attention point of view, right? When I’m trying to make a prediction about the node, and if I go too far away from in the graph in the neighborhood around that node, you can think kind of like a ball around or a circle around the node. Then the problem with graph neural networks was that they were not rich enough in terms of their expressivity. So, they started just kind of averaging things together. If you average over two big circles of things, you kind of always get the same value that started to be a problem. It’s called technically over smoothing, but with the attention-based architectures, the degrees of freedom and the finesse kind of the attention mechanism is so much larger that this over smoothing effect is not a factor anymore.
Sri Panyam 00:23:12 Do you have a real-world example of what that looks like and terms of, or smoothing? I mean, I think you deployed at Pinterest, what might that have looked like?
Jure Leskovec 00:23:20 Yeah, so for example, in graph neural networks, right? We built a system and deployed this very successfully at Pinterest, and what usually meant in graph neural networks is that if you think about how many hops away in the graph you go, you must not go too many hops away because if you kind of go too many hops away, then you reach the entire graph and it’s just too much information that is overpowering or the model. You can think of it this way, right? Like imagine if I want to predict something about you or about myself, knowing information about my friends is very useful, knowing information about my friends or friends is still useful. But if you think about, should I now go 10 friendships apart, then basically I reach every person on the earth because of the six degrees of separation. And knowing information about every person on the earth to say about something like me can kind of be over confusing or overpowering the model. That’s kind of maybe one intuition for this over smoothing problem phenomenon of graph neural networks.
Sri Panyam 00:24:15 And we couldn’t just put a cap or put some kind of arbitrary limit on how deep you wanted to go or what was that?
Jure Leskovec 00:24:21 That was the solution is don’t go too deep, go two, three hops, get the information that is relevant and then build a model on that. That was kind of the attitude with graph neural networks. Now with attention-based architectures, we don’t have to do that.
Sri Panyam 00:24:35 Interesting. Interesting. With attention-based architecture, right? How does that decide which neighbors to pay, how deep to go? What kind of subgraphs to kind of go, or what is the deciding factor or I guess logic there?
Jure Leskovec 00:24:46 That is a great point. I would say you have kind of the, in some sense you can think of it as a context window size. You can make that large, put the data in and then let the training process figure out how to attend. There are still a couple of kind of hyper parameters. One can tune how wide to go, how deep to go. But what we see is that the attention mechanism is not prone to over smoothing and can deal with large context sizes quite effectively. I would also say that this large context size becomes very important because now the latest exciting thing in this area is this notion of a foundation model and this notion of in context learning where you need to be able to work with large context sizes.
Sri Panyam 00:25:29 Fair enough. I think you talked about this in terms of foundation models and attention, right? From a training perspective, one thing that both graph neural network works and relationship kind of foundation models share with your usual LLMs is data hungriness, right? But it sounds like with graph transformers, you cannot be as hungry. Why is that what caused it?
Jure Leskovec 00:25:51 Yeah. I think you’re asking a great question, and I think here is where this tabular relational data and also the nature of prediction gets a bit different from the large language models where bigger is better, the more you can kind of memorize the more of the world, you know, the better things are. Here things are a bit different, right? Because depending on the amount of data, you may choose different approaches, right? If you have a larger amount of data, think about fraud detection or something like that, that is also running. Or think of a recommended system that is running at speed of a million recommendations per second, right? So, you really need accurate recommendations. You know exactly the predictive question you’ll be asking, and you want it to be cost effective. In these types of cases, you would usually go and fine-tuned a small model that is really good at that single task because I won’t go and ask that fraud detection model to do customer churn prediction.
Jure Leskovec 00:26:52 That’s a separate problem. So, you would say, okay, I have a single task, it’s a super high valuable task. I want to have a small, dedicated, cheap to run model that does this very well for you, right? And the important point is that in these domains, 1-2% increase in accuracy of decision-making model for fraud or for recommendations. I’ve seen clients where this means hundreds of millions of dollars in additional revenue just because you are doing it so often and the effect just adds up, right? So that’s one side. The other side is this notion of foundation models where what you can do now that is actually quite exciting and, in some sense, unbelievable, is that you get this ChatGPT type moment. But for predictive problems where you can specify the predictive problem on the fly, the model goes fetches your relational data and without any model training gives you an accurate prediction.
Jure Leskovec 00:27:47 Okay? So basically, now you have a pre-trained model that is agnostic of the database and is agnostic of the predictive task. So, you can ask it predict churn for me, and churn means no purchase for the next 30 days. A second later you get the prediction, you get the accuracy estimate, you get a natural language-based explanation. Then somebody says, no, no, for me churn means less than $10 of monthly spend. A second later you get now prediction for the probability of less than 10 dollar monthly spend. That’s a different capability where basically you can have a large pre-trained model, you don’t know the question ahead of time and you can just ask it. You don’t have to build now a prediction specific model. The pre-trained model can give you the answer immediately.
Sri Panyam 00:28:35 This actually pretty wild. Now if you look at your, again, ChatGPT, the usual go-to example I guess, right? I think now you train that model or a cluster of I think four to $5 billion worth of GPUs or a few months of run and then using extra amounts of data, right? Train that model. What is the trained data size for a foundational graph model here? These actually in the same scale, different scales. What are you looking at?
Jure Leskovec 00:29:00 That’s a great point. I would say right now, let’s say the tabular relational side of the world is younger than LLM world. People are working on scaling things up, but so far, we’ve seen that you don’t need the scale as you have in large language models. You can train with a smaller amount of data with much less cost. And these models can be smaller, you know, like sub billion parameter models or something like that. So, they can be small. And the reason for that is because the information is in the data. So, you don’t have to memorize, you don’t have to learn the entire internet worth of knowledge. You just need to learn how to spot patterns, how to attend over them to give the prediction. And because of that, the models are smaller, right? For example, single table models, I think they’re around 25 million parameters.
Sri Panyam 00:29:51 Wow.
Jure Leskovec 00:29:52 Right? Which is tiny. It runs on my iPhone, right? So those are small, relational, bigger, the ones that can do multiple tables at once, but still, this is not a hundred billion or a trillion parameters.
Sri Panyam 00:30:05 What’s the biggest, most complex model that you might find out there?
Jure Leskovec 00:30:09 I would say the field is moving very fast. We are innovating and researching very fast. And there are a lot of, I would say also different approaches. I don’t think here the final word or the final solution has kind of been converged. So, I would say right now I told you about the kind of sizes of these single table models. Relational models are larger, but the order of magnitude, it becomes interesting. The next generation architecture that we’re exploring may go even bigger. But the point is you need to get benefit from getting bigger. You need to start seeing scaling loss and then it makes sense to scale up.
Sri Panyam 00:30:45 My point was kind of in reverse, it almost sounds like somebody with the knowhow, like today, the knowhow is a limitation, not the Capex. So, if I had the knowhow, I’ve technically trained a model on my MacBook. Again, I’m mathematically not close to it. I’m not one of those people. But it’s not beyond the realms of practicality to do so. Right? And still have a state-of-the-art foundation model that is mine.
Jure Leskovec 00:31:10 Maybe that’s a bit too optimistic.
Sri Panyam 00:31:12 Okay, okay.
Jure Leskovec 00:31:12 You need spend some good amount on proper latest generation Nvidia hardware.
Sri Panyam 00:31:17 Right, but not five billion worth of GPUs.
Jure Leskovec 00:31:19 But not $5 billion worth of GPUs. Let’s say millions of dollars. Let’s say you need to be millions of dollars of investment,
Sri Panyam 00:31:25 Right? So technically a bank for example, again, they wouldn’t do it, they shouldn’t do it, but they may be a reason for them to say, look for billion for a couple million dollars, I can do a fully optimized in personalized in-house foundation model from which we can further train our own use cases for capital advantage.
Jure Leskovec 00:31:44 Exactly. And at my company Kumo.ai, we see a lot of traction for this.
Sri Panyam 00:31:50 Nice, nice.
Jure Leskovec 00:31:51 Exactly what you said.
Sri Panyam 00:32:21 I do want to touch back on one thing that I forgot to have mentioned earlier. Again, with graphs, edges matter, what about history? What about time series data? Like, I mean, isn’t there meant to be a marriage of the two for actually in the future?
Jure Leskovec 00:32:35 Time series data is actually interesting, right? Like you can say, okay, if your series are let’s say independent of each other, then you could say time series is a sequence, right? But reality gets more complicated very quickly because time series are correlated with each other. Time series are connected with each other, right? Two products in the same product category might be competing with each other. So, what I’m trying to tell you now is basically that time series prediction problem is also a graph problem. Because if I only attend over a sequence of sales of a given product, I cannot, for example, learn that sales of this product are correlated with the sales of this other product. And in this way have more information to make more accurate predictions, right? So even time series forecasting when you look at how the data is organized and all that, it’s not that here is an individual time series, tell me what happens next. No, here is a set of time series about this set of let’s say, products assets. Here is how these assets are connected, here are all their properties, and again, it becomes a graph problem. So, these types of approaches work well for time series prediction as well. And the reason why we think of time series as an individual type of thing is because that’s what today’s technology allows us to do. We have a hammer and we’re looking for nails, but I’m saying with a more general hammer nails also change.
Sri Panyam 00:33:57 Well, I think I was not clear there. I agree with you on that. I’m just saying it wasn’t clear to me on how the graph or foundation models today would encode time series information. So, because even time series data changes, relationships over time at T zero product one will affect T zero product two, and then that part affect t1, product one and so on, right? So, there are graph relationships even within time series data. And is that somehow being captured today or exported today in this new paradigm?
Jure Leskovec 00:34:25 That is a great point. Today there are time series foundation models, which are basically sequence transformers that attend and predict on the individual time series. And I think we both agree that that is limiting because of relationships between time series. And if you can learn over those relationships, you just automatically get more signals and your accuracy increases. So, these graph-based approaches we’ve seen work really well with time series, supply chain type information and so on.
Sri Panyam 00:34:58 I want to kind of contrast some of the complexities, right? If you take graph learning or graph foundation models, right? And then compare that with let’s say your traditional distance. From a trading time perspective, resource perspective, how would you contrast those two? LLMs on the other hand are heavy on this side, but in the same space. How would you compare? What are the contrasts?
Jure Leskovec 00:35:17 Yeah, it’s interesting. I think today for majority of predictive modeling, it’s not the training time that’s the bottleneck. It’s all this data managing ETL pipelines that are running on CPUs and are very slow and sluggish. And the models that we train on top, those are quite quick to train. So, the difference here is that we say, skip the ETL, take a beefy model GPU based and just let it train directly on the raw data. And I want to make a point, right? Is like earlier I made claims you can now build models faster, you can get 10, 20% better accuracy. And people are like, why? How? And my point is we shouldn’t be surprised about this. The same steps are carried out twice already. First, they carried out in computer vision and then they carried out in let’s say natural language understanding. And if you look at computer vision, I could say if I joke a bit, right, I’m building a detector, whether there is a cow on the image or not. Of course you could say if me computer vision expert, I can engineer a perfect feature for whether there’s a cow on the image or not, then my classifier will be amazing.
Jure Leskovec 00:36:35 We’re both smiling now because we both know that it’s kind of impossible to human engineer a perfect feature for is there a cow in the image or not, right? So, all these feature-based approaches that would do, I know sift features, Gabor filters and so on, where you would then train a neural network or a support vector machine or whatever it was to detect whether there is a cow on the image or not. You don’t do this today, you just have a vision transformer or a, you know, in the old days you had a convolutional neural network that just tens over all the pixels and figures out how to combine all the scattered spread information across the pixels into is it a cow or not. What I’m doing here or proposing is the same. Don’t try to engineer the perfect feature, let the attention mechanism attend over your database the same way as it attends over the pixels and let it extract that information out.
Jure Leskovec 00:37:32 So in this respect, I’m kind of saying very obvious things that we’ve seen work in the past, and maybe the only thing I’m saying is as soon as you think of a database as a graph, you can do it. Before, we didn’t know how to represent the database to be able to kind of attend over it. We didn’t know what the pixels were. And now I’m saying think of it as a graph, nodes are your pixels and there are relationships between them. Just attend over that. It’s more complex than computer vision because there is not this spatial locality and things like that, but it’s very doable and very fruitful.
Sri Panyam 00:38:11 Again, go back to, I think when you said treat the database as a graph, that I guess triggered something for me in a good way. When you have a prediction query or prediction request or prediction session coming in, right? What does the data access pattern look like in your database from the mechanism, from the model perspective, from the flow perspective.
Jure Leskovec 00:38:29 That’s a great point. So, what the data access mechanism essentially is as events are coming in, you need to keep the graph up to date. And then all you need to do is basically fetch like a local neighborhood around the node of interest, right? So, you need to basically fetch a small subset of your database data from a couple of tables, send that through the neural network, through this transformer based architecture to get a prediction. And what is a huge difference here is that once the model is trained, putting it in production is trivial. You just refresh the raw data. There’re no feature pipelines, no feature stores know that additional computation that needs to be done every time a transaction appears is just take the latest data, pump it through the neural network and you get the latest prediction.
Sri Panyam 00:39:19 So how do databases need to adapt and evolve to allow these patterns to be visible to the model or I guess the agent, right? We can go full graph DBs, right? Or we can have more specialized indexes on the current graphs, but something’s going to break.
Jure Leskovec 00:39:36 That is a great point. So, I would actually say, there are two solutions here. If you don’t need too much QPS, then you can basically just keep the data in Postgres or MySQL or in Snowflake or wherever you have it. And we can basically do this push down to extract the data on the fly through essentially running the SQL statements. If you need performance scale and things like that and short latency or if the mass of predictions you need to make is super large. Basically, the graph databases, the current graph databases are not the right approach because they’ve been built for a different type of workload. They’ve been built for this kind of sparkle like queries and for building the graphs and queueing them, they haven’t been built for the AI workloads. So, what we see is that today’s graph databases are about an order, two orders of magnitude too slow to support this type of predictive AI use cases.
Sri Panyam 00:40:40 On the read or write path, or both parts?
Jure Leskovec 00:40:42 Mostly on the read.
Sri Panyam 00:40:44 Okay.
Jure Leskovec 00:40:45 On just extracting these subgraphs out. So, what we built at Kuma is a specialized system built ground up. You can think of it as a graph database that is purpose built for this AI workloads for high-volume, large-scale predictions that can be run and has been adopted at kind of the largest internet scale enterprises here in Silicon Valley and beyond.
Sri Panyam 00:41:11 Nice, nice. I do want to talk about PyTorch Geometric. I know it’s a bit of a tangent, but how did it come about? What was the origin story? Where is it today?
Jure Leskovec 00:41:20 That’s a great point. So PyTorch Geometric basically came out of this idea, this was back in 2017, 2019, something like that, right? Like when Deep learning was raising up, it was computer vision, transformers were kind of there, but not yet. And then there were a lot of problems that didn’t fit into this fixed grid of pixels. They don’t fit into a sequence, but they fit into the graph. And this is a lot of problems in let’s say spatial data, computer graphics, chemistry, social networks, graphs and so on. And there was really a need to build an open-source package where researchers could build latest architectures, benchmark them with each other, and for the community to make progress. So, we built 5G or Python geometricals as the library. It has I think like now 20 plus thousand GitHub stars and things like that, that really catalyzed the research in this area of graph neural networks and graph transformers.
Jure Leskovec 00:42:17 We built also benchmarks. One was called OGB, Open Graph Benchmark, as we called it, large scale graphs. And now for relational data, we are building what we call a Rail-Bench, which is a set of databases and a set of predictive tasks over them and all are publicly accessible so that we can see benchmark progress and see what methods work and which don’t. And I would say for PIg was also great because we have great collaborations with Nvidia as well as Intel in the past where we said, okay, this has to run efficiently, scalable on the latest hardware and those partnerships and that support with Nvidia, with the PyTorch team and so on means that the open source library is actually useful and performant.
Sri Panyam 00:43:04 You mentioned Nvidia and hardware companies. Are you finding that there are other kinds of demands or expectations that would serve relational foundational models better from a hardware perspective than let’s say what they’re currently serving today, which is the L use case, right? Or are you making do with what’s coming out the usual hedge hundred?
Jure Leskovec 00:43:24 Yeah, two hundred and so on? Yeah, that’s a great point. Working with graphs, I’d say it’s an order of magnitude harder than working with these linear data types, right? Text is nice, it’s a sequence, you can chop it, you can just linearly scan through a GPU. Image is essentially, it’s a fixed size matrix. Each image is independent from each other. So again, you can kind of push them through or video in the same way. But graphs are hard, right? Graphs, they have no up and down, they don’t have any left and right. There’s no way to chop them into pieces. Everything is kind of connected and interdependent. You don’t know what is going to connect with what. So, dealing with graphs is very, very hard. And this means that you have to be very careful how you design the systems to be scalable and to take advantage of today’s hardware. And I would say through Stanford, we are actually working with Nvidia quite a lot in terms of them understanding what are the needs for the future chips as we move beyond sequential LLMs and memory access patterns become very different because you’re essentially almost like randomly accessing different nodes in memory, pulling those together, and you want to keep your chip use utilized.
Sri Panyam 00:44:42 Can you give us a peek at what’s coming?
Jure Leskovec 00:44:44 I can say that we’ve been discussing and brainstorming what would allow us to process these types of interconnected data in a proper way. And it’s a lot about defining the benchmarks and running simulations, running measurements to understand how these data access patterns can be better supported by the underlying, I would say software as well as the hardware.
Sri Panyam 00:45:08 Interesting. And I guess in a way, given that we are talking about not Xbox of data, not the amount of data volume at the same scale as LLMs, there’s opportunity for even more specific hardware to come out of this. Is that fair?
Jure Leskovec 00:45:20 It’s interesting, right? I think these enterprise data sets get big very, very quickly, right? Maybe they are not internet scale, but they get big very, very quickly. If you start thinking about, I know all the transactions on the Bitcoin blockchain right now, all the clicks and all the comments made by all the users of Reddit, this amount of data get very quickly. Or if you start thinking about banks, financial institutions, transactions and everything that’s going on there, that’s huge velocity and volumes of data easily petabytes and up.
Sri Panyam 00:45:53 Nice. You know, in the aspect of foundation models, it’s when we think LLMs, we think we are all now implicitly aware of our high experience hallucinations, right? What does hallucination mean in the context of RFMs? Is it there? It’s not there does looks differently. What does it mean?
Jure Leskovec 00:46:09 That’s a great point. I think what is different in a relational foundation models, because they’re making a prediction, we make sure that that prediction is calibrated so that we properly give the estimate of uncertainty and get back a sense of accuracy. So, in this respect, it’s not hallucinated, but basically you get an accurate estimate of how sure we are, how accurate is this prediction? And then with that you can then decide what to do. So, because we can train for prediction and we properly penalize these models, right? The problem with LLMs in some sense is that they don’t understand numbers. They just say, did I generate the next token correctly or not? So, if the next token is 200 or 2000 is the same amount of penalty, but in relation foundation models, if my prediction is thousand off, the model gets penalized much more than if the prediction is two or 1.5 off.
Sri Panyam 00:47:06 How do you do that penalization? What is the mechanism of penalizing a bad prediction?
Jure Leskovec 00:47:11 Yeah, because we have loss function that is actually not about what’s the next token generated correctly, but we can kind of measure the distance, if you want to think of it this way, between the truth and the prediction. And we are penalized by the amount of distance, not by was it correct or not, right? And in text you cannot do that because there’s no similarity between tokens. Tokens are tokens and they’re either the correct one or it’s the wrong one. There is nothing in between.
Sri Panyam 00:47:39 In code, which is one major application, LLM’s thread, you do have some kind of penalization in the whole AI loop where you might it compile? Does it do something? Does it do some of the extinction? You’re right, it’s not accurate, it’s not perfect, but there’s some kind of guidance there, right?
Jure Leskovec 00:47:55 In these domains where it’s verifiable, of course RL has shown a really good progress, but even there the verification signal is binary, it compiles or it doesn’t. It’s not that something is more accurate than something else, it’s just yes or no.
Sri Panyam 00:48:09 Right?
Jure Leskovec 00:48:10 Right, right. That’s kind of what I’m trying to say. But in this prediction use cases, you get more information, you kind of know how much you were wrong and you can tell that the model and then the model start to learn this nuances in a better way. Nice.
Sri Panyam 00:48:22 Is it also, again for layman, is it also kind of a backdrop mechanism or is a different mechanism for sending that feedback back?
Jure Leskovec 00:48:27 Again, it’s the backdrop is just the loss function.
Sri Panyam 00:48:31 Nice.
Jure Leskovec 00:48:31 The penalty functions,
Sri Panyam 00:48:33 Right. From a calibration perspective, typically what is the proportion of your training budget or your training resource budget that goes into calibration versus your general training?
Jure Leskovec 00:48:43 Good question. I would say a large, large majority of resources go into training. So, kind of calibration comes in some sense for free. It’s built into the training process itself through the usage of proper loss functions. And also I think the point is that these post training phases in LLMs are the ones that kind of lobotomize the model here. We don’t have to post train and say, oh, does the human prefer this or not? And things like that. So that kind of starts skewing the confidence of LLMs. But here we don’t have to do that. The quality of the output can be objectively measured. And because of that, the model gets much more reliable, consistent signals during training.
Sri Panyam 00:49:27 What about in context training during a session does calibration, because you don’t have to post training kind of phase, right? Could it mean that when you’re live in context, the predictions can somehow be fed back in that session, or more be equal in there?
Jure Leskovec 00:49:41 Yeah, so the way in context learning would work in this case is that you would give the model a set of kind of historical examples and then you would ask it about something that you don’t know. And now the model gets those historical in context examples that are of course used case, let’s say customer database specific, and then you give it something that you don’t know about, you ask it to make a prediction. And this pre-trained model is basically a reasoning over those historical examples that you gave to make you that prediction. And what is interesting here, because if I give you the historical examples, the model can take a couple of those away and pretend as if it doesn’t know, it doesn’t know the outcome in the past, make a prediction there and it sees how accurate it’s, and that gives you objective non hallucinated way to say, this is how accurate I am for this prediction, right? It cannot be overconfident because it can measure itself, let’s say on the historic data.
Sri Panyam 00:50:33 Now you also mentioned that this explanation, I guess comes with a score of confidence, right? And typically, is it a set of scores instead of likelihoods? Like is it per prediction? Like I guess how does the user of this prediction absorb that and you know, practical use it?
Jure Leskovec 00:50:49 Yeah, that’s a great point. So, you would get two things, right? Or three things. You get a prediction, you get a confidence interval accuracy estimate, and then you also get the explanation ah yes in natural language about why the model made that prediction. And the way we do this is actually super cool. It allows us to basically back trace the model and see what data it is attending over, right? And then we can basically look at that attention mechanism, the tables, the columns, the rows that the model is attending over, and then synthesize that into an explanation. And you can say prediction for this is like that because of the data in this table, the data in that table and so on. That’s what I would say is the exciting part is that you can truly make these explanations, not just by the signals or features that somebody sought to engineer, but directly from the raw data. So, the fidelity or richness of these explanations is actually quite impressive. You kind of learn something new because it’s hidden in the data and you didn’t know about it. You didn’t have to pre engineer it.
Sri Panyam 00:51:51 Interesting. So, it’s almost like a debugging trace.
Jure Leskovec 00:51:54 Great point. It’s also, we see many times it’s a debugging trace because what you can easily detect with this is to say, oh, I have some information leakage, I have some data in there that I shouldn’t have in there, and things like that. Exactly. So, you can think of it as a debug tricks.
Sri Panyam 00:52:08 But instead of phrase hitting endpoints or methods or function calls, you’re talking about which piece of data in your database you’re asking what was the weight, was the process.
Jure Leskovec 00:52:18 Exactly. It’s almost like debugging the data in a sense, right? Especially when things go wrong. It’s a super elegant way to debug what’s happening or when a prediction is made, you can use it for justification and things like that.
Sri Panyam 00:52:30 So as you debug it, how would you use this to effectively debug it? Let’s say a fan trace? I’ve noticed that, again, this is just visually for my own model, right? You find that in this query, the prediction used table X and not Y, or it used relationship A and not B. Yeah, I’m assuming it looks like that, but maybe not. But if you had that, what would you do with it and who would do it?
Jure Leskovec 00:52:50 That’s great, right? So, there are several different ways. If you’re the one who’s developing the model, what you can learn from this is that maybe you have a table in there that is leaking information or maybe you learn that some data is important for prediction, but actually that data is not available at the time of prediction, but the model is going to use it. So, you can kind of take that data away. Another way that can happen is that if you see that Molly is attending particular table, a particular type of data that tells you, oh there signal in here, you can bring additional data, link other tables and this way improve the accuracy of the model. So, it’s both good for improving the model as well as finding, I would say, bugs in the data, having columns that maybe, you know, somehow were in there but shouldn’t be there because they’re backfilled and things like that. So, both use cases would be solved by this, by providing ideas on what you can do to improve as well as debugging the data where the model is shockingly good and you’re like, hey, something seems fishy here.
Sri Panyam 00:53:54 Yeah, it’s almost like a good forensic investigative tool as well. You know, you don’t know what you don’t know and you finding out things that you didn’t know
Jure Leskovec 00:54:00 Exactly. You can think of it as a forensic tool and you can also, where it becomes very useful is because you can start asking these models what are called technically counterfactual questions. You can say, if I do this, what will happen? If I send this offer to the customer, what will happen to the customer, right? So now you can kind of test all these alternatives hypothesis and ask the model, okay, like we had a use case for sales lead scoring, and then the person who was using the system was like, hey, why did you predict so low probability of me closing this deal? And you can go back to the model and the model is like, look, I’m looking at the data, your deal size is very large compared to other deals. I’m looking here, you don’t have an executive sponsor. And also, by the way, this data hasn’t been updated for three months. Now the salesperson can be like, oh, okay, actually there is new information we forgot to update. Let’s put that data in. And you can query the model again. Or you can start by asking, okay, I don’t have an executive sponsor, let me figure that out. And then you can be also asking this counterfactual question, which is like, okay, so now if I offer the client 10% discount, what happens to the probability of closing? This is kind of an example of how this capability could be used.
Sri Panyam 00:55:16 Yeah. That’s a really good example actually. So how much of this — I use the word “fine-tuning” loosely — how much of this adding more data, doing more fine tuning or an RFM, can you do before you might say you hit the limit and you must retrain or go with a newer model. Is there any rough rule of thumb there? Or is just one day we’ll get there?
Jure Leskovec 00:55:37 I would say generally more data that is useful and relevant allows you to train more accurate models. So, we want to use the data that is relevant and as much of it as possible, but also this methods, you know, are not that data hungry. We have use cases where maybe we have an order of thousands of examples and we have use cases where we have an order of tens of billions of use cases. And the beautiful thing is that you can basically then choose your model, size, architecture that kind of fits your needs.
Sri Panyam 00:56:10 Okay. So that sounds good. One big hot topic is agentic systems, right? How do you see this kind of interdependence symbiosis between RFMs and agentic systems?
Jure Leskovec 00:56:21 Yeah, we see a lot of traction there because you know, for agents, especially in enterprise business settings, they need to be able to make decisions, right? And even if you say I have a customer support agent, that customer support agent actually has a lot of predictive decision-making tasks to make before it reacts, right? It’s about what’s the lifetime value of this customer? How likely is this customer going to churn? What is the next best action that I take to keep this customer? What is the offer or resolution I should offer to increase the probability of customer not churning and so on, right? And today I think we are early in agents, so we get excited about this almost like this kind of, oh let’s retrieve some knowledge from some knowledge base and reform it and things like that. But as these agents become more business critical, more autonomous, they need this decision-making power and reasoning ability over the enterprise structured operational data to make the correct decisions. Right? And if you now start thinking about a customer agent, one of course important thing is to talk, to understand and things like that. But at the same time is like having the right tone, giving the right offer and making sure that, you know, customers are satisfied, that directly affects their effectiveness. So, we see a lot of need for that.
Sri Panyam 00:57:40 Could you share examples of where a customer switched their agent to use an RFM versus the classical LLM and how it impacts the outcome?
Jure Leskovec 00:57:49 A great example is also in sales, right? If you say about let’s understand the probability of closing, let’s understand what’s the next best action to take here? Let’s understand what’s the next product to upsell to that client so that they increase their spend? Let’s understand what me as a salesperson should do next to be more effective. We see great results in these types of domains. LLMs cannot make — I mean they make this kind of good human-level common sense decisions, but these dedicated predictive models who have looked through all the patterns of the past are just so much more accurate, right? They’re like 20, 30% more accurate than this kind of common sense LLM or human can do. And we’ve done, like for example with some sales teams, we ran A-B tests where one part of the sales team would act based on these predictions and the other one would use LLMs and existing tools and was a huge difference. Was like 30% difference in effectiveness of those teams because they had structured data understanding, good quantitative predictions, counterfactuals, and things like that.
Sri Panyam 00:58:57 As we wrap up, I wanted to get some practical guidance for software engineers or data scientists who are kind of in this space and want to explore this. What would be the easiest way to start exploring it? I mean, what is the hello world of this?
Jure Leskovec 00:59:08 Yeah, that’s a great point. I would say there is a couple of hello worlds and depends maybe on the flavor or role, technical sophistication, or I don’t know, of people I think. For researchers, 5G Pi geometric is a great starting point, a rail bench and all this stuff. For people who just want to use this, I think these tabular foundation models, relational foundation models are great. And there is both open source as well as basically public SDKs when people can start playing with this. The one I’d recommend is called Kumor FM. So, if you go to K-U-M-O-R FM, like relational foundation model.AI want to start playing with this and this technology is also going to be available soon in large data warehouses like Snowflake and so on. So that it’s kind of ready to use pre-install, one doesn’t have to worry too much.
Sri Panyam 00:59:56 Sounds good. How can we learn more about this and how can we follow you? Are there conferences, will you be speaking at any of them?
Jure Leskovec 01:00:03 Yeah, that’s great. I try to publish our research mostly on LinkedIn as well as on X. So on X, I’m @jure, on LinkedIn as well. And yeah, I attend a lot of conferences. Of course, the top AI conferences, like NeurIPS, ICLR, and so on. And then also try to go to meetups, especially here in San Francisco where there is strong developer AI community hungry to kind of learn the next thing.
Sri Panyam 01:00:30 Thank you. Thank you. Any closing words of advice before we wrap up?
Jure Leskovec 01:00:34 It was a great conversation. I think for me to summarize, what’s the key? The key here is essentially this structured relational enterprise data is missing modality in today’s AI. And what I’m saying is the world of machine learning and predictive modeling is getting heavily disrupted with the technologies and approaches that I’m talking about. So relational, deep learning, relational foundation models, and that’s the next frontier and that’s where the future is going. So, I’d encourage people to get familiar with these things, be it machine learning engineers, data scientists, as well as business units and so on.
Sri Panyam 01:01:09 Thank you. And software engineers.
Jure Leskovec 01:01:12 Software engineers. Thank you so much.
Sri Panyam 01:01:14 Thank you. This has been a very enlightening and very insightful chat. I’ve learned so much. And as have our listeners, thank you. This is Sri Panyam with Jure on relations panel foundational models. Thank you.
[End of Audio]



