Frank McSherry, chief scientist at Materialize, talks about the Materialize streaming database, which supports real-time analytics by maintaining incremental views over streaming data. Host Akshay Manchale spoke with Frank about various ways in which analytical systems are built over streaming services today, pitfalls associated with those solutions, and how Materialize simplifies both the expression of analytical questions through SQL and the correctness of the answers computed over multiple data sources. The conversation explores the differential/timely data flow that powers the compute plane of Materialize, how it timestamps data from sources to allow for incremental view maintenance, as well as how it’s deployed, how it can be recovered, and several interesting use cases.
This episode sponsored by OpenZiti.
- Episode 393 – Episode 393: Jay Kreps on Enterprise Integration Architecture with a Kafka Event Log
- Episode 219 – Episode 219: Apache Kafka with Jun Rao
- Episode 162: Episode 162: Project Voldemort with Jay Kreps
- Episode 370: Episode 370: Chris Richardson on Microservice Patterns
- Episode 473: Episode 473: Mike Del Balso on Feature Stores
- Episode 456: Episode 456: Tomer Shiran on Data Lakes
- Episode 433: Episode 433: Jay Kreps on ksqlDB
- The Streaming Database for Real-time Analytics
- Streaming SQL: What is it, why is it useful? – Materialize
- Using Kafka as Your Primary Data Store? Here’s Why You Shouldn’t
- Eventual Consistency isn’t for Streaming – Materialize
- Home — Materialize Documentation
- Timely Dataflow: Timely dataflow in three easy steps!
- GitHub – TimelyDataflow/timely-dataflow: A modular implementation of timely dataflow in Rust
- GitHub – TimelyDataflow/differential-dataflow: An implementation of differential dataflow using timely dataflow on Rust.
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Akshay Manchale 00:01:03 Welcome to Software Engineering Radio. I’m your host, Akshay Manchale. My guest today is Frank McSherry and we will be talking about Materialize. Frank is the chief scientist at Materialize and prior to that, he did a fair bit of relatively public work on dataflow systems — first at Microsoft, Silicon Valley, and most recently ETH, Zurich. He also did some work on differential privacy back in the day. Frank, welcome to the show.
Frank McSherry 00:01:27 Thanks very much, Akshay. I’m delighted to be here.
Akshay Manchale 00:01:29 Frank, let’s get started with Materialize and set the context for the show. Can you start by describing what is Materialize?
Frank McSherry 00:01:38 Certainly. Materialize, a great way to think about it is it’s an SQL database — the same sort of thing you’re used to thinking about when you pick up PostgreSQL or something like that — except that its implementation has been changed to excel really at maintaining views over data as the data change rapidly, right? Traditional databases are pretty good at holding a pile of data, and you ask a lot of questions rapid-fire at it. If you flip that around a little and say, what if I’ve got the same set of questions over time and the data are really what are changing? Materialize does a great job at doing that efficiently for you and reactively so that you get told as soon as there’s a change rather than having to sit around and poll and ask over and over again.
Akshay Manchale 00:02:14 So, something that sits on top of streaming data, I suppose, is the classic use case?
Frank McSherry 00:02:19 That’s a great way to think about it. Yeah. I mean, there’s at least two positionings here. One is, okay so streaming is very broad. Any data show up at all and Materialize absolutely will do some stuff with that. The model in that case is that your data — your table, if you were thinking about it as a database — is full of all those events that have showed up. And we’ll absolutely do a thing for you in that case. But the place that Materialize really excels and distinguishes itself is when that stream that’s coming in is a change log coming out of some transactional source of truth. Your upstream or DB-style instance, which has very clear sort of changes to the data that have to happen atomically at very specific moments. And you know, there’s a lot of streaming infrastructure that you could apply to this, to this data. And maybe you’re maybe not, you actually get out exactly the correct SQL semantics from it. And Materialize is really, I would say, positioned that people who have a database in mind, like they have a collection of data that they’re thinking of, that they are changing, adding to removing from. And they want the experience, the lived experience of a transactional consistent SQL database.
Akshay Manchale 00:03:20 So in a world where you have many different systems for data management and infrastructure, can you talk about the use cases that are solved today and where Materialize fits in? Where does it fill the gap in terms of fitting into the existing data infrastructure and an existing company? Maybe start by saying what sort of systems are present and what’s lacking, and where does Materialize fit in in that ecosystem.
Frank McSherry 00:03:46 Certainly. This won’t be comprehensive; there’s a tremendous amount of exciting, interesting bits of data infrastructure out there. But in broad strokes, you often have a durable source of truth somewhere. This is your database, this is your LTP instances, is holding onto your customer data. It’s holding onto the purchases they’ve made and the products you have in stock, and you don’t screw around with this. This is correct source of truth. You could go to that and ask all of your questions, but these databases often aren’t designed to really survive heavy analytic load or continual querying to drive dashboards and stuff like that. So, a product that’s shown up 20, 30 years or so, it has been the OLAP database, the online analytic processing database, which is a different take on the same data, laid out a little bit differently to make asking questions really efficient. That’s the sort of “get in there and grind over your data really quick” and ask questions like how many of my sales in this particular time period had some characteristics so that I can learn about my business or my customers or whatever it is that I’m doing.
Frank McSherry 00:04:47 And that’s a pretty cool bit of technology that also often lives in a modern organization. However, they’re not usually designed to — I mean, they sort of think about taking the data that is there and reorganizing, laying it out carefully so that it’s fast to access and the data are continually changing. That’s a little annoying for these sorts of systems and they’re not really optimized for freshness, let’s say. You know they can do something like adding data in two counts, not so hard, but modifying a record that used to be the maximum value you got to find the second biggest one now. That sort of thing is annoying for them. Now with that people have realized like, oh, okay, there are some use cases where we’d actually like to have really fresh results and we don’t want to have to go hit the source of truth again.
Frank McSherry 00:05:30 And folks that started to build streaming platforms, things like Confluence, Kafka offerings, and Ververica’s Flink. These are systems that are very much designed to take event streams of some sort — you know, they might just be raw data, this lending into Kafka, or they might be more meaningful change data captured coming out of these transactional processing databases — but pushing those through streaming systems where, to date, I would say most of them have been tools rather than products, right? So, they’re software libraries that you can start coding against. And if you get things right, you’ll get a result that you’re pretty proud of and produces correct answers, but this is a little bit on you. And they’ve started to go up the stack a little bit to provide fully featured products where you’re actually seeing correct answers coming out consistently. Though they’re not generally there yet.
Frank McSherry 00:06:20 I would say Materialize is trying to fit into that site to say like, as you have expected for transactional databases and for analytic databases, if you’re trying to think about a stream database, not just a stream programming platform or stream processing toolkit, but a database, I think that maintains consistency, maintains and variants for you, scales out horizontally, stuff like that. But all of the things you expect a database to do for you for continually changing data, is where we’re sneaking in and hoping to get everyone to agree. Oh, thank goodness you did this rather than me.
Akshay Manchale 00:06:52 Analytics on top of streaming data must be a somewhat of a common use case now that streaming data, event data is so common and pervasive in all kinds of technology stacks. How does someone support answering the analytical questions that you might support would say materialized today without Materialize?
Frank McSherry 00:07:12 Yeah, it’s a good question. I mean, I think there’s a few different takes. Again, I don’t want to announce that I know all of the flavors of these things because it’s repeatedly surprising how creative and inventive people are. But generally the takes are you have always at your hands, various analytic tools that you can, you can try to use and they have knobs related to freshness. And some of them like, you know, will quickly happily let you append to data and get it involved in your aggregates very quickly. If you’re tracking maximum temperatures of a bunch of sensors, that’s fine, you know, it’ll be very fresh as long as you keep adding measurements. And, you know, things only go sideways in some of the maybe more niche cases for some people like having to retract data or potentially having to do more complicated SQL style joints. So a lot of these engines don’t quite excel at that. I would say the OLAP things either respond quickly to changes in data or support complicated SQL expressions have multi-way joins or multilevel aggregations and stuff like that.
Frank McSherry 00:08:08 So those tools exist. Other than that, your data infrastructure team skills up on something like Flink or KStream and just starts to learn, how do I put these things together? If you ever need to do anything more, yet more exciting than just dashboards that count things, like counting is pretty easy. I think a lot of folks know that they’re a bunch of products that, that will handle counting for you. But if you needed to take events that come in and look them up in a customer database, that’s supposed to be current and consistent, not accidentally ship things to the wrong address or something like that. You kind of either have to sort of roll this your own or, or accept a certain bit of stillness in your data. And you know, it depends on who you are, whether this is okay or not.
Frank McSherry 00:08:48 I think people are realizing now that they can move along from just counting things or getting information that’s an hour still, there really current things. One of our users is currently using it for cart abandonment. They’re trying to sell things to people and personal walks away from their shopping cart. Like you don’t want to know that tomorrow or two minutes, even an hour, you probably have lost the customer at that point. And so trying to figure out like that logic for determining what’s going on with my business? I want to know it now rather than as a post-mortem. People are realizing that they can do more sophisticated things and their appetite has increased. I suppose I would say that’s part of what makes them Materialize more interesting is that people realize that they can do cool things if you give them the tools.
Akshay Manchale 00:09:29 And one way to circumvent that would be to write your own application-level logic, keep track of what’s flowing through and service the use cases that you want to serve. Maybe.
Frank McSherry 00:09:39 Absolutely. That’s a good point. This is another form of data infrastructure, which is really totally bespoke, right? Like put your data somewhere and write some more complicated pile of microservices and application logic that you wrote that just sort of sniff around in all of your data and you cross your fingers and hope that your education in distributed systems, isn’t going to cause you to show up as a cautionary tale in a consistency or something like that.
Akshay Manchale 00:10:01 I think that makes it even harder. If you have like one-off queries that you want to ask one time, then spinning off a service writing application-level code to, so that one-off is time consuming. Maybe not relevant by the time you actually have that answer. So, let’s talk about Materialize from a user’s perspective. How does someone interact with Materialize? What does that look like?
Frank McSherry 00:10:24 So the intent is, it’s meant to be as close as possible to a traditional SQL experience. You, you connect using PG wire. So, it’s in sense as if we were PostgreSQL. And really, really the goal is to look as much as SQL as possible because there’s lots of tools out there that aren’t going to get rewritten for Materialize, certainly not yet. And so they’re going to show up and say, I assume that you are, let’s say PostgreSQL, and I’m going to say things that PostgreSQL is supposed to understand and hope it worked. So, the experience is meant to be very similar. There’s a few deviations, I’ll try to call those out. So, Materialize is very excited about the idea in addition to creating tables and inserting things into tables and stuff like that. You’re also able to create what we call sources, which in SQL land these are a lot like SQL 4n tables.
Frank McSherry 00:11:08 So this data that we don’t have it on hand at the moment, we’re happy to go get it for you and process it as it starts to arrive at Materialize, but we don’t actually, we’re not sitting on it right now. You can’t insert into it or remove from it, but it’s enough of a description of the data for us to go and find it. This is like a Kafka topic or some S3 buckets or something like that. And with that in place, you’re able to then do a lot of standard stuff here. You’re going to select from blah, blah, blah. You’re able to create views. And probably the most exciting thing and Materialize is most differentiating thing is creating Materialized views. So, when you create a view, you can put the Materialize modifier, and format, and that tells us, it gives us permission basically, to go and build a data flow that will not only determine those results, but maintain them for you so that any subsequent selects from that view will, will essentially just be reading it out of memory. They will not redo any joins or aggregations or any complicated work like that
Akshay Manchale 00:12:02 In a way you’re saying Materialized views are very similar to what databases do with Materialized views, except that the source data is not internal to the database itself in some other tables on top of which you’re creating a view, but it’s actually from Kafka topics and other sources. So what other sources can you ingest data into on top of which you can query using SQL like interface?
Frank McSherry 00:12:25 The most common one that we’ve had experience with has been pulling out in one way or the other. I’ll explain a few, this change data capture coming out of transactional sources of truth. So, for example, Materialize is more than happy to connect to PostgreSQL as logical replication log and just pull out a PostgreSQL instance and say, we’re going to replicate things up. Essentially, they simply are a PostgreSQL replica. There’s also an Open- Source project debezium, that is attempting to be a lot of different change data capture for different databases, writing into Kafka. And we’re happy to pull debezium out of Kafka and have that populate various relations that we maintain and compute. But you can also just take Kafka, like records in Kafka with Avro Schemus, there’s an ecosystem for this, pulled them into Materialize and they’ll be treated without the change data capture going on.
Frank McSherry 00:13:14 They’ll just be treated as append only. So, each, each new row that you get now, it’s like as if you add that into the table, that you were writing as if someone typed in insert statement with those contents, but you don’t actually have to be there typing insert statements, we’ll be watching the stream for you. And then you can feed that into these, the SQL views. There’s some cleverness that goes on. You might say, wait, append only that’s going to be enormous. And there’s definitely some cleverness that goes on to make sure things don’t fall over. The intended experience, I suppose, is very naive SQL as if you had just populated these tables with massive results. But behind the scenes, the cleverness is looking at your SQL query and say, oh we don’t actually need to do that, do we? If we can pull the data in, aggregate it, as it arrives, we can retire data. Once certain things are known to be true about it. But the lived experience very much meant to be SQL you, the user don’t need to, you know, there’s like one or two new concepts, mostly about expectations. Like what types of queries should go fast should go slow. But the tools that you’re using don’t need to suddenly speak new dialects of SQL or anything like that,
Akshay Manchale 00:14:14 You can connect through JDBC or something to Materialize and just consume that information?
Frank McSherry 00:14:19 I believe so. Yeah. I think that I’m definitely not expert on all of the quirks. So, someone could be listening to I’m like, oh no, Frank, don’t say that, don’t say that it’s a trick. And I want to be careful about that, but absolutely, you know, with the appropriate amount of typing the PG wire is the thing that a hundred percent yes. And various JDBC drivers definitely work. Though occasionally they need a little bit of help some modifications to explain how a thing actually needs to happen, given that we are not literally PostgreSQL.
Akshay Manchale 00:14:44 So you said some ways you’re similar, what you just described, in some ways you’re different from SQL or you don’t support certain things that are in a traditional database. So, what are those things that are not like a traditional database and Materialize or what do you not support from a SQL perspective?
Frank McSherry 00:14:59 Yeah, that’s a good question. So, I would say there’s some things that are sort of subtle. So, for example, we were not very happy to have you build a Materialized view that has non-deterministic functions in it. I don’t know if you were expecting to do that, but if you put something like Rand or Now in a Materialized view, we’re going to tell you no, I guess I would say modern SQL is something that we’re not racing towards at the moment. We started with SQL92 as a sequence. A lot of subqueries joins all sorts of correlation all over the place, if you want, but are not yet match recognize and stuff like that. It was just SQL 2016 or something like that. There’s a rate at which we’re trying to bring things in. We’re trying to do a good job of being confident in what we put in there versus racing forward with features that are mostly baked
Frank McSherry 00:15:44 or work 50% of the time. My take is that there’s an uncanny valley essentially between not really SQL systems and SQL systems. And if you show up and say we’re SQL compatible, but actually 10% of what you might type will be rejected. This is not nearly as useful as a 100% or 99.99%. That’s just no longer useful to pretend to be SQL compatible. At that point, someone has to rewrite their tools. That’s what makes a, it makes a difference. You mean, differences are performance related. You know, that if you try to use Materialize as an OTP source of truth, you’re going to find that it behaves a bit more like a batch process. If you try to see what is the peak insert throughput, sequential inserts, not batch inserts, the numbers there are going to be for sure, lower than something like PostgreSQL, which is really good at getting in and out as quickly as possible. Maybe I would say, or transaction support is not as exotic as opposed to the other transactions and Materialize, but the set of things that you can do in a transaction are more limited.
Akshay Manchale 00:16:39 What about something like triggers? Can you support triggers based upon
Frank McSherry 00:16:43 Absolutely not. No. So triggers are a declarative way to describe imperative behavior, right? Another example actually is window functions are a thing that technically we have support for, but no one’s going to be impressed. So window functions, similarly are usually used as a declarative way to describe imperative programs. You like do some grouping this way and then walk one record at a time forward, maintaining the state and the like, I suppose it’s declarative, but it’s not in the sense that anyone really intended and they’re super hard, unfortunately, super hard to maintain efficiently. If you want to grab the median element out of a collection, there are algorithms that you can use that are smart to do that. But getting general SQL to update incrementally is a lot harder when you add certain constructs that absolutely people want. For sure. So that’s a bit of a challenge actually is spanning that gap.
Akshay Manchale 00:17:31 When it comes to different sources, you have Kafka topics, you can connect to a change data capture stream. Can you join those two things together to create a Materialized view of sorts from multiple sources?
Frank McSherry 00:17:43 Absolutely. I totally forgot that this might be a surprise. Absolutely, of course. So, what happens in Materialize is the sources of data may come with their own views on transaction boundaries. They may have no opinions at all. Like the Kafka topics may have just like, Hey, I’m just here. But you know, the PostgreSQL might have clear transaction boundaries as they arrive at Materialize, they get translated to sort of Materialize local timestamps that respect the transaction boundaries on the inputs, but are relatable to each other. Essentially the first moment at which Materialized was aware of the existence of a particular record and absolutely you can just, you can join these things together. You can take a dimension table that you maintain in PostgreSQL and join it with effect table that spilling in through Kafka and get exactly consistent answers as much as that makes sense. When you have Kafka and PostgreSQL in there, they’re in coordinated, but we’ll be showing you an answer that actually corresponds to a moment in the Kafka topic and a specific moment in the PostgreSQL instance that were roughly contemporaneous.
Akshay Manchale 00:18:37 You just said, correctness was an important aspect in what you do with Materialized. So if you’re working with two different streams, maybe one is lagging behind. Maybe it’s the underlying infrastructure is just petitioned from your Materialized instance, maybe. So does that surface the user in some way, or do you just provide an answer that’s somewhat correct. And also tell the user, yeah, we don’t know for sure. What’s coming from the other topic.
Frank McSherry 00:19:02 That’s a great question. And this is one of the main pinpoints in stream processing systems. Is this tradeoff between availability and correctness. Basically, if the data are slow, what do you do? Do you, do you hold back results or do you show people sort of bogus results? The stream processing community I think has evolved to get that like, you want correct results because otherwise people don’t know how to use your tool properly. And Materialize will do the same with a caveat, which is that, like I said, Materialize essentially read timestamps the data arrives at Materialize, into material has local times so that it is always able to provide a current view of what it’s received, but it will also surface that relationship, those bindings, essentially, between progress in the sources and timestamps that we’ve assigned.
Frank McSherry 00:19:45 So it will be able to tell you like that time now, as of now, what is the max offset that we’ve actually peeled out of Kafka? For some reason that isn’t what you want it to be. You know, you happen to know that there’s a bunch more data ready to go, or what is the max transaction ID that we pulled out of PostgreSQL. You’re able to see that information. We’re not entirely sure what you will use or want to do at that point though. And you might need to do a little bit of your own logic about like, Ooh, wait, I should wait. You know, if I want to provide end to end, read your rights experience for someone putting data into Kafka, I might want to wait until I actually see that offset that I just sent wrote the message to reflected in the output. But it’s a little tricky for Materialize to know exactly what you’re going to want ahead of time. So we give you the information, but don’t prescribe any behavior based on that.
Akshay Manchale 00:20:32 I’m missing something about understanding how Materialize understands the underlying data. So, you can connect to some Kafka topic maybe that has binary streams coming through. How do you understand what’s actually present in it? And how do you extract columns or tight information in order to create a Materialized view?
Frank McSherry 00:20:52 It’s a great question. So, one of the things that’s helping us a lot here is that Confluence has the compliment schema registry, which is a bit of their, of the Kafka ecosystem that maintains associations between Kafka topics and Avro schemas that you should expect to be true of the binary payloads. And we’ll happily go and pull that data, that information out of the schema registries so that you can automatically get a nice bunch of columns, basically we’ll map Avro into the sort of SQL like relational model that’s going on. They don’t perfectly match, unfortunately. So, we have sort of a superset of Avro and PostgreSQL’s data models, but we’ll use that information to properly turn these things into types that make sense to you. Otherwise, what you get is essentially one column that is a binary blob, and you’re more than like step one, for a lot of people is convert that to text and use a CSV splitter on it, to turn into a bunch of different text columns, and now use SQL casting abilities to take the text into dates times. So, we often see a first view that is unpack what we received as binary as a blob of Json, maybe. I can just use Json to pop all these things open and turn that into a view that is now sensible with respect to properly typed columns and a well-defined schema, stuff like that. And then build all of your logic based off of that large view rather than off of the raw source.
Akshay Manchale 00:22:15 Is that happening within Materialize when you’re trying to unpack the object in the absence of say a schema registry of sorts that describes the underlying data?
Frank McSherry 00:22:23 So what’ll happen is you write these views that say, okay, from binary, let me cast it to text. I’m going to treat it as Json. I’m going to try to pick out the following fields. That’ll be a view when you create that view, nothing actually happens in Materialize other than we write it down, we don’t start doing any work on account of that. We wait until you say something like, well, you know, okay, select this field as a key, join it with this other relation. I have, do an aggregation, do some counting, we’ll then turn on Materialize as this machinery at that point to look at your big, we have to go and get you an answer now and start maintaining something. So, we’ll say, ìGreat got to do these group buys, these joins, which columns do we actually need?î
Frank McSherry 00:23:02 We’ll push back as much of this logic as possible to the moment just after we pulled this out of Kafka, right? So we just got some bytes, we’re just about to, I mean step one is probably cast it to Jason, cause you can cunningly dive into the binary blobs to find the fields that you need, but basically we will, as soon as possible, turn it into the fields that we need, throw away the fields we don’t need and then flow it into the rest of the data. Flows is one of the tricks for how do we not use so much memory? You know, if you only need to do a group by count on a certain number of columns, we’ll just keep those columns, just the distinct values of those columns. We’ll throw away all the other differentiating stuff that you might be wondering, where is it? It evaporated to the ether still in Kafka, but it’s not immaterial. So yeah, we’ll do that in Materialize as soon as possible when drawing the data into the system,
Akshay Manchale 00:23:48 The underlying computing infrastructure that you have that supports a Materialized view. If I have two Materialized views that are created on the same underlying topic, are you going to reuse that to compute outputs of those views? Or is it two separate compute pipelines for each of the views that you have on top of underlying data?
Frank McSherry 00:24:09 That’s a great question. The thing that we’ve built at the moment,does allow you to share, but requires you to be explicit about when you want the sharing. And the idea is that maybe we could build something on top of this, that automatically regrets, you’re curious and you know, some sort of exotic wave, but, but yeah, what happens under the covers is that each of these Materialized views that you’ve expressed like, Hey, please complete this for me and keep it up to date. We’re going to turn into a timely data flow system underneath. And the time the data flows are sort of interesting in their architecture that they allow sharing of state across data flows. So you’re able to use in particular, we’re going to share index representations of these collections across data flows. So if you want to do a join for example, between your customer relation and your orders relation by customer ID, and maybe I don’t know, something else, you know, addresses with customers by customer ID, that customer collection index to a customer ID can be used by both of those data flows.
Frank McSherry 00:25:02 At the same time, we only need to maintain one copy of that saves a lot on memory and compute and communication and stuff like that. We don’t do this for you automatically because it introduces some dependencies. If we do it automatically, you might shut down one view and it not, all of it really shuts down because some of it was needed to help out another view. We didn’t want to get ourselves into that situation. So, if you want to do the sharing at the moment, you need to step one, create an index on customers in that example, and then step two, just issue queries. And we’ll, we’ll pick up that shared index automatically at that point, but you have to have called it that ahead of time, as opposed to have us discover it as we just walked through your queries as we haven’t called it out.
Akshay Manchale 00:25:39 So you can create a Materialized view and you can create index on those columns. And then you can issue a query that might use the index as opposed to the base stable classic SQL like optimizations on top of the same data, maybe in different farms for better access, et cetera. Is that the idea for creating an index?
Frank McSherry 00:26:00 Yeah, that’s a good point. Actually, to be totally honest creating Materialize view and creating an index are the same thing, it turns out in Materialize. The Materialize view that we create is an index representation of the data. Where if you just say, create Materialize view, we’ll pick the columns to index on. Sometimes they’re really good, unique keys that we can use to index on and we’ll use those. And sometimes there aren’t, we’ll just essentially have a pile of data that is indexed essentially on all of the columns of your data. But it’s really, it’s the same thing that’s going on. It’s us building a data flow whose output is an index representation of the collection of data, but left representation that is not only a big pile of the correct data, but also arranged in a form that allows us random access by whatever the key of the indexes.
Frank McSherry 00:26:41 And you’re absolutely right. That’s very helpful for subsequent, like you want to do a join using those columns as the key, amazing, like we’ll literally just use that in-memory asset for the join. We won’t need to allocate any more information. If you want to do a select where you ask for some values equal to that key, that’ll come back in a millisecond or something. It will literally just do random access into that, maintain your instrument and get you answers back. So, it’s the same intuition as an index. Like why do you build an index? Both so that you have fast you yourself, fast access to that data, but also, so that subsequent queries that you do will be more efficient now, subsequent joins that you can use the index amazing very much the same intuition as Materialize has at the moment. And I think not a concept that a lot of the other stream processors have yet, hopefully that’s changing, but I think it’s a real point of distinction between them that you can do this upfront work and index construction and expect to get pay off in terms of performance and efficiency with the rest of your SQL workloads.
Akshay Manchale 00:27:36 That’s great. In SQL sometimes you, as a user don’t necessarily know what the best access pattern is for the underlying data, right? So maybe you’d like to query and you’ll say, explain, and it gives you a query plan and then you’ll realize, oh wait, they can actually make, do this much better if I just create an index one so-and-so columns. Is that kind of feedback available and Materialized because your data access pattern is not necessarily data at rest, right? It’s streaming data. So it looks different. Do you have that kind of feedback that goes back to the user saying that I should actually create an index in order to get answers faster or understand why something is really slow?
Frank McSherry 00:28:11 I can tell you what we have at the moment and where I’d love us to be is 20 years in the future from now. But at the moment you can do the explain queries, explain plan, for explain. We’ve got like three different plans that you can check out in terms of the pipeline from type checking down to optimization, down to the physical plan. What we don’t really have yet, I would say is a good assistant, like, you know, the equivalent of Clippy for data flow plans to say. It looks like you’re using the same arrangement five times here. Maybe you should create an index. We do mirror up, you know, potentially interesting, but majority mirrors up a lot of its exhaust as introspection data that you can then look at. And we will actually keep track of how many times are you arranging various bits of data, various ways.
Frank McSherry 00:28:53 So the interested person could go and look and say, oh, that’s weird. I’m making four copies of this particular index when instead I should be using it four times, they’ve got some homework to do at that point to figure out what that index is, but it’s absolutely the sort of thing that a fully featured product would want to have as help me make this query faster and have it look at your workload and say, ah, you know, we could take these five queries you have, jointly optimize them and do something better. In database LEN, this is multicore optimization is named for this or a name for a thing like it anyhow. And it’s hard. Fortunately, there’s not just an easy like, oh yeah, this is all problem. Just do it this way. It’s subtle. And you’re never, always sure that you’re doing the right thing. I mean, sometimes what Materialize is trying to do is to bring streaming performance, even more people and any steps that we can take to give it even better performance, even more people for people who aren’t nearly as excited about diving in and understanding how data flows work and stuff, and just had a button that says think more and go faster, it would be great. I mean, I’m all for that.
Akshay Manchale 00:30:44 Let’s talk a little bit about the correctness aspect of it because that’s one of the key points for Materialize, right? You write a query and you’re getting correct answers or, you’re getting consistent views. Now, if I were to not use Materialize, maybe I’m going to use some hand-written code application level logic to local streaming data and compute stuff. What are the pitfalls in doing? Do you have an example where you can say that certain things are never going to convert to an answer? I was particularly interested in something that I read on the website where you have never consistent was the term that was used when you try and solve it yourself. So, can you maybe give an example for what the pitfall is and the consistency aspect, why you get it correct?
Frank McSherry 00:31:25 There’s a pile of pitfalls, absolutely. I’ll try to give a few examples. Just to call it out though, the highest level for those who are technically aware, there’s a cache invalidation is at the heart of all of these problems. So, you hold on to some data that was correct at one point, and you’re getting ready to use it again. And you’re not sure if it’s still correct. And this is in essence, the thing that the core of Materialize solves for you. It invalidates all of your caches for you to make sure that you’re always being consistent. And you don’t have to worry about that question when you’re rolling your own stuff. Is this really actually current for whatever I’m about to use it for? The thing I mean, this never consistent thing. One way to maybe think about this is that inconsistency very rarely composes properly.
Frank McSherry 00:32:05 So, if I have two sources of data and they’re both running know both like eventually consistent, let’s say like they’ll eventually each get to the right answer. Just not necessarily at the same time, you can get a whole bunch of really hilarious bits of behavior that you wouldn’t have thought. I, at least I didn’t think possible. For example, I’ve worked there before is you’ve got some query, we were trying to find the max argument. You find the row in some relation that has the maximum value of something. And often the way you write this in SQL is a view that’s going to pick out or a query that’s going to pick up the maximum value and then restriction that says, all right, now with that maximum value, pick out all of the rows from my input that have exactly that value.
Frank McSherry 00:32:46 And what’s sort of interesting here is, depending on how promptly various things update, this may produce not just the incorrect answer, not just a stale version of the answer, but it might produce nothing, ever. This is going to sound silly, but it’s possible that your max gets updated faster than your base table does. And that kind of makes sense. The max is a lot smaller, potentially easier to maintain than your base table. So, if the max is continually running ahead of what you’ve actually updated in your base table, and you’re continually doing these lookups saying like, hey, find me the record that has this, this max volume, it’s never there. And by the time you’ve put that record into the base table, the max has changed. You want a different thing now. So instead of what people might’ve thought they were getting, which is eventually consistent view of their query from eventually consistent parts with end up getting, as they never consistent view on account of these weaker forms of consistency, don’t compose the way that you might hope that they would compose.
Akshay Manchale 00:33:38 And if you have multiple sources of data, then it becomes all the more challenging to make sense of it?
Frank McSherry 00:33:43 Absolutely. I mean, to be totally honest and fair, if you have multiple sources of data, you probably have better managed expectations about what consistency and correctness are. You, you might not have expected things to be correct, but it’s especially surprising when you have one source of data. And just because there are two different paths that the data take through your query, you start to get weird results that correspond to none of the inputs that you, that you had. But yeah, it’s all a mess. And the more that we can do our thinking, it’s the more that we can do to make sure that, you the user don’t spend your time trying to debug consistency issues the better, right? So, we’re going to try to give you these always consistent views. They always correspond to the correct answer for some state of your database that it transitioned through.
Frank McSherry 00:34:24 And for multi-input things, it’ll always correspond to a consistent moment in each of your inputs. You know, the correct answer, exactly the correct answer for that. So, if you see a result that comes out of Materialize, it actually happened at some point. And if it’s wrong for me, at least I can be totally honest as a technologist. This is amazing because it means that debugging is so much easier, right? If you see a wrong answer, something’s wrong, you’ve got to go fix it. Whereas in modern data where you see a wrong answer, you’re like, well, let’s give it five minutes. You never really know if it’s just late. Or if like, there is actually a bug that is costing you money or time or something like that.
Akshay Manchale 00:34:59 I think that becomes especially hard when you’re looking at one-off queries to make sure that what you’ve written with application code for example, is going to be correct and consistent as opposed to relying on a database or a system like this, where there are certain correctness guarantees that you can rely on based on what you ask.
Frank McSherry 00:35:17 So a lot of people reach for stream processing systems because they want to react quickly, right? Like oh yeah, we need to have low latency because we need to do something, something important has to happen promptly. But when you have an eventually consistent system, it comes back and it tells you like, all right, I got the answer for you. It’s seven. Oh, that’s amazing. Seven. Like, I should go sell all my stocks now or something. I don’t know what it is. And you say like, you sure it’s seven? It’s seven right now. It might change in a minute. Wait, hold on. No, no. So, what is the actual time to confident action? Is a question that you could often ask about these streaming systems. They’ll give you an answer real quick. Like it’s super easy to write an eventually consistent system with low latency.
Frank McSherry 00:35:55 This is zero, and when you get the right answer or you tell them what the right answer was. And you’re like, well sorry. I said zero first and we know that I was a liar. So you should have waited, but actually getting the user to the moment where they can confidently transact. They can take whatever action they need to do. Whether that’s like charge someone’s credit card or send them an email or, or something like that, they can’t quite as easily take back or, you know, it’s expensive to do so. Its a big difference between these strongly consistent systems and the only eventually consistent systems.
Akshay Manchale 00:36:24 Yeah. And for sure, like the ease of use with which you can declare it is for me, certainly seems like a huge plus. As a system, what does Materialize look like? How do you deploy it? Is that a single binary? Can you describe what that is?
Frank McSherry 00:36:39 There’s two different directions that things go through. There’s is a single binary that you can grab Materializes source available. You can go grab it and use it. It’s built on open-source timely data flow, differential data flow stuff. And you can, you know, very common way to try this out. As you grab it, put it on your laptop. It’s one binary. It doesn’t require a stack of associated distributed systems. Things in place to run, if you want to read out of Kafka, you have to have Kafka running somewhere. But you can just turn on Materialize with a single binary. Piece equal into it’s a shell into it using your favorite PG wire, and just start doing stuff at that point if you like. If you just want to try it out, read some local files or do some inserts, I play around with it like that.
Frank McSherry 00:37:16 The direction that we’re headed though, to be totally honest is more of this cloud-based setting. A lot of people are very excited about not having to manage this on their own, especially given that a single binary is neat, but what folks actually want is a bit more of an elastic compute fabric and an elastic storage fabric underneath all of this. And there are limitations to how far do you get with just one binary? They compute scales pretty well to be totally candid, but as limits and people appreciate that. Like yes well, if I have several terabytes of data, you’re telling me, you could put this on memory, I’m going to need a few more computers. Bringing people to a product that where we can switch the implementation in the background and turn on 16 machines, instead of just one is a bit more where energy is at the moment that we’re really committed to keeping the single binary experience so that you can grab material and see what it’s like. It’s both helpful and useful for people, you know, within license to do whatever you want with that helpful for people. But it’s also just a good business, I suppose. Like, you know, you get people interested, like this is amazing. I’d like more of it. I absolutely, if you want more of it, we’ll set you up with that, but we want people to be delighted with the single machine version as well.
Akshay Manchale 00:38:17 Yeah, that makes sense. I mean, I don’t want to spin up a hundred machines to just try something out, just experiment and play with it. But on the other hand, you mentioned about scaling compute, but when you’re operating on streaming data, you could have millions, billions of events that are flowing through different topics. Depending on the view that you write, what is the storage footprint that you have to maintain? Do you have to maintain a copy of everything that has happened and keep track of it like a data warehouse, maybe aggregate it and keep some form that you can use to sell queries, or I get the sense that this is all done on the fly when you ask for the first time. So, what sort of data do you have to like, hold on to, in comparison to the underlying topic on the fly when you ask for the first time, so what sort of data do you have to like, hold on to, in comparison to the underlying topic or other sources of data that you connect to?
Frank McSherry 00:39:05 The answer to this very solely, depends on the word you use, which is what you have to do? And I can tell you the answer to both what we have to do and what we happen to do at the moment. So, at the moment, early days of Materialize, the intent was very much, let’s let people bring their own source of truth. So, you’ve got your data in Kafka. You’re going to be annoyed if the first thing we do is make a second copy of your data and keep it for you. So, if your data are in Kafka and you’ve got some key based compaction going on, we’re more than happy to just leave it in Kafka for you. Not make a second copy of that. Pull the data back in the second time you want to use it. So, if you have three different queries and then you come up with a fourth one that you wanted to turn on the same data, we’ll pull the data again from Kafka for you.
Frank McSherry 00:39:46 And this is meant to be friendly to people who don’t want to pay lots and lots of money for additional copies of Kafka topics and stuff like that. We’re definitely moving into the direction of bringing some of our own persistence into play as well. For a few reasons. One of them is sometimes you have to do more than just reread someone’s Kafka topic. If it’s an append only topic, and there’s no complexion going on, we need to tighten up the representation there. There’s also like when people sit down, they type insert into tables in Materialize. They expect those things to be there when they restart. So we need to have a persistent story for that as well. The main thing though, that that drives, what we have to do is how quickly can we get someone to agree that they will always do certain transformations to their data, right?
Frank McSherry 00:40:31 So if they create a table and just say, hey, it’s a table, we’ve got to write everything down because we don’t know if the next thing they’re going to do is select star from that table–outlook in that case. What we’d like to get at it’s a little awkward in SQL unfortunately? What we’d like to get at is allowing people to specify sources and then transformations on top of those sources where they promise, hey, you know, I don’t need to see the raw data anymore. I only want to look at the result of the transformation. So, like a classic one is I’ve got some append-only data, but I only want to see the last hours’ worth of records. So, feel free to retire data more than an hour old. It’s a little tricky to express this in SQL at the moment, to express the fact that you should not be able to look at the original source of data.
Frank McSherry 00:41:08 As soon as you create it as a foreign table, is there, someone can select star from it? And if we want to give them very experience, well, it requires a bit more cunning to figure out what should we persist and what should we default back to rereading the data from? It’s sort of an active area, I would say for us, figuring out how little can we scribble down automatically without explicit hints from you or without having you explicitly Materialized. So, you can, sorry, I didn’t say, but in Materialize you can sync out your results out to external storage as well. And of course, you can always write views that say, here’s the summary of what I need to know. Let me write that back out. And I’ll read that into another view and actually do my downstream analytics off of that more come back to representation. So that on restart, I can come back up from that compact view. You can do a bunch of these things manually on your own, but that’s a bit more painful. And we’d love to make that a bit more smooth and elegant for you automatically.
Akshay Manchale 00:42:01 When it comes to the retention of data, suppose you have two different sources of data where one of them has data going as far back as 30 days, another has data going as far back as two hours. And you’re trying to write some query that joins these two sources of data together. Can you make sense of that? Do you know that you only have at most two hours’ worth of data that’s actually collecting consistent, then you have extra data that you can’t really make sense of because you’re trying to join those two sources?
Frank McSherry 00:42:30 So we can, we can trust this, I guess, with what other systems might currently have you do. So, a lot of other systems, you must explicitly construct a window of data that you want to look at. So maybe two hours wide or something they’re like one hour, one because you know, it goes back two hours. And then when you join things, life is complicated, if the two days that don’t have the same windowing properties. So, if they’re different widths, good classic one is you’ve got some facts table coming in of things that happened. And you want a window that cause that’s, you don’t really care about sales from 10 years ago, but your customer relation, that’s not, not window. You don’t delete customers after an hour, right? They’ve been around as long as they’ve been around for you love to join those two things together. And Materialize is super happy to do this for you.
Frank McSherry 00:43:10 We do not oblige you to put windows into your query. Windows essentially are change data capture pattern, right? Like if you want to have a one-hour wide window on your data, after you put every record in one hour later, you should delete it. That’s just a change that data undergoes, it’s totally fine. And with that view on things, you can take a collection of data that is only one hour. One hour after any record gets introduced, it gets retracted and join that with a pile of data that’s never having rejected or is experiencing different changes. Like only when a customer updates their information, does that data change. And these just two collections that change and there’s always a corresponding correct answer for when you go into a join and try to figure out where should we ship this package to? Don’t miss the fact that the customer’s address has been the same for the past month and they fell out of the window or something like that. That’s crazy, no one wants that.
Akshay Manchale 00:44:03 Definitely don’t want that kind of complexity showing up in how you write your SQL tool. Let’s talk a little bit about data governance aspect. It’s a big topic. You have lots of regions that have different rules about data rights that the consumer might have. So, I can exercise my right to say, I just want to be forgotten. I want to delete all traces of data. So, your data might be in Kafka. And now you have utilized. It’s kind of taking that data and then transforming it into aggregates or other information. How do you handle the sort of governance aspect when it comes to data deletions maybe, or just audits and things like that?
Frank McSherry 00:44:42 To be totally clear, we don’t solve any of these problems for anyone. This is a serious sort of thing that using Materialize does not magically absolve you of any of your responsibilities or anything like that though. Though Materialize is nicely positioned to do something well here for two reasons. One of them is because it’s a declarative E system with SQL behind it and stuff like this, as opposed to a hand-rolled application code or tools. Oh, we’re in a really good position to look at the dependencies between various bits of data. If you want to know, where did this data come from? Was this an inappropriate use of certain data? That type of thing, the information is I think very clear there there’s really good debug ability. Why did I see this record that was not free, but it’s not too hard to reason back and say, great, let’s write the SQL query that figures out which records contributed to this?
Frank McSherry 00:45:24 Materialize, specifically itself, also does a really nice thing, which is because we are giving you always correct answers. As soon as you retract an input, like if you go into your rear profile somewhere and you update something or you delete yourself or you click, you know, hide from marketing or something like that, as soon as that information lands in Materialize, the correct answer has changed. And we will absolutely like no joke update the correct answer to be as if whatever your current settings are were, how was it the beginning? And this is very different. Like a lot of people, sorry, I moonlight as a privacy person in a past life, I suppose. And there’s a lot of really interesting governance problems there because a lot of machine learning models, for example, do a great job of just, remembering your data and like you deleted it, but they remember. You were a great training example.
Frank McSherry 00:46:14 And so they basically wrote down your data. It’s tricky in some of these applications to figure out like, am I really gone? Or they’re ghosts of my data that are still sort of echoing there. And Materialize is very clear about this. As soon as the data change, the output answers change. There’s a little bit more work to do to like, are you actually purged from various logs, various in memory structures, stuff like that. But in terms of our, you know, serving up answers to users that still reflect invalid data, the answer is going to be no, which is really nice property again of strong consistency.
Akshay Manchale 00:46:47 Let’s talk a little bit about the durability. You mentioned it’s currently like a single system, kind of a deployment. So what does recovery look like if you were to nuke the machine and restart, and you have a couple of Materialized views, how do you recover that? Do you have to recompute?
Frank McSherry 00:47:04 Generally, you’re going to have to recompute. We’ve got some sort of in progress, work on reducing this. On capturing source data as they come in and keeping it in more compact representations. But absolutely like at the moment in a single binary experience, if you read in your notes, you’ve written in a terabyte of data from Kafka and they turn everything off, turn it on again. You’re going to read a terabyte of data and again. You can do it doing less work in the sense that when you read that data back in you no longer care about the historical distinctions. So, you might have, let’s say, you’re watching your terabyte for a month. Lots of things changed. You did a lot of work over the time. If you read it in at the end of the month, material is at least bright enough to say, all right, all of the changes that this data reflect, they’re all happening at the same time.
Frank McSherry 00:47:45 So if any of them happened to cancel, we’ll just get rid of them. There’s some other knobs that you can play with too. These are more of pressure release valves than they are anything else, but any of these sources you can say like start at Kafka at such-and-such. We’ve got folks who know that they’re going to do a 1-hour window. They just recreate it from the source saying start from two hours ago and even if they have a terabyte, but going back in time, we’ll figure out the right offset that corresponds to the timestamp from two hours ago and start each of the Kafka readers at the right points. That required a little bit of a help from the user to say it is okay to not reread the data because it’s something that they know to be true about it.
Akshay Manchale 00:48:20 Can you replicate data from Materialize what you actually build into another system or push that out to upstream systems in a different way?
Frank McSherry 00:48:30 Hopefully I don’t misspeak about exactly what we do at the moment, but all of the Materialized views that we produce and the syncs that we write to are getting very clear instructions about the changes, the data undergo. Like we know we can output back into debezium format, for example, that could then be presented at someone else. Who’s prepared to go and consume that. And in principle, in some cases we can put these out with these nice, strongly consistent timestamps so that you could pull it in somewhere else and get, basically keep this chain of consistency going where your downstream system responds to these nice atomic transitions that correspond exactly to input data transitions as well. So we definitely can. It’s I got to say like a lot of the work that goes on in something like Materialize, the computer infrastructure has sort of been there from early days, but there’s a lot of adapters and stuff around like a lot of people are like, ah, you know, I’m using a different format or I’m using, you know, can you do this in ORC instead of Parquet? Or can you push it out to Google Pubsub or Azure event hubs or an unlimited number of yes. With a little caveat of like, this is the list of actually support options. Yeah.
Akshay Manchale 00:49:32 Or just write it on adapter kind of a thing. And then you can connect to whatever.
Frank McSherry 00:49:36 Yeah. A great way if you want to write your own thing. Because when you’re logged into the SQL connection, you can tell any view in the system that will give you a first day snapshot at a particular time and then a strongly consistent change stream from that snapshot going forward. And your application logic can just like, oh, I’m missing. I’ll do whatever I need to do with this. Commit it to a database, but this is you writing a little bit of code to do it, but we’re more than happy to help you out with that. In that sense.
Akshay Manchale 00:50:02 Let’s talk about some other use cases. Do you support something like tailing the log and then trying to extract certain things and then building a query out of it, which is not very easy to do right now, but can I just point you to a file that you might be able to ingest as long as I can also describe what format of the lines are or something like that?
Frank McSherry 00:50:21 Yes. For a file. Absolutely. You actually check to see what we support in terms like love rotation. Like that’s the harder problem is if you point it at a file, we will keep reading the file. And every time we get notified that it’s like this changed, we’ll go back on, read somewhere. The idiom that a lot of people use that sort of more DevOps-y is you’ve got a place that the logs are going to go and you make sure to cut the logs every whatever happens hour a day, something like that and rotate them so that you’re not building one massive file. And at that point, I don’t know that we actually have, I should check built in support for like sniffing a directory and sort of watching for the arrival of new files that we then seal the file we’re currently reading and pivot over and stuff like that.
Frank McSherry 00:50:58 So it’s all, it seems like a very tasteful and not fundamentally challenging thing to do. Really all the work goes into the bit of logic. That is what do I know about the operating system and what your plans are for the log rotation? You know, all of the, the rest of the compute infrastructure, the SQL, the timely data flow, the incremental view, maintenance, all that stuff. So that stays the same. It’s more a matter of getting some folks who are savvy with these patterns to sit down, type some code for a week or two to figure out how do I watch for new files in a directory? And what’s the idiom for naming that I should use?
Akshay Manchale 00:51:33 I guess you could always go about very roundabout way to just push that into a Kafka topic and then consume it off of that. And then you get a continuous stream and you don’t care about how the sources for the topic.
Frank McSherry 00:51:43 Yeah. There’s a lot of things that you definitely could do. And I have to restrain myself every time because I would say something like, oh, you could just push it into copy. And then immediately everyone says, no, you can’t do that. And I don’t want to be too casual, but you’re absolutely right. Like if you have the information there, you could also have just a relatively small script that takes that information, like watches it itself and inserts that using a PC port connection into Materialize. And then we’ll go into our own persistence representation, which is both good and bad, depending on maybe you were just hoping those files would be the only thing, but at least it works. We’ve seen a lot of really cool use cases that people have shown up and been more creative than I’ve been, for sure. Like, they’ve put together a thing and you’re like, oh, that’s not going to work. Oh, it works. Wait, how did you, and then they explain, oh, you know, I just had someone watching here and I’m writing to a FIFO here. And I’m very impressed by the creativity and new things that people can do with Materialize. It’s cool seeing that with a tool that sort of opens up so many different new modes of working with data.
Akshay Manchale 00:52:44 Yeah. It’s always nice to build systems that you can compose other systems with to get what you want. I want to touch on performance for a bit. So compared to writing some applications, I will code maybe to figure out data, maybe it’s not correct, but you know, you write something to give you the output that is an aggregate that’s grouped by something versus doing the same thing on Materialized. What are the trade-offs? Do you have like performance trade-offs because of the correctness aspects that you guarantee, do you have any comments on that?
Frank McSherry 00:53:17 Yeah, there’s definitely a bunch of trade-offs of different flavors. So let me point out a few of the good things first. I’ll see if I can remember any bad things afterwards. So because of grades that get expressed to SQL they’re generally did a parallel, which means Materialize is going to be pretty good at purchasing the workout across multiple worker threads, potentially machines, if you’re using those, those options. And so your query, which you might’ve just thought of is like, okay, I’m going to do a group by account. You know, we will do these same things of sharing the data out there, doing aggregation, shuffling it, and taking as much advantage as we can of all of the cores that you’ve given us. The underlying data flow system has the performance wise, the appealing property that it’s very clear internally about when do things change and when are we certain that things have not changed and it’s all event based so that you learn as soon as the system knows that an answer is correct, and you don’t have to roll that by hand or do some polling or any other funny business that’s the thing that’s often very tricky to get right
Frank McSherry 00:54:11 If you’re going to sit down and just handrail some code people often like I’ll Gemma in the database and I’ll ask the database every so often. The trade-offs in the other direction, to be honest are mostly like, if you happen to know something about your use case or your data that we don’t know, it’s often going to be a little better for you to implement things. An example that was true in early days of Materialize we’ve since fixed it is, if you happen to know that you’re maintaining a monotonic aggregate something like max, that only goes up, the more data you see, you don’t need to worry about keeping full collection of data around. Materialize, in its early days, if it was keeping a max, worries about the fact that you might delete all of the data, except for one record. And we need to find that one record for you, because that’s the correct answer now.
Frank McSherry 00:54:52 We’ve since gotten smarter and have different implementations one we can prove that a stream is append only, and we’ll use the different implementations, but like that type of thing. It’s another example, if you want to maintain the median incrementally, there’s a cute, really easy way to do this in an algorithm that we’re never going, I’m not going to get there. It’s you maintain two priority queues and are continually rebalancing them. And it’s a cute programming challenge type of question, but we’re not going to do this for you automatically. So, if you need to maintain the median or some other decile or something like that, rolling that yourself is almost certainly going to be a lot better.
Akshay Manchale 00:55:25 I want to start wrapping things up with one last question. Where is Materialized going? What’s in the near future, what future would you see for the product and users?
Frank McSherry 00:55:36 Yeah. So, this has a really easy answer, fortunately, because I’m with several other engineer’s materials, typing furiously right now. So, the work that we’re doing now is transitioning from the single binary to the cloud-based solution that has an arbitrary, scalable storage and compute back plane. So that folks can, still having the experience of a single instance that they’re sitting in and looking around, spin up, essentially arbitrarily many resources to maintain their views for them, so they’re not contending for resources. I mean, they have to worry about the resources being used are going to cost money, but they don’t have to worry about the computer saying, no, I can’t do that. And the intended experience again, is to have folks show up and have the appearance or the feel of an arbitrarily scalable version of Materialize that, you know, as like cost a bit more, if you try to ingest more or do more compute, but this is often like people at Yale. Absolutely. I intend to pay you for access to these features. I don’t want you to tell me no is the main thing that folks ask for. And that’s sort of the direction that we’re heading is, is in this rearchitecting to make sure that there is this, I was an enterprise friendly, but essentially use case expansion friendly as you think of more cool things to do with Materialize, we absolutely want you to be able to use them. I use Materialize for them.
Akshay Manchale 00:56:49 Yeah. That’s super exciting. Well, with that, I’d like to wrap up Frank, thank you so much for coming on the show and talking about Materialize.
Frank McSherry 00:56:56 It’s my pleasure. I appreciate you having me. It’s been really cool getting thoughtful questions that really start to tease out some of the important distinctions between these things.
Akshay Manchale 00:57:03 Yeah. Thanks again. This is Akshay Manchale for Software Engineering Radio. Thank you for listening.
[End of Audio]