SE Radio 605: Yingjun Wu on Streaming Databases

Yingjun Wu, founder of RisingWave Labs and previously a software engineer at Amazon Web Services and researcher at IBM Almaden Research Center, speaks with SE Radio host Brijesh Ammanath about streaming databases. After considering the benefits and unique challenges, they delve into the architecture and design patterns of streaming databases, as well as the evolution and security considerations. Yingjun also talks about the future of streaming databases, including the potential impact that Amazon S3 Express One Zone will have on the streaming landscape, and how the unified batch and streaming might evolve in the database world.

Show Notes

Related Episodes

Other References

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. Today’s session is about streaming databases, understanding what it is, benefits of using it, and the challenges unique to it. Our guest today is Yingjun Wu, who is the founder of RisingWave Labs, building the product RisingWave, which is a distributed SQL database for stream processing. Prior to RisingWave Labs, YJ was the software engineer at the Redshift team, Amazon Web Services, and a researcher at Database Group IBM, El Madden Research Center. Yingjun, welcome to the show.

Yingjun Wu 00:00:50 Hi. Hello everyone. Thanks for having me here. I’m Yingjun, yes I’m the founder of Origin Wave Labs and I’m a database guy and a stream processing guy. So really excited to join the show.

Brijesh Ammanath 00:01:02 Let’s start with the fundamentals. Can you provide a high-level overview of what a streaming database is?

Yingjun Wu 00:01:08 Well, streaming database it’s more about a stream processing, but it’s not just about a stream processing engine, it is a database system. So people may hear of, probably have already used some stream processing systems like Apache, Flink or probably back streaming, right? But for RisingWave it basically provides the people with the database experience for processing streaming data in another way. It allows people to process streaming data inside of a database. So this is fundamental difference from the stream processing engines because streaming database actually stores the data, which means that we can essentially persist to the input and output of the stream processing jobs and then do data serving inside of a database system. And for RisingWave specifically as a streaming database, it is Postgres compatible, which means that, if you have system that can talk to Postgres, this system is highly likely to be able to talk to RisingWave. Which means that RisingWave can connect to a lot of systems in the Postgres ecosystem.

Brijesh Ammanath 00:02:20 Right. And what is the problem that we’re trying to solve by using streaming databases?

Yingjun Wu 00:02:25 Well, the streaming databases definitely tries to solve the stream processing problem. So what’s the problem of stream processing? Well, there are already a lot of stream processing systems, right? For partial Flink and the Spark streaming. But I think the key problem here in today’s world is that these types of systems are kind of first, difficult to use and second, they are not very cost efficient. So let’s talk about the ease of use first. When we try to use a big data system like Spark or Flink, well we need to think about how to write Java, right? And people may argue that these systems have already provided a Python, API, probably SQL API, and then probably we can directly write SQL code or Python code. But fundamentally you actually need to match a JVM based system. And more importantly, the thing here is that you actually need to understand the fundamental concepts inside of these stream pricing systems.

Yingjun Wu 00:03:24 Like you need to understand what is checkpointing? What is, well, how we can do for tolerance and how we can do scaling, right? You need to know how the system works internally. But in writing Wave, people do not need to think a lot of things about this, right? To think about, well, if you’re using Postgres, you will never need to worry about how the Postgres process the transactions, right? And if you use Postgres, you will never need to worry about what is Checkpoint and what is failed recovery, right? You just use it and you just, right. SQL code that’s fundamentally different from using let’s say system from Spark or probably Hadoop, probably Flink, right? That’s fundamentally different. And second thing here that will cost efficiency. So when we talk about what the spark of Flink, right? Where in this kind of big data systems where we, people are talking about map reduce, right?

Yingjun Wu 00:04:15 We try to shove the data, position data and shove the data into multiple machines and ask every single machine to their local accommodation. And then we do the application, right? Such kind of architecture is highly optimized, well performance, which means that I can give you 10 machines and this kind of system will achieve fast performance by running on top of these 10 machines. But what we see in today’s world is that people use Cloud. So we never need to worry about how many computer nodes we can get, right? Where as long as we pay money, we can get as many computer node as we want, right? But the thing here is more about cost efficiency, that is how I can reduce the cost. Why is that important because I mean, people don’t really want to spend, let’s say, always spend a lot of money on this data system and they just want to provision a small amount machine that’s just a fit to their workload, right?

Yingjun Wu 00:05:14 If let’s say that I have a small workload, I don’t really want to spend 10 machines, I just want to provision one single machine for the system. And in the stream processing world, the key thing here that the streaming workload can fluctuate, think about where if you’re using Uber, right, or think about it, where you using some, let’s say linking or some other Twitter, right? The workload can fluctuate, right? A lot of people will Vista website in the morning, but very few people wear Vista website at midnight, right? That’s traffic and that’s how wide the workload can fluctuate. In the stream processing model we really want to achieve, let’s say, oh, so called dynamic scaling, which means that we really want to just provision as much resource as needed, right? We do not need to provision a lot of computer nodes and we just want to provision computer node that can fit into my, that can support my workload. So yeah, to summarize RisingWaves trying to address the ease-of-use problem as whereas the cost efficiency problem.

Brijesh Ammanath 00:06:14 Right. And what’s the tradeoff for achieving cost efficiency?

Yingjun Wu 00:06:19 Well definitely there are some tradeoffs from the vendor’s perspective, right? Well, because when we build the RisingWave, RisingWave is open-source database and from the vendors’ perspective, we need to spend a lot of efforts in building both the combination layer and the storage layer. It takes a long time. We have already spent three years building a system and now it works, and it has been deployed in hundreds of clusters in protection. But we all know that it’s not being built in one day. I mean we have to spend a long time to build it. And second thing here that in terms of the tradeoff, we need, because we store data, we use the architecture called, so-called decoupled computer and storage architecture. And in this architecture, we need to store the data in remote object store. Let’s say S3 and S3 are kind of slow, right?

Yingjun Wu 00:07:14 So if we want to support latency sensitive applications like stream processing applications, then we need to think about how we can avoid accessing S3, right? So which means that we actually need to build pretty spot caching mechanism. But if you use, so-called computing and storage coupled architecture, you don’t need to worry about that because where everything is start in local machine, right? So that’s a tradeoff. So for big data systems like MapReduce basis systems like Hadoop, Sparks, Flink, they are actually optimized for performance. Well, for performance probably is secondary, but we always take care of the cost efficiency. We always need to think about how we can achieve the best in terms of cost efficiency. And that’s a tradeoff.

Brijesh Ammanath 00:08:03 So if I understood that correctly, RisingWave as a streaming database, which uses the Cloud as the data store in this case S3, and that allows you to scale in, you know, as per demand. And that’s basically where the cost efficiency comes in. And the tradeoff is obviously in being able to store data in S3 and that would be the latency or the speed of operations?

Yingjun Wu 00:08:29 Yes. Well, I mean if you want to store data in S3, then definitely accessing data in S3 is kind of slow, right? But here I would like to mention that I just took, I mean use S3 as an example, right? So because people can store data in GCP or probably can store data in Azure, right? If they’re on-prem, they actually can store data in HDS, right? Or probably some other object store, right? We actually support all those object stores, but all these objects and stores have the trade off, right? Well, they can start a lot of data without spending a lot of money, but they’re kind of slow. So RisingWave is built on top of these object stores. So we actually need to figure out how we can achieve the high performance when using such kind of every store as a storage.

Brijesh Ammanath 00:09:14 Right. Taking a step back, obviously RisingWave is close to your heart because you founded the product, you’re the founder of the company. But what are the various streaming databases available in the market?

Yingjun Wu 00:09:27 Okay. So when we talk about this thing where people will say stream processing or streaming database is a new thing, right? Where it’s actually not, to be honest. Stream processing — the concept of stream processing — was first invented in university, in academia, and people in academia were thinking about how we can support monitoring applications using database systems. And that’s why they think about whether we can, instead of just processing batch data, whether we can process streaming data inside of the database and then that’s why they build a such kind of thing called a streaming SQL and allow database to manage streaming data. And that happened in 2002. So streaming database have already been around for 20 years, but at the very beginning it was for all these for academia products, like I can name a few like Stream, like Aurora, like Boreas. They’re all streaming databases built in academia and then IBM DB two and Oracle as well as Microsoft SQL Server all incorporate the concept of stream processing into their database systems.

Yingjun Wu 00:10:40 So I think where they actually have the product for like Microsoft Streaming Insight and they essentially a streaming database. But people talk a lot about stream processing engines like Spark streaming, like Samsa, like Flink I think for all this system was inspired by MapReduce, right? Which was a paper published by Google in 2004. And what they try to address is whether I can process streaming data on a big scale, right? I probably really want to process the streaming data in, let’s say we using a solid machine, right? A hundred machines or a solid machine, right? And they do not really want to manage storage. So they build such some things but think because of the cloud and because of the, yeah, all kinds of reasons, right? People come back and say look, well I probably really want to have a system that provides a simpler UI or simpler interface and all I can use a database system to process streaming data.

Yingjun Wu 00:11:40 So people just come back and build a system called I think for one of the famous systems is called a PipelineDB. And PipelineDB, it essentially provides a plugin for Postgres, which allows users to directly process streaming data inside of the Postgres. And that’s PipelineDB. And another notable system I would like to mention is CaseQDB. CaseQDB is actually a database built on top of Kafka. That is for if you are Kafka user, you can actually directly use CaseQDB to process your Kafka data using SQL. And that’s another pretty famous streaming database systems and there are some others like Materialize and recently I heard of some others stream database was embedded, but definitely Redwave is among them. Yeah, it’s one of the probably best, in my opinion, it’s definitely one of the best streaming databases nowadays.

Brijesh Ammanath 00:12:39 Yeah, absolutely. Considering it’s quite told, right? I didn’t realize that streaming databases were that old, but do you remember any use case, or you know, any stories where your users or clients have migrated from a traditional database to a streaming database?

Yingjun Wu 00:12:55 So if you talk about the users, we actually saw two types of users. One type is from the big data world and the other type is from the database world. So from the big data world, I mean that people probably use Flink, I probably use Spark streaming and then they move to RisingWave and in database world, I mean that people probably use Postgres and then all probably have some other database systems then move to RisingWave. So you talk about the second type, right? Moving from the existing database to streaming database. So we did see that. So I can give you a few examples. One of our customers actually used Postgres initially and they actually heavily used Postgres Materialize views to support Dashboard. But the problem here is that Postgres Materialize View is not incremental. That’s the first thing. And second thing here that you actually need to continually see Shaker refresh to refresh a Materialize View in Postgres.

Yingjun Wu 00:13:55 I know that’s where there is, improvement in the Postgres community. But for, I mean in many cloud products, they do not release a pause for the, let’s say the incremental Materialize Views are probably something like auto refresh. So they think, I mean it’s kind of difficult to use. So they move to RisingWave. Because RisingWave provides a really nice Materialize View that case consistent and always displays the recent data or the displays a fresh result. And that’s one of the reasons why they moved from Postgres to RisingWave. And second reason here that they actually don’t really want to maintain as far as use or secondary index in their operation databases such as let’s say MySQL or Postgres or MongoDB. The reason here is that if people use MySQL, Postgres, they are actually supporting their online traffic, right? Well they are supporting their online applications and this kind of location consumes a lot of resources, right?

Yingjun Wu 00:14:52 If they build a Materialize View inside of the operational database, then what it means that the Materialize View will actually also can consume a lot of resources and it will actually compete a resource, the online traffic. So people don’t really want to see that. So that’s why they need to have a dedicated system to do that. And actually we also saw some other reasons people move from the operational database to the streaming database. For example, one of our customers say that they think that Postgres is too slow in terms of when ingesting streaming data, right? Think about it, if you have a Kafka and then you consume data from Kafka and then ingest it into Postgres. Then Postgres may not have the capability to consume so much data within just a pretty short period of time, right? But for RisingWave, RisingWave is actually designed for consuming streaming data from let’s say systems like Kafka, like from Kinesis, right? So we actually can support Realtime ingestion and that’s definitely a big win compared to existing database systems like Postgres and MySQL.

Brijesh Ammanath 00:15:58 Right. Couple of questions basis that for the benefit of our listeners, can you explain what a Materialized View is?

Yingjun Wu 00:16:04 Well, Material View is definitely a pretty fundamental concept in a database system. When people send a query, people can write SQL, right? Like Slack star from table, right? But the problem here is that if you do this kind of thing, then every time a human need to trigger such kind of query, right? But sometimes we really want to, I mean we really want to ask a database to automatically to refresh the result for me, right? So I probably really want to see the, probably a result of a query, right? Then what I can do, I can just create a Materialize View inside of database system so that the, and then after you create a Materialize View, then the system can automatically consume data from probably your tables or probably your upstream services or probably some other data systems and then do the combination and then persist data inside of Materialize View. You can also think of Materials View as, let’s say so called a special table, a special data table inside of a database system which can do the combination and show you the result. And that is the Materialize View.

Brijesh Ammanath 00:17:14 Got it. You also mentioned that in some cases Postgres is too slow in ingesting data. Can you help quantify that? What is too slow? You know, how do we quantify that?

Yingjun Wu 00:17:25 So in the stream processing world, there’s a concept called backlog or probably back pressure. So what it means that let’s say that if we maintain a Kafka and then we want to use a Postgres to consume data from Kafka, then let’s say that we have already generated the Kafka service, service, have already generated, solved in two posts, but for Postgres it only consumes, let’s say a hundred two-folds, right? At the same time. So, which means that, there are 202 poles that remains in Kafka that is not consumed by Postgres. So as the time flies we can single that. Actually there are more and more data will be produced by Kafka, right? So if Postgres cannot catch up with Kafka, then it means that where there is a huge backlog inside of Kafka and then we’ll never catch up with the latest data and we’ll never see the latest result in the Postgres. So there’ll be a lagging inside of Kafka, ay inside of Postgres.

Brijesh Ammanath 00:18:26 Okay. We’ll move on to the next section where I want to focus on some of the architectural principles or thinking behind streaming databases. We’ll start with, what are the fundamental architectural differences between a streaming database and a traditional database?

Yingjun Wu 00:18:41 If you ask me about the difference between a streaming database and the database system then will also say that for streaming database, it’ll always do combination first and then store data. But for a database system, it always stores data first and then do a combination. So I can give you an example, let’s say in the streaming database like CaseQDB or like RisingWave, we actually ask people to create Materialize View first. And then once you create Materialize View, once new YouTube become seen, it’ll be consumed by the RisingWave where do directly de combina2POtion and refresh the Materialize View. And then the data will be persisted in RisingWave. Often, definitely I mean, which means that case the user can determine whether they want to store data or not. But in a conventional database, well it’s actually optimized the use case where the data is first installing the table, right? So if I insert a 2PO into a table, then the 2PO will be persisted. After I have already inserted a bunch of 2POs, then I can send ad hoc query like Select Star from P, to query the table inside of our database system. So that is, that’s the fundamental difference between the streaming database and the database system.

Brijesh Ammanath 00:19:53 Right. Got it. And how does a streaming database handle changes in the data structure or the underlying schema?

Yingjun Wu 00:20:01 Okay so now you’re talking about the schema change, right? Well, I think it definitely varies depending on which system you’re talking about. But if you’re going to ask me about RisingWave, then I have to say that RisingWave does a schema change in the same way as a Postgres, which means that I mean if you want to do, I mean schema change, drop a column or probably add a new column to a table, then it’ll just, I mean handle the data change or the schema change in the same way as Postgres does. There’s no difference at all.

Brijesh Ammanath 00:20:32 Right. And you specifically call out, that’s the case for RisingWave. So are there other ways for dealing with schema changes in the streaming world?

Yingjun Wu 00:20:41 Actually in the streaming world, when we talk about the data change and sometimes we refer to the data change or the schema change in the upstream services, right? The reason here that K4 streaming database, well one of the main data sources are actually databases, operational databases as I just mentioned earlier, right? So let’s say that you have a Postgres and MySQL and now you have a downstream streaming database like RisingWave to consume the data from your Postgres or MySQL. Then in this case, people can change the schema in MySQL and Postgres, right? So RisingWave will actually detect the schema change and adapt to it. Let’s say that if you add a new column in your Postgres system and RisingWave will continue process the data and then it will actually detect that there’s a data change, there’s a schema change in my upstream service and then I will add a column accordingly based on my upstream services. But again, we have to say that where that’s how RisingWave handles the schema change of the upstream services. But definitely there are some other systems streaming databases and which actually do not really support schema change. But if they want to support schema change, then I think where they likely support in a schema change in the same way as RisingWave does.

Brijesh Ammanath 00:21:59 Right. How is low latency achieved in a streaming database?

Yingjun Wu 00:22:04 Well, low latency is achieved by using incremental accommodation. That’s also a fundamental part between the streaming database and the conventional database system. You know, conventional database system, if you issue a query like select star from P or select star from T(?), right? You actually do full combination, which means you will actually need to scan the entire table and then do the combination and show the result to the users, right? But for RisingWave and other streaming databases, if you define a Materialize View like create Materialize view as select install from P, then this kind of streaming databases will always do incremental computation, which means that every time a new new will comes in and is consumed by the streaming database, it’ll modify the result by just applying a delta to the existing result. It will not do the computation from scratch, it will not do that. So if we do incremental combination, then it means that we only need to care about the new data, right? We do not need to care a lot about the historic data, right? Because for the data, the result has already been there, right? So if we only need to deal with, let’s say a small amount of data, small amount of newly insert data, definitely we can do the combination in very low latency,

Brijesh Ammanath 00:23:20 Right. And the way you know that it’s new data is by storing state. And because it’s a database, you’re able to do that.

Yingjun Wu 00:23:26 I know that case. So there are definitely multiple ways to detect whether it is a new data based on the database. For example, I used to work in RedShift and the Redshift that detects a new data by scanning a table and identify where the new data is. And for RisingWave it detects a new data differently because RisingWave will directly consume data from upstream services. It will, the newly inserted data will actually be buffered by RisingWave. So that I mean RisingWave will essentially know which new data has already been inserted or probably reach new data change, let’s say delete or update will need to be handled by RisingWave. So actually it detects as this is kind of a new data or new notifications using buffers in RisingWave.

Brijesh Ammanath 00:24:15 And while we’re talking about latency, is that a sliding scale of latency which drives the choice of database or rather the processing choice. So you know, at one end of the spectrum you would have a requirement that you know, you need the data to be refreshed or to be updated only, you know, the end of the day. And that’s why you have the patch processing. What does that scale look like and where does streaming databases fit in that sliding scale?

Yingjun Wu 00:24:42 Yeah, that’s definitely a great question. So in many use cases, people probably do not really care about the latency, right? People do not really care about the freshness. Let’s say that if I want to do financial reporting, right? People do financial reporting monthly, right? And or probably even quarterly, they do not need to see fresh results, right? If I made some changes, well I’ll probably purchase some new product. I probably do not need to report, do not show this purchase in my report instantly, right? So I can just show this kind of change in my monthly report or quarterly report and in this case people do not really care about the latency, right? Or people do not really care about the freshness. And in machine learning, I mean some people also do not really care about the freshness, right?

Yingjun Wu 00:25:28 Well, I can probably use lots of your data to change a data model, right? For sure I can do that. But definitely there are a lot of use cases where the freshness is required. So I can give you a couple of examples. One is let’s say Uber, right? Well, I mean Uber is definitely a great example, right? So if I want to call a taxi Uber, then I really want to have someone to take my order, right? Instantly, right? I don’t really want to say that I want to grab an Uber, but where it takes me an hour to grab a taxi, right? So it’s not a very good user experience and there are some many other use cases like stock dashboarding, right? So, if I want to show my stock show the changes in the stock market, I don’t really want to just show people the, I mean the last days results, right?

Yingjun Wu 00:26:13 I definitely want to show the result or all the stock market within the last one minute or probably even one second, right? So I don’t really want to see any lagging there, right? And definitely there are some other pretty interesting use cases like manufacturing. We actually have a pretty big manufacturing customer who produces the motherboard device, electronic device. They have a lot of factories across the world. And what happened previously was that before introducing stream processing into a company, was that they actually need to make phone calls. I mean that was probably 10 years ago and they actually need to make phone calls to talk to the people in the factory and check what happened? They make this kind of phone call every four or six hours to check the status, but they found that, no, it doesn’t really work.

Yingjun Wu 00:27:03 The reason here that I mean I really want to see the latest status, right? Let’s say that if there’s some anomaly occurred in one factory, then I probably really want to parcel in entire pipeline, right? Or parcel in production pipeline, right? Because why I want to check whether there’s some anomaly or some misfunction, right? Or malfunction, right? So yeah, and then use the stream processing system first actually PipelineDB and then they move to some other systems there and then recently move to RisingWave. And once they adopt from the streaming database, well what changes to them is that they can just sit in the office and then see the dashboard and to see whether there’s some, a novelty occurred within the last a few seconds. So that’s how stream processing changes work and then how streaming support helps them to better manage their product line.

Brijesh Ammanath 00:27:56 So if I try to quantify that, what you’re saying is if the requirement is to be able to process data from seconds to minutes, you will look at streaming databases and if it’s more than that an hourly or a daily job, that would be most probably patch. What about use cases where you want to process your data in nanoseconds or milliseconds? Is that something which would still be a streaming database use case?

Yingjun Wu 00:28:23 For nanoseconds or probably microseconds, we actually see some company do that without stream program. Streaming programs really want to process? Well I mean do transactions, right? So within let’s say nanoseconds or microseconds, but I don’t really think that’s a pretty good use case for streaming database because for streaming database, well it’s probably good for hand supporting applications that require, let’s say milliseconds it will work. But for nanoseconds, where I think, I would suggest that we, I mean build your own system, that’s how all these trading firms do. Actually all these trading firms have their own specialized trading platform to process the trading data, right? Or the, all the stock data in microseconds. But we do see that many trading firms also adopted the RisingWave. We have several clients in the trading world and what they do is that they actually use RisingWave to monitor the stock market for their code. They really want to see whether there’s some technical changes over the last few seconds or probably last one minute, right? They also want to check how many transactions they have already done within last seconds or within last few seconds or within last one minute, right? They do not need to build a specialized system for this kind of application and then, but they do not really want to use a batch system. So they choose to use let’s say the streaming database.

Brijesh Ammanath 00:29:46 Yeah. So for high frequency trading applications, you have seen forms use their own custom application. Whereas for tracking the stocks and other use cases like Uber, stock prices, manufacturing, that’s where streaming database is usually excellent.

Yingjun Wu 00:30:04 Yes, that’s right.

Brijesh Ammanath 00:30:05 Moving on to the next section, which is around data ingestion and processing. What are the common protocols or methods used in data ingestion?

Yingjun Wu 00:30:14 For data ingestion we typically ingest data from Kafka and all the messaging queues or CDC, the database CDC. So for Kafka, well it is kind of like natural because for, I mean the Kafka always store streaming data. So, but for RisingWave and all, for some other streaming databases where they can directly consume the streaming database from Kafka or from messaging queue and the other type is the CDC, the database CDC or the change log of the database system. So for CDC, I mean people use CDC because where they think that they really want to re-consume data from their databases. Let’s say that if we have, let’s say Postgres or MySQL, I really want to capture the data change and then consume these data changes into RisingWave, right? So RisingWave can do that, and we do that too called Visium, which is a two specific designed for CDC, database CDC.

Brijesh Ammanath 00:31:12 Right. And are there any specific techniques for handling high volume of data?

Yingjun Wu 00:31:18 If we want to handle high volume of data, then I would like to say that we definitely want to paralyze the data consumption. So I mean in RisingWave, what we do is, we can actually treat multiple data ingestion sources and these sources will listen to the upstream services and then consume data in parallel. Because if you just use one thread to per such, I mean the data from all these data sources, then it won’t work, right? Because where there may be a lot of data, right? Definitely. So to allow RisingWave to consume data, lost volume data from upstream services, then definitely we really want to paralyze the system and we actually support distribution or distributing the data ingesting in different machines and definitely also works.

Brijesh Ammanath 00:32:07 I’ve heard the term data connectors, what are they and what is the role they play in data quality and consistency?

Yingjun Wu 00:32:15 Okay so, actually different systems may use different connectors, right? I mean if you talk about data warehouses like Snowflake and Redshift, they actually have their own connector. I don’t know the fundamental technology of their systems, how they build an internet connector. I don’t know. I don’t know that part, but they actually have their own connector. Like let’s say in Snowflake they actually have a project called Snow Pipe, which allows people to do streaming ingestion directly from different systems into Snowflake. But in streaming databases like RisingWave, like probably CaseQDB, I’m not sure about that. But yeah, let’s say inside of a streaming databases then typically we can consume data from messaging queues directly, let’s say Kafka, let’s say Kinesis, right? Because in Kafka, they actually have Kafka connector, right? We do not need to match the connectors on our own, right? But if you want to consume data from let’s say database systems like MySQL and Postgres, stream database is typically used the Visium, a CDC tool that allows people to directly capture the data changes from the upstream database services,

Brijesh Ammanath 00:33:26 And as Realtime data enrichment handle, how do you manage to do that?

Yingjun Wu 00:33:31 Oh, data enrichment will, there were actually two types of data enrichment. One thing here that’s great is kind of simple that is probably do some basic transformation, right? Data transformation like filtering, like aggregation and like probably projection, right? So that’s kind of basic data enrichment. I think, well for all kinds of streaming databases, like CaseQDB, RisingWave, PipelineDB, it can be done easily because for these kind of like jobs will all probably queries are kind of simple and we do not need to match large states. But another very interesting type of data instrument is called JOINs, right? Let’s say that’s where if I have multiple data sources, Kafka, MySQL, and the Postgres, I really want to join the data from all these data services, right? Then how I can do that, I definitely want to do JOINs, right?

Yingjun Wu 00:34:26 If I only use JOINSs then in streaming databbases, in stream processing, we actually need to match large state, right? Large internal state. And if we want to match large internal state, then we actually need to have let’s say mechanics like checkpoint and the failure recovery, right? And definitely streaming databases do is for checkpoint and failure recovery differently. For the RisingWave, I mean we saw data in S3 so that we actually rely on S3 to do all kinds of the failure recovery and checkpoint. But in some other databases like streaming databases like ______DB, they actually have their local store, they actually store their data in the local machine. So they actually need to frequency do checkpointing and persist the data from the local state to the remote storage like S3. And every time a system crash, they actually need to think about how I can pull data from the remote storage like S3 to my local state.

Yingjun Wu 00:35:22 And that’s how the streaming database deal with the checkpoint and federal recovery. But back to the data enrichment, yes, the difficult part of data enrichment for how the streaming database can do JOINSs and the JOINSs is definitely a key challenge in all streaming devices. And if we want to join efficiently, then we need to think a lot about the failure carry and checkpoint and definitely people also need to think about the optimizers because the drawing ordering can also affect the performance. Anyways for data instrument is definitely a pretty challenging task. If you specifically refer to JOINs.

Brijesh Ammanath 00:35:59 I’d like to double back a bit on that particular topic about JOINs. Can you explain why JOINs are difficult in a streaming database? Maybe if you take a concrete example where you got a data source, which is coming from Kafka and from maybe a Postgres database.

Yingjun Wu 00:36:15 Yeah, we can actually use, let’s say the ad monetization example. Let’s say that we have clickstream, right? And the other one is an impression stream. Now we have two streams. One is clickstream and the other one is impression stream. And as a platform, I really want to join these two data streams together because I want to see which impression leads to which click, right? So that can, I can monetize the ad, right? So if I want to do such kind of drawing right at end, what I want to do is I need to maintain two hash tables. Actually I want to build a hash table for the impression stream and a hash table for the clickstream. And every time a new table comes in into let’s say the impression stream, then I need to check whether there’s a match in the hash table for the clickstream, if there’s a match, I know that there’s one impression that to lead to one click, right?

Yingjun Wu 00:37:12 And similarly, if the new tool comes in from the clickstream, I need to check where there’s some match in the impression stream. If there’s a match, then it means there’s an impression that leads to this click, right? So I have always need to maintain internal space, the internal, these two hash tables in my streaming device. The problem here is that how I can maintain these two hash tables, right? I probably can maintain them in memory, but the problem here that let’s say that my machine crashes, if my machine crashes, then all memory data will get lost. That’s a big problem, right? Because well, which means that if I want to reduce such kind of combination, then I need to rebuild these two hash tables, right? Let’s say that if I have, let’s say a seven days data, right?

Yingjun Wu 00:38:01 If I need to possess seven days data, it means that I actually need to recompute from scratch from seven days ago, right? From the data that was January seven days ago. And it will take a long time, I don’t know, but probably one hour, probably two hours, probably even one day, right? Because that may be future amount data. So the thing here that we need to think about is how we want to persist these hash tables or how we want to persist these internal states, and that’s a key challenge that the streaming database need to handle. And yeah, back to what I said earlier, and there are different ways people probably just use, let’s say the local disc to persist the data and probably do checkpoint into remote machines, but in some other systems they may directly persist the data or persist the state in remote disc, remote SSD, sorry, not remote SSD, but remote object stall. And then they use the local machine as a cache to serve as a cache to accelerate clear processing.

Brijesh Ammanath 00:39:01 Right. And I guess if you stored only the local machine challenges around horizontal scaling?

Yingjun Wu 00:39:07 Yeah, so that’s horizontal scaling is also quite relevant to how we want to do persistence. Because I mean if you want to store data in local machines, then let’s say that if we want to scale from one machine to three machines, now the problem becomes that, and now we want to do scaling, then the problem becomes that we actually need to partition the local state into let’s say three pieces and then ship these three pieces into the three machines, right? Three newly created machines and then to do scaling, right? So that definitely works, but you actually need to deal with let’s say the data consistency, right? And also you need to think about there are probably some downtime during the data migration, state migration, right? But the other way which is adopted by Rewa that is will persist the directly persist the data in remote storage. If we do that, then scaling becomes that, we can just create three new machines and ask these three new machines to directly reach to directly access data from the remote object stall. There will be no data migration, there has been no state migration say from one machine to three machines. These three newly created machines will directly access the remote object stall to fast the states. So yeah, these are two fundamentally different ways to do horizontal scaling.

Brijesh Ammanath 00:40:25 Yep. Makes sense. Just wanted to touch on indexing before we move into the next section. Is indexing any different in streaming databases and how does that impact the overall QE performance or data retrieval?

Yingjun Wu 00:40:40 Well, indexing actually works in a similar way as a traditional database system. Well I think what the indexing mechanism is also varies based on yeah, what kind of streaming data you are referring to. And even in traditional databases, well different databases may handle indexes differently, right? In some databases their indexes are synchronous index, which means that really indexes actually not consistent with the base table, but in some databases where they always guarantee the consistent or the synchronous index, which means that the index always reflects the latest data, right? So yeah, these are two different ways and in streaming databases where definitely this is also trade off in some streaming databases where the index is like synchronous but in databases like RisingWave is synchronous, but the trade of your that key, I mean if you want to maintain synchronous index, you probably really want to have more combination resources, right? So you need to consume more combination resources. Yeah, that’s the tradeoff.

Brijesh Ammanath 00:41:40 When you say it’s synchronous indexes, what you mean is basically the index gets updated as the data gets updated?

Yingjun Wu 00:41:46 Yeah, that’s right. Well let’s say that if I insert a tool into a base table, right? A table in a database, right? Then I should see this data in the index built on top of that table, right? I want to see that TPO. But the problem here is that in some databases, well, I mean they do not really have a consistent synchronous index and then the index may be kind of stale. If you insert a TPO into a database table, the TPO may not be directly reflected in that index. They may be after one minute you will see that TPO, but there is no guarantee at all.

Brijesh Ammanath 00:42:19 Got it. We’ll move on to the next section, which is around security. Are there any fundamental differences in how security concerns are addressed in a streaming database compared to conventional databases or conventional security models applied?

Yingjun Wu 00:42:34 I would say probably there’s not much difference because, well, yeah, again, where security thing where solve it varies based on the companies in different companies may have different policies for, you know, because some companies want to make sure that all data is encrypted and they always use TRS to guarantee the security of the data transformation and instrument ways. We all, yeah, definitely we all always need to obey that kind of regulations or compliance. Yeah. This kind of compliance. So I would say that’s where there’s no big difference.

Brijesh Ammanath 00:43:11 How do you handle out of order events or data coming in out of order in a streaming database?

Yingjun Wu 00:43:18 Yeah, that’s a very interesting question, right? Well because in the, in a commercial database, people do not really care about what is out of order, right? Well yeah I just store data and there’s no order at all. Right? The commercial databases deal with transactions, but in streaming databases we deal with orders which means that if TPO, it may result in different results, right? It means these two different results. So we use a technology called Watermark to guarantee that with the data, even if the data comes in different orders, we do see a consistent result. Yeah. Let me give you a very concrete example. Let’s say that if I want to count how many sensory events occurred within the last 10 seconds, right? If I connect RisingWave with different sensors, right? Sensors may send data in different orders, right? Because of the network issues because all kinds of issues, right?

Yingjun Wu 00:44:10 This data can be disordered but RisingWave or probably some other streaming databases, right? We always want to display the correct results to the users even if the data is disordered, right? Then all these streaming databases use a concept called Watermarked to process the disorder data. Basically what the strategy is that we will actually maintain a buffer to capture the data within the, that is ingested within a certain parallel time. And if a data comes late, let’s say that if I want to count how many triples occurred within the last ten seconds, but a new TPO just to come in, I probably insert this TPO back to the previous window, right? Then what will happen is that will modify the result maintained by the window, right? So that’s how it works. We use a mechanism called Watermark.

Brijesh Ammanath 00:45:05 Yeah. So what you’re saying is there’s a concept called Watermark, which basically allows you to say that for a certain period it will take in all the events. So even if an event comes in late, as long as it’s within that Watermark range, it’ll be included in that query result.

Yingjun Wu 00:45:25 Yes.

Brijesh Ammanath 00:45:26 Happens to the events which come after the Watermark period?

Yingjun Wu 00:45:30 So that event, well I think, well it’s kind of depending on the information but in some cases, well the late event will just be discarded. Yeah. Basically will be ignored, but in some cases it would probably buffer, yeah

Brijesh Ammanath 00:45:45 So you’ve got a buffer time where you wait for any events which are coming late, you still included if it’s outside of that buffer period, you don’t include it.

Yingjun Wu 00:45:54 That’s right.

Brijesh Ammanath 00:45:55 Right. How do you manage data de-duplication?

Yingjun Wu 00:45:58 Yeah. In some services where there are duplicated data, right? Well let’s say that case maintain Kafka, right? Well and there maybe there are some duplicated Kafka for various reason, right? Because probably I just rebuilt the Kafka, right? So I probably reset the offset, right? So basically RisingWave and some other streaming databases, a CaseQDB and many others, we actually were tracking the offset of the data. So basically, we’ll track which data I’m currently processing, am I processing the 100th data, right? Or 100th TPO, right? If I’m processing that, if I saw that the new TPO comes in and it is still a hundredth TPO, then I should discard it, right? That’s how the streaming database deal with the data duplication.

Brijesh Ammanath 00:46:42 Alright, so you use something called an offset, which allows you to have a view of what has already been processed.

Yingjun Wu 00:46:50 Oh yeah, yeah. I think we’re software. Yeah, that’s right. That’s right.

Brijesh Ammanath 00:46:53 So as an industry, if you move from traditional conventional databases onto stream processing or Realtime analytics and now we’ve got stream databases and I think RisingWave is where we have got stream databases evolve into using cloud or cloud-based streaming databases. What do you see in the future? How do you see streaming databases evolve going forward?

Yingjun Wu 00:47:17 Well if you’re talking about the streaming database, the future of streaming database, definitely I would like to say that first of all it is more powerful, right? Because well as I just mentioned, well very early in this talk, well there was a lot of stream processing engines, right? So for newly built streaming databases, no matter whether its case going be Materialized RisingWave, I believe there are still a long way to go because, well, I mean commission streaming processing engines have already been built for properties say there for 10 years, right? Or probably 10 plus years. And definitely we need to catch up with some functionalities like I don’t know, Java, API, probably some Python API, right? All these some things. But in terms of the architecture, I would like say that it’s pretty interesting to see that, there may be a big change in the object store if you pay attention to reinvent this year, 2023 you’ll know that actually AWS actually announced this new technology called S3 Express, right?

Yingjun Wu 00:48:16 So basically what it means is that S3 can provide a single digit millisecond latency and that will definitely a game changer for all the streaming databases because which means that key all, and not just the streaming data, I have to say that key, all these data systems because that means that, we can actually store data in cheap devices, object stores, and actually can allow us to handle low latency processing, right? Because I previously mentioned that if we store data in S3, then every time we access S3 then it’ll probably incur high latency, a hundred millisecond latency or probably 200 millisecond latency, which is not tolerable. And then we actually need to build a caching layer to guarantee that okay most of the requests will hit the query and can be responded to in pretty low latency. But if S3 can provide a single digit millisecond latency, then we actually can safely store data probably all my data in S3, right?

Yingjun Wu 00:49:15 And then I mean we can guarantee a user that we can always guarantee the user that they can see the freshest results, right? And answer the queries in very, very low latency without incurring high expense. So I would like to say that with the hardware change, so free, the S3 express may change the streaming database and definitely probably the entire database world and the search channel, I would like to say that case many people talk about unified batch and stream processing, right? Which means that case they probably don’t really want to purchase two different systems, one for handling streaming data and the other one for handling batch data. So they are advocating for a unified system that can process both streaming database streaming data and batch data. So I do think that that’s a direction that streaming database should go for, right?

Yingjun Wu 00:50:05 Because well look, nowadays streaming database are used to process written data, right? But more data is stored inside of streaming database. Then we need to think about how we can serve this data to the users, right? Let’s say that if a user really wants to check the historical results, we don’t really want to let the users to push this data into some Snowflake or probably Redshift because that’s not convenient, right? So we really want to have the users to do all this batch processing and stream processing in one single system.

Brijesh Ammanath 00:50:42 Isn’t Spark streaming unified framework for batch and stream processing. So are you saying that you’ll see more saving databases going in that direction?

Yingjun Wu 00:50:53 Yeah, something like that? Well, I mean that’s a good question. In Spark, we have both batch and stream processing because for Spark, ultimately is designed for batch processing. And it also has a component called Spark streaming, which is designed for stream processing, right? So nowadays people can think can consider, Spark has, let’s say a unified batch and stream processing engine, right? But I would like to remind that case Spark is an engine, it’s a stream processing engine, but the unified streaming and a batch concept does not just exist in the engine world, you know, or in the big data world. It also exists in the database world. Let’s take Snowflake as an example, right? With Snowflake, recently announced a SnowPipe and they also have the, say the live table, right? The live table is essentially a concept of the stream processing. It based because what it does incremental accommodation and it’ll refresh the results for the users, right, automatically for the users. So I believe that case, the unified stream processing, and batch processing was also happened in the streaming database. And I believe that some streaming databases have already started implementing that part.

Brijesh Ammanath 00:52:10 Righ. Got it. So, what we are saying is basically you’ve got unified batch and stream processing in the stream processing world, but not in the streaming databases world. So we’ll see that happening.

Yingjun Wu 00:52:24 Yeah, we’ll see that happening in the streaming database world.

Brijesh Ammanath 00:52:27 Excellent. We have covered a lot of ground over here. We have covered the fundamentals of streaming databases. We have understood how it differs from stream processing. We have delved into the architecture and design patterns used in streaming databases and we have also looked at some of the use cases for streaming databases. Thanks a lot for that Yingjun. If people want to find out more about what you’re up to, where can they go?

Yingjun Wu 00:52:50 Well, they can definitely visit our official website, RisingWave.com and definitely they can also go to GitHub and check out our GitHub repository. And definitely the easier way is to join our Slack community, RisingWave.comSlack. That’s a website and you can join our Slack community and directly ping me and then we can talk a lot about stream processing, stream database, the future of data engineering thing. Yeah.

Brijesh Ammanath 00:53:19 Yingjun, thank you for coming on the show. It’s been a real pleasure. This is Brijesh Ammanath for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio 605: Yingjun Wu on Streaming Databases

Show Notes

Related Episodes

Other References

Transcript

Join the discussion

More from this show

SE Radio 646: Matthew Skelton on Team Topologies

SE Radio 645: Vinay Tripathi on BGP Optimization

SE Radio 644: Tim McNamara on Error Handling in Rust

Menu

Recent posts

Search

Search

SE Radio 605: Yingjun Wu on Streaming Databases

Show Notes

Related Episodes

Other References

Transcript

Join the discussion

More from this show

SE Radio 646: Matthew Skelton on Team Topologies

SE Radio 645: Vinay Tripathi on BGP Optimization

SE Radio 644: Tim McNamara on Error Handling in Rust

Menu

Recent posts