Stephen Ewen, one of the original creator of Apache Flink discusses streaming architecture. Streaming architecture has become more important because it enables real-time computation on big data. Edaena Salinas spoke with Stephen Ewen about the comparison between batch processing and stream processing. Stephen explained the architecture components and the types of applications that can be done with each one. Stephen talked about the challenges of streaming architectures and how they address them in Apache Flink. At the end Stephen talked about the future of stream processing.
Transcript brought to you by IEEE Software
Edaena Salinas 00:00:53 For software engineering radio, I’m Edaena Salinas. Stephen Ewen is a PMC member, and one of the original creators of Apache Flink and cofounder and CTO of data artisans previously, he worked at IBM and Microsoft on data processing technologies. He holds a PhD from the Berlin university of technology with data artisans and Apache Flink. Stephen works on pushing the capabilities and the adoption of stream processing technology. Stephen Ewen, welcome to software engineering radio.
Stephen Ewen 00:01:32 Thank you. Hello.
Edaena Salinas 00:01:34 There are systems with data constantly flowing through them. This data can come from different sources like telemetry or the internet of things as this amount of data keeps increasing challenges, a March on how to process it. Batch and stream are two ways in which we can process big data in a talk that you gave in San Francisco. You mentioned that we can understand these two by asking what changes faster your data or your query. Can you explain what this means?
Stephen Ewen 00:02:12 If you look at the leg of the nature of, of batch and streaming applications, betcha applications are, um, applications that take a fixed finite, um, bit of data and, um, like very typically our applications that take that. And you, you may want to run different types of, you know, different types of, uh, analysis over, over this particular static type of data. I think for example, of, you know, like a classical data warehousing use cases, you have, you have a static piece of data, which is what is coated in your warehouse, or even if your warehouse is continuously ingested, it’s like a snapshot view of it. And then you run different exploratory queries based on the result of the first, let’s say you drill down, um, certain dimensions and so, and run the next query. Um, another example is like machine learning, model training, where you take, you take a piece of input data, you may partition it in different ways as in like training and, um, and validation set.
Stephen Ewen 00:03:12 And then, um, and then you you’re on the training model, but it’s, it’s in some sense, always the same, same data that you base this on, even though you might split it differently between different trainings run different training runs. So in like these are kind of applications where the kind of work that you do changes, but it’s always on the, on the same, on the same static data set streamings is if you wish the, the, the opposite of that you have, you have an application that, you know, that continues to run and you can re really think very broadly of what kind of application that is. It can be something that compute some, you know, like some statistics over, over what is, you know, what is happening in, in the, in the data that is being produced. It can, it can be, you know, like an application, like a, like a web service that backs a, that begs an X, an online application.
Stephen Ewen 00:04:04 The, the logic there is, is the same kind of logic that runs all the time. But the data that it runs on is actually, it’s, it’s not finite. The data doesn’t really exist all before the application is launched. So while the application exists and stays the same more and more and more and more data comes and keeps changing over time. So that’s why it, why these two are almost the, you know, like the, the inverse of each other. And in some sense, that, that boils down to another interpretation of, of batch and streaming. You can, you can think of, of streaming or the most defining part of streaming is actually this aspect of, of continuity, of, of data being, being, um, being produced and being processed. A streaming application is an inherently an application that is not necessarily bounded towards the end. So while it starts, it doesn’t yet have an end of data where, where it’s done, where it actually shuts down and says, you know, my, my job has done it’s, it just stays there.
Stephen Ewen 00:05:06 And it continues to receive, to receive, um, receive events and, you know, events that can come from a message here that can come from user interaction with it versus a batch application is like a, it’s like a finite job or project. It has a fixed amount of data that, that you run it on. So the, like the most defining differences is that that streaming’s unbounded towards the end versus was a specialist not. And that, like with that comes, comes the fact that, that in batch, because the data aesthetic, usually the application changes more often and in streaming because the data is unbounded and the, like the streaming application exists before you even know the end of the year, the end of the data, the streaming, the application usually stays the same way that data changes.
Edaena Salinas 00:05:47 Are there characteristics of the applications that is better if we use the streaming approach, because in a sense, we can have this stream of data coming, but we can just decide to store up a big file and then just do batch processing. What is the advantage of having available this technology of stream processing for certain applications?
Stephen Ewen 00:06:13 I mean, stream processing has a bunch of very strong advantages, um, which is, I think, why it’s, why it’s actually creating such a, such a big interest at the moment. The first one is quite straightforward, right? If you, if you ingest the data first, um, wait until you filled up a certain chunk and then say, okay, let me create, like, let me take this static piece of data and then run a batch job. You introducing a delay until you actually run the computation. So for the, for the time that, you know, like this, this amount of data, this buffer, or this block fills up, um, so if you’re, you know, this runs this results, then often than this periodic, periodic batch jobs you ingest for a while. And every day, every hour, every few minutes, you kick off a certain batch job, but it inherently actually means that the data is already a little bit stale by the time the computation runs.
Stephen Ewen 00:07:06 And a lot of data is very valuable when it’s very fresh. So, so the, like getting this delay down is something people are very interested in and the best thing you can actually do is get it to zero and just process it as it comes. Right. Um, we we’ve, we’ve seen this trend of periodic batch jobs trying to become more and more and more frequent. So they act on, on fresher data all the time. And then streaming is like the extreme form of that. You don’t even, you don’t even try to chop it into finite sets and then run the stream, uh, sorry, run the batch processing application, but you just process the data exactly. Immediately as it comes. That is one part of it like spring processing really helps with, with like immediately reacting to data when it’s, when it’s still very fresh, when it’s really valuable and for a lot of application that doesn’t even make sense to react to certain events anymore.
Stephen Ewen 00:07:53 In hindsight, the other, the other aspect of that is that streaming is really a very natural way to represent a lot of problems. If you look at how most of the data is produced, it’s actually produced in a, like in a continuous fashion, in an somewhat unbounded fashion, right? Streaming data, like treating a problem as a streaming problem matches that very well. If you look at the data producer, like, you know, like a web service that, that just logs user interactions, and you want to process that in order to like, get to get insights and what is happening in your new service right now, or to build like a life recommender also like these, the interactions being continuous produced over time is really a stream. And if you try to apply a batch processing for that, you come to this to this point of saying, okay, this thing that is actually created as a stream in an unbounded fashion, and I’ll have to introduce like arbitrary outrage boundaries, and then, and then interpret each individual individual carved out piece as a, as a final dataset.
Stephen Ewen 00:08:52 It say it’s actually not as, as, as a thing to do as, as just treating it as a stream and processing it as a stream. And I think it’s mainly been done historically because of immaturity of stream processing technology. If there was just no other way than to say, let’s cut the bed, thus treatment of pieces and treat each, each piece as a batch, but a stream, a stream processing technology continues to mature. Um, we actually see that a lot of people are very interested in like, again, adopting this quite natural paradigm of processing. The
Edaena Salinas 00:09:23 Let’s talk a little bit about that. You mentioned batching was impart used because stream processing was still immature. So I just want to understand if there’s somebody that has this, all this data coming from the web service, like you mentioned, and they want to set up batch processing, what would be the components of this architecture, for example, where can the data be stored? Would it be like a distributed file system and just pretty much what would they need to do to set it up as closely as, as if it were stream processing processing, but not quite there yet.
Stephen Ewen 00:10:04 I think if you look at the somewhat historical evolution of technology, I think this, this actually goes, goes back quite a bit. You can see that, you know, with, with relational databases, people have tried to, to, to do the same thing like continuous ingestion, or, you know, as, as data is produced, you immediately inserted into the database and then your periodically run your analytical crews on the database. This is a form of like streaming ingestion into a store. And then running batch processing just within this, like all close monolithic piece of the database. And, you know, some companies have actually adopted that long ago and somewhat successfully, even if at, at quite significant costs in many cases. But, um, I guess if you, if you look at the, like at the, at the more modern trends, the big data way of doing this, it’s probably more like you, like you said, right, you have something like a continuous ingestion into a distributed file system, you know, with projects, I guess the first ones to do that were projects like Apache flume or so, and then he would have, he would have something else that kicks off the bet shop that analyzes the data in the, in the distributed file system.
Stephen Ewen 00:11:11 Yeah. So for, for streaming streaming ingestion, and then batch processing to approximate stream processing, I think that that would probably be the, the approach. And if you’re, if you’re trying to approximate stream processing more and more, um, you end up at trying to decrease the batch size, um, more and more until, you know, the batches become just very, very small and you might end up with something like mini batching, even.
Edaena Salinas 00:11:35 So you mentioned like in the very early days, it was just a matter of setting up a database related for this, for gathering the data and just adding the data to a table or something, and then processing that. And now we have distributed file systems with Apache flume. What are some the bottlenecks?
Stephen Ewen 00:11:55 So flume flume was one of the very early technologies. I think probably these days, if you, if you ingest into a FOD system, there’s, there’s very different ways to do this. Um, it’s either, you know, either you actually use a stream processor directly to ingest into, into the file system. And to some, if you were some preprocessing on the way, or you may use things like, like, uh, like tools that work with the, with a message queue and send directly, right. From message cues to the five systems like towards, around, um, you know, things like, like CAFCA, or also like event hub or pops up. And so on. A lot of them come with connectors that are like mainly intended tool to do ingestion and to batch processing, because it’s still is a big,
Edaena Salinas 00:12:35 I see. So what you’re saying is that the one where you mentioned that it’s just about your flow manage distributed file system, like it’s still pretty in the early days, if you tried to do that.
Stephen Ewen 00:12:44 No, I think these were the, like the first, this were, were the early days of yeah. Of taking streams, putting streams into a batch storage and then computing over the battery storage. Yes.
Edaena Salinas 00:12:56 What were some of the bottlenecks that could be hit back then? If, if you’re trying to go with this approach of doing batches more frequently as the system keeps growing,
Stephen Ewen 00:13:09 I mean, a bet shop, a bet shop, depending on how you set it up often has, has a problem of, I think you can call this like broken boundaries. So let’s say you have, you have a bad shop to interact SRE, true to analyze sessions with which users interact with, with a certain service. That’s, that’s very often a very often something you needed at the beginning, right? You just need to actually gather you the user sessions and gather like a user’s behavior within that session. And that is, that is input into some, some later processing step, like, you know, like training, training, a model to predict, um, what, what users would be interested in, interested in, depending on what they interact with. The problem is if you, if you actually chop your data stream, um, at some point you’re inevitably going to break a lot of these sessions, right?
Stephen Ewen 00:13:54 So you, you have to put some logic in place to kind of repair the boundaries, to carry some results from the previous batch, also into the next batch in, in, in that way, trying to, you know, carry the in progress sessions into the, into the next one. Um, if you start decreasing your, your batch sizes more and more and more, you can easily come to the point that this, that each batch doesn’t just this way more. It just either repeats a lot of work, or it has to spend a lot of work, um, like carrying over, um, information from the previous batch. And you almost come to the point that warming up the actual contribution, uh, computation is much more effort than the computation. And that is where, you know, if you actually get then come to the model and say, yeah, let’s actually just treat the computation as continuous. We would just, you know, like we keep the information, we keep the state in there all the time and just continuously compute on, on this just goes all of these problems.
Edaena Salinas 00:14:47 I see. And the ones that you highlighted is the batches, essentially just one piece, but because you’re doing them smaller and more frequent, there’s this additional steps of stitching together, previous information and using that for the current computation is what you’re saying, right?
Stephen Ewen 00:15:06 Yeah. This is just, um, in order to, let’s say, if you want to approximate like more closer, closer, real time computation with batch processing, there’s, um, there’s a lot of, of applications that, that have the, the requirement to choose operate in, uh, and with a very low latency where it’s, it’s completely completely infeasible to even start up the batch processing job within the time that they’re usually, um, needs to deliver the answer. And right. So a lot of, a lot of cross detection, anomaly detection, like a network intrusion detection, and so on where you really have to answer within, like sub-second often even sub a hundred milliseconds and so on. It’s completely infeasible to like first do the ingestion and then kick off the batch shot for these applications. It’s really essential that the computation is online and immediately reacts to the data as it comes in. But, um, even for applications that don’t have this like very hard, um, low latency requirements, if you start to, to make, make the batches shorter and shorter and shorter, at some point in time, you’re paying a lot of, a lot of overhead price in that, with that, with scheduling, with that, with, you know, loading pre warming pre-loading, um, like previous results and so on in order to continue on them,
Edaena Salinas 00:16:19 You mentioned if you have a system where you value data that is fresh, like for example, for anomaly detection or intrusion, that’s a point where you can switch to a stream processing architecture because the architecture with batching, even if you do it more frequent, we talked about some of the issues with that, but there’s a cost issue where cost can get very high. At what other point,
Stephen Ewen 00:16:46 I think there’s, there are many cases where you actually want to switch to that before, before actually coming to this like very, very low latency case. Like, uh, like I said before, like streaming is really a very good paradigm. Every time the data that you’re processing is really continuously produce data, and you have this, this kind of piece of work that is the same work that you apply to all the data, right. Be that routing it to a different point in your, um, in your infrastructure, like ingesting it into storage is one, one, one example often like a continuous piece of work that you do on changing data as it, as it changes over time, but also indexing it, filtering it. Uh, preprocessing it searching it for, uh, for anomalies continuously updating, um, certain intellect, certain like models or certain state that you keep around for, let’s say for users that interact with your, uh, with your website, um, or for other customers, um, all these things that you tend to do continuously all the time and that work on data that continuously produced like this, these are natural streaming applications actually. And even if they don’t require a super low latency case, there’s a, there’s really with maturing stream processing technology. There isn’t, there isn’t really a reason anymore to not treat them like that. The reason to treat them with batches. I think we’re mainly historic when batch processing was vastly more efficient than stream processing was, but that is not quite the case anymore. So there’s this really a very good point to start just treating streams of streams and modeling the modeling, the processing as streaming applications.
Edaena Salinas 00:18:20 I see. So you’re saying if you have a component that’s continuously generating data, that in itself is enough to switch to streaming. What about a scale? Is there a certain scale where we should start thinking about string processing or just with the idea of, we have something continuously being generated, let’s just wire up a stream processing architecture to it.
Stephen Ewen 00:18:43 I think in, in, in many ways, as soon as you have this continuous data, continuous processing continuous computation, this is, this is a good, good reason to actually start, start thinking about a streaming architecture. So we see, we see users like from our perspective users of Apache Flink that use stream processing on as many notes as like one, two, three notes for, for very low volume, for very volume, low volume data streams, or as many as 10,000 notes for extremely high volume data streams. So the technology really works on very small scale and very large scale, and it is really all about modeling your problem as a modeling application, as a stream processing application. That’s something that does continuous processing versus, you know, a certain scale that says, Oh, at this scale, I should really go rather for stream processing. Versus at that scale, I should rather go for batch processing
Edaena Salinas 00:19:37 Earlier when we were talking about very early days, architectures, you mentioned that one improvement in particularly in the batching one, but it can still be used for streaming architecture. Is this pubsub and message cues? Can you explain what pub solve is,
Stephen Ewen 00:19:58 If you look at that data being like streaming data being, being produced and being, being made available for, for processing it’s, it’s very often quite beneficial to like to decompose the stream processing part and the leg, the data producer part, and, and put, put the data somewhere where it can be consumed a miss industry make fashion, but, you know, it could also be, it can also be buffered there for a while. It can be reconsidered roomed in, in case we, you know, we have a failure in the application and need to like redo a little bit of work. And these systems that, that, um, that are somewhat there let’s, you can think of them as being the equivalent of the, of the distributed file system in the streaming space where the distributed file system or the storage for batch these systems are the storage for streaming those, yeah, those, those play the role of being the very often, like the first ones to, to receive the data and to buffer and store it at least for, for a certain while. And those are very often either pops up systems or
Edaena Salinas 00:21:02 I see. So what you’re saying is one way of structuring the streaming architecture is to have this section that handles data producing and this other piece, that’s more about the stream processing part. And that it, that in it says is the pops up, you’re publishing the data you’re producing it. And the sub, which is a subscriber, is just taking that data and doing some computation of it. So at a higher level, there’s two main pieces in a stream architecture, right?
Stephen Ewen 00:21:40 There are two main pieces, I would say probably more if you, if you look more towards the edges. And I think they’re, they’re interestingly very, very analogous to the pieces of a batch processing architecture in a, in a batch processing architecture, you have better storage, like a distributed file system. You have to batch processors to do the computation. You often have a piece that does the, like the data, the data ingestion, and you often have something that takes the result of the batch processor and does something with it either, you know, that continues to push it, push it down the like further downstream steps of the workflow, or are presented in the form of a dashboard or something like that. And in the, in the streaming architecture, that’s really very similar. They’re the producers, the producers typically write into some form of log or message queue, which is the storage piece.
Stephen Ewen 00:22:30 And at least the, let’s say short term storage piece, short term, mid term, depending on, on what exactly you use. And then there’s the computational part, the stream processor, which is the analog to the batch processor, and then usually have some form of, of, of system that takes the results. And, um, yeah. Does something with it, this system, you know, and the results can be really anything. The results can be actions that are triggered in the infrastructure by RPC or, or dashboards that present like real time analytics or, you know, just be storage systems or other streams for, for downstream consumption.
Edaena Salinas 00:23:06 I see. And you mentioned the message queue is short term storage, right?
Stephen Ewen 00:23:12 Um, yeah. So the way that most of these, most of these let’s say message cues or logs work these days is that they, they have a certain depth kind of like a limited lifetime. There’s some projects that try to work around that. But the, I think the, the, the state of the, of the art, like the popular projects all have a, have a limited, limited storage time or limited storage capacity. So when, when the data first goes into the log or into the message queue, it stays in there for, for a while for, for various applications to consume. And there’s very often a like one of these applications that consume the data is very often, you can think of it as some form of backup application that that takes it, um, maybe filters it, compresses it, um, for some form of longterm archival and in some form of bike storage, like again, a distributed file system. This makes, you know, this makes in many ways sense to use a distributed file system there because at this, at this time, you know, when the data actually is accessed from there, it’s no longer fresh data anyways. So it’s not really important to, to serve it from a message queue. And there are projects that try to combine these pieces message queues that age out data to distributed file systems automatically.
Edaena Salinas 00:24:32 And when data reaches his message queue, has it previously undergone a transformation or is it just directly putting the data the way it was formatted, for example, for a log, or is there up a step prior to,
Stephen Ewen 00:24:48 So when the, when the data hits the lager, the message queue like the logs and message cues are really the, or the star or the data storage both for, for the fresh, the, let’s say the initial raw data as it comes directly from the producing service, but very much also for the data that comes out of their first preprocessing steps. So, um, very many setups that we see have, have, um, have a, have a locker, a message queue just for the raw data stream that have a stream processing job that does initial, like initial filtering transformation, cleanup, and Richmond, and then writes it into multiple other message queues and logs for different downstream consumers. And then, you know, there may be different consumers that do do some other massaging of the data and write it again to a logger message queue for even further downstream consumers. So it it’s really, you know, in the same way as one batch processing job can write a result to the files distributed file system. And then the next batch processing job can pick it up from there. And the same way stream processing jobs can pull their source data from a message queue and write it down stream to a message queue. Again,
Edaena Salinas 00:25:50 What’s an example of a piece of data that reaches the message queue just to illustrate what are some of the things that are, can be logged by systems. We can talk about it in terms of that web service system or the anomaly detection or some other application.
Stephen Ewen 00:26:08 It, it can be really almost any event that, that reflects something happening, right? It can be machine generated events. So for, for some use cases that we see, it’s actually just servers reporting or microservices reporting what that just did. And it’s just, it’s just used for, you know, to, to maintain metrics of what is internally happening. It could be something like, you know, a credit card transaction that, that hits the message queue as in, okay, somebody is trying to do this transaction and the streaming application. Now computes a, let’s say a risk. Do we believe this is a fraudulent attempt? Or is it a valid attempt? It can be a user hovering over an item on the website, which, which feeds back into the message queue and into the stream processor that then, you know, updates the updates to the user’s behavior model, which is again, input to the website computing, what to show next, um, like which product to suggest next to. So it can be a, an event generated by, by an app, like a, you know, like a geocoordinates that goes to the message queue and the stream processor, um, consumes the stream of like, of, of pings with, uh, with geocoordinates to actually try to compute traffic models of a certain region. So with the, um, it’s really very, it’s very broad what it could be.
Edaena Salinas 00:27:25 Let’s talk about the next step in the architecture, which is a computational component when we’re dealing with streams, how can computation be done in this continuous data?
Stephen Ewen 00:27:39 So, in, in, in some sense, there’s like the, the first piece is, is very simple and application that just reacts to events as an application that just either receives, you know, receives events, being pushed their way, be that, you know, in, as being in reactive application as reacting to, um, to RPC calls even, or something that just, just posted on the message queue and does something with the event and like this, this kind of simple case I think, you know, has, has been around for a while, the, the, the tricky pieces, how do you implement a system that computes over these events? Just that in like in a, in a very consistent way, like in a photo, in a very full tolerant way, let’s, let’s assume the simple case we’re reading from a message queue, doing some computation and writing to a message, making sure that whenever something fails, we do not actually accidentally do repeat it to repeat it work.
Stephen Ewen 00:28:37 Like we’re not publishing an event while we recover from the failure that we have published before that can create wrong results downstream. And even more challenging part is once we’re not looking at this in a, in a stateless fashion anymore, once we’re looking at applications that take an event, but they can’t really look at the event in isolation, they need to understand the context. Let’s say the event represents a user interaction. They need to have the context of what were the other interactions that the user did before. Um, in order to, you know, model its, um, its session behavior, or if the, if the event is a, is a credit card transaction and we need to take other contextual data into account, like what, uh, what other transactions are ongoing at the same point in time, let’s say between the same source and target accounts or what was, what was the other, uh, previous behavior of the user, you know, for a certain type of, uh, for certain type of transfer whenever we need this contextual information.
Stephen Ewen 00:29:33 And it, it very often, it’s not just like reading this context or information from some auxiliary database, but taking the event, take processing it with the context in mind, also updating this contextual information. Whenever we have this kind of pattern, we, we, um, talk about like stateful processing, stateful stream processing here. And like, this is the, this is really the much more challenging port, because what you can think about is all of a sudden this computation now has the, has both the role of the computation and of, um, of data consistency in some way, right? Because they’re, they’re maintaining the state of the application, the contextual information for each individual requests that the application processes in conjunction with the computation and maintaining, maintaining a big consistent view on that without sacrificing performance and scalability that has been, if you wish like the grant challenge of stream processing over the last years and the like efficient solutions to that problem are what I think midstream processing technology really take off.
Edaena Salinas 00:30:37 So this computational component is the one that would have that logic for gathering the context. Is that correct?
Stephen Ewen 00:30:47 Um, I wouldn’t call it gathering the context. It, uh, if you do, if you do stateful stream processing, think of it as you’re just, you’re just some application that’s on program. I wrote in that, you know, Java C-sharp C Python or so, and you you’re, you’re handling a bunch of a bunch of events happening and on somebody, somebody sends you an event call it’s called a function and, and you’re just treating weird. You’re treating this and you’re, you’re trying to, to also maintain the context and you would, you’d probably, you know, have some local variables where you store, okay. I’ve seen so many events for that user. Let’s say in the last, in the last minute, I’ve seen so many events of that certain type in, in the last, uh, in the last day. And so on something that you would, you know, you’d store it in a local variable.
Stephen Ewen 00:31:37 Um, if you could just put a very simple standalone application, if, if you then of course, try to make this full tolerant, you know, like a, an application that, that can handle failures that can, that can, can recover. That is resilient. Then very often you don’t store this in, in local variables anymore, right? So what you do is you try to connect a database and you say all these things that, you know, I need actually to be able to recover, I store them in the database or whenever an event comes, um, um, you know, looking something up in a database, doing some computation and writing something back into the database, you may not do this explicitly. It may be hidden from you by frameworks that do that in the background. But in some sense, that is, that is what, what happened for most applications that just reacted to two events to, to real time data.
Stephen Ewen 00:32:21 And the big thing about stream processing is you don’t have this anymore, the separation of computing and storage, but you have actually a stateful computational component that, that does this computation and state management consistently together for you. So it, um, it’s really a, a of the, of the whole architecture of, of computation and state management. And this, yeah, this has, I think this has, um, this has both big implications on real time data processing, because it is much faster than talking to a database in the background. It’s much faster than if you talk to the database and the background, you’re, you’re almost back to the original architecture, you know, events come, you ingest them in the database, and then you try to compute something with either, you know, if you compute in real time, you may have to find some triggers on the database. So, or you, you frequently one curious, but instead of doing that, you really let the application directly handle the data and the, the streaming application that has to worry about the full tolerance and the recovery and consistency of the data and the stream processors, giving you the, giving you the, you know, like the building blocks that take this heart challenge, that soft, that her challenge for you to actually do the, the consistency and persistence, um, and recovery.
Stephen Ewen 00:33:36 And, you know, in the same way, also the scaling out and scaling up that is, that is really the core of, of stream processing technology. And we, we talked a lot about like, uh, applications and databases. And this, this is also where I think this is, you know, where this is starting to make a big impact. It has started to be something more than just a faster way or a more low latency way of data processing. As we do it with the batch processing, it’s actually started to become a different way of building data driven applications, rather than having like a status application server that talks to a database. Whenever an event comes whenever a user sends a request or something like that, you really fuse those two things together into a, into a state full stream processing application that, that reacts to, you know, like the events coming in the stream and, and manages the state itself.
Edaena Salinas 00:35:01 Yes, I really see the value of adding this state part to stream processing, particularly in the case that you mentioned about credit card transactions or taking into account, what else has the user done before for the, the stainless one? I just want to understand what would be an example where it wouldn’t really matter to take this context into account.
Stephen Ewen 00:35:24 I think very simple examples of status applications may be, you know, stream processing applications that just route data. So you receiving a raw data stream. And, um, let’s say for every, that comes in the raw stream, you’re, you’re, let’s say filter, filtering it, uh, re encoding it, and you’re riding it to another message queue downstream, and, you know, maybe writing a copy to the distributed file system storage for, for archival. In this case, you really treat each event individually. You don’t need to know any contextual information across events. Um, these applications are pretty much stainless. They may maintain, you know, like a minimal amount of state and the sources and sinks to, uh, to coordinate, you know, like avoiding duplicates and recovery. But that is more like implementation specifics on how, how the, how the consistent integration with the source and target systems works. The Epic, the core publication itself, doesn’t, doesn’t maintain contextual state. So things like, you know, streaming real time ETL where you don’t pre aggregate, but you really only, you really only filter on transform. That would be an example.
Edaena Salinas 00:36:32 What I’m trying to understand is does data from different parts of the application ended up in the same message queue for processing, even if it’s not very related to each other, for example, we have the credit card transactions information, but we can also have data about user behavior in, in our web application or something like that.
Stephen Ewen 00:37:01 I think in many, in many ways, um, yeah, the set ups, we see you as very similar, like it logs a message queues for, for these events because there’s, there’s not really a need to treat them to treat them differently. Right? And like in a first order approximation, they’re just all bytes that describe something that happened and that carry that information. And also their availability should trigger something, right? These are the two main components of an event carrier some information and a trigger, some action, and all of them exhibit that behavior. What, what exactly, you know, their, their purpose, their meaning is at this at the like four for the rough architecture that doesn’t really make any difference. No,
Edaena Salinas 00:37:51 You’re one of the founders and creators of Apache Flink, which is an open source stream processing framework. Can you explain in more detail what it consists of?
Stephen Ewen 00:38:07 Yeah. So Apache Flink is it’s a pretty broad system. These days it’s it’s has grown. It has grown a lot. It is in its core a system to do exactly what we said before to, to compute over data streams in a, in a stateful fashion. That’s, it’s, that’s basically, it’s its core mission. Now, Apache Flink takes a very broad interpretation to that. So over, over the course of this discussion, we, you know, we talked about batch processing, we talked about nuanced stream processing as fast, faster batch processing, faster analytics. We talked about stream processing, kind of being a model to try to create applications. And Apache Flink really tries to take this very, very broad interpretation to what stream processing is stream processing is the unifying paradigm to do, to do any form of like continuous analytics and to do like continuous applications. And it, it, for example, interprets batch processing as a special form of stream processing batch processing is our stream processing as computing over an unbounded amount of data with content with continuously existing logic.
Stephen Ewen 00:39:19 But in some sense, special assessing is, is a bit of a special case of that just right. So instead of having an infinite data stream, you have a finite data stream that is your finite data set. And whether the application actually continues to run for long or short, and then you start another application doesn’t really matter to Flink. So in that case, it actually does batch processing as well. And it, it actually, um, it doesn’t just look at batch processing as, as only doing stream processing with slightly different parameters, but it also has, has certain optimizations in the system, um, for, you know, like assumptions that you can make. If you, if you work on Bounder data versus unbounded data, for example, you can, you can do different forms of scheduling. You can use different forms of data processing algorithms. If you know that you’re working on unbounded data, it gives you, it gives you the abilities to Flink gives you very much the abilities to, to model stream processing in a, in a highly, in a highly flexible and expressive way.
Stephen Ewen 00:40:19 So one, um, one core part of stream processing of building stream processing applications is treating off data completeness and latency. I think you can think of it like that. If you only look at batch processing and vet processing have a fixed set of data, and that is all you ever have when you have it all at once, if you wish in stream processing data comes continuously. And for a lot of, a lot of results, there’s, there’s this, this trade off, like how long do I wait for my results? It might be a little more accurate if I wait a little longer, you know, if I’m, if I’m, if I want to, let’s take the example of some mobile application that, you know, sends sense event into our engine. And these, these events might, you know, they might be effected by some transport problems or so in our coverage issues, you know, they might not be sent all immediately, but they might be sending bundles in order to save some data.
Stephen Ewen 00:41:10 And then when you actually do analytics at any point in time, you can say, let me wait a little longer, and I can take a little more data into account to really understand what happened within this hour, because some data might still be late. On the other hand, you might not want to wait for long because that would hold back your results. So like, very like this is one of the, one of the absolute core elements of real time data processing in general, we’re stream processing, um, in Apache Flink gives you really good tools to, to model that for. So this is, this is the very, the very core of link, right? Continuous streaming really embrace the nature of all the streaming of all the streaming questions, giving you the right tools to, to, to model the application, doing batch processing as a special form of streaming drum stream processing, very robust, very scalable, um, Flink powers, some of the largest stream processing setups in the world that we personally actually know of.
Stephen Ewen 00:42:03 And it, it tries to, to also, um, innovate in like in other areas, like, for example, how do you, how do you specify stream processing applications, right? You can specify them as applications and Java, Scala and Flink, but there’s lots of reactive development on, on the layer for streaming SQL. So it kind of comes with a, with an interpretation of what SQL means in the streaming world and like very, um, like a very powerful implementation of, of, of SQL that actually has a very consistent meaning between the batch world and the streaming world, but still comes with like with certain extensions or built in functions that make a lot of streaming use cases very well representable in, in SQL. So it’s really a pasture of lingers. This is this big project con like computation over data streams as the heart at the heart. And then, then all the things around it that, that you need to, um, act to, to model the applications with different API APIs, um, with the right tools and, um, spending the whole breadth of the spectrum, batch processing, you know, continuous analytics, continuous data driven applications.
Edaena Salinas 00:43:17 What does the tooling look like for what’d you talked about just now where you’re trying to determine at one point, do you stop and provide our results when you, like you said, you might need just a little bit more data to produce something meaningful. I just want to understand, in what ways can users specify this, or does the system determine that part?
Stephen Ewen 00:43:46 Basically the core of the event, our model, that Flink implements, which is like highly inspired by the data flow model, um, from, from Google and Apache beam, the core part of that is first of all, making the switch from what is called like event processing time to proper event time. So we’re in like events are interpreted by the time that they were created or stored. They’re not interpreted as belonging to the time that they’re processed in the stream processor. And that, that is actually not, that’s actually not a big deal if you think about it, right? This is pretty much like, like in, in most batch processing systems and in SQL, you know, there’s a field in the event, that’s a timestamp. And in order compute what has happened per per hour, you sort of round the timestamp down to the full hour and grouped by it.
Stephen Ewen 00:44:39 So like having first class support for that in the stream processor, it just makes sense. Especially if you also look at batch processing as your use cases, right? The interesting part is when you have that, and the stream processor is very aware of this, this timestamp, that is, that is important. It’s not just something that’s hidden from the stream processor. It has to be aware of that time. Then you give the user the way to define a, um, like what do we call watermarking strategy? And that really encaps and encapsulates, um, this, this note, this trade off between latency and completeness, right? You can think of the, the watermark being the point when, when you say, okay, my data is now complete, Eucharistically complete. And I want to want to act on it now and how aggressive you define this to be depending on, you know, like the current characteristics of your data stream, this really encapsulates the trade off.
Stephen Ewen 00:45:33 That is one part, another part is, is for example, that, that, that Flink, um, offers like a very flexible programming abstractions to interact with. Um, like with timers that are, you know, triggered by either clocks or watermarks and so on in order to, to allow you to say, OK, I’m, I’m, I’m acting on data, I’m acting on, let’s say on what our market, but I’m also acting the latest at this particular point in time, you can take events, you can schedule them to be, um, reprocessed, uh, later, for example, when, when you have a little, a bit of a more complete view of your data, and you know, all those tools playing together is, is basically the tour chain that you have in order to, to model this behavior between latency and completeness.
Edaena Salinas 00:46:16 I see. And if I understand this correctly, you mentioned event time and event processing time event time would be the one, right? When the data is generated right by an IOT device or something,
Stephen Ewen 00:46:31 In some sense, it is whatever time you assign the event. Initially it’s quite often when the event was produced. It might also be the time when the event, you know, hit the gateway server. It might be the time when the event was first written into the message queue it’s, whatever time you assign the event, the important thing is it has to be a stable time and it, it should really, it should reflect something about that event, not about when you launched the job that processes the,
Edaena Salinas 00:46:59 And the processing time. Would that be when it reaches the computational component like Flink or it kind of, yeah,
Stephen Ewen 00:47:07 The processing time is really pretty much what clock time when it is the time when it, it reaches a specific operator. And, um, in, in many ways the like applications, let’s say, if you just want a simple application, you might be completely okay with just saying, you know, I’m aggregating my events in windows by event time, I have a watermark that is either perfect because, you know, my Windstream is decently well behaved, and I can encapsulate that in the watermark generator regeneration strategy, or I have a Eucharistic watermark and, and you’re done, you know, this is, this is what produces your, your output results. And, you know, if the watermark falls a little behind and the latency increases it’s okay. Um, you’re dealing with that. If you’re, if you’re doing very sophisticated applications that have like very hard reattempt requirements, you’re very often take processing time into account and say, okay, I also need to take into account that, you know, if, if event time, if the watermark Cigna me, signaling me reasonable completeness, if this actually takes too long, actually want to act anyways, just based on, on a local clock. When the event first comes, I may say, you know, defer to later to when the data is more complete, but also no longer than a hundred milliseconds in, in this way, you kind of combine both notions of time in order to create something that, that trades off completeness. And
Edaena Salinas 00:48:32 Earlier you mentioned that another part that you’re working on an Apache Flink is the finding what sea SQL means in this context. What I’m curious of is you work with, well, fleeing has been used by clients like Netflix, Alibaba, what are examples of queries that can be done to streams?
Stephen Ewen 00:48:57 So, so let, let me take, take back a step to answer that question. So the, yeah, the, the interpretation of, of SQL in the streaming space is actually is it’s a really kind of, um, I think a bit, a bit of a newer thing. It’s it’s um, so Flink Flink, the Flintco immunity is very much, uh, cooperating with the Apache beam community and with the Cal side community to define that. And in our model, there is SQL is inherently defined on tables. So let’s actually, you know, streaming SQL should also be defined on tables, but exactly in the way that, you know, classic SQL is more like batch workload, the table is, is viewed as a aesthetic, as a fixed thing. Streaming sequel takes a view of the table as a dynamic thing. It’s a, you know, it’s a table that changes over time. So the result of a SQL query that goes over a static table is a static table.
Stephen Ewen 00:49:52 And the result of SQL query that goes over a dynamic table is again, a dynamic table. And the semantics stay pretty much exactly the same. It’s actually a concept that is somewhat hidden in some databases in the form of materialized views and materials, few maintenance, and some of the more advanced than that warehousing to their business, especially. And we kind of took take this as a, as a, as a base semantical model for, for streaming SQL. The question then becomes for that kind of threefold, how do you derive the dynamic table from a of events? How do you turn the result dynamic table into a stream of events? This is kind of the question, how do you, how do you connect to the outset world? And this is where, you know, like users have quite a bit of flexibility to describe how to derive the table and recreate streams from the table.
Stephen Ewen 00:50:42 And then inside SQL, it’s all about which computations can actually work efficiently on changing tables to produce a changing table, right? You don’t want to recompute the SQL, the SQL queries work every time a change to the dynamic input table happens. We can theoretically do this, but it’s not going to be very usable at any non-trivial scale. So in some sense, all workloads that lend themselves to be to this incrementalization. So you can dynamically compute the results based on changes of the input. Those are, those are very good fits for stream processing concretely. There’s a lot of, a lot of crease for that, for which that makes sense. Some of them are not yet possible in Flink, especially if it comes to the point that they use very complex forms of let’s say, sub grays and correlated sub crews that, you know, the optimizer can not rewrite, then there’s no good efficient way to like incrementally update the results.
Stephen Ewen 00:51:42 So those, those don’t lend themselves for streaming sequel execution, but we’ve actually seen that like within, within the, the, some of the bigger users of link, like Alibaba, like Uber, um, like, uh, Yelp and Lyft there’s, um, like a lot of these, a lot of the use cases, staffers, stream processing actually match ask you all pretty, pretty decently there’s, there’s a, there’s a good mix. There’s some like very kind of advanced special applications that they, that they implement directly. Let’s say on the more low level streaming API is to, to really exploit very specific capabilities of link. Very often, these are the, also the heavy lifting jobs, the, um, the ones that run it, like a very large scale that do something like very specific and important, but there is, let’s say the there’s the bulk of, of applications that do some processing that contribute to, to some servers that actually are a pretty good match for SQL. So I wouldn’t go to the point to say, you know, SQL is all you need for streaming, but it’s probably a good 80 20 principle there. You know, you can probably with a good sequence of port capture, 80% of the, of the common use cases, and then 20% you offer to actually go down, go down the stack to like a more low level API, which gives you better control of what happens
Edaena Salinas 00:53:03 When I was researching for this interview. I saw that Apache fleeing has three abstractions. We’ve talked about two of them, the analytics component. We also talked about stateful applications. You mentioned right now, the, the other abstraction, which is more of the low, that explores specifics of Flink. Can you explain a little bit what this means, this low level abstraction?
Stephen Ewen 00:53:34 So one, one thing to maybe point out at the beginning is these are, these are in some sense you can think of them as slightly different abstractions, but they all kind of blend together or interact very well. Right? So a single program can actually use all of these together. You can start out in SQL, and then you can say the result of this dynamic table, I converted to a data stream. Then I do some operation on like classical data stream windowing, some computations, and then I’m using some more low level piece. Like that gives me a more raw access to state and time the, I guess, the, the APS that you were referring to were really the like streaming API versus the low level process API. And actually from Flink, if you look at the Flink code base, they’re really within the same API, the lower level piece is just something that exposes a little more, the raw ingredients is, um, on which other operations are built.
Stephen Ewen 00:54:30 So if you, if you think of streaming as, as a processing paradigm, that takes us stream of events and then treats a lot of events together, for example, by defining some things like, you know, give me the, give me the average time a user interacts with the service. And like you build a user session. You compute how long this is compute the average, let’s assume we’re computing this and the data stream API, not in SQL for now. This is really a bit of a computation where you don’t look all that much at individual events. You really looking at all these events together. You’re more thinking about the stream as a whole. The more you actually go to like a very complex application with, with sophisticated business logic. The more often it is actually the case that you receive an individual event and you’re doing very specific, very specific logic are depending on this individual event.
Stephen Ewen 00:55:29 So for example, based on, based on the type of the event, handling it differently based on some, some, some fields of the event, treating it immediately or deferring it to later or so on. And this is what this, what this lower level abstraction gives you red. Um, it gives you a nice way to say, give me an event, give me access to give me access to times or give me access to time. Meaning, let me actually understand what is current processing time. What does current event time? Let me schedule timers that call me back when a certain event time has been reached, um, even taking the event and putting it in such a timer and say, just give me the event back. Actually, when a certain event has been reached, I’m a very raw excess to, to interact with state and so on rather than state being implicitly handled by, you know, windows and window aggregates. And so on, this is more the low level abstraction. And if we actually look at the use cases of link, we’ve kind seen, seen things, um, kind of the more you are doing analytics, the more you gravitate to the let’s say data stream or SQL API, and the more you’re building data driven applications, the more people tend to gravitate to the, to the low level API, to the very raw, you know, access to event processing time and, and stage. That seems to be a pattern
Edaena Salinas 00:56:43 Customization in that case is one reason, right.
Stephen Ewen 00:56:47 To get more customization of like, of the lack of the pipeline as a whole. Yeah, exactly. But, um, yeah, the more you deal with applications, you’re more, more, you worry about the individual event. And I think the more you deal with analytics, the more you’ve really looking at groups of events, windows of events, you know, there are of course are individual steps like filtering and, and, you know, event transformations or so on that where events can be transformed into video. So like the big pictures you’re really interested in the sum of events rather than very custom logic per individually.
Edaena Salinas 00:57:25 The last question that I want to ask you is about the next wave of stream processing. What are some of the areas where you see architecture’s evolving stream computation evolving?
Stephen Ewen 00:57:39 I think there is various areas where stream processing stream processing is still evolving, like on the application API side, I think SQL streaming, SQL, um, still is, is actually still evolving in its in itself, even though the, the model, the semantics, that definition is actually quite like quite simple and quite complete in itself. If you actually look at how do you, how do you make this usable for a lot of common use cases, there are a bunch of, you know, constructs you may want to add, and you can actually add them, you know, without, possibly by extending the SQL standard, but possibly also by, you know, like using certain crafty ways of using the existing SQL standard, um, masking them as you, the essence on those, there’s a bit of work, for example, we’re doing in Flink in order to get, get typical, like join and enrichment semantics that are very common in stream processing field, like very natural in SQL.
Stephen Ewen 00:58:39 So on this, on this application API layer, there’s quite a bit of work, a piece of work that, that, um, somewhat related to that, that we did outside, um, Apache Flink, that the company that the origin of creators of Apache Flink founded that our artisans that we’re working on is actually, um, we’ve built a system for you. You can think of it like running distributed asset transactions over streaming data transactions that take, you know, not, not an individual, an individual key or so into account or an individual role, but really like a multi-role view. So like without diving into the details, it’s a little bit like opening up stateful stream processing from the capabilities, let’s say compute plus a key value store to compute plus a relational database, which is, we’re just quite a bit more powerful. Um, I think this is, this is a very interesting direction also in which, in which stream processing still can, can develop going back to the open source, um, to the work in, in open source in stream processing itself.
Stephen Ewen 00:59:40 One thing where stream processing is still where there can still be today, a reason to use batch processing rather than stream processing is if you wish resource efficiency, because batch processors still tend to be a bit more dynamic in their resource utilization. You know, they, they schedule more fine grant tasks on fine grinned, finite pieces of data, um, which, um, in, in many setups leads to fewer resource consumption compared to like long running streaming jobs. So there’s quite a bit of effort going in the Flint community to, to introduce more like automated scaling automated adjustments to Theresa’s needs. I think this is going to be, it’s going to be one of the last big things. We’re we’re batch processes have a certain like advantage over stream processors. There’s a whole lot of work going in providing a more unified view of debt batch and stream processing.
Stephen Ewen 01:00:35 Um, while the data flow model suggests really good way to do this in the API APIs, there’s still quite a bit of work needed, I think on the storage side and on the runtime side to, to get this, um, really unified behavior that, you know, stream processors when they work on historic data, even if they they’re not really put into a batch mode, they can still, they can still explore some of these advantages of bound the data. And then finally, there’s always, um, you know, the work to make, to make things like more scalable, faster recovery, zero downtime, upgrades, all this, all this work that, that becomes interesting. Once you actually say, Hey, we have the stream processing technology that, you know, you know, actually is a really cool way to build applications. Let’s now build really highly available, highly critical realtime applications with that, you know, and the characteristics that you actually need once you start doing applications that need, that need a lot of nines of availability. Stephen, thank you for taking the time to come on. The show has been great talking to you. Thank you very much.
[End of Audio]
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].