Dhruba Borthakur

SE Radio 469: Dhruba Borthakur on Embedding Real-time Analytics in Applications

Dhruba Borthakur, CTO and co-founder of Rockset, discusses the core requirements of real-time analytics, how it differs from traditional analytics, and the use cases that benefit from it. Host Kanchan Shringi spoke with Dhruba about the evolution from Batch to streaming to real-time analytics and the relation to big data. They also explored how people implement real-time analytics, as well as the limitations and hence the need for a new architecture. Dhruba also spoke of the aggregator-leaf-tailer (ALT) architecture and the why/what and how for embedding real-time analytics in SaaS apps.

Show Notes

Related Links


Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].

SE Radio 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected]. Se radio is brought to you by the computer society. I is your belief software magazine online at

Kanchan Shringi 00:00:16 Hello everyone. Welcome to this episode of software engineering radio. This is your host Kanchan Shringi, and today I’m really happy to welcome to our podcast. We’re going to talk about embedding real-time analytics in obligations through about a C2 and co co-founder at Roxanne. He was the founding engineer of the rocks DB data store at Facebook earlier at Yahoo. He was also one of the founding engineers of the Hadoop distributed file system. Duba has also been a contributor to the open source Apache HBase project previously held various roles. Advaita software founded an e-commerce startup called one receptor com and contributed to the Andrew file system at IBM transact claps. You have quite an impressive resume through, but so happy to have you here and really looking forward to today’s conversation. Is there anything you’d like to add to your bio?

Dhruba Borthakur 00:01:09 No, thank you. Thanks for inviting me. Really appreciate your invitation here. I’m very excited to be talking to you about real time analytics

Kanchan Shringi 00:01:18 Let’s get started. So before we jump into it, I do want to point out that we have done a few related episodes in the past, and there are about five or six episodes. We’ll have a link to those in our show notes and include any references that we discovered during our conversation as well. So why don’t we start with a very basic question? What is real-time analytics?

Dhruba Borthakur 00:01:43 Yeah, that’s a great question. I have been asked this question many times, and sometimes it’s difficult to explain as a definition, but what I can try is to give you examples are define what is real time analytics. So real time analytics is typically a way for somebody to look up very fresh data and make decisions based on this data. Take, for example, you are trying to make a prediction or you are trying to understand relationships between different events happening in your system, and you want to do it in real time. In the sense you want to do this decision-making astronauts, the data gets produced. So this is real time analytics. It has two characteristics. Usually it’s about the freshness of data and how quickly can you make decisions based on this data. Sometimes people also talk about real-time analytics as analytics that are very customer focused, because if you try to make decisions based on behalf of a customer or user, these are people who are online, who are waiting for the results to appear.

Dhruba Borthakur 00:02:50 So the need, the queries need to be really fast for these kinds of analytical applications. A very simple example for real time, analytics could be take, for example, Google maps take for example, right? And you’re driving down the highway and there’s an accident. Now, your app, in this case, the Google maps app have to show you an alternative route and that has to happen in real time. It cannot be based on traffic data that was created like 10 minutes ago, right? So this is a very simple example of how real-time analytics are kind of built into some of these backend applications that are becoming more and more prevalent. These days.

Kanchan Shringi 00:03:29 The two things are of course, some months waiting for the data. They’ve asked a question, they want a response right away, but also that the decision, the response has to be based on really fresh data. And you provide a great use case for that.

Dhruba Borthakur 00:03:43 Yeah, the freshness of data is super important for real-time analytics. There are many other systems that are in the analytical domain that is about making decisions, but a lot of those applications make decisions based on all data for real-time analytics. The freshness of data is super important for, for these kind of applications.

Kanchan Shringi 00:04:05 So you brought up other applications. So this is a good time. Maybe you can contrast with how traditionally all app and all app operations are understood.

Dhruba Borthakur 00:04:16 So w if I contrast with traditional analytics, you brought up this term called all up and say, oh, LTB, right? So take, for example, all app people refer to all up as online kind of analytical processing. So they’re online in nature, but again, the data is usually not as fresh as people would like it to be. Most of the applications that are out there, the building cubes, for example, are the build pre aggregations. They’re already tried to kind of summarize the data in, in more coarse-grained format. So that queries are fast on this data, which basically means that the curate, the data, clean the data to put it in a queryable form. And after that, when the serve the queries, the data is not changing in nature. So by definition, most of these, all our systems are there where you take a bunch of data, you process it, and then you start to use it to make decisions.

Dhruba Borthakur 00:05:14 So by definition, it’s not real time because by the time you took this data and made it queryable, you are doing some kind of batch processing system. This is how traditional analytics happen. These all have applications. They need to typically scan through a lot of data and they make something called cubes. Cube is like more like a bucket ization software where you aggregate exactly pre aggrigate in using nine are in different dimensions. And now because the data queries are fast. But the challenge again, is that because it’s pre aggregated, it doesn’t mean that it does take in any of your fresh data into account. And that’s how traditional analytics works.

Kanchan Shringi 00:05:54 And the processing takes time. As you mentioned,

Dhruba Borthakur 00:05:57 Processing takes time making the cubes take time. The cubes don’t change. Once you make a cube, let’s say you make a cube every day for 10 minutes cubes. Right now, you can’t really backfill data into this 10 minute cubes that you already made. So it makes your analytics not real-time in nature in general, the only way to change this model to make it more real time is to be able to build a lot more real-time indices or lots of pointers into these data. So you can update things in place so that you can make it more real time in the future. People have built indexes on this data, other than cubes, building an index is a great way to kind of make data real time, rather than building a cube, because indexes can be updated as, and when new data comes in. But yeah, I mean, there’s a big difference between how all that traditional queries happen and how real-time analytics typically occur in real life.

Kanchan Shringi 00:06:52 Is there any relation to the amount of data? Is there, you know, where does big data come in?

Dhruba Borthakur 00:06:58 Data is kind of a very ambiguous term these days, right? In 2005, I think they did a revolution kind of started or 2006 around that time. But what has happened right now is that almost all data sets are big data by definition, which basically means that if there’s a solution or if you have a system that deals only with small data sets, that is usually something that is not interesting to most people. So now I think the other things that I can add to your question is that big data itself has kind of two or three different dimensions. One is the velocity of data, which means that how fast is this data coming? One is the size of the data, which means that how big is this data set? And the third one is what is the variety of data? How much complex is this data?

Dhruba Borthakur 00:07:46 Is it all one format or is it five different formats coming in? So for real time analytics, I think it’s super important to be able to be able to process different data formats. It’s super important to be able to deal with a high velocity of data. So let’s say you might want to pin me down, but I would say that maybe somewhere between a few megabytes per second to a few gigabytes per second, could be the velocity of data that’s coming in. This is new data produced by events, right? This is your streaming data set that you are capturing by your application. And you want to look at all this data to make decisions. So this is the challenge for real time analytics. I keep going back to my thinking is that real time analytics is hard because of these reasons.

Kanchan Shringi 00:08:31 So you think sort of, for me, you know, besides being fresh, it is typical that all the data today will be of high velocity changing and different variety, a large number of data sources. And then you said streaming. So what is streaming analytics and how does that contrast with real-time analytics?

Dhruba Borthakur 00:08:51 Yeah. So streaming analytics and real-time analytics are complimentary in nature. I think. So when you talk about streaming analytics, we usually refer to the fact that analytical processing is occurring. When data is streaming into the system, I said about, let’s say high velocity of data, but all the data is coming into the system. There is. And that time itself, you could afford to do some process when new data arrives, it triggers a set of actions that generate insights from the data. This is streaming analytics because it’s happening as part of the stream. It’s also sometimes called us right time analytics, where data analytics happen when data arrives into the system, these insights, once new data comes into the system, you can generate insights and those insights can trigger more downstream, analytical processing. This is all called streaming analytics. It’s like a data waterfall, like, uh, like waterfall water falling from a higher level to a lower level.

Dhruba Borthakur 00:09:49 Every level when data is flowing downstream, it’s processing some analytical decision-making. This is all streaming analytics. On the other hand, real-time analytics typically use a combination of both right time analytics, as well as real-time time analytics. So you could have some streaming analytics, which is building some right dynamic mitigations. But if you’re talking about just real time analytics, it also needs a lot of variability in the queries, a lot of complex power in your queries itself because your queries could be different from point to point. If my application is making equator, the database credit could be completely different from other person making a greater, the same backend source. So essentially real time analytics uses a combination of streaming analytics, as well as greater time analytics to provide the real timeliness of your application. So that kind of two different complimentary things. If you have streaming analytics, it doesn’t need to be real time, but if you have real time analytics, it’s usually the case that you need some form of streaming analytics and some form of quality analytics built into the system so that you can respond to fresh data quickly and high volume and a large scale.

Dhruba Borthakur 00:11:08 So they’re complimentary to one another, I would say, okay,

Kanchan Shringi 00:11:11 Got it. So streaming is zoomed in a sense, and or that you have retained analytics, so there’s still work going on. There’s still processing going on. So how real time is this? Or is this near real time? So is it minutes, seconds? Can you talk about that?

Dhruba Borthakur 00:11:30 So again, I think when it talked about real-time analytics, we talked about freshness of data. And I think your question is that how fresh is this data? Is it one second freshness? Or is it one day of freshness? So that’s an answer that depends on the application itself, but in general, when the industry talks about real-time analytics, they’re mostly talking about a data latency of say few seconds to a few minutes. So this is why, okay. So let me backtrack and explain the difference between data latency and data latency measures. How real-time is our data. It is the latency from when the data was produced to the time when it is available to make decisions right. It’s available in their query engine. That’s the data query latency is when you issue a query, how long does it take for the query to come back with results from your backend system?

Dhruba Borthakur 00:12:21 So for real time analytics, the data latency is usually are a few seconds to maybe maybe a minute or two. Let me give you an example. Let’s say you are like in a food delivery use case, right? You have an enterprise does delivering food from picking up food from restaurants and delivering. Now there, maybe let’s say the entire distance of our entire delivery cycle is let’s say 15 minutes, right? For a backend system that can react to changes in delivery. Like a delivery person is breaking down or the food is not ready or available. So they’re less than a minute easier should be already delayed instead, because you need to react because your patch cycle is just 15 minutes, right? So from pickup to delivery, so they’re a little late. This, it could be really small that you need for your time. But on the other hand, if your logistics system is delivering luggages, our goods from one place to another, from the east coast to the west coast, you might have only one train that comes everyday.

Dhruba Borthakur 00:13:20 So there your data latency could be maybe minutes are maybe an hour or so as well. The whole point is that real-time analytics essentially reduces the latency of your day. It is sometimes people also ask me these questions about, are the systems online or offline? That’s another way to look at how real time this data latency can be. And my answer there is that real-time analytics mostly are online systems. They’re not offline backend batch systems. They’re systems that if the don’t react quickly, then your dog gets all your audience suffers an outage of software service advantage, or to suffer bad decisions, which are not optimal for your business, or it’s not optimal for the user, from your system. Yeah. So it’s kind of not very clear cut in my mind exactly what the number is, but it’s usually seconds two minutes.

Kanchan Shringi 00:14:13 You’ve given us a couple of examples of use cases. And in that there was already an action that had to be taken your Google use case for maps, for people to read out. So essentially the goal is for people to perform an action, right? Not just being able to get the information, but be able to perform the action in which case, of course the user performing the action. But there are other cases where I guess there has to be an automated action also required. Would you be able to provide any such use case?

Dhruba Borthakur 00:14:45 So the real-time analytic systems that we see in real life, there are actually automated in nature. I took the Google maps example just to give you give it like a human color, but actually the decision-making process is very much automatic for most other systems. So take, for example, let’s say we talked about like say construction, logistics delivery company, right? There’s an enterprise, which is in the business of producing cement and delivering it to construction places, places where big buildings are being constructed, then you’d be actually amazed that there’s a strict latency between when the semen is mixed and when it needs to be put on the building, otherwise it goes waste. So now what happens is that there is a complex process where at the construction site, there are a lot of sensors, which is figuring out how, how well is my hydraulic lift working?

Dhruba Borthakur 00:15:42 How fast is it able to do its work? How many steel pipes have I installed till now? Because based on all those sensors is when the semen needs to be mixed in a fire that place. And then the semen is loaded into the trucks and delivered to this construction site. Now this is all could be automatic. This is a very real time analytics situation, where you’re getting sensors from many different objects in a construction site, that’s being fed into a backend system. And then based on irregularities or anomalies of this process, the Siemens needs to be mixed at a different time and then loaded into trucks and delivered here. So this is very automatic process. There’s no humans involved. I mean, it’s all software. And then here, the data latency is super important for these kinds of users. Because if the miss deadlines, then the semen goes waste.

Dhruba Borthakur 00:16:31 The driver who drives the truck gets into the construction site, finds out that all these three hours of driving has gone waste because now the semen cannot be used at the right place at the right time. So there are lots of industries where the cost of a non-optimized ongoing team to deliver their products is super impactful for their products and their business. And this is where real time analytics is really useful for these kinds of use cases. Similarly, I think you might’ve seen also like a flight tracker, sorry, flight paths that are being taken by airlines. There are also a lot of it is real time analytics because constantly the, the machines on the ground are actually telling the planes exactly how to work out. Exactly who takes. I mean, there’s a direction that that’s set in the beginning, but there are so many small changes being made in the route as in when the system and the plane is running. And those are all automatic systems. They’re not people making decisions from minute to minute or second to second. These are all software making decisions based on fresh data.

Kanchan Shringi 00:17:34 Thank you for those two examples. They certainly make a lot of sense to me. And in both cases, you know, it was clear to see the fact that there is a lot of data. This is big data. The velocity is changing. The sensors are constantly generating data and certainly the variety, probably many, many different kinds of sensors. So I think it matched all the metrics that you asked us to look out for, for real-time analytics. And clearly for these it’s the use case seems obvious if you don’t do this, your flight path is not correctly computed. It could have accidents. In the other case, the cement could go waste as well. So that’s clear, but it may not always be the case that you need to have decisions right away. So I’m sure it’s based on the use case. So how do people come to a decision that they need real-time analytics? Because I’m sure there are costs associated with this. There must be some cons because otherwise, you know, why wouldn’t you do this all the time. Maybe let’s start with talking about the costs enrolled, and then we can see in what, what are the things you would ask people to look out for to know that yes, they do need time analytics.

Dhruba Borthakur 00:18:45 There’s definitely a cost to real time analytics. So real-time analytics is a hard problem to solve. That’s why most current use cases that people use, they use data analytics. There’s the sense that, okay, I’m going to live with data analytics where my data is one hour old or one day old because I can’t, I don’t have the resources to be able to do real-time analytics. That’s how the state of affairs went in the software industry in the last decade or so, but things are changing now, but coming back to your cost and the question of cost is super important. So the cost of real time analytics is falling actually day by day. Not because some new invention has happened, but because more and more people find that this is an important problem to solve. So there are new techniques that are being developed. So the biggest cost of real time analytics is that frustrated plan for all types of data.

Dhruba Borthakur 00:19:40 This is not easy. So in traditional ways, people will set up five different systems to deal with five different data sets. And then they have to manage all these data sets all these five different systems by themselves. They might have a relational database, they might have some document database. They might have some other kind of backend systems to deal with this variety of data. So that’s a big cost to, in general, for people. The other is essentially operational overhead of real time analytics, operational overhead of, of reacting. Like let’s say you have a high volume of data coming in and suddenly something is not working now. All that traffic needs to be read out to a different channel so that you don’t lose any of that data. You see what I’m saying? So real-time analytics is not just about getting it and processing, but it’s about keeping all of these big pipes up all the time, 24 by seven, so that you never lose some pieces of data.

Dhruba Borthakur 00:20:36 Otherwise your system is going to be a problem. So there are big costs involved in it, but I can also explain how some of the newer technologies are helping people to surmount some of these challenges regarding costs. Take, for example, there are lots of public clouds that people are leveraging to kind of deploy these data systems. So that’s one way where there’s a lot of automatic processes that help you do fault recovery, right? Like if a machine goes down, there are automatic ways to spin up new machines or new hardware and not have to lose any of the data that’s coming in at high volume. Similarly, again, I can go into some more depth about how systems have evolved over from batch to real-time and how some software changes have happened to be able to manage all this costs. The costs essentially have come down drastically over the last maybe year or so now, or maybe last two, three years.

Dhruba Borthakur 00:21:35 But it is something that people should definitely think about because the cost is not just about the streaming data, but also the high volume of queries that you need. Because most of these real-time analytics are used by other pieces of software. So they need to make queries at sometimes at high volume and high QPS. That’s definitely one cost. And the last part of the cost is about matching demand. So it’s likely that the workload of your system needs to go high during the daytime. And then maybe there’s no word for it in the nighttime, like for example, my semen construction company, most of the time in the night, there is not much work going on in this, in this business because the physical workers are not constructing the buildings that much. So your system has to manage costs by being able to spin down unused resources when not in use, and then spin them up back again. When daytime strikes again, this has to be automatic. This is what I mean by, there are lots of traditional costs, but a lot of the newer systems that are out there that try to avoid this cost by billing software could be more intelligent.

Kanchan Shringi 00:22:43 So my takeaway was really that maintaining the data systems themselves, then absorbing the operational overhead of making sure that the system is operating all the time, matching demand, tuning it so that you are able to react when you do need to react. And what do you do when you don’t need to react? Are you ensuring that your systems are utilized for some other purpose, maybe at that time and public clouds have certainly helped in this area. That was my takeaway from this. So thank you. This was a good introduction to, you know, what is real-time systems, some basic use cases. And we also talked about the costs of it. So taking a step back, let’s talk about batch because I assume that’s where people started contrasting batch with them analytics a little bit more. What did people do earlier? What were the typical components use for batch mode

Dhruba Borthakur 00:23:37 Has always been batch for the last, maybe 30 or 40 years now, since people invented computers to store and do analytical processing on it. But if you look at only at big data systems, most of big data started around in 2005 or 2006 or 2000, somewhere around the middle of that decade. And it all started with a paper from Google, which talked about the Google file system. Just send it over. Google produced a research paper saying that this is how large data sets can be stored in disc systems, right? And using the paper, there was a project that got built in open source project. I was part of the project, and this is a system that essentially let you store a lot of large datasets, but the focus always has been are at that time was how can you store data bytes, terabytes, and petabytes of data into the system?

Dhruba Borthakur 00:24:34 The focus wasn’t, how quickly can you make decisions? You see the sort of design goal of these systems were very different in the last decade. The last decade, the challenge was how can I store petabytes of data cost effectively and cheaply and not have to drain all my pockets? So this is where Hadoop became very popular. It was one of the first systems that you can store a lot of data. You can make queries. There was something called Hadoop MapReduce, which again is a very scan based system because these are all bad systems. The focus was queries that can do full scans of your data and produce results and reports. The focus was not on leveraging our most recent data, set the focus on not on low-latency queries. The focus was entirely on how can I store petabytes of data. And then I can make one gigantic scan of this data and give me results that I’m looking for.

Dhruba Borthakur 00:25:30 So that is how it all started in the beginning. But then another kind of evaluation that happened is that SQL systems started to take a little bit of hold in this in the beginning, Hadoop are any of these big data systems were very much key value stores or like some custom APIs, but then people saw that using sequel is a great way to leverage the data in these systems because sequel is a way that’s easily understandable and many developers can adopt it very quickly. You can express very complex computations using very simple SQL statements. So it’s the developer. Doesn’t have to think about a lot of complexity. It just specifies that, Hey, this is what I want to do with these two data sets. I want to join this large data set with this data set, and then I want to do some aggregations and the software to execute all those commands.

Dhruba Borthakur 00:26:26 So the developer or the application developer really liked to use secret are the BI analysts. So sequel started to take, hold on these systems and now more and more big data systems are any data systems. I feel like sequel has become kind of the franca of all these applications. Not because CQL is some, some magic language. It just because it’s a very versatile language. It allows you to do all kinds of operations with the data in a very declarative way. You don’t have write a program or a functional program to specify the steps. You just tell the system that I want to do these things. You figured out how to do it best and give me the results. So this is why sequel has become so popular on some of these bigger systems, but it’s a, it’s an evolution. I feel as far as traditional data systems are concerned. They’re very much batch systems and full scan systems. And this is why real time analytics wasn’t really possible on those kinds of systems. They were very batch oriented and batch oriented means that you cannot take fresh data into account because you create a batch of data to be processed, and then you keep processing it for the next hour or the next day.

Kanchan Shringi 00:27:42 So it’s HTS no longer part of the ecosystem for building real time analytics.

Dhruba Borthakur 00:27:47 Yeah, that’s a good question. So I was the project leader of the Hadoop file system back in the days when it was a very cutting edge software project, I feel like Hadoop has kind of lost this charm in the last, maybe two, three years back in the days, it was a very powerful system. But now what has happened is that people have started to move to the cloud. And the cloud means like said Amazon, AWS cloud, or Google cloud or Azure or some other cloud systems. And there, these cloud vendors already have an HDFS equivalent system that is pre-installed for you. Like in Amazon, it’s called S3. So you can have an object store that skins in finite. So there is no advantage or there is no differentiator as an enterprise who is using Hadoop to be able to say, Hey, I run my own Hadoop because the cloud vendors run it for you.

Dhruba Borthakur 00:28:40 So HDFS I think is losing its charm similarity, MapReduce, which is very much a batch system is not what people are looking for these days. Because again, mostly people are thinking about real time analytics, not batch analytics. So it’s changing. Also, if you use Hadoop, you still look at things in a very old fashioned way. You have to look at capacity planning, your Hadoop cluster. You’d have to look at how much CPU you use, how much storage you combine together in an HDFS cluster. Whereas if you use some of these cloud vendors solutions, they actually just pay as you, as you use service, you don’t have the capacity plan. Any of these things, you don’t have to think in advance how much data you’re going to use the next day or next month. So people have more or less migrated to a lot of cloud storage systems that are out there, but HDFS are still used by some enterprises.

Dhruba Borthakur 00:29:33 I’ll give you one or two examples without naming the people. Because I worked closely with these enterprises. There is this company who deals with buying and selling used. Gosh, they have a service that allows users to see what other spade for a similar cars in their neighborhood or in the local area and publish this to the dealers of the cars, this store, all this data on an on-prem Hadoop five system, right? And they use something called HBase, which is a database and a key value store. But again, this is not a real time system. This is the difference of those systems. Like they can look at used car prices of yesterday and earlier, but not like the latest ones. Like if somebody is making any changes today, they don’t get reflected for 24 hours. Similar. There are some big financial farms that continue to use the Hadoop. First of all, they have far more restrictions and regulations, but the dues store things like mortgage deeds or loan, repayment, history, these kinds of things on run, the risk assessment and adolescents on Hadoop file system. And these risk numbers that don’t really change from like minute to minute or hour to hour. So it’s okay for them to not have these real time concept for these kinds of use cases on the Hadoop backend.

Kanchan Shringi 00:30:51 Okay. So the point is that the use case it was built for at that time was quite different and a change is certainly needed. So moving on to that, you know, like what’s new, what have you been working on? We’d like to focus on that. I read a little bit about rock set and I came across the term aggregator leaf tailor architecture. So I do want to ask you about that, but before that, what is Lambda and what I shortcomings, why did you have to create something brand new?

Dhruba Borthakur 00:31:20 So for people who are in the data landscape or who people who deal with data systems, software systems, I think the Lambda architecture has become very popular in the last few years. So what the Lambda architecture is is that it says that there is a batch pipeline to load data into your system. And then there’s also streaming pipeline into which you load data into your system. So the land architecture says that there are pipelines to load data into your system. You cannot get rid of any of these because you need real timeliness. It’s great to have these two pipelines built into your system, the land architecture and the focus of it is that how can we get good efficiency and good throughput of your system? That’s the design goal on the landmark? So which means that you have, let’s say a hundred terabytes of data, which means your bulk processing pipeline, but then you also have say one gigabyte of data per second, coming in, which means that there’s a streaming component to it.

Dhruba Borthakur 00:32:22 And the land architecture very clearly says that this is how you should build a system so that your system is optimized and you get the best value for your money compared to the amount of data that is coming into your system. But you would realize that it is not really talking about real-time analytics. It’s all talking about efficiency and throughput of your system. Now, based on our discussion, I hope it’s getting clear in our, in our minds that real-time analytics is about efficiency, but the real focus is the ability to react very quickly to changes in your email.

Kanchan Shringi 00:33:01 So it seems similar to your answer to the earlier question on the difference between what is streaming analytics and what is real time analytics. So beyond just loading the system, which seems to be the focus of the Lambda architecture, you’re saying there has to be more optimizations needed for the reading,

Dhruba Borthakur 00:33:18 Correct? Absolutely. So Lambda, doesn’t tell you much about how fast can you acquit is B how are you good at changing from minute to minute? What is the query volume of your workload? Those are not part of the design of the Lambda architecture.

Kanchan Shringi 00:33:32 Okay. So then let’s jump into the specific technical accomplishments with the aggregator leaf Taylor rocket.

Dhruba Borthakur 00:33:40 We are dropped, said we use the aggregator leaf Taylor architecture in Charlotte. We call it the ELD. So the LT architecture by itself is not something that suddenly came out of the blue because of a lightning bolt from the sky. It is inspired by many of the systems we built when I was at Facebook. Some of the backend systems I’ll give an example of where you might be already using LD architecture system. If you use Facebook, you load the Facebook app, and then you see the Facebook feed, right? The data feed, which is about comments, posts likes from your friends. So that’s a data feed that’s coming in and actually very high volume for because all your friends are posting. Things are liking pictures, and it’s an analytic system that Facebook newsfeed, because it doesn’t show you everything. It does a combination of your best friends that has tries to apply some logic between a ranking between all the feeds that are, that can possibly show in your system, right?

Dhruba Borthakur 00:34:40 And it does relevance matching, starting, and then shows you the feed from the Facebook app. So this is a real time feed, which is focused on reducing data latency. You don’t want to see posts from your comments, from your friends, which was posted yesterday, right? You want to see them right when your friend posted a comment. So this backend was built using an LTE architecture, LT like architecture. So what happens is that the beauty of the LT architecture is that it is focused on serving low-latency queries on large datasets. It has to be efficient and optimized and throughput optimized just like Lambda, but it has an additional constraint that your queries have to be fast, and you should be able to support a large volume of grease if needed. So this is the difference from the use case perspective. This is the difference of an LG architecture.

Dhruba Borthakur 00:35:34 Now, if I double click on it and I feel one more layer of the onion, the technical differentiator of the LT architecture is that it has three components. It has the aggregators and the data. So the tailor is the component who is basically involved in writing to the system. So when you’re sending data into the system is the tailors who is actually indexing this data or creating it in a format that is queryable. That’s where a lot of work is happening on the tailors. And then the tailor sends it to the leaves. And the leaves are the components that stored is data per system. So you don’t lose it. So this is kind of your storage system, the leads, and then the aggregators are part of the query system. So when queries come in the aggregators of the components that fetch the data from the leaves and do whatever processing your SQL query might be telling you to do, and then give you results.

Dhruba Borthakur 00:36:33 So there are three separate stages in the LT architecture. And the reason why this is important is because this is a desegregated architecture, perfectly built for cloud systems. This did not appeal much when you have an on-prem system where your hardware is fixed. The reason it’s a desegregated architecture is that is very cloud-friendly. And I can explain again, why. So what happens is that if there’s a high volume, if you’re right rate to your system in grudges, then you spin up more and more tailors. You don’t have touch the leaves and aggregators because it just more data is coming in. But the amount of data that you have in your system is constant, or the queries are not increasing in nature. So you just need to scale up the caterer. Similarly, when there’s no data, you spin down the tailors and you don’t have any extra costs for the tailing component, right?

Dhruba Borthakur 00:37:23 Similarly, if your data set increases day over day and month over month, but then you need to add more and more leaves. But if your data set is decreasing because of whatever reason, then you can keep shrinking the number of storage nodes in the leaves, in the system and the aggregators. These are actually super exciting because when more queries come, you spin up more aggregators and you serve them. And the same low-latency that your real time system needs. But then when there’s no more load and people are sleeping or whatever in the nighttime, or maybe it’s not Christmas day, so something else or less load, you can spin down all your aggregators and you have no costs for spinning up. So basically the desegregated architecture is cloud-friendly and it can give you the lowest cost compared to an on-prem system where you have to provision everything for peak capacity. This ties into your other assist. Other question you asked me is saying, how can you reduce costs for analytics? So LD architecture is a great way to reduce costs as well because it’s desegregated in nature.

Kanchan Shringi 00:38:26 Could you contrast it with Apache droid, flank and kudu? From my impression they have been used for similar use cases.

Dhruba Borthakur 00:38:34 So good question from an architecture perspective, there are different systems. So like I said, the aggregator leaf data architecture, because the three parts of the system. So what has happened is that the three things that are desegregated in this architecture is dust CPU that you need to write it out of the system, right? And then the second thing is the amount of storage in our system. And then the CPU that you need to solve queries from your system. There’s three different things in this architecture. Whereas take, for example, if you look at a budget kudu or dread, there is no clear separation between the compute that you need to write data into the system versus the computer. You need to quit. There are shared set of compute servers. So now what will happen is that, I mean, that works fine for many use cases, but the challenge for that architecture is that if suddenly you have the high right volume, let’s say suddenly some software started to send a lot, a hundred times more data than it was doing last hour.

Dhruba Borthakur 00:39:35 Now your queries will suffer because it’s sharing the same amount of CPU or the same set of compute nodes that are used for writing the data, as well as getting the data. They’re not segregated in nature. So it’s because it is shared in nature, which is why. Similarly, if suddenly there is a high number of queries coming into an existing system, let’s say grid, for example, your new data might get delayed because now it’s not as fresh anymore because all the compute is being used to solve all the credits in our system. So this is one big difference from the architecture perspective. Now you also mentioned about Flink. So Flink is very much a streaming system, which basically means, again, the difference between streaming analytics and real time analytics. So thanks for example, in LTE architecture, you could actually do use Flink as well as part of the tailors.

Dhruba Borthakur 00:40:29 You could think about Flink as if the Taylors are made of Flink notes, where things are happening when data is coming into the system, but then you actually store it in the leaves where the actual data storage is. And then I get to get our schools to be there, to query all the data that your Flink has output to the list. So Flink more is like a streaming system where you can process data as, and who I need is coming in. And after you’re done processing, you can park it in the leaves and then the queries can come into the aggregator. So they’re kind of complimentary to one, another Flink also could be similar to say, sparks streaming use cases where you can use it more like an ETL tool, where it’s a set of transformations that you can do using Flink. And then you can park it in an ELT architecture system, and then you can start quizzing the leaves and I got to get her. So, yeah, I think I’ve seen people who kind of use that in that the MTA architecture, as well as Flink together in one system

Kanchan Shringi 00:41:27 That I keep getting as the fact that a lot of these systems were optimized for the storage and what you’re now focusing on is how to make the query better with real time indexing and how to scale these three components separately so that this cloud native

Dhruba Borthakur 00:41:46 Absolutely. So the focus for the LQ architecture is how can you serve a lot of queries efficiently? How can you make the queries low-latency? How can you make sure that you use only the minimum amount of hardware needed to serve these queries? So there are three things to make this grid as efficient. You also mentioned about indexing. So indexing is super useful for us to be able to serve queries in low-latency. I mean, indexing is, is a time tested way of reducing query latencies. I mean, if you look at my sequel database, our database that was produced, even anything else, Informix, whatever else was there, 40 years back, they all have this thing called indexing. It just that nobody had actually used the next thing on large datasets before. So most of the indexing was for small, like relational kind of transactional databases. People never actually used indexing for analytical applications. What now people are realizing is that to serve low-latency queries on large data sets, you need to index these data. Indexing is the way to reduce latency. So the aggregator leaf killer architecture lets you do this because there’s cost of indexing at a time when data comes in, which is the Taylor’s and then there’s a cost for the queries to be able to use the index as part of the query, which is done by the aggregator. So yes, the LT architecture definitely helps the next thing. Part of things.

Kanchan Shringi 00:43:15 Yeah. On the indexing, I did read the Tom convert indexing. So can you explain what’s that

Dhruba Borthakur 00:43:23 Two or three different types of indexing that people are used to in the software industry are things like a column or index columnar index is an index where all the columns of your records are stored together. Let’s say you have a table with three fields, right? Name and age and salary. Let’s say the three columns in your database. Now, if you keep all the salaries in one place separately from the other group fields, if somebody says, Hey, give me the average salary of a person, I can just scan through all the salary fields of all the billions of records in my database and give you the average. It’s very fast. So columnar storage or column or index is a traditional way to make scan type of queries fast. That’s one type of index that people are used to. Then the other type of index people are used to are something called inverted index.

Dhruba Borthakur 00:44:19 So inverted index is what is essentially used by open source software like leucine and elastic search. They’re used essentially to give you very low latency on point quiz, like suppose you have equated to this table saying, find me the person whose salary is exactly $1,000, right? So now that’ll probably like 3% in your billion or million people database. If you build an inverted index, those queries will come back really fast. It doesn’t have a standard entire table to figure this out. This is the second type of index. And the type of index are essentially very much like a relational table with a primary key. In this example, I gave you, you have the name and the age and the salary. You could have an index with say the employee ID of the person. So that’s the 40 index. And so now that’s how conditioning relation or tables are stored with the primary key.

Dhruba Borthakur 00:45:14 And that’s a record or a role based index that we have, which is, could be indexed by the employee ID of the person. So if somebody is the person who is getting hired or leaving the company, you can use the employee ID to find its record and update the fields in the record. Right? So the converged indexing is essentially building all these kinds of indexes as one thing in the backend. So the application doesn’t have to specify what indexes to build. The system automatically builds these indexes. And then based on the query, it tries to use the right index for the right purpose. So typically conversion, that means that you build like an inverted index, you build a condor index, you build a row index as one thing, and you store it and leverage in your credits just to give it a little bit more color. The growing index is essentially very much like my SQL or Mongo DB indexes that they built by default. The inverted index is very much like what you’ve seen or elastic search and bills by default. And a column learning index is very much like an index that’s built by Redshift or snowflake or Vertica or any other kind of analytical software. So this is the association that conversion essentially builds indexes from all these different types of backend systems you might be used to and stored in one place.

Kanchan Shringi 00:46:34 This obviously affect the rate at which you’re able to ingest the data as well. So maybe talk about some of the technical challenges there is that what led to the separation and trying to scale them independently?

Dhruba Borthakur 00:46:49 Yeah. In general, I think people think that building indexes is costly. That’s why most databases have a field saying, create index on this call. If your record has 500 fees, they don’t want to build 500 indexes because building indexes in general has been a difficult proposition for most users. But at rock, what we do is that we do build a conversion index for all the fields in your data because technology has tried to make the cost of indexing much lower compared to traditional systems. So it actually makes it economically feasible to build indexes on every field of your data. So now think about the power of the system. You don’t need a DBA, first of all, to build an index on whatever fields are there. And the index is big on every field. And so now when a query comes in, almost all the queries are fast because the indexes are on every field of their data.

Dhruba Borthakur 00:47:46 You asked me about the challenges of building. So the challenges are multifold, but I’ll let me tell you how the challenge has addressed in our software. One of the challenges is that we don’t use a B3 to store these data. So typically a binary for your, a B3 is how databases store data. So what happens is that the B3 has something called high right amplification when you are updating data because the data gets embedded in place. But as far building these nexus for conversion next for offset, what we do is that we use a log structured March street instead of a beach, it’s called an Alison engine. So we use an Alyssa Runyon called open source rocks DB. So that lets us index a large volume of data at a much better price performance compared to a B2B based database that people are used to earlier.

Dhruba Borthakur 00:48:40 That’s one. And the second way we try to address the challenges is that drop set conversion index is very much built as a search engine and not like a database. Think about like Google search, right? When you are searching for a term is big for low latency. But when you are searching for restaurants in San Francisco, right, let’s say you were searching for that. It’s actually hitting a large database and giving you a results back. But the database is optimized for latency of your queries need to be really fast. So Roxanne is also built like this. The conversion index is basically something. What do you call it as a document sharded database? Which means that when data comes in or a record comes in, all the fields of the record are indexed on one machine. So a record is not split into multiple servers, even if your record is very big, it has 500 fields.

Dhruba Borthakur 00:49:35 All the fields are going to be indexed on one machine randomly. And then when it comes in, the query now has to hit all the machines on their cluster and get results back and give you results back to the user. So it’s like kind of use just a scattered gathered kind of guardian to be able to fan out to all possible machines in our cluster and get the results back from them. Unlike a traditional database, let’s say you were talking about grid or something else there. When a credit comes in is going to hit only the machines that have the data because the data is partitioned. Whereas in Roxanne or a conversion next in system, we scattered it among all the machines in the cluster. And so when a credit comes in, it has to hit all the missions in a cluster and give the results. So the advantage of that is that it is much more cost-effective to build local industries on a document, which is why we adopt this approach and building all the indexes. You don’t basically need like a distributor pack sauce or some other Algonquin to be able to index data and large volumes because you have these documents are scattered, get our way of storing all the documents and retrieving results. So those are two primary ways of how we have reduced the cost of indexing on these large streaming databases.

Kanchan Shringi 00:50:51 As I hear this, you know, I’m thinking this is blurring the lines for me between Welty piano lab. So do you envision this replacing current application workloads? So LTP workloads as well as search?

Dhruba Borthakur 00:51:05 Yeah, so that’s a good point. So in all that, typically people just talking about aggregations and joints, which are basically, like I said, cubing, and some other pre aggregations, but when you are talking about the use cases that most of the real time analytics are user-facing real-time analytics are used for did need cubing and all apps, but they also need things like search queries. They also need queries like ranking and relevance matching. So these are typically not part of your traditional kind of transition that most people are used. So yeah, probably we need a different kind of workload, which is basically an OLAP system, but it has far more things associated with it. Some people have started to call it operational analytics kind of thing, where I think ALPAP is the name of where online analytics, operational, analytical processing glare, real timeliness of the data is super important. So it’s not just about cubing is not just about aggregations and joins to have to be there, but you also need searching and you also need kind of index search indexes and relevance, ranking, and pattern matching to be able to serve all these operational analytical queries that you might need 40 at a time. And I think

Kanchan Shringi 00:52:26 So one last question for this section, before we move on to a last one on embedding real-time analytics, there was a note about Kima, less ingestion. Can you explain that?

Dhruba Borthakur 00:52:37 Yes. So schema-less ingestion falls into the purview of real-time analytics. And let me explain why. So real-time analytics is all about making decisions on your fresh data. So now think about data is coming in from five different sources, with five different formats. One of them is Jason with five fields. The other one is coming in from a tradition on a relational database. The third stream is application logs that are coming in from say, Kafka or Flink or something else. Now, if there’s a way for you to take all this data, we don’t have integrated schema, but then make them queryable so that people at the time of query can leverage the schema that is being deciphered by the system automatically. That is called schema-less ingestion. So rock said, what it does is essentially it allows you to dump semi-structured data from different sources. You don’t have to specify a schema at the time of writing this database.

Dhruba Borthakur 00:53:38 So this is basically, it brings out the best of no sequel database with a secret database, right? So no sequel databases. I feel like the became really popular in the last decade because they could ingest data in a free format. So this is Cumulus ingestion. On the other hand, software developers to don’t like to create a lot of schema themselves because schemas keep on changing, but the need to use sequel because the sequel is a very complex and powerful language. And prior to Roxanne, there, wasn’t a way to run SQL queries on a no sequel database. That’s why they’re called no-sequel database. So what Roxa does is essentially takes this data. It’s a schema-less thing that’s happening, but the system is creating a sequel schema automatically for your application developers. That schema is based on the data that is coming in and because of the indexing, the system can find out all the different pieces of data, all the different types of data that is in the system.

Dhruba Borthakur 00:54:39 And it actually gives a sequel interface. We do value defined schema, application developers. So now application developers, they feel like, oh, actually it’s a SQL system with a great scheme. I can inspect the scheme. I can call, describe table and I can see the schema, but nobody has to create the up front. So if, if your new data that is coming in has new fields, they automatically appear in the schema. This is what I mean by a schema-less ingestion, which a strict schema that is used for credits, but not at the time of writing this data.

Kanchan Shringi 00:55:10 I will also include your medium post on the ELT architecture in our show notes. So if interested, listeners can go refer to that in detail. So coming to our last section, which is really the key focus of our podcast today, and we’ve been building to that is how to embed real-time analytics in SAS applications. First of all, what does that even mean? Embedding the analytics within the application

Dhruba Borthakur 00:55:39 Conditionally, I have seen people use analytics mostly to make their businesses better. And how can I make my business process better? There’s a lot of analytics which tells you that lets you do what is situations, right? What if, if I had done this, then how would my results have been different? That’s what traditional analytics is all about. So it has nothing to do with the products that our company is selling. It has more to do with how the business of the enterprise is doing now, on the other hand, in the recent timeframes, what we have seen is that lot of analytics need to be consumed to make the products themselves better. So I give you the example of the construction company, right? So their product is the ability to deliver seaman. Cost-effectively from one place to another. So that’s their product. So how can you embed analytics into this product?

Dhruba Borthakur 00:56:33 Another example, I’ll give you online gaming. Hundreds of people are playing an online game. You need analytics for things like creating a gaming leaderboard. The leaderboard is who’s the best player who can do these six different things in the last major show. You need real time analytics because the leader boarding query is a very complex query and you don’t want to show a leaderboard of yesterday’s data because people are actually playing games and they want to see the leaderboard getting refreshed as some people are winning and some people are losing. So this is what I mean by embedding analytics into your application. This is a newer area where, where your company’s products are getting directly impacted. It’s very different from traditional analytics, which is more focused on making businesses better. How does this change the architecture of the application architecture changes? Because you would see that because they’re embedded in products, the volume of queries usually vary a lot.

Dhruba Borthakur 00:57:36 So if it was a business application, then it will be the CEO or the business analyst in your company who will make a query, let’s say every day or every week. So the pattern is very well-defined. Whereas when you build it into there into your product, I think there are certain demands of the architecture. So the demands of the architecture is that first of all, the query latencies have to be really low. That’s one basic building block for the architecture. And the second one is that the query volume also can change based on how your product is being used by your people take, for example, you are running a fitness app on your phone and suddenly a lot more people are running on the road because it’s a Saturday. So now you want to show them an analytical product. They’re saying that who are the top people who are performing today, right?

Dhruba Borthakur 00:58:24 Who’s like, let’s say you, you also want to see how you compared with yesterday’s run over our own schedule. Now you need to show them real time analytics on this. And the volume could change from hour to hour very quickly. So the architecture has to be very much focused on low-latency queries on these large data sets. And also the volume of queries could change from time to time or minute to minute. So this is two basic differences. As far as the application is concerned, as far as the data is concerned. Also what happens is that the schema of the data keeps on changing constantly because let’s say the fitness app, the application developers, they will try to capture more data from your phone or from your device, like say next week, they’re going to ship new features, which is also going to catch the temperature of your surroundings, where you are running.

Dhruba Borthakur 00:59:16 So now the data stream that is coming into your data set has a new field saying from second minute per minute, now your database has to store this additional field, every record. So the data variety also changes because it’s the application developer who is controlling, what is data getting produced? And what is the data that is getting consumed? And the rate of change on an application frequently demands that your data formats are changing or new fields are appearing in your data that you need to process. So you need flexibility in your architecture that can handle multiple data types. Schema-less ingest is a great feature for these kinds of use cases. And the rate of change of your application is basically enabled. Are if you can support these kinds of things in your backend architecture,

Kanchan Shringi 01:00:05 Do you see customers using rock set now, you know, is it for augmenting some parts of the stack or replacing some services?

Dhruba Borthakur 01:00:14 So rock said by itself is a sequel API on semi-structured data. So in general, people can essentially use it for replacing a lot of their sequel analytical systems that they already have, but this is inferior. This is not what we’ve seen in practice practice. So what we actually see is that people use Roxa for solving real time, analytical needs take, for example, this fitness tracker that I gave you examples about, right? So they also have a very good way of upselling products to, to their users, but it has to be real-time at that right moment of time that they can sell up, sell up, sell a certain kind of product to the user. And it’s very real-time in nature. So they use drop set for this kind of use this, but on the other hand, because Roxanne is a SQL system with JDBC API sequel, API is a large datasets very quickly.

Dhruba Borthakur 01:01:13 We see that people also start using it for other use cases, just because it’s very easy to use. It doesn’t have any operational overhead for our users. So one they’re also sees that now their application developers can move very fast, although they might not need the real time data to be one second. Then they might be okay with say half an hour of data latency. But the fact that it’s a schema-less ingest, but the fact that they enable application developers to add more features to their application without having to talk to a database engineer or an ETL engineer or provisioning our capacity planning concept to start using Roxanne for different things, which are again SQL on semi-structured data, but they are the little NC might not need to be one second. It might be okay with 10 seconds. It might be okay with 15 seconds, 10 minutes of data latency or 15 minutes of regulator, just because it makes them run faster. The developers in that company is concerned. So that’s what we have traditionally seen for most of our,

Kanchan Shringi 01:02:19 What about the folks building the SAS applications? So do the developers building it need to now gain some additional skills.

Dhruba Borthakur 01:02:28 Our focus always has been to empower developers, application developers to leverage the bar. All lane sites are intelligence that is fired along the data that they’re collecting. So most of the time, the application developers who use rock said to don’t really need to have a data team or a data engineering team to handle all the data needs. Did typically start off with making sequel queries or less. So Roxanne is essentially a SQL query engine. So the dump a lot of Jason or CSV or XML data into rock offset, it could be dumped at one time or the application developer can also tell. Rob said, all my data is all in S3, so please suck it in from there. Or my data is in Kafka. So keep continuously daily. And the thing that they have to learn is how to make a sequel query of arrest.

Dhruba Borthakur 01:03:23 So this is one option for most of our application developers. We also have Python as where if you know, just 500, you could use byte on constructs to make queries that automatically get generated. But the even more interesting thing is that let’s say one application developer generates an SQL query and I create using the SQL query and then finds out that this is the real SQL query to be made. He can actually park the sequel query as something called liquidity Lambda. So our opposite as this concept of a query Lambda, which is basically a sequel query that can be stored in the backend with obviously parameterized sequel. And what did expose us is a rest API that you can hit from any machine or any browser. So you can just hit a rest. API is not no sequel on your client’s side anymore. To plan just hits a query Lambda Lambda ID.

Dhruba Borthakur 01:04:19 And in the backend is SQL query is executed and returned back to the application. So we have seen that many people start to use DocSend as a developer API, because not everybody needs to know sequel. If one developer knows sequel and he parked it as a query Lambda. Now all other developers in his company starts to use that query Lambda. You can actually use it from any language now because it just arrest standpoint. It is no secret thing, or it doesn’t need to link in any rocks or anything at all. So the developer an option is pretty minimal. The barrier to adoption is very minimal for a new developer. He doesn’t have to learn some new things that are offset specific.

Kanchan Shringi 01:05:00 Are you tracking any technical challenges that you know, you haven’t solved yet, but do you have encountered with working with customers?

Dhruba Borthakur 01:05:09 Absolutely. So typically customers in different verticals have different requirements. Take, for example, in the healthcare industry, there are certain rules and regulations that are mandated by the government. Like take, for example, in the healthcare industry, there’s something called HIPAA regulation. So we are working to making sure that rock set is HIPAA compliant. But on the other hand, we have ways to be able to mask certain fields of your data so that if somebody has health data and the still want to use drops it today, they could mask out certain fields of your data. So the rocks, it doesn’t get to know about any of those. Let’s say you have some PII data that, that customer cannot share a rock so they can very easily mask it and they can hash it to a certain value. So you have tools to be able to ensure that our customers are adhering to the regulations that they need similar to say for financial companies, they might have some other sets of requirements that rockstar might not meet, especially related to the security of the data.

Dhruba Borthakur 01:06:14 Essentially, you can talk about it as SOC two compliant, SOC two compliance. So we are working on some of those. Some of those are needed to put more sensitive data into Rockford, and we will have all those compliances mostly done in other three months or so, or maybe a month or so from now, that’s one area, the second area of challenges that as a culture service and not just specific drops, but any other cloud service is the ability to give our developers, people who use rocks and those developers ability to debug bugs in their code by themselves, right? So we are building tools and processes so that they can, we can help them debug bugs in their code quickly and easily. That’s basically one way for us to make sure that people can run at full speed when they’re using. Ralph said, some of these are difficult to solve because the variety of problems are quite big.

Kanchan Shringi 01:07:10 Thanks for all that. Druva. So my takeaway from this section really was the fact that embedding into the SAS apps is really helped them make the products themselves better in real time, reacting quickly handling much higher query volumes, many more data types. And as we have talked about earlier in the very first section, this does introduce operational overhead. There’s additional things the SAS developers have to account for to also build the workflows, to react to this and also to match the demands.

Dhruba Borthakur 01:07:44 Yeah, that’s a great summary. All of these things are important. All these things have to fit in place and compliment one another to be able to embed analytics into your application.

Kanchan Shringi 01:07:54 So let’s start to wrap up now. I hope we covered a lot of ground and at least the key things that you were hoping to cover. Is there anything that we missed that you would like to talk about

Dhruba Borthakur 01:08:05 For real-time analytics? Operational complexity is definitely something that is top of mind for most people. Not because it is hard, but because even in earlier times, people could actually do some of these real-time analytics. If the cobbled together five different software solutions and the hire five people to be able to do real-time analytics. But this is the real differentiator I think, is that people are seeing more and more solutions in the market to be able to do real-time analytics as a cam solution. Roxanne is definitely one of the primary ones there where you can just find and click and different data sources. And you can think about data as an API rather than thinking of data as having gravity, trying to do a lot of heavyweight lifting. So if there are tools that people can use, that that can make them use data as a real simple API, they don’t have to think about the complexities of queries or scaling up and down or fault tolerance or whatever else. That’s a great enable. I feel for making these embedded real-time analytics solutions super useful and adopted by lots of developers.

Kanchan Shringi 01:09:20 How can people contact you

Dhruba Borthakur 01:09:22 Please go to So Roxanne is a hosted service, which means that as a user, you don’t have to install anything. You go to, click and sign up for an account. And within five minutes, you’ll be actually be able to use and try connecting your data set to Roxanne and making it, making queries. Sometimes if you want to know more about the backend instead of crying or upset as a service, you could also just click on a demo request are you can just send me email and we can definitely help you answer your questions and see how we can help in setting up your system. So you can use it.

Kanchan Shringi 01:10:00 Thank you so much to buy. This is a very, very interesting session for me. I hope our listeners enjoyed it and learned from it as well. Thanks a lot for coming.

Dhruba Borthakur 01:10:09 Thank you. Thanks concha. I really enjoyed the discussion with you.

Kanchan Shringi 01:10:13 Thank you. This is contained Shinji for software engineering radio. Thanks for listening.

SE Radio 01:10:19 Thanks for listening to se radio and educational program brought to you by either police software magazine or more about the podcast, including other episodes, visit our [email protected] to provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod ( — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

More from this show