SE Radio 592: Jaxon Repp on Distributed Data Infrastructure

Jaxon Repp of HarperDB speaks with Brijesh Ammanath about distributed data infrastructure, including what it is and why it’s important. They discuss the key factors that make distributed data infrastructure attractive, as well as challenges to implementing it. The episode explores the architecture and design principles, the key security considerations, and the transition factors for distributed data Infrastructure. Brought to you by IEEE Computer Society and IEEE Software.

Show Notes

Related Episodes

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. Our guest today is Jackson Repp. Jackson is the CTO and head of marketing for HarperDB. He has over 25 years of experience architecting, designing, and developing enterprise software. He’s the founder of three technology startups and has consulted with multiple Fortune 500 companies on IoT. That’s Internet of Things and Digital Transformation Initiatives. Jackson, welcome.

Jaxon Repp 00:00:43 Thank you very much for having me.

Brijesh Ammanath 00:00:45 In today’s session, we will talk about Distributed Data Infrastructure, understand what it is, benefits of using it, the challenges which are unique to it. We will deep dive into the architecture and design patterns of distributed data infrastructure and also delve into security concentrations. Finally, we’ll look at what one should consider while transitioning into a distributed data infrastructure setup. So let’s start with the fundamentals. Jackson, can you help explain to our listeners the fundamental concept of distributed data infrastructure and why it’s important?

Jaxon Repp 00:01:17 Absolutely. There are a few reasons why one would want to set up one’s data infrastructure in a way that data does not live in a single location. The old adage was that I want to be more resilient. A data center might go down. So if it does, I don’t want my entire application to fail. I would like to be able to route to a different place where the data is able to be served and my application can continue to function. That is what we have sort of dealt with for years and years. And there’s lots of rules and tools around setting up a distributed data infrastructure, mostly for resilience. And that includes things like disaster recovery, a site that’s very far away that eventually gets that data. I may not access it, so it doesn’t need to be very particularly fast, but if all of my main office buildings burned down, I still need that data.

Jaxon Repp 00:02:06 So I needed to be and exist somewhere. More recently, it has been a challenge of scale. I have so many users that are trying to interact with my data, that having a single database answer those questions or accept those inserts is not sufficient. Or if it is, I’m spending a lot of money to put a terabyte of RAM and multiple cores in a machine so that it can perform at a level that is satisfactory for my users when they’re trying to interact with that data. If I am in many locations, obviously I divide up that user base and the resource requirements for each node of my data infrastructure are lower. It introduces other challenges though, however, the most recent reason, and this is I don’t think it’s entirely pandemic driven, but before the pandemic sub 50 millisecond response times for APIs were not something everybody was demanding.

Jaxon Repp 00:03:06 We rarely heard about it. High performance applications, absolutely. But for most people it was, get it to me within half a second, 500 milliseconds, and that’s fine. And when everybody started working from home, everybody got real impatient for apps that were slow. And the newest reason we have distributed data infrastructure and the one that sort of drives my company and companies like us, is the idea of extremely low latency. So not only am I distributing the load across multiple data ingest points or access points for queries, and not only am I resilient, and not only am I protected in case of a disaster but also, I am closer to the user. So I am lowering the latency for that response or for that insert. So it’s really a combination of latency, cost, resiliency, and ultimately the whole balance of that is complexity, because obviously we keep adding moving parts to a system, it gets more and more complex.

Brijesh Ammanath 00:04:06 Yeah, all of that makes sense. So resiliency, scale, low latency, all of that is very critical and important for any business that’s out there on the net right now. But circling back to the fundamental concept, what does data, when we say it’s a distributed data infrastructure, what does that mean?

Jaxon Repp 00:04:24 It means that I’m going to store data in more than one place. The idea that I may divide my data into multiple servers, I may divide my data across multiple regions. I may replicate my data in whole or in part across multiple regions. So traditional sharding divides up very large data sets across multiple servers, keeps track of where each piece of data is and allows you to access it. So that’s the basic level of putting data in more than one place. When I talk about distributed data infrastructure, I mean your data is accessible in whole or in part in many different locations around, let’s say the globe, or it may be around your service area, but distributed data infrastructure are the servers and the connections and the software that allows you to access data as if it was sitting on your laptop as quickly as possible, even though it may be located in multiple places in the world.

Brijesh Ammanath 00:05:27 Understood. And why is it called data infrastructure? Is it same as distributed databases or when you say it’s distributed data infrastructure, are you taking a different view of what the stack is and does it include additional components and layers?

Jaxon Repp 00:05:42 Well, infrastructure requires obviously hardware, the servers to run it, the software that is going to store your data. And when you have distributed data, you need some way for those independent nodes, those independent servers to speak to each other. So there’s tooling around how those nodes communicate, how a transaction that happens on one server is replicated and reconciled on a different server where a similar transaction might have just happened to affect the exact same record. So the tooling and systems to make sure that everything in an end state is what should have happened based on the time at which it happened anywhere in the world is complex and requires a lot of individual parts. So I refer to infrastructure as a combination of software and hardware, and not just databases, but all the tooling around it that allows you to break them out and have them live more than one place.

Brijesh Ammanath 00:06:42 Right. So in my mind side, it’s much more comprehensive than just a distributed database because includes not just the database, but it includes the processing, it includes the engines, integration tools, and all of that is combined under the umbrella term data, distributed data infrastructure.

Jaxon Repp 00:07:00 Absolutely. When you think about what you’re really trying to achieve, extremely low latency for users that are dispersed around the planet, there’s a lot of tooling that not only goes into just storing the data and making sure that this entity that covers the planet has your data and is readily accessible, but there’s also a lot of tooling to make sure that people can push data in, in a way, or get data out in a way that fits with their end use. So there’s also lots of software involved in making it easy. We often refer to distributed data infrastructure as a data fabric. There’s a lot of work that goes into making a weave. There’s lots of different sorts of fibers that go into it, but the objective is to provide a uniform surface from which and to which data can flow.

Brijesh Ammanath 00:07:51 Understood. And are there any specific considerations which are required for distributed data infrastructure to be optimized for IoT applications?

Jaxon Repp 00:08:02 Yes, I think one of the off cited statistics is that we’re going to have several trillion devices generating information and where that information ends up, whether it’s just fired off, not collected, and it exists in the ether and that device just measures something and every once in a while you go over and just look at the current value, that’s a perfectly good sensor. And walking across the room and looking at it is a perfectly good way to manage it. If I’ve got several hundred under my domain or several hundred thousand for say, a large manufacturing concern, then really what you’re looking at is how can I handle that level of throughput? How can I handle it securely? How can I reconcile that if they are doing ingest into different nodes, different servers, different databases in different parts of the planet, how do I leave those together? How do I make sure that I reconcile them in time? So are my clocks aligned? Things like that. There’s a lot of concerns when you’re talking about distributed data infrastructure that are absolutely table stakes in order to be able to say, I trust that if somebody in Mumbai puts a piece of data into this server that somebody in Los Angeles when they ask for a question of my server will get the right answer. That includes the data that was just inserted on the other side of the planet.

Brijesh Ammanath 00:09:29 True. We touched on the benefits, but I want to double check on some of the benefits that you mentioned earlier. So data availability is crucial for businesses. How does distributing data across geographical locations on nodes contribute to improve data availability? And how does that all sync up together?

Jaxon Repp 00:09:48 You can imagine, and we can go back to the IoT use case, when I have hundreds of thousands of sensors producing data, I may be looking at temperature and I might be looking at vibration and I might be looking at any number of parameters that are coming off that machine that I’m measuring. And if I ingest that all off of hundreds of thousands of machines that are creating that data every five milliseconds, that is a tremendous amount of ingress. And when I have all of that data, the availability of it depends on what my use case is. If I’m just trying to create a dashboard that tells me the current value of what’s the current temperature of that machine? Is it too hot? Do I need to turn it off for a while? Do I need to increase cooling? Is it about to fail?

Jaxon Repp 00:10:37 Do I need to do preventative maintenance? Do I need to, I don’t know, change the oil, then that’s easy. I could just look at that. But if I wanted to allow intelligence systems to look at data, to look at patterns in that data, say machine learning, right? AI looking at every single reading off of every single sensor attached to every single machine, and also considering other external factors like the temperature and humidity in the environment, in each location of my factory, I often call it the Rumsfeldian challenge, right? I don’t know what I don’t know. So the safest method often is to default collecting absolutely everything. So that what truly needs to be available for my enterprise, what truly can make a difference, what truly can improve my performance profitability, resiliency is in there somewhere. And we are very, very quickly getting to a place where distributed data infrastructure is a tool for very quickly collecting and returning, but also analyzing all of that data in a way that perhaps humans can’t even do anymore.

Jaxon Repp 00:11:44 So having a platform that accesses data. When we talk about availability from a data system, you’re often talking about somebody typing in a SQL query and asking it a question or a customer coming in and hitting an API endpoint that asks a question and gets that back even at the peak volumes of hundreds of thousands of requests a second for some of the largest networks in the world. That’s fine, that’s, we can do that with existing infrastructure today. But when you talk about a machine learning algorithm that needs or is capable of processing things hundreds of thousands of times faster than that, because what they really want to do is recognize a pattern and perhaps turn off a machine before it blows up and kills people. Or in some people’s eyes even worse, loses productivity for a day at a million dollars an hour.

Jaxon Repp 00:12:35 That’s what availability is becoming. It’s not just what you and I are going to look at and be able to comprehend. It’s what machines are going to look at and the depth and breadth of what they can look at is several orders of magnitude, obviously what we can do, and that’s where the industry is going. So while we say low latency for users and API endpoints or ingest from lots of users that are maybe adding comments on a YouTube blog, that’s really, really small scale compared to what we’re seeing in industry today. And so infrastructure needs to be able to handle that and that’s what we are building tooling for. That’s what truly distributed data infrastructure is great at.

Brijesh Ammanath 00:13:16 Do you have any specific examples where any of your customers have benefited from implementing distributed data infrastructure? I think it’ll really help bring those points around resiliency, scalability, low latency to life.

Jaxon Repp 00:13:30 Absolutely. I’ve worked on an IOT platform in a former life, and we were effectively sensorizing hot dip galvanizing plant, which is where you take the steel poles that hold up road signs, like literally everything is hot tip galvanized in the world so that the steel doesn’t rust. And that process was incredibly manual and the people who dipped each of these massive pieces of metal in three chemical ponds and then in a pool of molten nickel were treating it like it was an art. Now the reality is the system required or the, the legal requirements for any given road sign holding piece of metal, have a certain thickness of zinc that you need to put on there in order to guarantee that it won’t rust and it will withstand whatever element and whatever environment it’s put in. And treating it like an art didn’t seem right because what humans would do would prefer to never put too little on so they don’t have to redo it.

Jaxon Repp 00:14:28 And as a result, they were putting way more zinc on this metal than they needed to, and that was the primary cost center for this business. So we built a system where we could sensorize it and we could say it was in this tank for this long, this tank for this long, this tank for this long. Then we dipped it in the zinc and the operator left it this long. Then we measured how thick it was versus how thick it was supposed to be. And we said, we’re going to put a green light right in front of you. And I know you think it’s an art when you take it out but take it out when the green light is there. I promise you we’ve already done the work. We’ve analyzed not only the length of time and the mill spec of how thick that zinc was, but we’ve also analyzed all the external factors like weather, temperature, humidity, and when that light turns green, pull it out of that tank.

Jaxon Repp 00:15:21 And sure enough, we were meeting the mill spec required and we were using 30% of the raw materials that we were using before. So that’s a 70% increase in efficiency. And we did that project in a few weeks because humans are really terrible at judging objectively chemical processes that science and measurement and tooling is much better at. And in order to be able to make that calculation as to when that light turns green, we needed all of the data from all of the tanks and all of the external factors, and we needed a system that could also respond and have output that said, hey, you’re done. Turn on the green light and tell that guy to take that out of that tank. As a result, profits jumped and the customer is extremely happy.

Brijesh Ammanath 00:16:06 Very interesting example, and help me understand the scale of the operation, which needed a distributed data infrastructure, and a traditional relational database would not have worked with you.

Jaxon Repp 00:16:17 So it needed to be able to store the data locally. Because the other thing that’s extremely true about hot dip galvanizing is it’s an incredibly dirty process and it doesn’t happen anywhere where humans are. So it often doesn’t happen anywhere where there’s really good network connectivity and it happens across massive yards. So the measurement of that mill spec doesn’t happen next to the shed, but the shed usually isn’t super close to the office either in case there’s an industrial accident. So there’s lots of failures in network connectivity. So you need to work locally, so you need to be able to store that somewhere. And then when network connectivity is restored, which it does sporadically, you need to be able to move that data into the central location because looking at it locally only gives me one to maybe a hundred pieces a day, but I need enough data to tell you exactly what environmental factors, exactly what temperatures each of the tanks are at to be able to build an algorithm that will tell me to take it out at the right time no matter where I am across the country. So in essence, what we did is we were recording the data locally, we were employing an algorithm that ultimately told you to turn the light on, but the algorithm was a product of all the aggregate data across all of the individual facilities where this process was happening.

Brijesh Ammanath 00:17:42 Right, understood. So I guess this is an example of known unknowns being translated into known knowns and then having an implementation which, analyzed it and changed an art form into a highly mechanical form.

Jaxon Repp 00:17:56 Absolutely. That’s exactly it. We took raw data and we turned it into human action.

Brijesh Ammanath 00:18:01 Excellent. Let’s move on to the challenges. So what are some of the challenges that organizations face or might face when implementing a distributed data infrastructure? And if you’ve got an example to walk us through those challenges that any of your clients have faced or while you have been consulting or in your current role where you have experienced clients implementing a distributed data infrastructure, what were the challenges they face that would help bring it to life as well?

Jaxon Repp 00:18:28 It’s a great question because the challenges are different every time. And I think one of the things we’ve learned at HarperDB as we go into different organizations to help them improve performance and resiliency and all of the benefits is under understanding exactly what their topology, right? Where their servers should live, what their servers need to accomplish, what is the load and what is the necessity for availability? And ultimately what is the scale of that data? I think most industries, most clients have a certain amount of data that they are used to dealing with. And when you start to tell them that there are insights, right, the unknown unknowns that they could capture and optimize from, maybe not today, but in the future, the planning of that topology and that infrastructure, including where I am today, and the path to get to the end result.

Jaxon Repp 00:19:25 That’s the biggest challenge because for most customers, they can’t afford to shut down their entire data infrastructure and migration feels like a multi-year task. So the opportunity is to provide them with an infrastructure where it’s not rip and replace. They don’t have to tear things out. They want to extend their existing system with distributed data infrastructure that delivers all the benefits. And the challenges are how do we keep your existing system functional, as functional as possible while delivering those benefits out on the edge. Specifically, we talk about the observability is super-hot right now. It’s the hottest trend. Everybody wants so many statistics about their systems operating status primarily to optimize, to make sure things don’t go wrong because data infrastructure is so critical. But when we are doing that, we look at, well, what’s truly happening in your data pipeline? So for requests say, how many layers of caching are there, what is the response time for every single layer of that cache?

Jaxon Repp 00:20:29 What are your cache hit and miss rates? How can we optimize that? We’re working with a specific customer right now who has 600 million SKUs, and they are trying to manage the availability of those. And obviously you don’t want to serve every single one of those from a database in real time. You want to cache as many of those responses as you can. So how do you cache them? How do you decide what gets cached? What remains cached? What gets evicted if I run out of room because servers do have limitations, and how do I optimize this because it might not be the same in every location. So my rules for evicting things from a cache, my rules for partitioning data and moving a record perhaps somewhere else are wholly determined based on the user behavior in that particular region. So helping them plan and understand that moving data is a natural part of a distributed system as opposed to the current thinking, which is my database lives right over there and that’s where all my data is.

Jaxon Repp 00:21:33 So getting customers used to the idea that sometimes the machine or the system or the algorithm or the AI is going to make decisions that maybe your human engineer wouldn’t do, and your human engineer will say, I don’t know why that record isn’t in cache because I just hit it and it came back as a cache missed. I just hit it and it came back with, it had to access disc instead of being pulled from the RAM. Why is that? And many times we investigate and we’re like, well because you’re literally the only person hitting that particular endpoint, asking for that particular item. And it’s their favorite item, it’s the item they always test with, but it has fallen out of favor because there is a newer model, or everybody else in the world is looking for a different version now. So helping them understand how the rules are often optimized, not by humans, but by machines.

Jaxon Repp 00:22:27 And another great example is the idea of database admins who like to look at query plans and optimize queries. That hand, that is their job. That is what they’ve been doing for 30 years, 40 years. So they know exactly how to do it. They work really, well with antiquated databases. I won’t say antiquated, but just enterprise scale databases that have been running for over a decade and they know every toggle that they can handle. And it’s hard to help them understand that modern systems can toggle those things on and off and reorder and optimize queries far better, far faster, and in real time than a human can do it. So helping them understand that modern software isn’t trying to take their job. There’s lots of complexities that come with a distributed data infrastructure. It’s just how do you help them understand their new role as a manager of a distributed data system as opposed to a database IE singular database administrator.

Brijesh Ammanath 00:23:29 Right. So if I understand it correctly, it’s the challenges are more around changing the viewpoint of technologists from the traditional centralized database to understanding the complexities around optimization and rule based usage, which are in a distributed data infrastructure. Is that right?

Jaxon Repp 00:23:48 It is. It’s also understanding the limits of a distributed data system. And there’s a mathematical theorem called the cap theorem that says you can have consistency, availability, and partitioning, but you can only have two. Right? The three-legged stool says you cannot divide your data across multiple instances and have it be available if you want it to be exactly the same all the time. IE consistent. So I cannot have two nodes on opposite sides of the planet that are always on and ready to answer questions. If I also want the second I put in data on one side of the planet at the very same millisecond or a nanosecond after, I want to ask that question, I want to ask that question on the other side of the planet and get the answer reflective of the data you just put in. Because it takes time.

Jaxon Repp 00:24:39 It takes time for data to move across the planet. In our most optimized systems we’ve seen 75 milliseconds to replicate an insert around the planet to a hundred nodes. So it’s super-fast, but it’s not instant. And that theorem puts limitations on what truly distributed infrastructure can do. You could either lock the row, which means that it’s not going to be available to everybody while I am writing it globally. And then you don’t have availability. You could have a single database, which means I don’t have partitioning, or I could accept that I will be what we call eventually consistent, and again, eventually means perhaps 75 milliseconds. But I wouldn’t want to have a banking app where I could take a dollar out of an account that has $1 in it and you could do the same thing on the other side of the planet at the exact same nanosecond. That feels like a very bad way to run a bank.

Brijesh Ammanath 00:25:35 Thanks. So, from a distributed perspective, I guess the primary contention would be around the availability in the cap theorem?

Jaxon Repp 00:25:44 Yes. And you can choose, right? There are absolutely products that, let’s accept that in a distributed data system, everything is partitioned because we have more than one node. There are products in the marketplace. CockroachDB is one that does global row locking and has global consistency, right? They’re asset compliant across the cluster. That is a totally acceptable paradigm, but you sacrifice performance because not everybody can write to a row while that row is locked. So as a result performance may be lower, but you get that asset consistency worldwide. My company Harper db, we focus on availability. So we are eventually consistent. We work really hard to make sure that our paths between nodes are optimized and we can lower that latency as low as 75 milliseconds or less between those nodes to reconcile a transaction. But we can’t have both.

Brijesh Ammanath 00:26:38 Right. Can you explain to our users what does eventual consistency mean?

Jaxon Repp 00:26:42 Eventual consistency means that, let’s say I have two nodes, one in Los Angeles and one in London. And I’ll be writing to a table, I will be asking it questions and that will be happening in on, on both servers. If I insert a record, a new blog post, I put that in there. And consistency would mean that if somebody in London or somebody hitting the London server asks for all of my blog posts, they would see all of my blog posts, including the one I just put in there just literally a nanosecond ago. It just got committed to disc, the transaction is complete, and they say, gimme all this blog posts right then and it will appear. And that is consistency in an asset compliant way, right? Eventual consistency means that I will write it, but it might not be immediately available.

Jaxon Repp 00:27:35 IE consistent in London because it takes time for that data to move over there. The way you achieve consistency or asset compliance, right? That consistency on a global scale is for me to have the London server. Basically when I insert the data into Los Angeles, the first thing I do is say, I’m going to change this row, or I’m going to put a new row in lock this row and don’t let anybody read it because I am going to change it. So they may wait 75 milliseconds. So the database query might be slow, right? Because it’s waiting for that to be done. And ultimately that is the tradeoff that we make. So the idea that for a real time system, which most often is user behavior, it is sensor data, it is lots of analytics that are coming in. Is it super important?

Jaxon Repp 00:28:27 Is it critical? Will people lose money if it’s not available? The second is committed to disc everywhere, because if it is super critical that it’s available everywhere, think again, banking transaction. Yep. Then really the model is to either have global row locking or to have a single monolithic database that I write to that of course locks that row while other people are like piling in on it, even from London. But then their latency is high because where does that database live? Just if it just lives in Los Angeles, then people on the other side of the planet have a lesser user experience or higher latency while they’re trying to access data. So we tend to focus on Harbor D beyond on making it as fast and performing as possible for ingested REITs and we sacrifice consistency in order to achieve that.

Brijesh Ammanath 00:29:13 Perfect. Clearly understood. We’ll move on to the next topic, which is around security. Security is a top concern when it comes to data. And what are the unique challenges faced in our distributed data infrastructure setup from a security perspective?

Jaxon Repp 00:29:27 Well, right off the bat, it is very similar to any network, but we can start with the database aspect of it in that the database needs to be secured in exactly the same way. A normal monolithic database needs to be secured, right? All the ports, access control roles and users, and the ability to determine what people are allowed to access, let’s say what databases or what tables, or even in the case of our product, what attributes. So think of it as column level security. So your select star with your role, might be different than my select star. My select star might bring back everything and yours brings back everything except the PII, the personally identifiable information. So, so normal database considerations as to how people can access this data and whether or not it’s encrypted on disc and whether or not we want to do that based on the trade off in terms of latency.

Jaxon Repp 00:30:22 Because obviously in order for me to search stuff and return stuff that’s encrypted, there’s encryption and decryption and it just adds to that time. So there’s lots of standard database security things that you need to consider. And then in a distributed data system, you also need to consider the security of the communication between those instances of that database. So what sort of resiliency do you have? What sort of encryption as the data travels across, right? So at the very minimum, TLS sometimes MTLS the idea that things need to be absolutely secure in flight so that you can’t have somebody listening in the middle and intercepting that data because obviously it’s not sitting in one place anymore. It moves around constantly. So there’s the security of the channel itself, the security of network access to hardware obviously, but then there’s the security of the actual transmission of those messages.

Jaxon Repp 00:31:15 And I would say less security than assurance. Can I handle a network outage? Can I route around it? Can I handle the fact that I might have data that I just put into Los Angeles server that needs to be in London, but somehow that network connection, I cannot peer-to-peer connect to it. So how do I intelligently understand all the routes that are available to me, understand that the data needs to get there and publish it to that server in such a way that it still gets there. That may mean rerouting it, that may mean publishing it to a different server that then forwards it on to Los Angeles or to London. It may mean storing it and retrying it. And it may mean that I want to have an exactly once delivery, I send it and I get an act back so I know that it’s been received, at which point I’m like, that one’s taken care of, let’s move on to the next one.

Jaxon Repp 00:32:12 Or do I want to, am I okay if, if something that gets put into this data, say a sensor data chunk, obviously acting back for every transfer of data takes time. It slows down the rate of transmission. So if for raw sensor data that I can afford to miss one or two readings because I’m collecting them every five milliseconds is okay, if missing some of that is okay, then maybe I just fire and forget. Maybe I just fire it off and say I hope it gets there. Maybe I will fire it off three times. Figuring my failure rate is so low that if I fire it three times, at least once it’s going to get there. So on the other side, how do I recognize that I’ve already received that message and not accept it more than once? Because if I don’t protect myself against that, then perhaps somebody could literally spam me and DDoS my system to death.

Jaxon Repp 00:33:01 So there’s lots of rules around both the database storage, the instance itself, the communication between the system. And then when you’re looking at a distributed data infrastructure, what you’re not actually doing is saying, I’m going to talk to the Los Angeles server, I’m going to talk to the London server. You are going to talk to a single endpoint and you are going to put data in and you need intelligent routing to make sure that that actually gets into the system. So, global DNS based load balancing security products for DDoS in front of that and, lots of the big network services that are traditionally employed in every network are super important for distributed systems. Because if I’m able to get in, in a way that is malicious, you risk not just taking down one of your databases, which happens to people, fairly frequently, but you risk taking down a whole network of servers.

Jaxon Repp 00:33:57 So you’ve already sold in perhaps a solution that guarantees sub 25 millisecond response times globally. And because your distributed system has in some way been compromised, not only aren’t they getting the extremely low latency that they are counting on properly for downstream systems that look for data at a certain rate so that they can make intelligent decisions in real time. Now the whole network’s down. So now that system is going to back up and there’s going to be errors everywhere. So the other part of this is what we can control for is the managed service network that we set up, but we can also extend that data fabric sometimes on-prem where we don’t control the infrastructure, right? It’s the network closet down the hall and it has a massive hard drive and it’s where you’re running the really big analytical queries because it’s just faster. You want it in house. Somebody decided that it was a good idea not to move to the cloud for this because it would be too slow, right? Too far away. Because I need sub millisecond response times. So when I can’t control for that, how do I recognize patterns coming from something outside my managed infrastructure that are indicative of something that is wrong or incorrect, poorly shaped, or even malevolent.

Brijesh Ammanath 00:35:11 Right. Now we’ll move into the dig a bit deeper into the architecture and move to the architecture section. I appreciate as you give examples in this section, you might touch on the product that the HarperDB product that you’re currently working with. So feel free to give examples from there as well in terms of how we have solved problems around architecture. In HarperDB. Let me start with asking we know that NoSQL databases are inherently not asset compliant. So the asset which stands for atomicity consistency, isolation, and durability. And the usual approach to solve for us for this is to use multi-model databases. Can you start off by explaining what are multimodal databases?

Jaxon Repp 00:35:56 Multi-model databases store data in multiple formats in order to achieve multiple goals. So the idea, if your listeners are familiar with like a row based versus column oriented database, each has advantages, time series database, which has processes that create aggregates, right? I don’t want to run and tell me what happened. The average every 15 minutes across the time series database. So you may store it in a different way so that you have very quick access to aggregates. You may store it in row if you are often returning individual entries, like this is my blog post, you may store it column oriented. If you are looking at things from an analytical perspective, there are lots of ways to store things and often you store them in multiple ways based on multiple uses. HarperDB, our product is a single model database and we are acid compliant at the node level.

Jaxon Repp 00:36:54 We are a no SQL solution, but we have SQL semantics on top of it because we find that lots of customers are super used to SQL and they want to join things. So one of the challenges with one of the most popular no SQL databases is that you can’t do joins. You have to pull the records and do that in code and ultimately that’s far slower. So the way we have attacked consistency is we absolutely are asset compliant at the node level. We just can’t, in a distributed system be ask and compliant across the network because we need to lean into the fact that we are extremely performant. Mm-hmm , we want to accept and return data as quickly as possible. That’s that trade-off for consistency, right? Eventual consistency allows us to be asset compliant at that node level, but we are eventually consistent across the cluster.

Brijesh Ammanath 00:37:43 Right. For all listeners who want to learn more about multimodal databases, they can listen to one of our past episodes, which is episode 353, which goes into depth about multimodal databases. We’ll move to the next section, which is around businesses wanting to transition to distributed data infrastructure model. And I appreciate it’s specifically relevant, especially relevant for real-time domains. What are some practical steps that businesses would want to look at? Or what are the best practices to be kept in mind when you plan the transition?

Jaxon Repp 00:38:19 I think one of the biggest things we see customers concerned with, they understand all of the benefits. They have accepted the idea, at least with our product of eventual consistency. Obviously, the ones who don’t use other products and we wholeheartedly recommend them, but the biggest thing is the lift involved and the fear and the entrenched uncertainty and job responsibilities. It’s really, we can architect the solution. It is institutional concern over all the things we’ve talked about. So going through for a large enterprise, the security audit going through the pen test going through looking at dependencies and software and running us through a tool of analysis. So can we survive the gauntlet of each enterprise’s individual security requirements and teams? Can we explain the value effectively enough to the decision makers at a high level?

Jaxon Repp 00:39:23 And then also turn around and explain at the lowest level to the actual infrastructure IT teams exactly what we need, why we need it, what the ports are, what the throughput is when they, their internal tooling needs to fire off an alert about something that’s wrong. How do they understand that there are no external transactions coming in, but the database is still moving things around, they still see network traffic, that makes them nervous because they’re used to, well, the, the transaction load peaks at noon every day and then it goes down. So itís really, education is, most of my job is going into organizations, helping them draw it out. They get what the graph looks like, they get what the schematic looks like, they get where the nodes are, they understand why we would place applications where we would and why we would partition data the way we would and why we’d set up access controls the way we would.

Jaxon Repp 00:40:20 But they are nervous, understandably about simply jumping off a system that works and that they continue to throw money at to vertically scale. IE add resources to serve a growing user base or a new requirement for lower latency. So they see the benefits, but there’s challenges when you talk about moving from my current application stack, which might have let’s say a hundred moving parts, right? That’s standard. We have tooling and systems and measurement and we can sort of hold those in our mind and we know every one of the a hundred pieces and we get it. But if I want to be in a hundred places around the planet, well now I’m worried about 10,000 moving parts. And when I talk about the parts, I’m talking about the database, but in an application stack you’ve got the application layer, right?

Jaxon Repp 00:41:08 Maybe it’s your API servers, and then you’ve got your real-time streaming service, your message queue like a Kafka. And if you use like a MongoDB traditionally, you’ll have a Redis cache sitting in front of it. And if I’ve got all of that to set up everywhere, it becomes a little daunting. Because one of the things we do is we, we draw your existing infrastructure, your existing stack, and we say if you want low latency, we can achieve it with all of the parts you have right now. Here’s what it looks like. There’s an awful lot of boxes on this page, and I’ll lean into HarperDB product a little bit. Is what we’ve done is we realized the distributed infrastructure is complex and scary. And so we built an application layer on top of the database so you can build your APIs, you can build your machine learning classification or recommendation engines there. You can even build machine learning models in its long running processes like collecting sensor data. So your application layer is directly on top of the database. So there’s not over the air call or a connection or a driver that you need to use, which lowers latency but also lowers complexity. And then the same real-time streaming mesh network that connects all our nodes is capable of ingesting Kafka streams. It’s capable of ingesting MQTT, MQTT protocol from mobile devices or mobile apps. So we built that in too.

Brijesh Ammanath 00:42:29 Sorry, what protocol was that?

Jaxon Repp 00:42:31 MQTT.

Brijesh Ammanath 00:42:32 Sorry, I’m not aware of it. Can you help explain what it is?

Jaxon Repp 00:42:35 Oh, MQTT is a protocol, a pub/sub protocol that has its own specific package shape, but is a pub/sub based model. So think a client that wants to use a little bit less data than sending a whole Json object as a packet, MQTT is more efficient, and it has built in logic around exactly one’s delivery or at least one’s delivery or at most one’s delivery.

Brijesh Ammanath 00:43:00 Okay.

Jaxon Repp 00:43:01 It’s a communication protocol, it’s a messaging protocol and it’s one of the many ways that people try to ingest and egress or subscribe, publish and or subscribe data into modern systems. And you have to accommodate that. So we built within our streaming engine, because our data replication paradigm is pub/sub. I can choose to have a table in Los Angeles publish to a table in London, or I could choose not to. So in many ways, I don’t have to move the data around all the time. It’s not an equal copy everywhere. It’s what data needs to be moved and what data doesn’t, and we accommodate that. So if you think about like your traditional message queue, your traditional database, your traditional API servers, your traditional in-memory cache, like a Redis in front of a MongoDB, we built all of that into a single product specifically because we knew that this was going to be a complex undertaking and we wanted to overcome the objections of trying to manage a hundred thousand moving parts when we walked into an enterprise and said, distributed computing is your future and we’re a database that talks to each other.

Jaxon Repp 00:44:09 Honestly, the reason we built the application layer on top of Harper DB was because we had a distributed database. We had message queue moving the messages around, and we went into deployments, which were super focused on low latency. Our customers API servers where they’re in the cloud somewhere and we’re like, okay, well we definitely want to be in the same region and maybe weíd even like to be in the same availability zone data center. Can we be in the same rack? How close can we get? And we realized that we weren’t really in control of the latency. There was a gap between where the client connected and where the data lived. So we couldn’t guarantee an SLA in terms of latency because how close could we get? So we built a full function application layer on top of the database specifically because we wanted to lower latency.

Jaxon Repp 00:44:59 It turned out the way we built it with credit to the team, the care we put into the ergonomics and developer experience, it’s been very, very easy for our customers to transition from their existing API or application infrastructure into transition those functions into our application layer A, they do it because it’s easy, but B, it because it lowers latency. And certainly when we go into enterprise and we’re selling them the solution, we don’t make anybody do anything. So we can start with your API servers just staying and making calls out to Harper DB and getting data just like they do with their existing database. That’s super easy. But the migration plan obviously is to eventually put everything into a single platform so that it’s much easier to sort of understand and manage.

Brijesh Ammanath 00:45:46 So from a migration perspective, start off with this security audit and then make sure that you know your entire dependency graph and a lot of time is spent on education and if you have a product which gives you the entire platform, that makes things much easier.

Jaxon Repp 00:46:01 Absolutely.

Brijesh Ammanath 00:46:02 Looking ahead, what trends or developments do you see in this space and how would they shape the organizations and the way they leverage data?

Jaxon Repp 00:46:14 I think the biggest trend is the idea that we want to build applications that are so performant that people don’t even notice that you’re loading data at all. It is a combination of user experience through UI and prefetching and being really intelligent about how you deliver data. No longer just initiating a call and waiting for it to be assembled and returned to you. But streaming that data back in chunks can begin to have an immediate impact on the user experience while the rest of it fills in the background. Pub/sub is huge. The idea that clients now connect permanently to a given server and they say, ìI want to know about this and anytime anything changes in the database, I want you to push it to meî. I don’t want to pull you, I don’t want to waste bandwidth, I don’t want to waste the overhead.

Jaxon Repp 00:47:16 I want to connect once, do my handshake, my SSL handshake once, and then I just want to be here and I want constant real-time updates, right? It’s getting closer and closer to real time, not just for apps that always sort of were real time, but apps that polling seemed like a perfectly acceptable strategy three years ago and, and now everybody’s like, oh, it’s pub/sub, that’s better. And when you do that, there are considerations that have to be taken in terms of the load that holding open a socket for every single client brings up. And you have to be able to understand how that affects your capacity. So what, even though we are distributing the load across multiple servers, the capacity of each server in a pub subsystem, even with persistent connections on a pub/sub, when I am say persisting a query which might have a user ID in there, that means I’m holding a little bit of RAM on that machine to look for records that affect my specific user or my specific query, right?

Jaxon Repp 00:48:22 A between these two numbers and there might be a hundred thousand people connected to the server and none of them are asking for data between the same two numbers. So that’s a RAM consideration. So how many of those sockets and how much work can I do to serve that? And does that number go down? Does it mean I need more servers? Does it need to mean I need to add resources to each server? And what does that balance? So it is absolutely an infrastructure and resource planning. And then obviously that plays directly into things like Kubernetes and containerized infrastructure where autoscaling happens and autoscaling for a distributed database. How is that achieved, right? I can add a new node automatically, but how do I get the data from, a currently performing node into a new node that I just spun up to handle additional connections and how do I do that in a way that responds to a peak in user behavior that might only last for 15 minutes, but it may be 10 x what I’m used to.

Jaxon Repp 00:49:21 So how do you not fail in a really dynamic system when the level of traffic, the intensity of those transactions can change in a split second? And to be honest, one of the solutions that I think everybody’s looking at, and you hear this across every database when you hear their CEOs speak in public is, AI is probably going to have some role in that, right? The idea that I can spin things up and duplicate it and I can sort of fake it till I make it and bring records over distributed querying the idea that yes, I have a new node, but I might actually be sourcing the data elsewhere until such time as the new node has a complete copy of all the data it needs to respond to requests autonomously. So there’s lots of things that are really exciting as we go forward.

Brijesh Ammanath 00:50:06 Sounds great. We have covered a lot of ground here. Was there anything that I missed that you would like to mention Jackson?

Jaxon Repp 00:50:13 I think I’ve said the word AI and machine learning lots today. And I can imagine a world where distributed data systems now that have multiple copies, whether it’s partial or whole or sharded and are distributing queries. Those sorts of systems are interesting. But I see a scale coming that is 10x several orders of magnitude larger than what we have right now because I’ve seen classification and recommendation engines that are so scary good. I’ve seen AI that is so incredibly competent because it has near instant access to every piece of data when it’s training, which is an incredibly expensive process. And then local training or retraining of a model based on local data. So the idea that everybody is sort of leaning into AI is amazing and I’m super interested to see how that affects what data infrastructure even looks like. Is it something completely different than the product we’ve built?

Jaxon Repp 00:51:17 And I’m absolutely sure that it is. And I would challenge everybody who deals with data infrastructure to look at how data truly is accessed, what are those patterns? What’s the human factor? But now what is the machine factor? How do you collect everything from so many disparate sources, not just the sensors that you own, but all of the other things in the world, including user behavior, server load, whatever and sort of auto generate a data fabric that is the sort of thinking we need. And if you’re dealing with data infrastructure in any way right now, and it’s maybe the more monolithic architecture, begin to think about how you can optimize the shape of your data and the shape of your applications to get ready for that. Because it’s coming.

Brijesh Ammanath 00:52:02 Right. If people want to find out more about what you’re up to, where can they go to?

Jaxon Repp 00:52:07 Well, my company is HarperDB and you can find us [email protected] and we have lots of blogs there. I occasionally write stuff. We have lots of guest authors that come in that do tutorials and teach people how to build applications inside distributed systems. We have a developer center that has all the resources you need to get started, and ultimately lots of templates for our applications that can help you bootstrap a distributed system quickly and easily. And that includes using HarperDB Cloud, which is our software as a service offering. So while you can always install HarperDB locally and that it’s a single NPM command, you can also like spin up a free instance online today.

Brijesh Ammanath 00:52:49 Jackson, thank you for coming to the show. It’s been a real pleasure. This is Brijesh Ammanath for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio 592: Jaxon Repp on Distributed Data Infrastructure

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts

Search

Search

SE Radio 592: Jaxon Repp on Distributed Data Infrastructure

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts