Tug Grall of Redis Labs discusses Redis, its evolution over the years, and emerging use cases today. Host Akshay Manchale spoke with Tug about Redis’ fundamental data structures and their common uses, its module based ecosystem and Redis’ applicability in a wide range of applications beyond being a layer for caching data such as search, machine learning, serverless applications, pub/sub etc.
This episode sponsored by BMC Software.
Show Notes
Related Links
Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Akshay Manchale 00:00:49 Welcome to Software Engineering Radio, I’m Akshay Manchale. And today I’ll be talking to Tug Grall. Tug is a Technical Marketing Manager at Redis Labs. He’s a passionate developer and open source advocate with a deep background across the open source community. Prior to Redis labs, he held positions at the likes of RedHat, MapR, MongoDB, Couchbase, and Oracle. We covered Redis previously in Episode 209. This episode, we’ll go over the basics and then we’ll talk about more emerging use cases and how it’s currently being used and managed. Tug, welcome to the show.
Tug Grall 00:01:23 Very happy to be here. Thanks for the introduction.
Akshay Manchale 00:01:27 Great. So, let’s start off on a very high level. Can you talk about what Redis is and what it’s used for?
Tug Grall 00:01:33 So, Redis is one of the many databases that you have on the market. And it’s part of this NoSQL movement that was created around 2007 I would say, where you need new database to handle new use cases around speed, scalability, flexibility. And Redis was created as an open source project by Salvatore, when he needed a very, very fast dictionary for a data he want to access and he created in open source a key value store purely in memory. This is the initial DNA of his database, and over time — more than 10 years now — the community added more and more features to the database.
Akshay Manchale 00:02:20 So what sort of applications is Redis used in?
Tug Grall 00:02:24 So technically it’s used in almost any type of application. But it’s solving one of the problem very quickly and, it’s very fast access to the data. So, this is why you will see as an entry part of the project that use Redis, Redis is used as a cache, it’s used as a session store in the initial project and this is one of the reasons it was built, because when you build an application, you want to access the data as fast as possible, and especially if you know that you will access the data multiple times in a row, like in sessions, user sessions. It doesn’t make sense to have to go back to disc or do a long lender network or to another database and so on. So people start to use Redis as a simple key value store in memory for this type of use case. And over time, because it’s very simple to use, they start to do more and more things with the data. And keeping the same logic of being very, very simple to use, the community of Redis has 10 different core data structures from single strings, from list, from set, from hashes. So, the type of data structures that will help you to solve specific problems for your application. That go a lot beyond a simple session to out cache. You can use that to store your e-commerce kind of cart, with information about what you want to buy, to do some recommendation in very, very fast access. So, if you look at it from a use case point of view, because it’s used to share data between different calls or different services, it could be used in retail or e-commerce to build a cache at the top of a catalog, to build a session of a user, cache over a catalog, session of users. It’s almost any business you do today. If you talk about shared configurations, because you want to have different services in different technology that want to share, for example, a specific configuration, you can use Redis. You can have a service in PHP that will save some data or service in another language that would want to access this data. And this could be just actions on event or actions on configurations. So this is why, in terms of use case in industry, I like to say Redis is everywhere. Usually, when you meet a developer — when you work on Redis and you go to a developer conference, or you do a meet-up or something — you go into the room and you say, who is using Redis or who has been touching Redis? You are sure that almost everybody has seen, heard about it, or they know that somewhere it’s there. Because it’s also, Redis is embedded in many, many solutions, you know, like CMS, like e-commerce platform it’s used internally by default to cache the data. So, this is kind of the DNA: speed and simplicity, and scalability for your data.
Akshay Manchale 00:05:27 So a lot of the speed and access is accelerated by keeping things resident in memory, as opposed to disc and databases, is that right?
Tug Grall 00:05:35 Yes, this is a choice of the Redis team, and the philosophy of Redis has been that we want that to be in memory, like that you have consistent kind of latency. So usually as we look at, when we monitor an application, we will look at one or less milliseconds to get the operation done. But the good part of being in memory, it will always be that. You just need to be careful about how you scale in the sense that, we don’t necessarily start the data in memory only. That means you can choose when you configure your Redis to say, “I want the data also stored on disc,” just in case you have to stop Redis. So, if you crash one node, when you restart, you will load the data from disc to memory. But when you work in your application, it always go into the memory on all the dataset, is in memory.
Tug Grall 00:06:31 So the main reason it’s to be sure that me as a developer, I have tested my applications. I know that this specific operation is taking half of a millisecond or 1 millisecond to get the data and to do the comparison between these two lists. If I do that today, if I do that tomorrow, or if I do that in one week, you will keep with the same kind of performance, because the data will always be in memory. On a traditional database, you may have the data that you have not used. So, let’s say you want to compare two lists and it’s right now in memory. Great, but over time, because it will not be used, you will go back to disc and when your application will need it you will have to go from disc to memory and then get back to the user. So, this is one of the things that makes Redis very interesting for applications. When you need to what we call real time in the sense that you know the time you will spend to get access to retrieve the data out of the database.
Akshay Manchale 00:07:28 So, what sort of data do you normally store in Redis? I know there are certain data structures that it supports out of the box. Can you walk us through some of the simple ones that are commonly used?
Tug Grall 00:07:37 So, you have 10 data types inside Redis — what we call “Redis core” is really the Redis you are thinking about when you’re asking me the question. So, the reason I said Redis core is because, as we will talk about over time, Redis has an extension mechanism that is called modules that allow you to create new data structures on your command. So in Redis core, you have the key and when you have the key, then you will associate a value to it. This is why we talk about a key value store. But people will often say, oh key value store it’s a simple key and it’s a simple value. In Redis it’s not the case. It’s a key, simple key, very easy to define. You define the way you want, the smaller the better.
Tug Grall 00:08:24 Why? Because it’s kept in memory, but it has to be long enough or clear enough, to as a user that you understand where this key is. It’s a UUID that has no clue how, what it means. It’s not necessarily easy to read. So associated to this key, you will have a string that could be up to 500 MB. Be careful okay, because you don’t want to always, because it’s not only the place that you put that in memory, it’s a network that you have to pay. So you have this string, the string is very often used if you have to cache an information about a subset of HTML content, an information about a user information, we talk about web services call, and you don’t really care about how it’s structured. You will use this string. You also have hashes.
Tug Grall 00:09:13 So beyond your key, you will have a set of other fields, name and value. Typically, instead of having a long string that contains — using an XML or JSON, some information about a profile user, like your first name, last name, and so on — with a hash, you will be able to, in behind the key, add the field name: first name, last name, and so on. So, the benefit of that is the fact that you can have as many fields as you want. So, you could have thousands of thousands of fields, and when you save, when you retrieve, when you do an increment, you can have a subset of these fields. The hash is a way of organizing, a way of structuring content, a data structure. So typically, if you do a product catalog, you may use this as a data structure where you will have an ISSQ as a key, and then a set of fields. And when you retrieve and communicate with the database, you will only save or retrieve a subset of these fields, depending on your business API that you want to deal with. So in answer to this core, kind of very used, very used data structure of yours, you also have sorted set on a set. So, you can put values in a set and you can query them, and add, and manipulate and navigate into this. Unsorted set, it’s the same. Your value associated to a key with a score. And a common use case with sorted set is ranking in gaming. If you want to have the top five players for a specific thing, or what are the most downloaded features, and so on, you want to organize content in a way you can sort and query very efficiently. On a specific sort — order by — you will just associate the key that will be top 10 users for this game, and you will have the name of the user, a score, and so on. And when you query that, you will just have the capabilities of reading this. And you also have list. Lists are very basic structure, not necessarily basic in the way they are implemented, but the way you manipulate. And with everything you can do is, pop an add feature at the top, at the beginning of the end of the list. This is, many of the core features that are used for applications. And then you have others like pub-sub features that allow the user to do, publish and subscribe data on events. It’s used for notifications. It’s used for communicating between two different services in real time like that one system calling on the other one. And in Redis 5, one of the data structures that I really like, because I worked a lot with messaging technology before, it’s Redis strings. It’s a log of event, like a Kafka more or less like a Kafka. So, you can sum an event, store the event to the database and then you have a set of consumers that can read a part of it, read entirely, read/write, allowing you to do some good on advanced messaging applications using Redis, out of the box.
Akshay Manchale 00:12:26 So is that like a durable messaging system like Kafka or is there a difference there?
Tug Grall 00:12:31 No, it is durable. With the trade-off of being in-memory. Right? So when I say being in-memory, once again, it does not only sit in memory. It could be saved on disc; you choose when do you want to save. When you configure, when you start your Redis instance, you can say, I want only memory, so I don’t want to start anything on disc. Or, I want to save on disc every six hours, every four hours. This is good if you, for example, I have a big cache and you are exporting every day a big data set from the mainframe into Redis to have a very fast access, and you don’t want to have all the processing and even if something crashing went off your container and so. But you can also do append-only, so append-only will be every transaction, or every mutation, or every second, you can save data on disc.
Tug Grall 00:13:27 So when you, if we talk about this specific use case of having a persistent log, when you do add, X add is a command, X add is a command, X add one message, X add on other message, you can choose to save on disc every time. So, you will not have a big impact on performance because it’s done asynchonously. So we say it’s a single thread. We often talk about the single threaded application in Redis. It’s towards the server side from operations, but everything that is related to replication, that is related to saving on disc, it’s in a different thread to avoid impacting the performance. In this case you will save the data on disc. So it’s persisted. And you can play your game with the data with whatever you want. I just want to add something a little bit specific on this is, as you know, as you mentioned, I’m working for Redis Labs and Redis Labs is adding also some other feature to the community version, in the enterprise version. On one of the use case it has is, you can use something called Redis on flash, that is only available in enterprise but the idea is that if you have a bigger data set, we called cooler data and hot data, you can add some of the data that will not sit in memory all the time. Keys will stay in memory. Some of the value will be on disc. An in this case, we are closer to a very, very, very fast, database that is this closer in memory.
Akshay Manchale 00:14:56 So, in some sense, Redis is used as a cache on top of other data stores. So you can have updates going into your primary data store through Redis maybe it’s going out of Redis. How do you manage recency of data?
Tug Grall 00:15:10 So you have many different ways of doing that. And also, you know, just to be a little bit controversial, you also have lots of people that using now Redis as a primary database for the application. You can use, and you should use Redis as your database. If the data model fits, if you don’t need it present somewhere else, it’s possible. So, to go back to your questions about, yes, when it’s used as a cache, the biggest challenge with all the cache caching technologies on caching infrastructure on architecture is when do you invalidate? How do you know that you have to go back to the initial data? So here you have different options, depending on this is you have no magic here. It’s really first, the developer has to understand the data is manipulating. For example, if I, if I take a very basic examples that is very simple to implement, and in this case, it’s quite easy. Suppose you are calling a web service, such as giving user the weather forecast in your area. You are building an agriculture kind of applications, and you want to give forecast. Forecast will not change every five minutes. Right? It’s okay if you cache for one hour or two hours for forecast for the next five days. So in this case, in Redis a simple way, the simplest, simplest way will be to simply say, put a TTL, the time to live and expiration of one hour. Like that as a developer, you have nothing to do. You set the value, so you call the web service for the weather. You set the value in Redis and you say in one hour. There it is, it will be done for you. When you have something that is most complex, it will really depend do you have access? Can you have some notifications about the initial data source? And what we see more on more is we were talking about Redis strings on Kafka, two minutes ago. So, if you are using an event sourcing kind of approach or messaging approach, you will use that to invalidate your cache. You can receive the event on say, when I received this type of event, I will invalidate the cache. And you have two ways of doing that either you’re really invalidate the cache, or just delete your object, or you can also ask the system to go into the main data source and put that into cache. This is also how we see lot of CDC (Change Data Capture) integration. In this case, you plug your initial database that captures the transactions and you push the event inside Redis to keep the data up to date, based on the initial data source. In this case, it’s coming from the golden record — that is your mainframe or your archive database, on MySQl pushing data back to Redis. Lately, we’re using a module called Redis gears. It’s a part of the ecosystems that is available in open source. You have an interesting use for that. It’s doing a cache of white Redis. In this case, your application can interact only with Redis. But when you modify something in Redis, you will push update in your relational database. In this case, if you modify something in Redis, it’s already up to date and you will eventually use a Redis string and push data into your relational database. So you see you have different ways of doing that. The key in this part is when you work as a developer, as an architect, is try to first understand that caching, you always have these questions about how long is my data will be valid? What events does makes these data invalid? Is it something that I can control? Is it something that is only time based? And if it’s something I can control what will be the best solution for that?
Akshay Manchale 00:19:05 So you mentioned there’s time to live for certain data structures, right? So, let’s say something actually expires. And then an application is trying to look for that particular key. Does it actually reach out to the database first and then do a put on Redis? Or is it just reaching out to Redis and then Redis, if there’s a cache missed, populates it from the data store, how does that work?
Tug Grall 00:19:29 I like the way you ask the questions, because it makes me thinking. It goes back to the initial DNA of Redis. As simple as possible, make it simple. What does that mean? It doesn’t do anything for you besides expiration so, what I find is, if we go back to this example where I have to call a database or a web service? After one hour, if you try to get it, you will not return to value. You will say it’s nil, or the key does not exist. What happened in this case? Depending on exactly when is it in the process, the key will be really out of the database because you have some cleanup jobs that will really delete. Or what’s happening is if you arrive just one second after a one millisecond after the expiration, you will check if the time to leave has been done.
Tug Grall 00:20:19 And if it’s done, you will delete and return nil. So by default, you will not try to go back to call the web service, or call a database for that. This is usually something that you put inside your application. And you have many systems that does that out of the box, but it’s not inside Redis. You know, when you work, for example, with Spring, in Java you have many frameworks, you can activate some kind of lever to cache. In this case what will happen is, hibernate on Spring data and so on, will work automatically by themselves, to see if they need to put the data in Redis or not. And they will deal with accessing the database. What you can do, you can build some stuff yourself to say when, inside the database using some new eScript, using some specific services or even a module that can build two calls the database yourself.
Tug Grall 00:21:16 But the thing we have to be aware and careful is we want Redis to be as fast, as simple, as efficient as possible. So, where is the best way to call the data? So to get the data out of the golden record, the main system of record, it’s really based on your application on how fast it has to go. So traditional way, I try in the cache, if it’s not in the cache, I call the database. I store in the Redis and I returned the value like that after it’s back into Redis.
Akshay Manchale 00:21:50 That’s interesting. One of the other things that I know about Redis is, Redis is also known for probabilistic data structures. And it’s a topic we covered previously in Episode 358. Can you tell us what probabilistic data structures are present in Redis and how they’re commonly used?
Tug Grall 00:22:07 So, you have in the main, in the core, you have a parallel log that allow you to increment value in an efficient way. The idea is to count value, billions of billions of values, keeping the number very, very, very small — the memory size very, very small. If you want to count the number of hits on a specific movie or something, for example, you are in YouTube, when you want to have a quick calculation. So, one of the things will be to see how many times a specific resource is available or has been accessed? The APR log will do that. And you will optimize the size on the restate to 12K in memory, very efficient, even if you have big numbers, so price to pay in this case, it’s an approximation of the data. So, if you want to know exactly to the exact unit? You won’t be this case and knows this one is in Redis core.
Tug Grall 00:23:02 When you use one of the modules developed by the community it’s called Bloom filter or RedisBloom. And this was based on Bloom filter. So, it’s to check in a very, very, very efficient way if something is in the list or not. And the interesting part is — I just have to be right in the way I express that — is if you want to quickly know if a value that you are checking is part of a list or not on, instead of having to come haul a very large set of data, you will do a guess. The guess that will be more or less. It’s a fault or the tolerance will be bigger and spread out depending on how you count a year to be your space in memory.
Tug Grall 00:24:40 I was discussing with one of the user of Redis in India and they are doing payment platform. So, they have many, many merchant. And what they want to do an AB testing between different features. They will have to have millions of customers. And they want to check if this customer is part of this city, and if he has a special buy on a specific items, the last six months, or the last 24 hours, I want to show this feature or
Tug Grall 00:25:06 I do not want to show this feature because I’m testing a new kind of feature. This decision as to be very, very fast. The rule is very simple. Is he part of this list, this list or this list? But doing these two or three tests has to be very, very efficient because you have to do that for millions of users with thousands of features, or hundreds of features. So, this is typically where initially in Redis, they were using sets and they were comparing sets. But the issue, it was taking a little bit too much time, even in Redis, so they switched to Bloom filter, allowing to have instead of few thousands of merchants, to have more than 1 million of merchants into the database and be very efficient. So, the idea of being probabilistic it’s yes, you are not 100% sure, but you do it very, very, very fast. So, you are allowing you to add new feature to your system. You just have to understand, not like I just did exactly what the tradeoff is, because it’s all about trade off.
Akshay Manchale 00:26:10 Well you know, with this since it’s not an exact answer, but it’s also being served from the cache. You have a risk that it might just go away. Like your instance restarts or something. Now, what were you having your Bloom filter is actually lost. So how do you reconstruct that? How do these probabilistic structures come back up online and give you an efficient answer reasonably fast?
Tug Grall 00:26:33 So, the way my own self does, make sure that it doesn’t disappear. So how do you do that? You can use persistence on discs, to our case, and the other one, and this is if your data in Redis is important and it’s complex to build, because it’s either a large volume or because it’s a Bloom filter so it’s long, the volume is small, but it’s based on a million of entrees. You should use persistence, but more importantly, you should use high availability. You should put it in a very serious way that you have a master and replica. So you have a copy of your data, inside the cluster, inside your own one. It’s very easy to do in Redis, you just create a cluster, where you have multiple Redis running in multiple machines. So every time you remodify your value, we will be replicated to another node. So like that, if something happen, you will fail over to this node. I don’t answer your questions because I don’t have an answer. If you have, if it’s million of million of million, you have to save the data somewhere, but you don’t do it manually. You use persistence and you use replication, giving high availability to your Redis instances.
Akshay Manchale 00:27:54 Okay, that makes sense. I want to talk a little bit about the query aspect of it. So, you know, you can do look-up by keys. You can look for existence checks on lists, but what is the concurrency model for all of this? Is it always like consistent with respect to another right operation coming in? How does that work in Redis? Is it transactional?
Tug Grall 00:28:14 So every operation is atomic. Nothing will happen in, only one operation afterwards. You also, so the parties you can guess on, you understand what’s happening. So you don’t have transaction in the suns. You cannot do begin transactions, modify, modifying modify on all back? But what you can do, using pipeline and transaction is, you can be sure that operation 1, 2, 3, 4 are happening one after the other and nobody else will come. Because being single threaded, that doesn’t mean that you cannot have multiple clients honing on the same server. Obviously, you have many clients, but with the API is that you have inside Redis, you can be sure, and you can pause when you need to, to do exactly what you mean. For example, you want to increment value, set something in a Mosaic key, retrieve some values. And you want to be sure that nothing happened between this first operation and this operation, only these two operations have to go together? So, this is what we call transaction in Redis, you can do that. This is called transactional pipeline. Obviously you should not do that all the time because it will impact your clients that will wait a little bit until these three operation has been done to continue. So in this case, it’s managed by Redis itself.
Akshay Manchale 00:29:38 Okay, interesting.
Akshay Manchale 00:30:05 I want to switch gears a little bit and talk about some more emerging use cases that are coming up with Redis. I see that it’s being used in machine learning stacks. I see it’s being used in search, graph databases. So there’s this whole array of new use cases that are, this is being used in. Can you talk about some of the more interesting ones that you think are non-traditional users of Redis that is being used today and some examples?
Tug Grall 00:30:29 You know it’s a very good sequence to what you were just asking to transactional query is, one of the things, Redis is very fast, Redis is quite easy to use, the API is quite simple for the developers. One of the challenges is, suppose we have, we go to the exact question you ask is, I use Hashi to start all my connected users last week, connected user. So I have hundreds or thousands of entrees, but I would like to do a simple query like, can you give me the number of the top five or the number of connected user, current connected user coming from California? If I want to do that with Redis core, without anything we’d have to manage myself because I cannot do a full scan.
Tug Grall 00:31:17 So what I will have to do is, I will have to use a hash to store the user profile that contains the state, but I have to somewhere, have a list of the states of the connected users to be able to get this value. So, I will manage my index on the relation between my hash, on my list of connected user for a specific state. So, it’s a use case that people use, but only very advanced developers will be able to do that. People that have the time to design this, and will really, really, really need this kind of milliseconds latency. In reality this request was becoming more on my regular, in the discussion of, I want to do that. This is exactly why Redis community and Redis Lab started to develop modules that answers this type of question.
Tug Grall 00:32:08 So one of the first module was how we can do indexing querying on full type search inside Redis. To take exactly the example that I showed is you want to start a Hashi and you want to be about without having to manage yourself, just creating an index, saying, okay, I have a hash that has a key. The pattern of the key is user, column something. Everything that start with user, I want to index them with first name, last name and state. Like that I will be able do some queries. So, one of the use case that is becoming more and more often, is to be to query by value, in a very simple way and it’s done using Redis search module. In this case, just by providing more advanced capabilities in querying and aggregation, you don’t need to go to a self-party tool to do that.
Tug Grall 00:32:59 Using a semi-private sheets, developers want to create complex applications using graph data and graph database. They want to be able to be quickly do anomalies or comparison, to find some pattern and be able to analyze an impact of something. And the goal of the community of Redis and the modules was to give a module that will help them to do that. So this is why Redis graph was built. So, you see querying and indexing, doing graph queries, and for the same reason, you have time series. So be able to organize in a very simple way, using the simplicity of Redis, where you can organize time series, do queries by tag, find a very fast intel and so on, into the data. And for this, the way we are projected as a community and as a company is to what are the main use cases in term of, I want to walk with these steps set of data and provide a data structure and the commands.
Tug Grall 00:34:06 So, typically you will see, we talked about Bloom filter, search for indexing and querying data by value, including full text search. Time series, graph database, and you also have gears. I talk about it’s kind of some eventing system or some or less programming because you can add some programming on application inside your database, and you talk about AI. So artificial intelligence, machine learning globally is becoming more and more important in many enterprise and what’s happening? More data, more need, you need fast access to the data. This is where Redis come to the picture. And what happened is people can start with data as a cache. Like we talk at the beginning of the discussion, but what is more interesting, can I cache a part of my model? Can I use the floor or something inside Redis? So the community and Redis Labs has built Redis AI to aid you to add inside Redis. A part of your artificial intelligence kind of infrastructure and architecture to make your application faster and easy to manage.
Akshay Manchale 00:35:24 So is it fair to say that its kind of like cache system model for you to give you the response based on whatever approximation that the machine learning model does rather than being used for learning itself?
Tug Grall 00:35:37 Yes, exactly. I have not personally spent a lot of time on this module. We are adding more examples as we speak. It’s something that is quite important with these days. You have lots of investment in the development to show more example. It’s more on the execution of the model in the application to be as fast as possible.
Akshay Manchale 00:36:05 So you also mentioned several times about Redis gears, and it’s used in serverless programming and as a computing thing on a data, can you walk us through the examples in a little more detail about what Redis Gears is?
Tug Grall 00:36:19 So one of the questions on use case is, you use Redis and you want to react on some specific event and you want to process some data on a specific event. So for example, you want, obviously, first of all, when I talk about several SR computing inside the database, it’s connected to the data, right? It’s not to execute a web server. It’s really, to process and manipulate the data that you have inside Redis. Do something with the data that you have some with. For example, you want to be able when you have a specific event or on a specific job that you want to run, instead of having to add something outside Redis, you can do that inside Redis. So one example will be when you have a specific data, or a date on a specific data, you want to modify some Mosaic data in your overall database.
Tug Grall 00:37:15 So, you kind of want to do a trigger that will add some information and verify on you want to do that as close as possible, from inside the DB. Instead of coding something outside and having the processing on all of this, being outside the database to modify again the DB, the idea of gears, is to be able based either on event inside the DB, when you insert, when you have notify. So, you will have a specific scene happening in the DB, including a specific scheduling, then you will execute some code, and depending on how you develop and deploy, it could be created today in Python. And we are working to implement this logic in Java, on C on so on. And so the idea will be, and this is important when you talk scale, because if you have a small data set with a single process, you can put the process on the same server and you will be almost the same, but you imagine if you have a distributed database.
Tug Grall 00:38:13 So you have not only a replication for high availability, but you have multiple process where you, what we call the clustering inside Redis, where you distribute the data. You shove the data based on some keys or between different machines. So what is nice is the gear itself. So business logics that is running is on each of the machine close to the instance inside the instance. So, you will distribute the processing in parallel. So you can imagine like a MapReduce approach of the data. Because you will distribute the load, you will do some aggregations on efficient access to the data. This is what Redis gear is for. And the idea of Redis gear is two things. It’s to have this for you as a developer, to use it or to extend it, if you want. I use it sometime as part of the demonstration I do is just to capture an event, puts that on a messaging system, and then you consume the event. Or you have, why to be on cache or why behind a feature where we use gears to capture event. Often modification inside Redis and some simple business logic to connect to a relational database and do the update or delete or create into the relational database. So in these case, it’s just consuming the event, manipulating the data, interacting with an external system.
Akshay Manchale 00:39:34 So, one thing I want to clarify, initially you said that Redis is a single process, single threaded. Gears are in line with that single process operations. So is it outside of is it synchronously sort of executed?
Tug Grall 00:39:48 So the same when we talked about Redis as a single threaded process, it’s not really on a single set of process. You have as a process like saving on disc, replications, gears, executing gears. The single process is where you interact with the data. So, when you have some multiple, so suppose you have a single instance of Redis, and you have hundreds of client application writing in it. And you have many doing sets, some are doing deletes, some are doing gets and so on. These commands are always on the same thread. So, like that you don’t have to do concurrency management looking on all of this stuff. It’s always with the same flight. This is where it’s single-threaded. You have other thread outside and typically gears is doing some threading outside. So, it doesn’t break the model of being single threaded, because a single threaded is when you manipulate the data. So, when gear is writing the data, it’s back to the single thread, but the execution engine is on a different thread.
Akshay Manchale 00:40:50 Okay so your latency is still the same with respect to your operations governance.
Tug Grall 00:40:55 Yes, this is really the focus of the design of Redis. It’s to be sure that saving, writing data, reading data should be consistent in term of expectation. And the design is, are ones that, so this is why when you size your Redis cluster, you have some kind of basic holes, you will say, I will put one, I will use, few calls or one call for each instance, and one or more call for the other stuff so you can have multiple instances of Redis on the same machine. So, you will take three calls for each instance, one additional call for gears IO, and so on. And you will say, I will use 25 gigabyte of data in each of the process. And I will, on another age, obviously we change depending on the use-case of the dataset, but I would say 25,000 operations per seconds. I can start with that if I don’t know exactly what my use case, I am safe. So it’s where you optimized to work on the multi-core stuff. So it’s because it’s single threaded, this part of a process is single threaded, it’s very simple to tune on level you put out on your machines, just keeping some memory on call for the other part like IOs, network applications, all these kind of things.
Akshay Manchale 00:42:27 So, let’s say you start off using Redis to accelerate your application, and then you notice that you need to scale up, what are the signs that tell you that you need to sort of increase your memory footprint, or maybe even distribute and add additional notes? How do you do you go about that?
Tug Grall 00:42:43 So one of the things is usually, you have to look at is what is the latency of your application, just to be sure you don’t overload the system. And this is the part, so this is a number of aggressions, the part in this case, you just look at how many operations per seconds I have on my machine. And if it starts to be, to have a latency, and you look at where it’s waiting. Is it waiting to access a memory and so on, is it waiting to access a specific path? You will split the database in multiple process so you will shove the data to be sure that he doesn’t wait too much. You distribute the load in term of operation and we don’t wait. So, this one is just looking at the latency the other part is more of a safety net. I was talking about a replication for high availability.
Tug Grall 00:43:35 Suppose you have a one terra byte of memory inside your machine. You can put, I have not tested it so I don’t know if you can put a five gigabytes of data in one single Redis. It will work, not an issue. If you don’t have million of operations per second, it will work perfectly. But what happen if your data, we go back to the discussion above, what happens if it this crash? You know you have to do an upgrade of your server. You have to, you have a crash of the network on so on. You have replicated the data, why it’s good. So you have a copy. It’s perfect. So we’d go back to zero audit, it’s working perfectly. But the issue is that is, you will still need, if this is one of the server totally crashed, you will need to restart the server and you will need to wake up this 500 gigabyte from this specific instance to a zero instance.
Tug Grall 00:44:32 So it will be costly in term of operation on the network. So this is why one of the best practice it’s to say on a single instance of Redis. Try to stay on less than 25 gigabyte. It’s not a hard limit. It’s just good practice because like that, if you have to replicate the data, we start from these, do an upgrade and moves the database to another server. It’s a lot faster to move 25 gigabytes stuff instead of that big, big things that you have to load back in memory. So to go back to your initial question, you look at the number of operations per second. If you start to see latency too much wait time between a request, you have to distribute the data, to distribute the load. If you instance are becoming too big, you should, distributed the data for safety. When I say safety, something won’t happen. How fast do we go back to normal?
Akshay Manchale 00:45:29 So time you to recover your data is really, really important factor there. Let’s say you do have a multi-instance Redis. So now you have your keys that are distributed across different instances, possibly different clusters or different physical machines itself. How do you find what you’re looking for? Is that something that’s left to the application?
Tug Grall 00:45:51 No, it’s not. Some people will do it manually, but obviously Redis has been built to add fuel to scale out, on distributed data. So this is what we call the cluster on the cluster API. So, what will happen is you will distribute keys in small virtual slot of 16,000 on something. If you have 12 slots. So you have, you take a key, it has a hash and we will give you a number between zero on 16,000 something. And this slot will be on one of the physical server, one of the instance. So if you use clustering, you will automatically find where do I have to go, to find the data? So this one is done automatically by you, if you are using the pop-up client. If you are using Redis Enterprise of Redis Cloud, it’s transparent because you are connected to one single
Tug Grall 00:46:43 what we call the proxies that we use the walk of distribution. What you have to be careful, you know a little bit earlier in the discussion you ask about transactions. And I told you that it’s possible to begin something to step one, step two, step three. This step one, step two, step three up to five has to be done on the same instance, on the same shot. You don’t want to come back into the very complex use case where you start something on this node, you have to finish on the other node. So, kind of a two phase commit global construction stories that you don’t want, that very complex, you will have a big impact. So, the question will be yes, but you are telling me that Redis is distributing the data all over the place, so how do I manage that?
Tug Grall 00:47:28 So you can, and this is something when we talk to the developers and we’re kind of doing, educations or discussion about scalability, we say from day one, if you know that you have a big data set or big number of operations per second, think about the cluster and you don’t necessarily have to implement it immediately, but understand the impact because of what I just said. Suppose you are building a system when you have customer on some information about the customer, like invoice or product, or they want to be linked together in two or different keys. So, what you can do with Redis is when you create the key, you put a hash, in fact by default, if you put something between two buckets of keys, this will be used to calculate the slot.
Tug Grall 00:48:20 So if you want, if we talk about the customer and why inside your key or customer ID, it could be the customer profile, it could be a list of orders, it could be the list of call to the CRM. It could be anything you want. And in this case, what you do is you put this hash and Redis cluster will be sure that everything that has the same hash or the same customer number 001, everything that is related to customer 001, will go on the same shot. So, under the same Redis instance, so we will still distribute the data, but at least you can control a little bit, where this stuff is going. So idea of clustering, if it’s well done, it should be transparent for the developer, efficient by giving you the flexibility of this kind of use case.
Akshay Manchale 00:49:09 In other words, the application developer has to do some sort of data modeling to be able to place data depending on the operations and business logic.
Tug Grall 00:49:19 Yeah. If you need to have this kind of what we call multi keys operation.
Akshay Manchale 00:49:26 Okay, that makes sense. So what are the trade-offs with respect to replication? If you have multiple instances, do you usually do synchronously synchronous? Is that configurable? How do you go about that?
Tug Grall 00:49:37 It’s done, as in kind of sleeve or efficiency. So keeping in mind that what we want is we want to be sure that default operation as fast, fast, fast, fast. So what is the challenge in this case is, since it’s done synchronously, you don’t know if the replication happen by default. So your guess will be, I will save my data on disc. I will replicate so like that in 99, but 9, 9, 9, 9 situations, my data are safe. Even if I break something, it will just take the replica from disc, I am safer. So, for most of the time, you just don’t care about replication. You just activate it, you say, I want replication enabled and it will replicate the data. On some operations, and this is at the operation level. This is what is interesting when we talk about Redis and many are secure database will have the same time of a part, instead of
Tug Grall 00:50:37 the basic rule is by default, I want it to be simple, I want to be it fast, I want to be it replicated for high availability, but some operations in Redis you may want to say these specific keys when I do these specific operations, I want to be sure that it has been replicated. So I will wait until it’s done. So, we have a specific command inside Redis that is called wait. In this case, it will wait for the replication to happen, get an exception if it failed. That doesn’t mean the write has failed. It could be just a replication has failed. Or you can choose, I don’t care I will eventually do it, or you can retry. So by default, as synchronous transpire, I don’t care as a developer, but if a specific operation, I can control it on wait for the replication to happen. But in this case, as you can guess, it’s a price to pay is you have to wait. So the latency for the operation will be based on the replication factor.
Akshay Manchale 00:51:34 Right. And is that the same approach for durability when you, when you say you can back up your various data to disc, is it the same sort of guarantee that you can or configuration that you can have?
Tug Grall 00:51:45 I would say no, the wait it really depends on the version of Redis Enterprise. Because in Redis Enterprise the replication is done in memory, so you control only the in-memory path and you don’t care about the disc in open source, but I think, this is where I will have to check in detail in which version, but I think in the new Redis6 version, even in open source, before it was done by disc. So if the replication is done, it was always on disc. So just to be sure, I’ll clarify this. You cannot control, to wait for a write replication is your controls, the replication, but in some open source versions your replication is made from disc, not from memory.
Akshay Manchale 00:52:39 Okay. So, what are the operational complexities of going to a clustered system versus a single instance? And are there modules, or what’s in the ecosystem to sort of manage operational complexities? What are the operational complexities and what, how can you manage it?
Tug Grall 00:52:57 So it really depend your operational complexity in a higher availabilities of replication on cluster distribution of the data, really depend on the type of your own month you are. One of the things that I want to stop by, is more and more user of Redis used managed services. So in this case, operational is just put yourself on the cloud. You have it, you choose your plan and it’s done for you. If you are running it in a house, this is where it’s becoming important to understand how much time you want to pass to do on ups versus doing on buying on auto-pay software. Because each instance of Redis is very simple to do, to be honest, if you, if you want to start a cluster on your laptop on two or three machines, it’s not that complex.
Tug Grall 00:53:50 You know, it’s, you’d have to put some files, start be it just be a little bit careful about in which order you do that, check the configurations. Then one day you may have to do an upgrade. So you have to be careful about this. So, it’s just lot of control of budget configurations. If you run with a solution like, on Kubernetes running on a software, this all part of clustering, high availability upgrade is managed for you. It’s one click of a button or a CLI. So it’s not complex. Even, you know, open source, the important part of the open source, it will give you has much high availability as you want, but you will have to do the configuration yourself. And it’s mostly be careful on the configurations on how you start the processes on your developer machine. So, it’s a very important part,
Tug Grall 00:54:44 if you want to do something serious and you want to manage that yourself, so you don’t want to use a manager, becomes a cloud, or you don’t want to use an autopilot software, you have to have a DevOps team or an OpsTeams that can automate things. Because you will add more and more if you know exactly the size on your own in between nodes on three process, it’s not a big deal. In this case, even me as, I’m not a DevOps expert, I will be able to do it. But team’s goal is to scale out and you may add more and more data and more and more use-case more and more database. You really have to be sure that if you don’t use a managed service, you have a good automation of your own configuration.
Akshay Manchale 00:55:24 Yeah, that makes sense. So, I want to drill down into some of the pub sub system or pub sub support that Redis has. Can you talk about who the publishers and subscribers here are? Can you have like users who are subscribing to changes from a Redis instances is that how it works or is it Redis to Redis?
Tug Grall 00:55:44 So, you have two things. When we talk about, on your specific question, pub sub? The Redis user when they think about pub sub is really thinking about two things, Publish, Publish Subscribe. And in this case is, the key, when you know you do a key in Redis, you have a key on a value, in a publish subscribe model, the key it’s called a channel because it’s not persisted. So, you do a publish of a publish user notification, and you send a message and it’s published inside the base, okay? Nothing happened until somebody, you have some subscribers that can subscribe to one or other messages. One over a channel, one on multiple channel. In this case, it could be a Java application, publishing a message on many different system.
Tug Grall 00:56:41 It could be one Python, one Go, 25 Java and so on. They will either up and running when the message is sent. It will be pushed and consumed by all the client or the subscriber. So, it’s on various application, various Regis client application to another various Regis client application. So this is when you use publish subscribe, this could be done easier by you. Your application publishing an event. Or you can, what we call the key space notification inside Redis, you can configure Redis to say, when this key pattern, I want to be notified on this channel, when this pattern is a new key or it’s deleted or it’s expired. So for example, when we were talking about the caching expiration, I can for example, I didn’t think about it when you ask the question, but I can say if we talk about these web service recordings, whether API, I can use publish subscribe.
Tug Grall 00:57:46 So publish will be done by the database automatically every time one of the key concerning, whether API expire, it would sum an event, my application will subscribe to it. So one way of doing publish subscribe messaging, it’s using publish subscribe but it’s a fire and forget. When I fire, I don’t know what’s happening. I don’t know if somebody is consuming it and I should not care. So it’s good for, really real time notifications on this kind of thing. If you really want to know about having messaging, when you want to send a message, best step control as a consumer, can control only once. At least once I have a pattern about retry, if it fails on so on. This is where you use Redis strings as a data structure, because it’s persisted and you have the concept of consumer that can register to a specific set of messages. So, you see, you have a multiple possibilities, you can even use lists for that, but I will stay on publish subscribe and strings, but there between applications, Redis client application, right? It’s not useful application between the Redis and so on. The shaft is for new building your application.
Akshay Manchale 00:59:01 Sure. So, you can have like a many to many system or many clients publishing to a key space and then many clients consuming events from that. Okay, that’s nice. Redis has a rich ecosystem. What are some other interesting modules or extensions that you want to mention to our users?
Tug Grall 00:59:21 So the modules that I talk are really part of the ecosystem, as you want to use new commands on new data structure. But you also, as much of a big ecosystem of things, like, for example, integrating with tourists for monitoring and connector for Grafana. So, you can have a Grafana a data source that will, you can use wifi to ways to monetize your Redis, but also to inspect as a data you can create, you can do graph and you can inspect them. So ecosystem about monitoring in Redis is huge. So you can integrate Redis to your monitoring system, into Grafana, into Datadog. And I will say also it’s a big ecosystem of having Redis as a managed services everywhere. So Redis Labs is one of them providing Redis with all the modules online, but it’s very important to have something on the cloud today as on more applications are building this managed services as a database it’s key. And I will says as a big ecosystem is a fact that almost every language has a client, you can use Redis with almost any language you want. So, it’s plugged in many, many tools.
Akshay Manchale 01:00:35 So to wrap things up, you know, obviously Redis still has that simplicity grounded in its evolution so far. Where do you think Redis is going in the next couple of years? What can we expect?
Tug Grall 01:00:47 So the idea is, because of the need of fast on a variable, real time applications, very fast application, Redis is used more and more as a database, as a main database for your application. So, this is where it’s going to unfold. This, you need multi model or multi programming model with the modules. And a good example is we launch as a community, this is year Redis AI, and this is where it’s going from. It’s really about adding more commands to give you more features to build richer application where only Redis is needed. Not seen only as a cache, but you working as a database more and more.
Akshay Manchale 01:01:37 Okay. That, that sounds really exciting. With that, I’d like to wrap up doc, thank you so much for coming on this show. This is Akshay Manchale for Software Engineering Radio. Thank you for listening.
[End of Audio]
SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)