Search
Mike Freedman - SE Radio guest

SE Radio 623: Michael J. Freedman on TimescaleDB

Michael J. Freedman, the Robert E. Kahn Professor in the Computer Science Department at Princeton University, as well as the co-founder and CTO of Timescale, speaks with SE Radio host Gavin Henry about TimescaleDB. They revisit what time series data means in 2024, the history of TimescaleDB, how it integrates with PostgreSQL, and they take the listeners through a complete setup. Freedman discusses the types of data well-suited for a timeseries database, the types of sectors that have these requirements, why PostgreSQL is the best, Pg callbacks, Pg hooks, C programming, Rust, their open source contributions and projects, data volumes, column-data, indexes, backups, why it is common to have one table for your timeseries data, when not to use timescaledb, IoT data formats, Pg indexes, how Pg works without timescaledb, sharding, and how to manage your upgrades if not using Timescale Cloud. Brought to you by IEEE Computer Society and IEEE Software magazine.



Show Notes

Related Episodes

From IEEE

From the Show


Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Gavin Henry 00:00:18 Welcome to Software Engineering Radio. I’m your host Gavin Henry. And today my guest is Michael J. Friedman. Michael J. Friedman is the Robert Econ Professor in the Computer Science department at Princeton University, as well as the co-founder and CTO of Timescale, building a category defining relational database and Cloud platform for time series data. His work broadly focuses on distributed systems, networking and security, and has led to commercial products deployed systems reaching millions of users daily. Honors include, right here’s a big list, ACM Grace Murray Hopper Award, ACM SIG Ops Marg Weiser Award, I think. ACM Fellow Presidential Early Career Award, that sounds like a good one. Sloan Fellow and the DARPA CSSG member. Mike, welcome to Software Engineering Radio. Is there anything I missed in that impressive bio that you’d like to add?

Mike Freedman 00:01:08 Thanks for having me Gavin. I think it’s academics like to give ourselves accolades, so there you go.

Gavin Henry 00:01:14 Very impressive. So we’re going to have a chat about what time series data means in 2024, why TimescaleDB is needed and how it integrates with PostgreSQL. Lastly, we’ll close off discussing a full example use case. So Mike, let’s start with time series data. How would you define it today in 2024,

Mike Freedman 00:01:34 We think of time series data as any type of metric event information that you generally want to collect over time because it changes. This is a type of workload that often actually has a pattern that makes it look append mostly. So you are often collecting these either regular or irregular streams of data and you want to collect it over time because what’s valuable to you for your business, for your product is not only what’s happening now but the ability to look back on how that information changes over time and use that to understand the past and also predict the future.

Gavin Henry 00:02:21 What sort of common timeframes do you see in what we understand time series data be?

Mike Freedman 00:02:26 I think it depends a lot on the application. We have users who really care about what happened in the last 24 hours. We often have users in different domains. For example, in the energy space in renewables, in things like oil and gas, there are often compliance reasons and historical reasons to understand seasonality, where they store it for going back 10 years and use that. Also to understand what, seasonal trends are and how the data changes over time. That often translates to features that you end up building or that, we end up building as part of our database. To also recognize that how you might access information in the last few minutes, few hours might be very different from how you want to access data or the derived roll-ups you have on that data stretching back months or years.

Gavin Henry 00:03:22 And what in particular makes this type of data difficult to work with? For example, the volumes we get the time spans orÖ

Mike Freedman 00:03:31 Yeah, I think when we set out to build Timescale, we did it really because of the scale and performance that this domain needs. Both, and you even alluded to this, both the volume and the velocity of the data is often different than you might see in a traditional kind of e-commerce or transactional application. You often have data arriving at a very fast rate that because you are also keeping back the historical data, it’s not just you don’t delete it after a short period of time. So that transits to large data volumes and yet at the same time you want to actually build operational workloads. And by that, I mean you are using this data not to, you don’t throw all this data in some data lake or data warehouse, you never look at it again. When people are coming to Timescale, they’re doing so because they’re building operational applications.

Mike Freedman 00:04:24 And by that I mean we actually have developers who are using this to build products on top of their showing those products typically customer facing, they’re mission critical and so they also need a responsiveness and good performance when they’re continuing to build those applications, run the APIs, show the live dashboards to their customers as opposed to just somewhere where it sits around for a while and occasionally do some ad hoc reporting on top of that. So it puts a lot of demand on what you technically need for a database in order to satisfy those requirements.

Gavin Henry 00:04:56 And I know you mentioned Mike, their operations. So in my head I presume the intervals of the data we’re working with need to be consumed, processed and available in the ops tools, systems alarming monitoring, whatever it is, you’re using that data and consuming it for pretty close to the intervals it’s arriving or are there lags that you can accept? How does that look?

Mike Freedman 00:05:21 I think what you can accept often differs by application. I could say that in Timescales case, Timescale even preserves transactional semantics. So what that means is as soon as the data is written into Timescale, it’s available to queries. And also the transactional semantics mean that it doesn’t create these confusing views for customers or for users such that you could see these partial data it still maintains the atomicity that you might expect from a traditional transactional database. Taking a step back if I could talk about this for a minute, which is an observation that we came to realize after some time working in this space. Traditionally when it comes to databases, what a lot of engineers think about are these two workload patterns people talk about called OLTP for Online Transaction Processing or OLAP for Online Analytical Processing. And this often had to do with OLTP are these really short and simple queries.

Mike Freedman 00:06:27 They’re the things that are the backend to your e-commerce site. You have some type of ORM or some type of software, you do a small read and write to the database. And then OLAP were these big analytical workloads where you kept this really wide row with lots of columns, you threw it in a data warehouse and occasionally it came back later and were able to ask random ad hoc queries of the data. I think what we have seen is two things. One is that we are serving queries that are analytical in nature. And so from that regard, some of the aspects are similar to what people traditionally thought of OLAP. But the big difference is we serve these operational settings unlike the traditional use case where again you threw it in this data warehouse or data lake and then only later did people make ad hoc queries about that.

Mike Freedman 00:07:22 And so one of the ways that changes is that I think it becomes more about who’s using the database then this definition of the data shape. And what I mean by that is there were traditional, you had developers who are building applications databases served as the backend to the applications that they’re developing that are serving their customerís critical internal use cases, et cetera. Or you had data teams which basically had these managed how data flowed through their system, spent a lot of times with ETL jobs that is ways to move data from point A to point B and basically became the owners of the data and almost an internal service team that gave access to other people in their organization or their company. But they didn’t really know what they’re doing with that. Their job was just to bring it together and let other people do machine learning reporting or so on, on top of that.

Mike Freedman 00:08:23 And the difference is that in the first case in the developer case, which is where we see a lot of the really exciting time series use cases and where Timescale plays is that the developer comes in often owning both sides of it. They own the data in and they own how it’s used. And so they often think more deeply and design their schema, design the data structures they have in the database, design even their queries to really be built around production. Let’s figure out how this works. Not because I’m running a one-off query, but because this query is being triggered or being accessed every time a customer hits one of my dashboards. And so you just see end up different design decisions made in a database where you want to build for the case where people want these production workloads as opposed to let’s just throw it somewhere where it’s cheap and we could later do ad hoc queries against it.

Gavin Henry 00:09:19 Thank you. That’s a very comprehensive explanation there. Where are examples of where TimescaleDB shines and time series data?

Mike Freedman 00:09:29 I think it ties into what I just described before, which is people building products, building customer facing APIs, building critical dashboards on top of this data that they’re collecting. And let me give you the counter example. When you speak to different people, time series data can mean different things. Not because the nature of the data, which is you have some type of information that’s being collected over time that’s changing and you want to look at the history, but more about how important it is for you as a company and how you use it. One of the big places where the notion of a time series emerged in the last 15 years and where you see some of the specialization is in server initially started in what people called APM, Application Performance Monitoring. And that became broadened into observability. And you see there are both companies that started with AppD and Dynatrace and New Relic and more recently Datadog.

Mike Freedman 00:10:36 And you often also see open-source software, probably the best known is Prometheus and now it’s being standardized and open telemetry that is focused at collecting the data. The difference here is the use case of this is typically an internal one and it’s important that you have observability, but if there’s any blips in your observability architecture, it’s not great because you’re flying blind. But that also doesn’t mean that your customers are offline, right? And so the use case there is this vertical solution where the end user might be the SRE or the DevOps team where what they really care about are there dashboards there, is there alerting in place? They donít really care about the system behind it. And so if you look at what gets built there, it’s a bunch of full stack solutions where you have an agent running on an end host, it’s collecting data, it’s being kept in a storage.

Mike Freedman 00:11:38 There has been a lot of innovation on how to make that super cheap. The latest being is to offload it quickly to an object store like S3 and then let’s make it easy to build dashboards and build a whole alerting on top of that. And so when some people think about time series data, they immediately go to the observability use case, particularly if they’re an engineer who deals at all with operations. But actually we see this and I explain this by starting with something that we’re not in that we saw this use case of throw your data somewhere, build some dashboards on this metric, not as mission critical, no need for as much flexibility and no need to often integrate this with other important business data as a very different use case than what Timescale started at and continues to see.

Mike Freedman 00:12:30 So when Timescale started going back in history a little bit, we were initially building an IoT platform. We wanted a place to collect and store data coming off the emergence of all these new types of connected devices. And when we initially attempted to use one of these time series databases that started also in this observability space I just saw, we kind of struggled for several different reasons. We struggled because it in general wasn’t performant enough for the type of workloads we saw, and it didn’t focus on the reliability the same way that we needed.

Gavin Henry 00:13:03 And that was Iot.

Mike Freedman 00:13:05 Well we were starting an Iot but we were initially using a database that had its roots in this observability space.

Gavin Henry 00:13:12 So just to go back to an example of where these shines. So you’re in the historic part, you’re in the IoT sector. You’re getting, let’s call it telemetry data from these IoT devices. The backend service or database that you’ve got at that time is not good enough for what you’re trying to do. How does that present itself to you so you can say this isn’t good enough, is that you can’t do certain queries or it’s too slow or you tweaked it to death and you can’t get anything out of it And what line in the sand you go, right, this is the right use case for time series data, this is what we’re using, it’s not a good enough, let’s go build something.

Mike Freedman 00:13:52 Yeah. Why did we take the crazy path to go build a database, which seems like a high bar.

Gavin Henry 00:13:57 Yeah. The two bases to that, let’s build a product but also we know this is the type of data we want to model. Yeah. This is how to model it. That’s why I’m trying to get to in this section is from a developer’s point of view, the question I’m asking and the answer I’m trying to get is time series data. So I need to reach for that system.

Mike Freedman 00:14:14 Right? So there are probably four axes that we saw current limitations and the first two, which I alluded to already maybe developers can understand that it was, it was really around the access of performance and reliability. But the other two I think were perhaps more instructive even from day one. And that was we really wanted to use full SQL and we did that because it has this massive amount of flexibility, and it comes with an entire ecosystem. And if you look at a lot of the databases that were initially developed in this space, they were “SQL-ish” but they for example didn’t allow you to do joins or include other types of clauses in your where clauses — your where predicate — which ended up making them a lot less flexible for the type of questions you might need. And the second thing was really — and I know this sounds strange — is that we wanted or we needed joins.

Mike Freedman 00:15:12 And how that often translates is not only are you collecting your metrics, the stream of time-series data, but you often have important metadata around it, other types of business data, other type of information that you collected. In this Iot example, where this was is we at the time were running two databases — one Postgres, a traditional relational database and the other one being this time series database. And we would often want to ask questions like tell me all devices about this particular customer that came online within the last period of time, or all devices of a certain type. And this information is often spread between these two locations. Between this business data and these streams of time series data. Now this becomes very natural in a traditional relational database. In fact very natural and ultimately Timescale. But in a lot of these things that were built only for observability or built for metrics, you don’t have the ability to do that type of rich joins you might expect.

Mike Freedman 00:16:19 And so you end up needing to not just maintain multiple databases, which is operationally more complex, but you need to then in your application, effectively write the equivalent of a join. And not only is that not as performant, but it’s more like hey I want to ask a query and before I could do the query, I need to do a two-week engineering sprint to roll out a change to my microservices that now allow me to do this in application space. And we quickly found that this would be very not productive and really went against the type of platform we wanted. And so to solve that Iot problem which cared about operational ease of bringing the data together so you could do richer queries on it, which cared about SQL, which cared about performance, which cared about reliability, that’s what ultimately took us down the path of building Timescale. And frankly we think that those types of requirements are fundamental to anybody using time series in this mission critical kind of customer facing settings. And we see that from I can talk about it from Iot, from product analytics, from FinTech. We see this in crypto, we see it in a lot of energy sectors. We operate in transportation, logistics, backing, e-commerce, anything dealing with events. We see that that same requirements that start in IoT across almost all application domains.

Gavin Henry 00:17:41 Yeah, I use it myself, TimescaleDB and obviously with Postgres and a main open-source application I have that consumes voiceover IP fraudulent data. So I have all these interesting nodes thanks floating about and they get probes from potential hackers trying to find open phone systems. And I think the main feature I use from your product are time buckets. So they can have a heat graph to show which days had the most events across the nodes that they run.

Mike Freedman 00:18:12 That’s good to hear. I actually didn’t know that you were also a user.

Gavin Henry 00:18:15 Yeah, no, that’s why I reached out. Because for me as a developer, just to go to summarize what we’ve just spoken about in that section and move us on the web application stack that I use as Phoenix and Elixir. So Postgres is one of the main backends for that. And I knew I needed time series data because of these events streaming in and I wanted to process them and display them all correctly. And TimescaleDB was just a really nice fit because I didn’t need anything else. I just needed to put in Timescale part and deploy that. So yeah, that was just a little example of how I personally use TimescaleDB. So I’d like to take on to the next section just to explore why time series data is hard in traditional databases. Obviously, Postgres will talk about there and why Timescale or how TimescaleDB makes this easy. So if you, when you said that when you started using Postgres, did you try to use that for everything before you reached for a time series database?

Mike Freedman 00:19:12 From the beginning we started by splitting our workloads into a time series database and we kept our metadata in Postgres and partly is that we were concerned from day one about building for scale. And so we kind of recognized some of the challenges that traditional relational databases in Postgres and really more of the basic data structures that they use have with dealing with this type of data. And then of course when you brought them back, Postgres is more than just a database, it’s almost a platform for development and we could talk a little bit about why we’re extremely bullish on Postgres in general and why we’ve really embraced it in the last decade.

Gavin Henry 00:19:56 Yeah, I’d like to get that in a second because we’ve done a couple of shows on performance and extending it and benchmark it just from a naive approach. If my data’s in a database I would probably do something like select some of the rows based on a timestamp between midnight 01 and 23:59:59 across that data and it would be slow because there’d be lots of rows and there wouldn’t be many indexes on that. Why is it so much better with TimescaleDB or, a time series database in general?

Mike Freedman 00:20:30 Yes so let me, this is what I’m going to go a little bit of a deeper dive into maybe dataset, some data structures, some ideas.

Gavin Henry 00:20:37 Go for it, we love that type of stuff.

Mike Freedman 00:20:38 Although I’ll try to keep it at appropriate level. So let’s maybe start with a thought experiment of what would normally happen if you’re writing to a traditional database and in particular let’s say that you’re writing to a single table, and I think that’s the most relevant example. You can of course let’s say create a data structure where you create a table per hour or a separate table per day that then pushes off a lot of the management to the developer where you need to exactly select which tables per day you want a query and becomes very hard to write applications. So let’s just start by saying hey we have one table, we call it our metrics and we write everything there. And so if you imagine there’s a couple places where the performance when you try to query stuff starts to suffer. First, something like Postgres is traditionally storing their data in what you’d call row-oriented form.

Mike Freedman 00:21:33 And so what that means is that you write a row of data whether or not it’s 50 bytes or 500- or 5,000-bytes row by row as it kind of comes into the database. Immediately what happens is, if you turn around and say, hey what happens if I want a query for a single column? Over time you need to read the entire row out of the database in order to access that one single column. And so one part of the way you query time series databases is you want to query these scans over a longer period of time, but sometimes for a smaller number of columns and for those row-oriented databases aren’t as performance for these long types of scans. Behind the scenes what Timescale does is the recent data is stored row call in row format. So we can write to it quick because row format is faster to write because you’d write row by row.

Mike Freedman 00:22:27 But we change after data reaches a certain age, we convert those rows into compressed columns which allows you to have much faster scans over time. There’s a couple more things but let me talk about a second main point. If you think about writing to a single table, effectively what you write in is you write it in the order at which it arrives at the database. So it’s get written to the database and the order that it gets sent to the database. But on top of that you often as you alluded to, will build indexes. One common index would be based on time the timestamp. That one doesnít work that bad. If you think about time as a tree, you might build a traditional B-tree over your data. It’s kind of right-leaning in that the latest data or left lane depending on the way you define it, but the latest timestamp is always on one side.

Mike Freedman 00:23:21 So if you query for recent data, all of the timestamps associated with that data are kind of in the same region of the tree of the index structure. That will work well because what’s happening under the covers is your database engine is keeping some of these index pages or some of these database pages in memory, the working set and some of it’s on disk. So if you are always interacting with the latest data and if you look at your index and that is always part of the same subset of the index tree, that part will fit in memory well and so you’d be able to make relatively reasonably fast queries to it. On the other hand, if you want other types of indexes, which by the way are super common, for example in your case you’re collecting VoIP data about your devices, and it’d be pretty common to not only index by time but to also index by something like device comma time. Because during your query you don’t want to just say tell me information about all my devices. It’s common to probably say tell me information about this specific device over time, by the way do you do that type of query?

Gavin Henry 00:24:23 Yes, the probes that come in are SIP, Session initialization Protocols avoid and inside that’ll have a user agent. So that’ll be the user agent from the device. It’s usually spoofed, it might be a, the phone system that’s come from, they’ll be the phone number that they’re trying to call. Usually they’re the low-cost ones that they’re just trying to probe to see if they can make an outside call before the premium ones come through, might have a certain prefix. We then look into the IP address to see what countries they are who allocated them. So there’s loads of metadata around that, which you might say show me all the user agents that attacked during this timeframe. So exactly what you’ve said.

Mike Freedman 00:25:02 And do you index anything other than the timestamp?

Gavin Henry 00:25:07 I think the user agents on the phone numbers we index by.

Mike Freedman 00:25:10 Okay. And it also sounds like not every application is similar in this case it sounds like you have a highly ephemeral set of maybe user agents, or they change frequently.

Gavin Henry 00:25:19 Yeah, particularly from a security perspective is interesting but not it’s not critical. But your example, it’s just one table called event.

Mike Freedman 00:25:28 Yep, totally. That is super common. It also makes management a lot easier. But, going back to the example I was describing before, in this case, if you have an index by let’s say device ID comma timestamp or user agent timestamp, the problem is those user agents are no longer the working set is spread across all the user agents in your system. And so unlike the times stamp example where you’re always going to access or the common access patterns are to like to the portion of the tree that’s the recent, you are probably getting a random set of device IDs or a random set of user agents that are being interacted with. And so if you try to build an index over this, first of all you totally can and do build an index, but the access patterns to this index are somewhat random based on the device ID or the user agent.

Mike Freedman 00:26:16 And so what that means is that if this entire index doesn’t fit in memory, then you’re going to be pulling more and more data from disc. And that is of course when your performance starts to suffer because memoryís a lot faster than accessing disc. And so one of the things that, how this translates to the way Timescale works is that we have the abstraction which we call hyper tables, and it looks like a standard table except under the covers we’re doing this automated partitioning based on a customer provided partitioning key, which is usually the timestamp. And so what that means is you basically create a policy that says give me a new, we call it chunks, give me a new chunk or give me a new partition every day, every week. We have some applications that do it every 10 minutes because they’re high volume.

Mike Freedman 00:27:04 And what that means is that when you query something for over time, it kind of has additional metadata so that it doesn’t need to access your data over all time. And the indexes that are built across all time, it only accesses the chunks and only the small indexes built on them based on your time interval. So if you have three years of data but you say select star from metrics table where timestamp is greater than now minus one day, it will only basically run this query in parallel across the chunks correspond to that day and therefore also use the smaller indexes on top of them. And because the workload patterns, those things are much more likely to be in memory, they’re smaller. And so that translates even before we get into all the fancier stuff like kilometer compression, that translates to a lot more scalable and more efficient queries. You’ll also see as we talk through how we use the same type of partitioning to really enact a lot of other policies and order in our system where you want to access recent data differently than you might access old data.

Gavin Henry 00:28:12 Yeah, my , this application uses the create hyper table on the events table on the event timestamp. We have a unique index on the event UUID in the event timestamp, so we don’t get fake data then we have three NIXs which is the source IP of the probe, the number that was attempted to be called. And then the client ID of the, for example, the auth ID, client ID because the nodes that send in data are all authorized by an API. So they get a client ID from that. It’s simple and it’s very fast with the bucket window thing.

Mike Freedman 00:28:46 That’s what we see a lot. We see people often have these relatively simple schemas. Theyíre not let’s say it’s not a thousand columns of random data, it’s well-structured but yeah.

Gavin Henry 00:28:59 This is almost like an audit log table. So you’ve got the application name that’s sending in the probe, the version, the timestamp, UUID and then some other metadata whether it was received by TCP or UDP, destination source IP, the usual things you’d want to see and then specific things that we want, which is what they’re trying to do.

Mike Freedman 00:29:20 Totally. And a lot of people when they think about time series, they often think about this as regular, let’s say sensory to that arrives at a regular interval. Another big use case that it sounds more similar to what you’re doing are they’re really events. The information is used to describe an event in the system that is irregular nature and in your case it’s these probes, these VoIP or session establishment protocols. When SIP runs it’s a kind of an event happening in the system. We see the same thing with product analytics. It’s when a user clicks on something in a product, or we see a bunch of people running gaming applications where it’s actually an in-game event that is being logged to the database. And we think of this as most data, at least most high-volume data has that time series nature to it.

Mike Freedman 00:30:08 Even if you think about traditional e-commerce applications, which is like the canonical OLTP application that developers think about where typically most of your data, where’s the biggest table and that’s your orders table. It’s a log of every order placed in the system, right? And if you think about it from that perspective, an order is just an event log kind of ordered by time. And so that type of event log can be treated. And a lot of the technologies that we built for so-called time series plays equally with the type of event log that you have in a traditional e-commerce database. So what’s interesting is one thing that we see is that if you start looking into most databases storing hundreds of gigabytes, terabytes or more actually where that data is, is sitting in what looks like time series data. And so it’s very applicable to the type of technologies you built into Postgres.

Gavin Henry 00:31:03 It’s quite funny because when you come up with the actual names and define what something does, you start to see everywhere, don’t you say hold on a minute, we can apply that to this and that and it’s just been waiting to be found.

Mike Freedman 00:31:15 We very much agree.

Gavin Henry 00:31:17 So just to close off that section, so TimescaleDB makes asking these questions and getting the answers easy because it changes the column, the road data into columns are easy to find and compresses them and chunks them up. Is that a dumb way to express it or?

Mike Freedman 00:31:34 Yeah, there’s a couple key enabling technologies I already alluded to. One is that we do this automated chunking or partitioning in what we call hyper tables. The second is we as that data starts to get past its initial phase where you need high ingest, we change it to a compressed column kilometer format, which makes querying it both faster as well as we often see in production 90, 95 even 98% compression ratios. Really impressive because we’ve done a lot of optimized algorithms there. We also have a notion of a data life cycle. I talked about from row format to column format in our Cloud platform for real old data we have tiering to even things like bottomless storage on S3 and again a different performance cost trade off. And perhaps the last thing is we have a technology which we call continuous aggregates, which if you’re a developer coming from the database space, you might call this an incremental materialized view.

Mike Freedman 00:32:35 All that means it’s a fancy word for, we perform automated rollups in the background. So you don’t have to do that. So in your example, if you’re building an API or building a dashboard, we are often querying the same thing over and over because it feeds your API or dashboard. You’ll basically create that rollup in the database and whenever data comes in at a regular interval, the database will automatically kind of roll that up and create that post-process result that then gets, it’s almost like a caching layer that gets served to the application. But of course unlike cache you could do full SQL against it. And so that becomes a powerful way to, again, further stress performance for developers who know the type of applications they’re building.

Gavin Henry 00:33:20 Fantastic, thank you. So now I want to get very technical and ask you to explain how TimescaleDB is integrated with the famous and all glorious PostgreSQL. How do you do it?

Mike Freedman 00:33:33 So we’ll see how technical I get, but Postgres is interesting in that I alluded to this earlier that it’s not just a database but almost becomes a platform. And one of the ways that happened is that almost, more than 15 years ago, Postgres started down the path of supporting what is called an extension framework. And so, what does an extension framework means? It means that there’s basically hooks throughout kind of hooks and callbacks throughout the Postgres software that allows third parties to come in and write basically integrate themselves such that for example, when a query might come in, you could register a callback to yourself such that you might get called, your software might get called, your extension software might get called before standard processing. And so there’s somewhat hooks throughout the code base so you could decide to integrate yourself at different points.

Mike Freedman 00:34:33 And even the last five years there has been, even that framework has been even richen through what is called internally in Postgres software or table access methods. And there’s even a form of this for index access methods where you could build, the first part of that integration was to change the ability to handle SQL queries and do query processing. And now it makes it easier and easier to not just handle queries DML calls and schema changes like DDL calls but also some of the ways you might store your actual data internally. We’re starting to do that with the way we store our compressed data and some of the ways we even build indexes. The short answer is what does it mean to build on top of it? It means that you hook, have a bunch of hooks through the code base, you as a developer you continue to write and see its low level, you’re running in the same process space. So, if you have to be careful about it because if you have any bugs or seg faults that of course will, crash the whole Postgres, not just your own little extension and because part of it you could actually work on the same memory space so you have access to the same memory domains that Postgres is normally operating in that allows you that level of fine grain control.

Gavin Henry 00:35:47 The setup of the application we’ve been discussing that I created, I’ve got a create extension if not exist TimescaleDB. So that’s the kind of standard Postgres way to install an extension, isn’t it?

Mike Freedman 00:35:59 Basically there’s building extension has two parts. There’s obviously the software that gets shipped with the code that you install. And then on top of that, on a per database perspective, you have this command called create extension, which is what in some sense loads the endpoints from that extension into your runtime.

Gavin Henry 00:36:22 Do you get access to all the native Postgres utilities? As for memory, create things all the different hooks so you’re not having to write all this from scratch, you get access to the whole internal Postgres ecosystem of C libraries and functions?

Mike Freedman 00:36:39 You do get access to much of it. There are different points in the Postgres code base that you get integrated. Not everybody, not everywhere, you know it’s in general a tradeoff between what are the things that are needed as well as more hooks that the core exposes that obviously increases in some sense how much testing they need to do possibilities for things going wrong. I think one thing that also happened is a lot of the extensions that are built on Postgres are relatively simpler or well constrained. They might be things to let’s say support a certain type, they might extend the functionality to support a new type of, almost like a built in UDF, a function. They might be a new index type. So one of the more popular ones is post JS, which is geospatial indexing. There might be a bunch of maintenance things. So PG stat statements is a common extension that basically keeps track of query execution time something like Timescale, maybe Timescale and another extension at the time called CIIS were probably the first two that really took it deep that we started changing a lot more of the internals of Postgres than perhaps was even imagined when the extension framework was first built. So that run into a lot of pros and cons around that. Yeah.

Gavin Henry 00:38:09 So when you’re hooking into these callbacks and hooks for a normal insert, that data’s there in a normal database as well, but it has all the Timescale bit around it.

Mike Freedman 00:38:19 Yeah, so it has the Timescale but it’s also basically like whenever you query database that has Postgres installed and running effectively, we get a callback that the code, the Timescale code, there’s a chain of these callback functions and we get a callback command that basically almost is like a handler, do you want to handle that? And so we look at what tables, what database objects that query is touching and if it’s affecting one of the hyper tables or one of the other things under kind of our management, then we will process it via those rules. If not, it kind of falls through and goes to normal, normal processing. It’s also the case that we kind of interact at different levels. We have some code that runs, let’s say on your insert pipeline. One of the reasons why if you do an insert, we need to route your insert to the right chunk that I described before.

Mike Freedman 00:39:17 Part of that is also one of the ways that we do is automated chunk management. So if you are, if a chunk is let’s say defined for this one hour interval and the hour threshold passed and you do a right to that new timestamp, which actually does not have a table to find on it rather than fail which would occur if you’re doing automated, if you’re manually creating these types of Postgres partitions, we actually as part of the in insert pipeline will automatically create this database object and then insert that data in it. So you again, so it’s very easier. Our code base are different of course for the insert pipeline for, the query pipelines for, various schema modifications and it really, it affects many different parts of the database.

Gavin Henry 00:40:06 And you spoke about routing it because the day has been chunked. Is that something that’s automated based on the type of queries you’re making or is that some type of configuration thing you need to do?

Mike Freedman 00:40:17 As soon as you ran that create hyper table command, basically everything happens somewhat automagically for you. So that create hyper table command takes in some sense an interval. There is a default but there’s a but that defines the size of this chunk. You could change it over time, and you just write, you could do any type of insert and it will route that the rows of that insert into the right chunk. It will, in many cases it’s the latest, in some cases you are collecting older data. So it what we might call backfills to older time intervals behind the scenes now that we have this compressed kilometer format, it transparently handles writing this data to an uncompressed region next to the compressed region and then asynchronously combining the two to recompress it. All that kind of happens transparently to you as a developer.

Mike Freedman 00:41:11 Similarly on the query side, in your case you are just querying that one table of events. You are not realizing that the query engine under the covers the query optimizer is saying okay, I have a thousand tables here based on what the query predicate is, I only need to access these eight chunks of a thousand. I’m going to then in parallel some of these chunks are compressed, some are not. I’m going to in parallel execute against these eight chunks, compute any ordering, limiting whatever you want on top of that and return to the client. All of that is transparent to you as a developer.

Gavin Henry 00:41:50 And it just sits in the normal sort of storage engine possibilities that Postgres can offer?

Mike Freedman 00:41:55 Yeah, we reuse the heap store in Postgres in a very particular way to also support our compressed kilometer storage. I could kind of go deeper into that if you want as well as operating with normal memory spaces as you would in traditional Postgres. The one place where that’s different is we introduced about a year ago on our Cloud tiered storage where beyond having this row based for the very, the most recent stuff, compressed storage for most of the hot stuff, this tiered storage which is basically backed by S3 tables is for much older data that is using a kind of a very different technology than traditional Postgres heap storage because we’re actually storing these as parquet files on something like S3 and then building a query engine on top of that. But again, from you as a developer, you just query your database, you are transparently it decides do I need to pull this from which type of storage based on what that where predicate is this data in the last month which is all in fast year all fast or is it from five years ago? I need to go back to that tiered storage backed on S3. That’s all the machinery there though is all transparent to you as a user.

Gavin Henry 00:43:12 And how do you test between different versions of Postgres with your integration? Is there like a Postgres test suite that you’d normally do from the front end with your select and things or do you have your own custom test suites that you use behind the scenes in front of the callbacks or behind the callbacks as it were?

Mike Freedman 00:43:31 So I mean testing becomes a whole different problem or no problem but challenge. We certainly have, from day one, like many good software engineering practices, we have a full suite of unit tests get that gets written out on every PR. We use various forms of static analysis again because we’re writing most of this in in C although we do have some newer extensions in Rust as well, obviously C introduces a lot of …

Gavin Henry 00:43:57 Thatís what I was going to ask, if that was around when you’re thinking about doing this.

Mike Freedman 00:44:01 Rust, it wasn’t really popular when we started and also it works well for some things, and it doesn’t work as well for other things. Part of it is because we’re integrating really closely with the Postgres core that when we do so we find in many cases that C remains more appropriately for interacting with it. But to go back to your question, we obviously have a full suite of unit tests. We have static analysis that we run constantly, and we internally have, of course, benchmarking suites and not only for correctness but also to look at performance regressions. And so we’ve been similarly doing a lot of work over the last year or two around query performance, not just in optimizations. We even, I talked about columnar format we launched last year, support for vectorized instruction sets. So this is when you could use low level instructions in the architecture like X86 simd instructions.

Mike Freedman 00:44:56 And so that further improves performance. So a lot of what we’re also testing for is not just correctness but also performance regressions. The other thing I’ll say is because we write a database, but we also run a cloud with thousands of customers, we then have a whole rollout procedure in that we first run to dev, we first roll out to dev, we roll out to a first set of services running on cloud that runs Timescale services. So we dog food, we build a lot of advanced software for Timescale Cloud using Timescale services the same as our customers. And so we will roll out changes to those regions. We will, kind of have a canary approach to rollouts where we’ll roll out a new versions to only a percentage of our customers across different regions observe how that rollout might affect performance before and after because we have a lot of low-level observability that we can collect about performance and then use those to decide to go forward. So in some sense we could build better software because we’re also the operators of a popular cloud today.

Gavin Henry 00:45:59 That’s the best way, isn’t it? And do you work closely with the Postgres team when you find things because you’re in it so deep?

Mike Freedman 00:46:06 So there’s two answers. First of all, is we’ve actually for a couple years employed Timescalers who actually whose job it is to be upstream contributors. So one of the things that was exciting last year is that we we’re not the oldest startup who we were and still a relatively modest sized company. We were in the top 10 companies that contributed to Postgres in terms of — I think it was Postgres 16 — in terms of the number of contributions or PR reviews or whatever. So we are definitely trying to work with the community both inside the core as well as building both the open-source extensions as well as our core TimescaleDB, which is free for anybody in the community to use. The good thing is we found that Postgres itself is generally very stable.

Mike Freedman 00:46:57 I think they take really the right approach here is that their goal is not rapid change but careful evolution. And I think that has for such a core piece of technology that has serves almost the whole internet so well that’s really the right approach. And so there have been some ways that we’ve contributed upstream contributed ideas, but the good thing is that the Postgres really provides a lot of those types of hooks that that we use, and other people use to build the type of extensions. I would say that we do have a lot of experience building a lot more advanced extensions than most and we’ve contributed some ideas around that. One of the areas is that we open-sourced last year a security checker, which we call PG Spot. We found out that a vast number of the Postgres extensions have security flaws in similar manners. And so we’re just in the process of doing a lot of outreaches to some of these more of another wave of extensions where we found more flaws. And so I don’t want to go into the details yet obviously.

Gavin Henry 00:48:02 Yeah, weíll put that on the show notes and get a link for people to have a look at them.

Mike Freedman 00:48:04 Yeah, for responsible disclosure reasons, but I mean there’s different ways that people also contribute to the community, and this is our ability to make a lot of the whole ecosystem more secure is certainly something that we take a lot of pride in.

Gavin Henry 00:48:16 Yeah. That’s very impressive, thank you. So I’ve got two questions left before we move on to discussing a complete example of where we use TimescaleDB to finish off the episode. So what makes you the same as other time series databases? So this is a great question that came from one of the other volunteer hosts of the show normally ask, which I will ask, what makes you different from everyone else? Let’s start with what makes you the same as other time series databases?

Mike Freedman 00:48:43 I think one of the ways that we’re the same is we often try to look at the same type of workloads and in many databases they have the notion of, if you pull a common use case for Postgres including with an ORM, they often talk about a crude workload, create, read, delete, or update delete. And that often by its nature is very key value store it’s point queries, it’s created a row, update a row, read a row, delete a row because you are managing them like a user table. And I think the workload for so many dealing with time series or events, internally we sometimes refer to this as it’s an acronym we don’t use publicly too much, but I really like it is awesome Append, Update, Scan, Materialize. By that I mean you’re not creating a random record, most records are being appended to the end of your database occasionally due to updates but they’re not as primary a use case.

Mike Freedman 00:49:44 Scan, you are often not reading an individual row; you are looking at data over a larger period of time. And Materialize, it’s not all about these random scans, it’s that you often have known things that you want to access. So it is beneficial due to materialize those results so you could satisfy those common queries quickly. And so I think if you look at a number of the databases that try to handle this data, even though they might say it, they optimize for these things like insert heavy or pen heavy as I described. They optimize for these scans over longer periods of time. They may have some support for materialization to enable fast queries to known things you want. So I think you often seem those at least goals of similarities. I think to be honest; the biggest difference is that most teams that have gone down this path thought they needed to build something from scratch and perhaps the biggest way that Timescale is different and then building something from scratch comes with a lot of trade-offs.

Mike Freedman 00:50:54 Not just in terms of the flexibility and the richness of both the query language, the database functionality, but I think it that ecosystem takes a long time to develop both for the features inside the database as well as the software built around it that you could use. You mentioned earlier that hey, you went to get Elixir, it just worked with Timescale, it worked with Timescale not because of what we did, it worked with Timescale because it worked with Postgres. And the last thing is people is the ecosystem is the developers who know how to use it and it can become efficient really quickly. And they have a mental model of what’s happening. They understand kind of more implicitly what it probably could do or what it can’t do. And so there’s not that whole learning phase that normally a company is brand new technology. So I think that’s probably the biggest changes is our ability to build this on top of Postgres.

Gavin Henry 00:51:44 That’s great. You’ve covered both of my questions here. What makes you the same and different And that was the one of the biggest surprising things for me is I imagined in my head when I read the articles and came across the name and the different blog posts I was reading that would be like this TimescaleDB file like SQL Light or something in my application and it was a real pleasant surprise when I saw it was part of Postgres that way.

Mike Freedman 00:52:09 Yeah, that’s the common, when I think about our customer journey or our user journey, the very common user journey that we have is that a developer starts building their application and they immediately reach to Postgres because in some sense that’s become the default today. It’s the no-brainer. I don’t know what I want to do, I reached a Postgres, it’s this great flexible, powerful thing that’s not going to fail me. And at some point, what happens is their product takes off, they start inserting more data, they’re starting to encounter these problems. And so we kind of think that what we’ve done is we’ve made Postgres powerful, we’ve scaled the performance of Postgres to make it work for these demanding applications like an IoT sensor data, crypto, AI, events, where in general time series data and others where you need, you love the ecosystem and reliability and everything that you understand about Postgres, but you need it to scale more, you need more performance, you need more features for this type of use case.

Mike Freedman 00:53:15 And so your two options are either really abandoning Postgres and going to some more niche database that has of course other problems and unknown unknowns or try to figure out how you can continue to scale with Postgres. And I think that’s really where the part of a journey that we come in and that we’ve enabled people to scale with Postgres and people are often surprised on how far you could scale it. I talked earlier about in running our Cloud we build our own services on top of Timescale. And what’s really interesting is the way we do that now is we actually use in our Cloud, we have vanilla Timescale services like any customer does. And the thing that actually is backing our production, some of the production features, we have our vanilla Timescale services that is we build on top of the same thing that all of our customers can use.

Mike Freedman 00:54:07 One example of this is we have a new insights product features that basically gives you really low level information about how your queries are working. So it collects sketches of every database query that you do, probably normalized. It covers the shape of queries, timing information, how many pages it hit in memory lock disciplines. It allows you really to drill into individual queries to understand how your database is operating and where things are working fast and where things might be working slow and how to debug it. Why do I bring this up? That service, which is a standard Timescale service ingests about 10 billion records a day and is approaching a petabyte. So normally you don’t think of it using tiered storage, its using column compression, all this stuff. Normally you don’t think of a Postgres database as being a petabyte. That’s what we, and that is customer facing. It serves all our customers. And so indeed you could actually make Postgres scale.

Gavin Henry 00:55:04 Well that’s looped back nicely to the start of the show when I was looking for some scalability numbers. That’s a great way to help us finish up the show on an example. So what I’d like to close off with is, as an example project, it could be anything time series related, obviously that covers how we set up TimescaleDB, how we scale it, important things that a lot of people don’t talk about is the operational side. How do we back it up? How do we check that’s working? Okay, how do we upgrade it, which is that just upgrading Postgres or is that something else? So we’ve got around 10, 15 minutes for that. Would you like to have a stab, Mike? So let’s start, how do we get set up with the Vanilla TimescaleDB?

Mike Freedman 00:55:47 I have two answers here. One is that there’s the, how you do all these things, if you’re using Timescales open source or the free version of Timescale, and there’s obviously how you do that if you use Timescale Cloud, which is wildly easier than the former. So let me somewhat answer this in two ways. Of course, the easiest is, or many of these instructions are in our docs, so any listener is certainly welcome to log into our doc or to go to our docs and just to follow the various instructions to be able to copy and paste various commands. The good thing is if you are, if the listeners are familiar with how to manage Postgres, managing Timescale looks incredibly similar and, that’s because for the most part, Timescale somewhat innovates on the internals of the database.

Mike Freedman 00:56:44 How we store data, how we might query data, what SQL features we support, not actually on the operation side. You mentioned already the one additional things about installation, again, this only applies if you are self-managing the software, not using our Cloud, where you don’t need to worry about any of this. But not only do you install Postgres, but you also install the Timescale binaries. The other thing I would add is if you go to, again, our docs, we have docker images and other build commands that and other RPMs and other Ubuntu images and whatnot. That will make it very easy to install this using standard package managing software. But once you install TimescaleDB, if you’re running self-managed inside your database, you just run this one command, create extension, that’s a standard command in Postgres for extension management. And that kind of enables it for your application.

Gavin Henry 00:57:39 Yeah, that’s what I do. I pull in the TimescaleDB docker image in my development environment and then run that create extension.

Mike Freedman 00:57:47 Yep. When it comes to some operational things like HA, High Availability replicas, either running replicas purely for failover or for scaling weird queries, again that looks very much like traditional Postgres operations. You set up a streaming replication between a primary and a replica the same way that you would with Postgres. If you are, again, self-managing the built-in open-source Postgres will enable you to stream to a replica. It won’t provide this broader framework to detect when a primary mic goes down and then execute the command to fail over to the replica. That’s often done with other open-source software. One of the common ones is called Patroni. That’s a very common in the Postgres community for managing this type of open source or to manage this type of failover. That works very naturally. Also, if you’re running Timescale, in fact, two of the initial three engineers who developed Patroni a time where at a company called Zalando, now have been at Timescale for the last four or five years. So it’s also technology that we use heavily internally as well. When it comes to backup, you could run traditional backups, whether they’re whole backups like, PG dump, PG restore, or as well as streaming incremental backups that allow point in time recovery. One common one might be PG backrest. We also use that internally. There are other ones like Wall G, those all similarly work with Timescale. So the good thing is if you are deciding to self-manage Timescale, many of the operations that you have for Postgres also work with Timescale.

Gavin Henry 00:59:28 Maybe a better way to ask this section is what tools wouldn’t be used that you would normally use for Postgres?

Mike Freedman 00:59:35 I would say two of the ways that are different is that number one, upgrades are slightly different, and that is because you upgrade Postgres, and you upgrade the extensions on Postgres separately. That makes sense because when we, we might release software, we release software obviously as a different cadence than Postgres does. So there is a standard Postgres command for updating extension, and that of course runs with Timescale. But I think that where some care needs to be taken is version compatibility in that we typically build our extension to be compatible to the last. It’s always been to three major versions of Postgres, sometimes four major versions of Postgres. But sometimes, especially when people are self-managing upgrades can be harder. So they don’t do them for many years. And so occasionally we find customers who might still be running, let’s say PG 11 and an old version of Timescale, they want to upgrade to the latest version of Timescale.

Mike Freedman 01:00:39 The latest version of Timescale might only run, I think it’s, I forget if the latest version is PG 12 or PG 13 onward. And so they have to, in some sense upgrade these in in kind of a couple different stages where they might upgrade to one version of Timescale, upgrade Postgres the latest, then upgrade Timescale again to latest version that fits in the latest version of Postgres. So there’s a little bit of a compatibility matrix that people need to think about. Of course, keep your upgrades up to date so you don’t need to worry about this, or if you use the Cloud, we make sure that all these updates are always kept up to date, so you don’t need to worry about this either. The last thing I would say is databases are generally easy to operate when they’re small. They get harder to operate when they’re big or more things could go wrong.

Mike Freedman 01:01:24 We’re very used to running production databases at terabyte and multi-terabyte scale. We have a lot, a lot of customers who do that in production. Our experience is some of the places where people get into trouble when they’re self-managing is, it seems easy to get a database running when it’s 10 gigabytes, but then trying to manage reliable replicas, reliable incremental backups running, doing smoke testing on your backups, making sure all that you could spin these up, make sure you do upgrades smoothly. That becomes hard to do in a self-managed environment when you’re running at scale, starting hundreds of gigabytes, certainly terabytes, thousands, hundreds of thousands of database objects, and so that’s just something for people to be conscious of and considerate of, is to make sure you have great DBAs when you start getting into that scale. Or of course look to manage service.

Gavin Henry 01:02:18 Yeah, I’ve been caught out with a, just in the dev environment, upgrading between dock images because they come as one thing. Normally the image has got the latest version tied to that Postgres version. I got in a little bit of a mess, but it’s just because I didn’t read the docs properly. Okay, that’s great. It’s obvious it was a great choice to have Postgres as the core and hook into that because it comes with so much more that you didn’t need to invent yourself and just iterate on, which is brilliant. You’ve explained about how the Cloud stuff works as well to make our lives easier, which is great. I think I’m going to wrap up the show now if that’s okay. I think we’ve done a great job of covering what time series data is and what TimescaleDB brings. But a question I like to always ask at the end of my show, if there is one thing that you’d like a software engineer to remember from this episode, what would you like it to be, Mike?

Mike Freedman 01:03:07 I think Postgres is for everything. And so we really believe that engineers could make their jobs simpler, more reliable and more flexible. If rather if trying to simplify their stack by putting 99% of workloads probably will work with Postgres and the things that have made it powerful like Timescale, like the things we’ve done for time series AI and whatever, as opposed to build this very Rube Goldberg like ecosystem of lots of databases, lots of integrations, lots of ETLs and so we think that simplicity can be king and developers can go a lot farther than they ever thought possible with just Postgres.

Gavin Henry 01:03:50 Thank you. Was there anything we missed in the show that you thought might come up or you’d like to mention?

Mike Freedman 01:03:56 I think that’s great Gavin, I mean if anybody wants to check it out, of course they could look at Timescale.com. See about our database software itself as well as the Cloud experience we’ve built and what makes us excited every day I think is that we have users who are building in many cases the future of computing and anywhere from next generation energy to satellites and space to fusion technologies, to new forms of finance and monetary forms. It’s really an exciting time to be working on data infrastructure and to enable developers to be kind of the creative people that they are.

Gavin Henry 01:04:41 And where can people connect with you if they want to follow up or have any questions about things that I mentioned.

Mike Freedman 01:04:46 Easiest way maybe is even on Twitter at Michael Friedman or I am Mike@Timescale, if anyone is interested.

Gavin Henry 01:04:54 Excellent. I’ll put those in the show notes. Michael, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening.

Mike Freedman 01:05:01 Thank you for having me.

[End of Audio]

Join the discussion

More from this show