Lukas Fittl of pganalyze discusses the performance of Postgres, one of the world’s most popular database systems. SE Radio host Philip Winston speaks with Fittl about database indexing, queries, maintenance, scaling, and stored procedures. They also discuss some features of pganalyze, such as the index and vacuum advisors.
- SE Radio 454: Thomas Richter on Postgres as an OLAP database
- SE Radio 362: Simon Riggs on Advanced Features of PostgreSQL
- SE Radio 328: Bruce Momjian on the Postgres Query Planner
- Twitter @LukasFittl
- Mastodon: https://hachyderm.io/@lukas
- LinkedIn: https://www.linkedin.com/in/lfittl/
- SolarWinds SQL Sentry
- Postgres Wiki Performance Tools
- Citus Data
Transcript brought to you by IEEE Software magazine and IEEE Computer Society.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Philip Winston 00:00:43 Hi, this is Philip Winston for Software Engineering Radio. Today my guest is Lucas Fittl. Lucas is the founder and CEO of pganalyze. He has worked as a software engineer in the Postgres ecosystem for the last 15 years, including as a founding engineer at the startup Citus Data, which was acquired by Microsoft in 2019. Prior to that, he was an early engineer at Product Hunt, which was acquired by AngelList in 2016. Postgres is one of the most popular database systems in the world with over 30 years of active development. Today we’ll discuss Postgres database performance along with the pganalyze tools. We’ll start by introducing Postgres and pganalyze and then tour through various Postgres performance related features such as indexing, queries, scaling maintenance, and if we have time stored procedures. Welcome to the show Lucas, and can you tell us what is Postgres in your point of view?
Lukas Fittl 00:01:43 Of course. And thank you Phil for having me. So Postgres in general is, if you’re an application engineer, you might think of Postgres as the thing that just does work behind the scenes for you. So Postgres is a relational database system. It’s just turned 27 years old, so it’s old enough to drive a car and to write its own thesis . And Postgres was started at the University of Berkeley 27 years ago and it is extremely popular today, right? So most likely if you’re using application today, Postgres will be the backing data store behind it. The main data in the database is going to be stored in Postgres in many cases. And so really to me, Postgres is one of the most important open-source technologies today.
Philip Winston 00:02:24 What types of applications does Postgres excel at? I know it’s a very general purpose database, but if there is an area of what it’s most known for?
Lukas Fittl 00:02:34 Sure, and I think there isn’t necessarily one type of data that you can store in Postgres. I think generally what I’ve seen is that for system of record type data, right? So if were you, let’s say you build an application and you’re thinking where kind I actually store the data form application, that’s where Postgres excels at. But what Postgres is also good at is being extensible. So for example, PostGIS is a very popular extension for Postgres that allows you to work with geospatial data in Postgres. More recently there’s an extension called pgvector, which allows you to store vector data in Postgres so you can work with your AI, ML embedding type data with Postgres. And so that’s really the power of Postgres is that flexibility of being able to support individual data types that are useful for particular use cases.
Philip Winston 00:03:18 Yeah, I’m glad you mentioned PostGIS. That was going to be one of my questions. I think when we get into the specifics we might come back to PostGIS and how it impacts indexes and other things. How about the cloud computing angle on Postgres? I know Amazon has this RDS service and Aurora is a type of database. Are these Postgres under the hood or are they compatible with Postgres?
Lukas Fittl 00:03:45 Yeah, and that’s a good question. So generally you could run Postgres yourself on a virtual machine, right? So I think what I would always emphasize similar to the Linux kernel, you can just run Linux yourself, right? You can build it yourself. But what’s interesting of course is that roughly I would say 15 years ago, database as a service kind of took off as a concept. And so I would say one of the most prominent early starters in that space was Heroku Postgres. Heroku these days, unfortunately not that many people are using it anymore, but it I think was one of the first places that really brought developer experience as the main thing that they cared about and the experience. And so Heroku Postgres was a managed database service where instead of you having to run your own merchant machine, you were actually able to kind of have just an API that you could use. API in this case, meaning you run your SQL queries against a connection string.
Lukas Fittl 00:04:31 And so nowadays of course AWS and Amazon RDS service as well as Amazon Aurora is probably the most popular Postgres as a service database, managed database that you could find. What’s notable there I would say is that Aurora, for example, is a fork of Postgres. So what they’ve done essentially is they’ve taken the open-source Postgres code, which is BSD license, so like very liberally licensed. And so they took that code and they essentially made modifications to the storage engine. And part of the motivation there was to increase resiliency. So when you have different availability zones in AWS, they want to have a better concept of how the kind of resilient architecture of the database works. And so Aurora is essentially, very tight to AWS’s approach of doing things versus RDS as theyíre it’s still a forked version, but it’s less much less patched essentially than Aurora.
Lukas Fittl 00:05:19 And so if you want to be portable, right? If you want to have your postcards work similarly on GCP on AWS on Azure, then it’s certainly better to, in my opinion, focus on the RDS type service or Google for example. You have Google’s LWDB service, which is again their fork version of Postgres and they also have Cloud SQL, which is their more standard Postgres. And so I think oftentimes it’s actually better to not stray too much into the forks of Postgres because you get kind of special behaviors that are unique to each provider’s infrastructure.
Philip Winston 00:05:48 This might be self-evident, but why is database performance so important?
Lukas Fittl 00:05:54 Sure, and I would say, I mean it kind of comes back to the Postgres being your source of truth. If your system of record, meaning that a lot of cases when your application wants to answer a web request for example, right? It will have to ask a database and will have to ask a database like Postgres. And so one thing I saw this morning for example is, so Google announced a new kind of mode of measuring page performance, right? So they’re now saying, I think anything that’s slower than 200 milliseconds on a mobile device is actually a slow page and they will rank it less. Now you could think if that was an application, right? So if you were actually like querying customer data for example, ultimately what you would have is a web request that goes to a Postgres database and that request is a query, right? And so the query, if that is slowed on two in a millisecond, you would by definition fail that kind of performance test. And so database performance matters because ultimately it is actual web performance, right? It’s actual end user performance in many cases. And so whether it is measurements that Google makes, whether it is, actual end users being frustrated because your application is slow, database performance matters because it essentially builds a better experience for your end users of your applications.
Philip Winston 00:07:02 And we’ll get into this later, but certainly your ability to scale might be hampered by your database performance. So things are working fine when you have 10,000 users, but then you have a million users and performance falls apart. So I think that’s something we’ll discuss in a little bit. But I wanted to mention from a previous show that I’ll cite soon, SQL is a declarative language. What are the implications for this as far as performance? It seems like it’s related to the fact that you can’t necessarily see what steps or instruct the database in what steps to take. You have to just count on it figuring that out.
Lukas Fittl 00:07:41 Yeah, I think that’s a great question. So I think you’re essentially stating what you want, right? You’re telling the database, please give me this data from this table and filter it in this way, or order it sorted in this way. But you’re not necessarily telling it how to do that. And so it is both, I think the power of database, but also of course the downside, which is the database will make that choice for you, right? So you can declare your intent and the database will find a way to answer that query to answer that intent of yours that you declared using the query language. Now of course this relates to index usage, right? And all these other topics, but I think really it is a fundamental nature of how databases work, especially relational databases work, is that you’re not just saying, I’m calling this method which work exactly this way, but instead you’re saying, I want this data and I want it filtered and grouped and sorted this way. Figure it out for me.
Philip Winston 00:08:30 So let’s start talking about pganalyze. When did you start working on this project and what was the original motivation at the time?
Lukas Fittl 00:08:38 Sure. So pganalyze, I started I think is it like 11 years ago now? So 2012 as a site project initially. And it kind of grew out of my own, I would say frustrations with working with Postgres. So I first worked with Postgres when I co-founded a company back in 2007 back in Europe. And we were doing like social media blogging type site and at the source of that site was a Postgres database. And so back then there was just no good tooling to understand why our site was slow, right? Like we had essentially bad user experiences and we didn’t understand why. And so couple years later, right, I kind of got together with somebody I had worked with at the time who in my perspective was a Postgres expert. And so I essentially went and said how could we make a product out of this?
Lukas Fittl 00:09:21 How could we say, for example, one challenge at a time was, how do I know what is slow in the database, right? So I see the CCP utilization of my database, but how can I actually know what is taking time? And we could really only observe a slow page load, but how do we actually know it is this particular query? And so what Postgres introduced around that time is an extension called PG Stat Statements. And what that extension does is it measures all the queries that get executed on the database side. And so the reason that’s so important, right, is because you can just ask the database, give me all the queries that you’ve seen and how long they’ve taken, right? And then you can from that decide what to optimize which index is to create. And so what pganalyze essentially in its very first iteration did was literally just showing that data, right?
Lukas Fittl 00:10:04 So it was a very ugly looking bootstrap site, that at the time was the way to do things. And so it was literally just pulling that data and showing it in a quick view. It didn’t work as a business really in the beginning. It was more a kind of project just for, for our own benefit essentially. But already that was really useful. It’s just having that data of the query, which queries have run, and then also over time. So you can say, well, yesterday at 5:00 PM we saw a spike in query activity. Why was that? Right? Was why was it suddenly slower? And so that was really the initial focus of the product.
Philip Winston 00:10:36 So I think one thing you said there is that it’s the load on the database or the types and amounts of queries that are going to determine the performance in some cases. So having a slow query that you almost never run is maybe not a problem, but having millions of queries that have a performance issue is, is obviously a much bigger problem. So can you talk about Postgres built in, explain feature and I guess explain analyze and what does pganalyze do that goes beyond these?
Lukas Fittl 00:11:11 Sure. Yeah. So if we think back just in terms of what you, what we talked about earlier, right? In terms of SQL being a declarative language, right? And so you’re, you’re not really telling a database how to do something. And so the explain feature in Postgres, what that does is it will explain to you what the database will do, right? So generally when Postgres gets a query, it first does the parsing of the query, it then does what’s called pars analysis to actually say, well these like tables you’re referencing are actually these physical tables and database and then does planning and then it does execution. Now, the part where Postgres essentially has a lot of choice it’s making is that planning process, right? Because in the planning process, Postgres will decide which indexes to use, how to join your tables, which kind of how to group things, how to sort things, right?
Lukas Fittl 00:11:52 All these decisions are being made in a planning process. And so what the explain kind of statement does in Postgres is it will show you the, the query plan that is being determined in that planning phase and would’ve also shown you is the cost that Postgres determines. So generally when Postgres makes a decision on which way to query a table, right, which index to use or how to join, we’ll do that based on a cost and costs are essentially just units that Postgres is using. So they’re not time or anything, they’re literally just this has cost five, this has cost 10, I’ll choose cost five because it’s less cost essentially. And so when you do just explain without analyze, really what this does is it just shows you the output of the planning stage, right? It just shows it as a result of planning.
Lukas Fittl 00:12:33 The good thing about just bearer explain is that it’s really fast because planning is always intended to be fast, right? Like Postgres will not spend a lot of time finding the perfect plan because then each of your queries will be slow. Instead what it does is it essentially does a best effort search in a lot of cases. And so explain will give you that even for very complex queries. You’ll quickly get the explain plan or the query plan by just putting, explain in front of your SQL statement. Now explain analyze on the other hand that actually runs the query. And so explain analyze is very useful if you want to know which portion of the query took, which runtime, the thing to know about, explain analyze, and this might not be, this might be not obvious, so I’ll mention it now, which is it has overhead to run, explain, analyze itself.
Lukas Fittl 00:13:14 And so there’s some settings you can use to reduce that overhead. But one of the challenges is if you’re measuring things to explain analyze, your kind of in a sense you’re measuring a target that also is not the same as the usual execution time, right? So that there, there’s some intricacies there. The other thing that we’ve explain analyze that’s really useful is explain analyze buffers. So buffers are a special option that you can pass when you’re doing explain analyze. And the reason it’s so useful is because it shows you how much data the database is loading. So you cannot just see the runtime of the query and the individual parts of the query, but you can also see how many bytes of data essentially are being loaded, how many blocks of data. And so that is super useful because ultimately a lot of times the bottlenecks in a query are IO problems, right?
Lukas Fittl 00:13:56 And so oftentimes you end up just having slow IO because you’re going to a network storage in the cloud. And so understanding which part of your query is doing that IO is super important. Now what we do in pganalyze is we actually integrate with an extension called auto explain. Now auto explain is really useful because instead of you running these commands interactively, right, you connecting to the database, you actually get the queries recorded by the database itself. So auto explain says if the threshold for example is let’s say one like say one second for example, each query that is slower than one second will get logged to the Postgres log. Now we do in pganalyze, we pick up those log entries. So pganalyze as a product, we’ll just parse the logs continuously and then grab these log entries that say this query took this long and here was the execution plan. There’s various settings that we recommend tuning for auto explain. So it’s safe to use in production. But generally people have reported, I would say in the most expensive version of it between one and 3% overhead. But it’s definitely close to 1% if you use the right settings. And it’s just so useful because the exact plan that executes at the time of your query being slow, not just the plan that you see when you run a query right now.
Philip Winston 00:15:08 So give us a sense of using pganalyze. Is it primarily visualizing this auto explain, are there completely different features that people run say for looking at memory or DISC or, what’s the basic set of options if I’m running pganalyze?
Lukas Fittl 00:15:27 For sure. Yeah, and I’ll speak to, I would say how I look at the system, right? Because we use pganalyze ourselves to optimize our own database, which I generally find is the best way to build a product is, use it yourself. And so I think first and foremost, right, it shows you data that you don’t really see otherwise, right? So oftentimes today what you would see in your cloud provider dashboard is like a very high-level graph that says this is the utilization, right? Maybe you’ll see something like active sessions, but you don’t really see that detail, level of detail that, pganalyze shows you. And so usually when I use pganalyze, I would use our kind of query overview to see which queries are the most active and then I would jump into an individual query.
Lukas Fittl 00:16:04 Now for each of those queries, I can get the performance over time. Again, that is what we referenced earlier, pg_stat_statements. That’s really data that comes from pg_stat_statements. Now where I think pganalyze is unique and I think it’s even today, I mean this is true five years ago, but it’s still true today. What we do well is we actually link together these statistics tables. So the pg_stat_statements type data and the actual postcards log data, right? So sometimes what’s challenging is when you have these auto explain plans, they’re in your logs, but you’d have to just read through log files and it becomes very tedious, right? And so what we do is we actually pull those up into the query page so that you see all the explain plans that were recorded for the query, right? In that specific query page.
Lukas Fittl 00:16:43 And so that saves you a lot of time because you don’t have to go looking for log files. There’s of course many other features that we’ve added to pganalyze, again, from our perspective of using it as well as from customers kind of reporting things. One other thing I think that’s noteworthy in the context of query performance is what we call query tags. So one challenge that as an application engineer that you sometimes have is SQL queries are not really what you’re thinking in, right? You’re thinking in OM calls, right? Your OM kind of abstracts the SQL query away from you. And so what we’ve added this a way to essentially you tag your queries with the location in which kind of the query originates, right? So you can say this line in source code, this file, maybe this name of this particular controller in my application, and then the query tax feature will highlight that on the query page. And so that way I can actually understand better which party application is causing this slowness.
Philip Winston 00:17:32 How do you think about database performance compared to general application performance or maybe operating system performance. It seems like there’s a lot of similarities, but there’s differences also.
Lukas Fittl 00:17:46 Yeah, and I think in terms of tooling is maybe one way to answer this question, right? So for example, we ourselves use a combination of pganalyze and Sentry(?) for our own performance monitoring. So when we want to look at this from a kind of application centric perspective, we would look first in Sentry(?). Now what Sentry(?) would show us is a trace for example, right? So it would tell us this particular rep request took this amount of time, but really that stops at the individual SQL query. And so if I’m thinking of individual users experiencing a slow web experience, I would start there and then I would go into pganalyze to really do the deep dive and really find out why the query was slow, right? Why was this a pattern, right? Is this query always slow? I think from an operational perspective, we for example have our own Grafana dashboards as well.
Lukas Fittl 00:18:28 And so I think some of the high-level metrics I think systems like Grafana or AWS CloudWatch are best suited for because they kind of give you that summary view over a lot of different pieces of infrastructure. And so to me really, I think it’s a combination like use the right tool for the right task, right? So if you’re trying to drill into your database, use something that’s database specific. If you’re trying to understand what end user transactions are slow, use a PM tool. If you’re trying to get a sense of your infrastructure use something that’s good for infrastructure monitoring, I think it’s really good to use purpose-built tooling versus something that’s generic that doesn’t really do a good job.
Philip Winston 00:19:03 Yeah. So it sounds like you’re talking about monitoring up and down the stack kind of at all levels. How about for Postgres itself? Are there other performance monitoring or optimization tools that you use yourself?
Lukas Fittl 00:19:17 Yeah, good question. I mean, there’s a lot of projects in the open-source if you’re ever interested in this from a non-biased perspective, right? I always have my own biases of course. There’s a good list in a Postgres wiki that just lists out all the different monitoring tools. I think what I use myself, again, it kind of comes back to explain, I would say, right? So when I am looking at a slow query, what I would usually do, I would try to understand why it’s slow right now. I would maybe look at a recorded explain plan and pganalyze. But then I would also, if I want to experiment, right? If I want to try out different variants of the query, like maybe I’ll remove a part of the query, see if it performs differently. Usually what I would do is just run and explain on the database directly.
Lukas Fittl 00:19:52 And I think I would say that’s really the most common tool that I use. Of course there’s many alternatives that people might also use for other parts. I’ll give you one example. So for example, in pganalyze we have a feature to do log monitoring, but a very popular tool to do this in open-source world without a product like pganalyze. It’s called pgBadger. And I think pgBadger is a great tool. I’ve used it in the past. I no longer use it because pganalyze solves that for me. But I think there is obviously a lot of great projects in the Postgres ecosystem that are created by individuals just because they’re scratching their own edge essentially solving their own problems. So always good to find things like that.
Philip Winston 00:20:26 I’ll include links to these applications in the show notes. What language or languages was pganaloyze written in? Is it a part of it, a Postgres extension or how does it have access to the Postgres specific data?
Lukas Fittl 00:20:41 Yeah, it’s a good question. Let me see. So I am by trade mostly, I would say, well I start with C++ personally, but there’s no C++ in pganalyze luckily. There’s a bunch of C which I’ll get to in a moment. We started with Ruby I think initially. And so a lot of the web application that we have is, is well built in Ruby. We recently started adopting Rust. So a lot of the kind of performance processing that we’re doing is in Rust. So the way to think about it, right, we’ll be doing is you have your database server, right? And we have a component that runs on your server or close to your server in the case of managed service. And then that component sends data to our service, that data gets processed, gets stored Postgres database ultimately, and then that data gets queried in our web API.
Lukas Fittl 00:21:21 And so kind of the way our stack looks like is that component that runs in your environment that’s actually GO application for historic reasons. Then that sends into our statistics processing, that’s Rust. And then we have a web interface that’s built in Ruby, to me it again comes back to using the best tool for the job, right? So I think for example, GO, the big benefit is that a lot of cloud providers actually publish SDKs for GO. Rust for example, I think the AWS, Rust, SDK is actually in pretty good shape now, but a year ago it was still experimental and so you couldn’t really access the AWS APIs in a stable way in Rust versus Go just works well. And so GO was a good fit there. Think of Rust for example, in our backend it really comes down to performance, right? So how do we make sure that we have the least overhead essentially when we’re pulling in these statistics and we’re running our analysis. And so that’s where kind of Rust sits. And then Ruby I think is just very convenient for kind of basic CRUD type applications where you’re just showing data and you have like stuff like users that sign in and things like that. Ruby and Ruby on Rails is just good at that.
Philip Winston 00:22:21 So is the C part you mentioned, is that an extension?
Lukas Fittl 00:22:24 Or yeah, sorry , let me actually speak to that. So the by intentionally said we don’t want to build an extension because the problem with extensions, and that is true today and hopefully will not be true at some point in the future, if you’re using a managed service provider like AWS, RDS, you can’t actually use your custom extensions, right? So if I today went and I created a custom extension, I would have to then ask AWS to enable that extension on their system. And that is a long process and they usually do it based on customers requesting it, not me as an individual just requesting it. And so it’s very hard to get a custom extension to be adopted by the cloud providers. There’s some efforts recently to improve on that, but what we’ve essentially said early on is we, we don’t think that it’s the right approach to build our own extension because it limits our adoption, essentially our use.
Lukas Fittl 00:23:11 And so we did instead we actually using portions of the Postgres source in our own application. So I think roughly six years ago I first created a library called pgquery. And what pgquery does is it actually pulls the parser from Postgres and makes it a standalone library, right? So you could think of when Postgres gets your query, as mentioned earlier, it first does the parsing, then it does the pars analysis, then does the active planning, then the execution. And so that very first step of parsing is very important because when you’re trying to understand what is a query even doing, right, like which columns this is referencing, which tables is referencing, you kind of need that parse street that that first step determines. And so what pgquery does is it just plucks out that particular portion of Postgres, puts it in a C library and then also has a Ruby wrapper around it. So that what we could do if pgquery was in our Ruby application at the time, we were able to essentially just call the parser and say this query, what is it actually doing? Right? Is it an update statement? Is it, what is it updating? What’s the table name? Which columns did it filtering by? So all that very in-depth analysis, we need to essentially extract the parser.
Philip Winston 00:24:17 So is pganalyze today strictly a service or is there also an open-source package that you can install or how does that work?
Lukas Fittl 00:24:26 Yes, it’s essentially a combination. So we open-source parts of our libraries that we built. So pgquery being the best example of this, so pgquery is open-source. If you have a use case for partisan queries yourself, you can just run pgquery in your own application. And people have done all kinds of things for that Fun fact, if you install GitLab, GitLab bundles, pgquery as part of their installed because they’re using it in in some part of their own tooling. But pganalyze as a product is a commercial product. So we are a small company, we are bootstrapped, so no, we see money, it’s just our own money from essentially building a product, selling it to customers. And so to make that sustainable, we’ve essentially chosen that parts are open-source that we find are generally useful, but then the pgonline product itself is a commercial product.
Philip Winston 00:25:11 Now, before we dive into a Postgres performance specific sort of tips and tricks and, and just learning about it, I would like to reference three software engineering episodes that dealt with Postgres that are useful. You might find useful. Episode 454, Thomas Richter on Postgres as an OLAP database from 2021, episode 362, Simon Riggs on Advanced Features of PostgreSQL 2019, and episode 328 Bruce Momjian on the Postgres Query Planner 2018, which I found particularly interesting the way he discussed that in detail. So one of those shows mentions OLAP. Can you talk about the difference between OLTP and OLAP, which is I guess the difference between transactions and analytics, particularly with respect to performance and performance monitoring?
Lukas Fittl 00:26:10 Of course. Yeah, so generally the way I would think about it, right? If you have your typical web application today that let’s say, it’s a CRM tool or something, right? That would be a typical OLTP application, right? So OLTP applications have a lot of fast queries. They’re usually a system of record type system where you’re serving a web request for example, that is a typical OLTP system. Now an OLAP system on the other hand, would be something more like a data warehouse, right? So where you’re trying to do business analytics, business intelligence type queries where you’re trying to say how many of my customers have done this in this timeframe? And you’re looking just a lot of records right now. Postgres, the good news is Postgres can do both, right? So Postgres is, can be used both as an OLTP system as well as an OLAP type system.
Lukas Fittl 00:26:54 Generally though, I would say Postgres will excel at OLTP type workloads. OLAP workloads can be, it can make sense to use more purpose-built databases in some cases, particularly for example, one thing to note about Postgres is way of storing data is it’s generally a row-based format. And so row-based formats make a lot of sense if you’re querying for a small set of individual rows. But in an OLAP use case, what you’re often doing is you’re querying over a lot of data. And so there a column, column store type format makes a lot of sense. There are actually extensions for Postgres now that do store data in columnar format. But generally speaking I would say Postgres is very well suited for OLTP and in some cases suited for OLAP. Now when I look at this from a performance optimization perspective, it again comes back to query plans, right?
Lukas Fittl 00:27:39 So when I optimize a query plan for an OLTP workload, I would usually have a very simple query I imagine in many cases. Where it’s more a question of how do I squeeze out that last remaining bit of performance versus in an OLAP case, usually it’s very complex queries that join a lot of data or that’s scanned a lot of data. And so if you have the right set of indexes, you can certainly improve both, right? So indexes matter for both OLAP and OLTP. But I would say in the case of OLTP, you might be able to more purposely build an index just for the particular report you’re generating once a day, for example. Versus in the case of OLTP where you have just small fast queries all the time, you really want to look at the whole workload, right? So you really want to look at all the queries that have run in the last seven days or last 30 days and then get a sense for what are they querying for, right? What are they filtering for, and then optimize for that kind of set of queries versus just individual queries.
Philip Winston 00:28:33 You mentioned there might be cases where you need to switch to a more specialized database. I’m guessing you’ve done your performance analysis and there just is no knob to turn, there’s just nothing to make it faster. Is that an experience that you’ve had where you just don’t find a solution even though you can see the problem?
Lukas Fittl 00:28:52 I would think about what problem you’re trying to solve, right? So maybe a good analogy of this would be if I am analyzing data that’s just stored internally and I’m just doing a report across all the data, right? So most of my reports are loading up all the information, then to be honest, I don’t think Postgres is a good use case for that, right? If you’re looking at all of your data, most of the time Postgres will have more overhead than other systems. If you have column store, again column stores just by design have less overhead and even if you use the column or extension for Postgres, that would still not be as good as a purpose-built system. However, the big benefit of Postgres would be that it integrates better with other things you have, right? So coming back to this OLTP versus OLAP comparison, what people in the last years essentially have created is the term HTAP.
Lukas Fittl 00:29:41 And so HTAP is kind of a hybrid where you have some OLTP type queries and some OLAP type queries. And so why that matters is because you’re not going to choose a purpose-built database that’s good at data warehousing when you’re also doing individual OLTP type queries. This was actually something back when I was a Citus Data that we saw a lot where Citus is good at and why people chose Citus over other solutions was that they had a mix of workloads. So they had a use case where they wanted to be able to do some analysis across their whole workload or across subsets of their workload, but then they also had individual users just adding records and doing inserts. And so that type of stuff is where Postgres is really good at combining these type of approaches. But if I had just OLAP, I would probably pick something else very popular these days. For example is ClickHouse, ClickHouse is a columnist store. And that can make sense in many cases. I was referencing Sentry earlier today just in our own tools that we use in Sentry for example, adopted ClickHouse for parts of their workload internally, they’re also using Postgres for other parts, right? So they essentially started using both data stores because that made sense, right? It gave them the best solution to combine them.
Philip Winston 00:30:49 Let’s dive into indexing probably one of the most important aspects of database performance. Just from a high level. How do you make the decision about which types to index, what type of index to make, whether to have an index, how do you decide these things?
Lukas Fittl 00:31:07 Sure. And let’s maybe before we dive into the how to make the decision, maybe for those of you who are not familiar with all the different index types in Postgres, maybe just a quick, repeat of that. So your most common index type in Postgres would be B-tree Index, right? And the way to think about a B-tree is essentially it helps you optimize filtering as well as sorting for your data. So a lot of times when your query, for example, when you have no indexes, maybe the best comparison. So one of the index choices is no index. Then what you’re doing is you’re essentially looking at a large file that stores all your data and you’re just going through it row by row and you’re filtering each row and you’re saying, does this row match the query? And if it matches, it returns. If it doesn’t match, it doesn’t return.
Lukas Fittl 00:31:46 And so what a B-tree index is, is essentially a cache data structure that optimizes that lookup so that you don’t have to manually go through each row and it does that very well, which is I would say 90% of cases you should use B-tree index. There’s of course other index types and we don’t necessarily have to go into all the details today, but just be aware that for example in Postgres, when you’re doing lookups, like does this value is included in this array? For example, you would want to use a GIN index when you’re working for geospatial data, you might want to use a GiST index or SP GiST index in some cases for all the OLTP(?) workloads it actually makes sense to use hash indexes. And so there’s just different index types with different trade-offs that factor into your indexing decisions. Now kind of coming back to your question of how do we reason about this, right?
Lukas Fittl 00:32:29 How do we essentially make a choice of indexes? To me, the way I look at indexing is essentially it is the developer’s job, right? So I think that maybe the first important thing to note is I don’t think that indexes are the DBAs job or the DevOps job. I think it’s a developed application engineer’s job to create the right indexes. And so the way to think about that for me is you’re writing your code, right? Your code makes ORM calls, these ORM calls produce SQL queries and these SQL queries ultimately look for data and database. Now when you’re trying to decide which index to create, really you need to be thinking about the queries you’re making, right? You need to be thinking about what are those queries filtering by. So when I do a select from a table or your ORM does a select from a table, you have a where clause and then it says where column one equals something and column B equals something else, right?
Lukas Fittl 00:33:24 Really that is the most important piece because this is something that the database will not do for you, right? The database will not look at these where clauses and say, oh they’re always querying for this set of where clauses, I’ll create index, right? That is a responsibility that the application engineer has. And so really it comes down to which queries are you running, what where clauses do they have that they’re filtering by? And then if you want to get more advanced, which join conditions are you using and which order conditions, like how are you sorting your data? Because the best performance is going to be when you look at a combination of all the conditions you’re filtering by as well as the order in which you want your data returned. Because an index can both filter the data but it can also give you a sorted result and you can really get tremendous performance improvements if you get this right.
Lukas Fittl 00:34:12 I can actually share a specific example that I worked with recently where we had a case where there was a 400x performance improvements, actually four 14x performance improvement where we took a query from 1.2 seconds, so roughly 1,300 milliseconds to 3.13 milliseconds. So huge improvement. And as I referenced earlier, interactive page loads might be 200 seconds, right? So you could see how, before with when it took three oh seconds, it was actually milliseconds, it was actually too slow, right? Somebody would’ve actually had a slow experience versus that indexing change actually made a difference and it tremendously improved that career performance to the point that was actually interactive. And that is really why I think it, it matters to look at the whole stack, right? Because it doesn’t matter what code you’re running on the application side, if you’re making that single SQL query and that turns out to be that slow, right? Then suddenly everything else is going to feel sluggish.
Philip Winston 00:35:02 So you mentioned several different types of indexes. Is B-tree the default or does Postgres automatically select any of these other index types in certain cases?
Lukas Fittl 00:35:15 B-tree is the default. So B-tree is, if you don’t tell Postgres which index type to use, it’s just going to give you B-tree index, which I think contributes to the fact that most people have B-tree indexes. The thing also to note there is index types relate to the operators you use in your queries. So when you, I mentioned earlier this example, if you have column A equals some value, right? So the equals sign, that’s the operator you’re using in that case. Another example would be if you have a not equal sign, that would be something that your regular B-tree index cannot index. Well actually it can I think filter in some cases but it doesn’t work well and there’s other operators in Postgres like your like containment operator or Json B for example, you’re saying give me these Json documents that match certain conditions, which again, a Bree index isn’t a good fit for. And so Postgres will not make that choice for you. You actually have to tell Postgres create index and you say using, and by default, if you don’t specify using, it’s essentially assed to be using B-tree, but you could just as well say using GIN or using GIST or using, for example, pgvector as IVF flat as a particular index type that the pgvector extension introduced. So that’s a totally different index type for, vector embedding type data.
Philip Winston 00:36:23 So you mentioned operators, there’s something called operator classes. Does that have to do with choosing an index also?
Lukas Fittl 00:36:31 I would argue most people listening to this hopefully don’t ever have to worry about operator classes . So the quick version of it is that an operator class is essentially the behind the scenes implementation of how Postgres would match up things, right? So how Postgres would, when you have this operator in your query, how it would ultimately get back to this index type supports this operator class and then supports this operator. So it’s part of what the Postgres itself will do during planning. I think in practice what most people will do is they will create an index, they’ll run and explain on it and then they’ll realize the index isn’t used, right? And so then they’ll hopefully realize that part of the problem is that that operator is not supported. There’s actually a good overview also in the Postgres Wiki. If you look at particular index types, the Postgres Wiki or the Postgres documentation will actually tell you which index type support which operators. And so I think it’s really more of a question being diligent, right? When you look at your queries and you’re not using an equals operator because you’re doing something else than just understand that not all indexes support all operators. It’s very specific to which index type supports which operator.
Philip Winston 00:37:33 Yeah, I can definitely relate to documentation having concepts which you don’t actually encounter as an end user or you may not have to deal with. You mentioned Json B that is a binary format for Json, that Postgres stores. Do you anything about the history of that? It seems like when I used it, I didn’t really appreciate that it’s a pretty significant addition to the database because you can store large blobs of Json and query into that Json. Do much about how that came about?
Lukas Fittl 00:38:10 Sure, I know some of it, I think there’s definitely other folks that, are the experts on it that actually did the work here. But my understanding of the history is that, so if you recall maybe 10 years ago, no SQL was all the rage, right? Everybody was like, oh stop using SQL, stop using relational databases. We’ll just do MongoDB and do lookups with on Json data. And we’ve obviously come around since then, right? So no SQL kind of tied off as a concept and these days some people are using MongoDB, some people are using Postgres. And what Json B really did is it brought this concept of document stores to Postgres. The history here is that if you look back many Postgres versions, like Postgres eight something, there was just a text data type, right?
Lukas Fittl 00:38:49 So if I had Json type data, I would just use the text like data type in Postgres, I would just put it in a text column. Obviously, I could do a lookup on that text column, right? If I wanted to match the whole document, but I couldn’t really query inside the document. Now after the text type, which was always there, essentially the Json type got introduced I think roughly maybe 15 years ago, I think slightly more actually maybe 18 years ago. And so what the Json type did is it essentially gave validation to that data type. So you actually had the fact that this was a Json structured document, but it was still essentially text inside Postgres. Now what Json B is doing, it’s very similar to what MongoDB is doing with I think their Bson(?) on internal representation, which is, it’s essentially creating, as I understand parts of the Json are essentially stored differently and they’re stored differently in order to enable faster lookups on the data and also to be able to index the data.
Lukas Fittl 00:39:44 So it’s essentially making part of the, like for example, the fact that you have an integer versus text value in your Json document gets represented in a particular way and that then enables the database to actually work efficiently on it. And what Json B really enables for you is that you’re able to do things like, let’s say you have a column and you have JS document in that column and it’s set of key and value pairs and you want to look up a particular key. And so in Postgres you can essentially just say in newer versions you can actually use the dot syntax and older versions you would do an arrow and then you would say the key name and then it would give you that particular value that the key references. And so that is something you can index in Json B columns, right? So you can create an index that optimizes this type of lookup and that is enabled by having Json B as that kind of binary structured form of Json in the database.
Philip Winston 00:40:32 So when using Json B, is it ever necessary to pull out a key and put it in as a full class column as sort of for, is that faster access or do you, is it totally fine to just have all your data be in the Json and query in it?
Lukas Fittl 00:40:49 Yeah, I would, I would definitely recommend if you’re always querying for a particular column, it’s better to not have it in a Json B because what you’re essentially doing is you’re repeating all of these keys and you’re also incurring some overhead just, but like Lookups they’re fast, but they’re not as fast as a regular column. And so if like, usually what I see and what I think is the best practice is if you have unstructured data that is different between your different row, right? So some customers have this data, some customers have the other ones, then Json B a great fit. If you always have a particular column, then I would not put it in a Json B field, I would actually put it in a regular column because that is going to be the best performance it’s going to have to lease data storage requirements. And so essentially that mix is the best practice that I’ve seen.
Philip Winston 00:41:32 That’s kind of what I expected when I, when I was using it. How about expression indexes? That seems like another index type or is that aspect of indexing?
Lukas Fittl 00:41:44 Yeah, I would say it’s an aspect of indexing. So expression indexes, one thing that’s cool about Postgres, Postgres has a lot of versatility in how you can do things. And so expression indexes are essentially a way for you to index not just a single column, right? So when I do create index, I choose what to index and usually I would say this column or multiple columns, right? Which there’s certain trade-offs on that, but let’s say we’re querying our users’ table and we’re always like, we’re comparing an email address. And so usually what I’ve seen is that people want to compare email addresses lowercase, right? Because if somebody signs in with their email address typed uppercase versus lowercase, you kind of want to match the same way. And so in your query you would have select stars from users where lower email equals a certain value.
Lukas Fittl 00:42:26 Now what Postgres would do usually is if you had just an index on the column email, it would actually not be able to use the index because the lookup you’re doing is not on email, it is on the lower function on that email column. And so what an expression index lets you do is it lets you index that expression. And so you can actually essentially index the result of the lower function, this case on the email. And so your index will just have all the lowercase emails. And so that way all the lookups are going to be fast because you’re actually going to be able to match the index to that particular query. Now unfortunately, the inverse also doesn’t work. So if you have an index, an expression index on lower email, if you just do a regular query with just email, without that function, you’re not going to be able to use the expression index, right? So it’s very particular to how your queries are structured, how your workload works, and you really got to understand which queries you’re running to be able to make those choices essentially whether to use expression indexes or not.
Philip Winston 00:43:20 How about pganalyzes Index Advisor? How did that feature come about and how does it work as a developer?
Lukas Fittl 00:43:27 Sure. And it came about as essentially this exact challenge. Like if I’m an engineer, I don’t think about individual queries usually, right? I essentially write code, add a new feature, and then that feature has queries associated to it, but I don’t really think of that day to day. And so the motivation behind Index Advisor is that your kind of feature development ultimately produces queries and those queries ultimately run against your database. And oftentimes people don’t add the right indexes because it’s not something that’s always part of the engineering practice. Oftentimes what happens is people do an at best effort, right? They would try to understand, well maybe this is a right index type, but they’re not really looking at the whole query workload. And so motivation behind our work on Index Advisor is that what if we took all your query workload, right?
Lukas Fittl 00:44:13 All the queries that you’ve run in a database, we split it up into what we call scans, which are essentially on this table I’m filtering for this set of conditions. That’s a scan for us across all your queries, across all these scans. We would then say, what is the right set of indexes to choose? Now this is not a novel idea actually. Like this idea, I think maybe 30 years ago, there was a paper already on the problem of index selection, right? So index selection itself is generally a problem that people have encountered all the time with relational databases. I think what’s unique in our approach here is that what we’re actually doing, first of all, we’re doing it for Postgres. So if you use SQL Server today, SQL Server has a built-in feature for index recommendations, but Postgres itself does not have anything built in.
Lukas Fittl 00:44:57 So first of all it’s in Postgres. The second thing is we are doing this across your whole workload, not across individual queries. So if you are interested in individual queries, there is, for example, the HyperPG extension in Postgres, HyperPG lets you say, if I have this query and if this hypothetical index existed, what would be the cost of the query, right? So that planning cost that we talked about earlier. And so HyperPG is a very important building block to what we are doing, but it is just a piece, right? And really what we’re trying to solve is that I would say that mental overhead that application engineers have when they think of query workloads, when I, I mean I’m just looking at our own database and there’s like thousands of queries and so I’m not going to go through each and every single query and look at the where clause. Like that’s just way too tedious. It just takes too much time. It’s what we essentially said, how can we do better, right? How can we actually make this a better experience for the engineers?
Philip Winston 00:45:51 I think we could spend the whole show on indexing, but there are some other topics I want to try to get to, although we probably will refer back to indices in some cases. So let’s talk about queries in the sense of what performance problems can be fixed just by changing the SQL query itself. So not database configuration, not table configuration, not indices, but just more or less you wrote the SQL poorly. I guess do those types of situations happen?
Lukas Fittl 00:46:22 These types of situations definitely happen. There’s definitely a lot of cases that are not indexes. I just want to make one last remark on indexes, which is sometimes when you are not having an index being used, the change you should make is not add an index but change your query, right? So sometimes it actually makes sense to rewrite the query so that you are matching your existing indexes. One example of this would be partial indexes, which just know that this is a problem. And if you don’t see your indexes being used, just make sure you’re matching your partial indexes. The one thing I would say, if we’re not talking indexing, the other cases that I’ve seen is oftentimes one problem, for example is regarding pagination. So one very common class of problem queries that I’ve observed is when you have an application, right, and you’re showing a list of like results essentially in a search, and then you’re not just showing the first page, but you may be showing the fifth page.
Lukas Fittl 00:47:11 And from a user perspective, usually what you would want is a page one, page two, page three, page four, right? And you click on the fifth page and it shows you the fifth page results. Usually how you implement this on a Postgres side is you do limit and offset. If you’ve ever read anything about limit and offset, then there’s essentially a group of people that say offset is bad, you should not be using it because what Postgres will do, so let’s say I’m saying Postgres, give me the fifth page of results, right? Postgres will actually fetch all of the results and then paginate in that result set to give you that offset of, let’s say each page is a hundred results, right? We’ll, giving the results between like results 400 and result 500, right? And we’re giving you these hundred records, it had to also load the other records right before that. And so oftentimes in terms of rewriting queries, the change you can make here, which unfortunately is a bit invasive, but it’s really big, a performance improvement is instead of doing limit and offset, you would actually change your where clause to look up what’s called keyset based pagination, where you’re saying, give me all the records that are larger than this ID number or larger than this identifier. And that’s going to be much faster from a query perspective, but you do have to change your query to benefit from that.
Philip Winston 00:48:19 You mentioned partial indexes, and I did read about that where you could index, for example, only non-null rose was a simple case given, and I imagine it can get much more complicated than that. How about views and materialized views? Are those things that any query should consider using for performance?
Lukas Fittl 00:48:40 So I would say views don’t really give you a benefit in performance. They really give you a benefit in terms of, kind of abstracting your data better, right? So if you have a lot of, like you, you find yourself, especially when you’re running manual queries, right? So it’s more actually your all app use case where you find yourself writing queries and you always have this really complex query that always looks the same. It always maybe joins the same things and you just want to have an easier way of querying. Then a view will help you do that because it extracts the, essentially it’s a rewrite of your query. It’s a simplification of your query. It’s kind of like a function call you could say, right? So you’re just kind of simplifying your query into a shorter form and you can just say select from view and then it’ll automatically translate into much more complex query.
Lukas Fittl 00:49:25 What materialized views do on the other hand, materialized views are a cached version of that data. So instead of the data being retrieved at query time, it actually gets stored in that form. And so the materialist view can have a performance benefit. The big downside with materialized views is that they get refreshed explicitly, not automatically. And so it’s something that you have to, like you say, create materialized view on a particular query. Then the materialized view will cache that data. But then the problem is if you want the UpToDate data, you do need to say refresh materialized view. To get the updated version in new versions of Postgres, that’s actually, there’s a version of this called refreshed materialist view concurrently, which if you are on that version that supports it, definitely use that version. Usually in Postgres concurrently means it takes less locks, so it has less overhead essentially on database. If you use the version without concurrently, it’ll actually block activity on that view. So if you have queries that try to query the materialized view, they will actually be blocked for that refresh. So that’s just something to watch out for when you’re doing a lot of refreshes.
Philip Winston 00:50:27 So I’m guessing CTEs common table expressions are like views and that it’s not directly a performance concern, but maybe they do let you write longer and longer queries. Is there any impact you’ve seen from CTEs on performance?
Lukas Fittl 00:50:43 So CTEs are interesting because they, they’re different depending on which Postgres version you’re looking at. So CTE for those of you who’re not familiar, would be essentially in an individual query. Let’s say I’m joining, like I’m doing select star for my table and then I want to join against some other data. And the other data is not just a single table, but it’s maybe some other complex query that I want to run. So CTE lets you abstract that so that you could actually write with and you say my data as, and then you have a, a query, a subquery that is all the data you want to work with. And then later on you reference that CTE again in the same query. Now in older Postgres versions, what Postgres would actually do is it would materialize the CTE so materializing not in the sense of a materialized view, which is actually stored on disk, but just materializing in terms of storing it in memory so that it would first run that CTE query, that sub-select, remember the result of that, and then run a query like the other parts of the query against it.
Lukas Fittl 00:51:37 Now in Postgres 12, if I recall correctly, they changed this so that instead of a CTE always being materialized, you can now either say with the name of CTE as materialized or not materialized. So you can kind of control the behavior or if you don’t explicitly specify, Postgres will actually choose what to do. Now the reason that this was the change was made is because oftentimes this materialization component of CTEs actually works against you because you’re essentially preventing Postgres from doing certain optimizations and how chooses the query plans. And so in newer Postgres versions, oftentimes it’s actually a benefit to have that materialization not happen. But sometimes when you are trying to run your query in a certain way, you would actually want the materialization and so then it does matter to actually add that keyword so that you are explicitly materializing and then later essentially reference just to that cash results.
Philip Winston 00:52:29 Maybe I should have asked this first, but how about just join itself, the bread and butter of relational databases, what sorts of performance implications are there and how you perform a join? I’m guessing maybe it’s the order A join with B or B join with A, or what ways are there to do this better or worse?
Lukas Fittl 00:52:50 No, you’re right. So I think the most important choice, so when you’re joining your table, it’s the most important choice that the Postgres planner has to make is which order to join the tables in. And the second most important choice is how to join, right? Now for the join order, there’s various ways Postgres does it. I would say the most surprising thing to know is, so join order is actually quite expensive to calculate. Because when you try to find the optimum join order, you need to understand the costs of all the different variants. And so Postgres will actually switch over to a different way of planning. If I think by default you have more than 10 join tables. So if you’ve joined more than 10 tables, Postgres will actually do something special. Be aware of that. There’s certain settings that relate to that.
Lukas Fittl 00:53:32 And so once Postgres has found a join order that works, it also looks at the join method that it’s using. The three join methods that exist are either hash joins, merge joins, or nested loop joins. You can always use a nested loop. So nested loop just means get the data for one of the tables and then iterate over each row and find the data for the other table. Because of that iteration, nested loops can be quite slow because you’re essentially, if you have two big tables and you have two big result sets, you will have one table where you’re doing a lot of small kind of queries into the table versus merge joins and hash joins. What they do is they load the two kind of results separately and then merge or hash join them together. And so that way it’s much faster essentially from a just pure lookup perspective because it’s much simpler lookup to do.
Philip Winston 00:54:20 Okay. That was some about queries. How about maintenance created that category maintenance. One topic there I found was vacuuming. As an application developer, it sounds a lot like garbage collection. Is that something that your tool is able to monitor or recommend how to configure that? I guess.
Lukas Fittl 00:54:43 So we are actually in the process of launching what we call the pganalyze vacuum advisor. And so vacuuming for those of you who don’t know, is essentially, so garbage collection is one way of looking at vacuuming is essentially defragmentation, right? So the way to think about it, the way the Postgres does things, when you’re doing an update statement or delete statement, the old version that you no longer need is still on disk. And so what vacuuming does is it cleans up that old version off the row. Now some of you might ask, why don’t we just remove the data right away, right? If I’m doing a delete, why do we keep that old data around? And really the answer to that is the way how Postgres determines visibility. So if you have multiple concurrent connections, it might be that another connection is still able to see that row that you’re just deleting.
Lukas Fittl 00:55:26 And so that’s why it’s so important that Postgres doesn’t just delete it right away. It actually kind of keeps it around for a bit longer until nothing else sees it. And then vacuum essentially cleans that up. Now what I found is that vacuum and Postgres is one of those maintenance commands that are actually really like, it seemed really simple, but it gets really complicated. Where it gets really complicated is, so Postgres has a process called autovacuum. And what auto vacuum does is essentially it’s a, you could think of it as a more advanced version of a CRM job that runs on your tables and it will run the vacuum command on your tables on a schedule. But the schedule is not just every day at five, but instead the schedule is when a certain condition is met, then run the vacuum. And there’s a lot of different conditions.
Lukas Fittl 00:56:08 The most important one to know is that if there’s a number of dead rows, right? So if you have number of deletes for example, or number of updates, then at some point vacuum will kick in, auto vacuum will kick in and will actually clean up that data and essentially make the space reusable. And so if you’re seeing your table size grow and grow and grow and kind of become bloated, then oftentimes the issue is that your auto vacuum is not running often enough and it’s not actually cleaning up the data. Now what we found is that we already collect a lot of useful data about this in pganalyze, but so far we haven’t really made tuning recommendations. And so our new vacuum advisor essentially gives you tuning recommendations where it specifically says, for this table, tweak this threshold to run vacuum more frequently to avoid bloat accumulating. And we do that based on essentially looking at the data we’re receiving and projecting if vacuum ran more frequently, would bloat have been avoided and then we kind of make your recommendation to change these table specific settings.
Philip Winston 00:57:04 Another maintenance task might be backups. I almost feel like some application developers these days just expect their database to be redundant and essentially never have to restore from backups. But I imagine other people still do regular backups. What are the performance implications of backups in terms of how you do not degrade the user perceived performance?
Lukas Fittl 00:57:30 I think the good news, as you say, most people don’t have to worry about backups today, right? So if you are in the cloud, I think what matters most is you understand what your company’s backup policy is, right? Like how long backups should be kept and make sure that your databases actually keep it that long. I don’t really think you have to worry much about the whole database backup in that sense. What is I think, helpful to understand is if you’re doing a pg_dump on your database, that is of course actually incurring performance. So the big difference here is what I would call a physical backup versus a logical backup. And so a logical backup would be pg_dump, which actually runs queries against your database versus a physical backup essentially reads files. And certainly if you ran pg_dump, what would happen is you would produce a slow query. It’s essentially a select star on your table. And so that could have some overhead, well it’ll have overhead because you are kind of getting all the data, you’re copying all the data into Postgres shared buffers, and then you’re returning to the client essentially. And so that’s just slow. And so that’s something that you want to watch out for. If you have a large database, it’s usually a best practice not to do that. You might want to use a replica, right? And if you have a replica that then you run pg_dump on.
Philip Winston 00:58:32 Okay, speaking of replica, how about, just a little comment on scaling in general. So I think I mentioned this at the beginning. You have some user load and it’s gotten 10 times or a hundred times bigger and your performance is suffering. And two options in most backend cases are sort of horizontal or vertical scaling. How am I going to decide which of these is going to solve my problem or what combination of these? Because that seems like a pretty big operational question.
Lukas Fittl 00:59:04 Yeah, and I can recall also from my experience back at Citus where we saw this problem a lot. So the way we solved this back at Citus was we essentially had this extension for Postgres, which would let you run multiple Postgres servers. And essentially you could think of your whole set of servers as one. So you could run a query like again, for example, select star and it could actually go across the nodes transparently behind the scenes. That is obviously the most scalable approach, right? because you can just add more nodes. Like if you want more scalability, you just add another node. Sometimes people solve this also on the application side where you do sharding and then the application does the routing between the different nodes. So if you think of yourself as the billion dollar company that needs to be able to scale infinitely, then certainly thinking about a multi-node approach I think is very useful.
Lukas Fittl 00:59:47 The one thing I would caution against there is I think there’s a trend towards making everything a distributed database. And I think that is actually not a good choice sometimes. So the one thing that I recall, again, from with Citus is what we’ve done a lot of times back then, is we would actually try to make queries go to a single node as much as possible because querying a large set of nodes for your queries is expensive. And there is some essentially speed of light constraints also in terms of making sure that things are consistent. And so what you will actually find is that some of these databases that promise infinite scalability and it’s all distributed, they actually end up being slower for very small fast queries where it’s actually much better to have a single node you’re querying. And so that’s just something to think about is, are you doing a lot of querying across nodes? And if you are doing that, then can you partition your data in a way so that doesn’t happen. There’s lots of details to this, but I would still personally recommend the Citus extension, it’s open-source so you can just install in your own server. Some of my colleagues from back then are sold Azure, so they also offer managed service for that.
Philip Winston 01:00:48 How about partitioning? Is that different than replicas in terms of dividing up your data? What is partitioning I guess, and what is the upside or downside of partitioning?
Lukas Fittl 01:01:00 So I would say there is essentially three ways of, like if you go from a single table on a single server, right? There’s three ways to scale it out. So partitioning in the simplest form can mean partitions in Postgres on a single server. So that means you do potentially have a regular table, but instead of this the table being a single table on a server, it’s actually multiple tables on the same server. So that oftentimes makes a lot of sense. Partitioning in that sense is useful when you’re, for example, if time series data and you want to drop all partitions, that’s a really fast operation. And so partitioning will help you manage large sets of data on a single server better. A variant of partitioning is what call sharding. So sharding essentially is when you are splitting up your data, but instead of splitting it up on the same server, you are splitting it up across multiple servers.
Lukas Fittl 01:01:43 And this comes back to an extension like Citus for example. Now the third wave scaling out would be replication. This is something, let’s imagine you have your AWS Aurora cluster and you can just add additional reader nodes. And so a reader node in Aurora is just a replica in Postgres terms. So that means it’s a full copy of your database that essentially follows behind the primary that you could use for queries. The challenge there of course is that you’re still limited to the performance of an individual node. So if your queries are really expensive, they’re not going to get faster by using replicas, they’re just going to spread out more. So if you have too many queries, that can be a way to scale out.
Philip Winston 01:02:19 Great. Well we talked about indexing, queries, maintenance, scaling. I have one last topic to touch on and then we’ll start wrapping up. I’m not sure how much you have to say about this, but I, as an application developer, I’m kind of interested in the concept of stored procedures. This almost sounds to me like having your database run arbitrary code. How common are stored procedures today and what languages can you use to write them and can you monitor their performance? Because that starts to sound like application performance monitoring.
Lukas Fittl 01:02:53 And I think, so procedures are interesting, like procedures could either be create procedure or create function in Postgres on the server side, what’s interesting about them is I think I’ve usually seen them in the workloads that I see. It’s a lot of times migrations from other database systems. So if you migrate from Oracle to Postgres or SQL server to Postgres, that’s where I see a lot of people, like a lot of old school databases, so to say, use procedures to encode business logic on the database. And it kind of has grown out of fashion since then, I think, right? So if you looked at applications within the last 10 years, most of the time people don’t do that same level of putting business logic in a database. I think part of it has to do with just the functions on a database site, being more constrained in terms of which language you can use, but also harder to debug.
Lukas Fittl 01:04:24 And so what’s really nice about PL/Rust is that specifically as an RDS also added support for that recently. So if you are an RDS, you can actually now deploy your Rust functions to your RDS database as well using PL/Rust. Now if we think about how we want to debug the performance of that, that’s where it gets a bit tricky. So there are ways to track function performance on Postgres. So you can, they will show up as a query, right? So just the fact that query runs will show up and then each function will also kind of, when it runs, queries itself inside the function. These queries will also be visible in tools like pg_stat_statements and through that will also be visible in pganalyze for example. But where it gets tricky is when you have very complex functions oftentimes the tooling today is a bit lacking and the tooling really on the Postgres side, unfortunately to reference back which part of the function caused which query to be slow. And so it actually becomes a bit harder to debug. And so I personally don’t do a lot of putting logic into functions that do a lot of stuff on the database side just because of that extra level of indirection making it harder to reason about, hard to profile. And so that’s kind of I would say use with caution, but if you have a workload where this makes sense, there is definitely good options today to do it.
Philip Winston 01:05:35 I was kind of relieved when you said it’s potentially going out of style. It sounds like super powerful but also very potentially very hard to manage in terms of your application logic being split and partially on the database.
Lukas Fittl 01:05:49 Yep, that’s exactly right.
Philip Winston 01:05:50 So let’s start wrapping up what’s next for pganalyze? Say what features are coming this year or community events or just what are you looking forward to in pganalyze? And then we’ll talk about Postgres.
Lukas Fittl 01:06:03 For sure. Yeah, so I think for us at pganalyze it’s really more about how can we make Postgres smarter? How can we make Postgres work better for you, right? As an application engineer. So we are about to launch our new vacuum advisor, which will kind of give you the right per table tuning insights. We just gave a talk at pgCon in Ottawa, which was finally an in-person conference again after many years. And so there we talked about new index selection algorithm. And so we’re going to release a new version of Index Advisor later this year that features that new constraint programming-based model for Index Advisor. And then generally, I think the way we’re looking at this is how can we give you the right data at the right time, right? So ideally, how can we notify you and alert you of a problem in your database proactively that really ties to specific queries or ties to specific indexes or tables, for example.
Philip Winston 01:06:49 Yeah, just as an aside, I think one of your features was notifying you about a problem before it gets bad. So there is a dashboard that you can set up alarms with and just kind of keep an eye on things?
Lukas Fittl 01:07:02 That’s right. Yeah. And itís, like one practice that I’ve seen, especially larger companies where you have data platform teams adopt is that they’ve actually adopted pganalyze as a way to get Slack notifications. So we’ll actually put, here’s a new missing index and we’ll put that in your Slack channel for your individual application teams, which is really useful. Just take a little bit of a burden off of that central data team that today has a lot of the manual work here. So just a question of getting data to the right people in the right channel.
Philip Winston 01:07:28 How about Postgres itself? Is there any major releases coming up, particularly related to performance or just any features you’re looking forward to there?
Lukas Fittl 01:07:38 Yeah, so Postgres is on a, generally a yearly release cadence. And so we’ll have the Postgres 16 release that’s currently in beta, we’ll probably come out in September or October. That’s the usual schedule, depends a bit on the release candidates and if there’s any major bugs found. And Postgres 17 development has started now the branch just opened about a month ago. And so what’s really exciting for me, I think in 16, from a performance optimization perspective, there’s a new way of getting statistics about the Io kind of performance of the system. So there’s a view called pg_stat_io that Melanie Plague kind of contributed to Postgres. And what that new view really does is it lets you say which part of the system is creating io, right? So let’s you say, well, is it the vacuuming, is it the queries? Is it something else that’s causing things to be slow?
Lukas Fittl 01:08:21 And so you kind of have this high-level kind of bird’s eye view on the whole database activity that you didn’t really have before. And again, it’s coming out in Postgres 16. And I think both extent PG is to read that view and kind of provide insights based on that as well. In Postgres 17, which is currently under development, I’m really excited about asynchronous Io. There’s a lot of work ongoing behind the scenes about synchronous Io and how it makes the database faster. We’ll see how much actually gets into Postgres17. So there’s obviously no bit of ambiguity early in the development cycle, but I think what Andres Fre and Thomas MRO and team have worked on there, there’s some really interesting kind of patches being worked on for Postgres.
Philip Winston 01:08:56 Great. Where can people read more about you and pganalyze and even Postgres? I’m curious if there’s any niches in the community you can point to that are interesting to follow. I know it’s such a big project, might be a little intimidating to people, just curious how you found the community and what resources you like to use.
Lukas Fittl 01:09:16 For sure. So let’s start with Postgres. So, I think what I read, which I would, if you’re really into it, what I would encourage you reading is the Postgres Hackers mailing list. So the hackers mailing list is kind of, like the Linux kernel mailing list. So all VM project mailing list. So it’s very technical, very a lot of deeply technical content, but that’s where all the development for Postgres happens, right? So Postgres is a community project, it’s not a single company behind it, it’s many people contributing. And so you will see each and every decision being made out of the open essentially. And so that’s really fascinating to observe and just to read the mailing list. And again, that’s pgsql-hackers, which you can find on the Postgres and project mailing list. Otherwise I would say we at pganalyze for example, if you’re interested in just what’s new with Postgres, we actually record a weekly series called Five Minutes of Postgres, where I’m essentially featuring what’s new with Postgres.
Lukas Fittl 01:10:01 So if you just want to keep up with Postgres things, you can subscribe to Five Minutes of Postgres on YouTube. And then if you’re interested in pganalyze, we have a newsletter that we send out each month. We also, I for example, have a Mastodon account, we have a Twitter account for pganalyze. So if you want to follow us on social media sites, that’s of course an option as well. And in general, feel free to email us anytime, we generally try to be approachable and answer any questions you have about what we’re doing or what we’re talking about.
Philip Winston 01:10:27 Excellent. And I’ll put links to all those in the show notes. Thanks, Lucas, for your time today.
Lukas Fittl 01:10:33 Of course. Thank you, Philip.
Philip Winston 01:10:34 This is Philip Winston for Software Engineering Radio.
[End of Audio]