SE Radio 485: Howard Chu on B+tree Data Structure in Depth

Howard Chu, CTO of Symas Corp and chief architect of the OpenLDAP project, discusses the key features of B+tree data structures, which are the default selection for efficient and predictable storage of sorted data. Host Gavin Henry spoke with Chu about B+tree data structures; why they allow searches, insertions, and deletions in logarithmic time; what a tree structure is; how LMDB uses them; the history of what got us to B+tree; roots, leafs, nodes, pages, RAM, NVMe, SSDs, file formats, code efficiency, self-balancing, AVL trees, Red Black trees, RDMs, SQL, indexing, concurrency; how to deal with crashes mid-write; and the decision factors of when you need B+tree data structures.

This episode sponsored by Shortcut.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Gavin Henry 00:00:16 Welcome to Software Engineering Radio. I’m your host, Gavin Henry. Today, my guest is Howard Chu. Howard Chu has been writing free and open-source software since the 1980s. His work has spanned a range of computing topics, including most of the GNU utilities for example, GTC, GDB, et cetera, networking protocols and tools. Kernal and file system drivers and focused on maximizing the useful workforce system. Howard founded Symas Corporation with five other partners in 1999 and serves as CTO. Its current focuses, database oriented, covering LDAP protocol, LMDB and other non-relational database technologies. Howard, welcome to Software Engineering Radio. Is there anything I missed in your bio that you’d like to add?

Howard Chu That’s quite an extensive list.

Gavin Henry 00:01:05 Cool. So, today’s show is on a pretty meaty topic called B+tree data structures, and they keep coming up in other related Software Engineering Radio shows. We touched on them in my first show I did for Software Engineering Radio with you on LMDB; we did some in Episode 454 with Thomas Richter on Postgres as well as PP database, with Simon Riggs on Postgres, Episode 362, on database engines, Episode 445. So, someone has to try and tackle the topic and I know you’re an expert in it. So here we go. So, I think because it’s so complex, I’m going to split the show into three parts. We’re going to dig into what tree algorithms are, B trees, lay the foundation, move on to B tree and a bit of depth what’s involved. And then probably halfway through the show, we should have enough laid down to tackle B+trees. So, here we go. I’d like to start with an overview tree algorithm is it’s terms in its history, so we can lay a good foundation, then move us on to B trees and finally onto people. So, what do we need to understand first?

Howard Chu 00:02:24 Well, all right. So, these data structures are used to quickly search for items in a large data set, and the tree structure is used because it allows you to search more quickly than, for example, just a linear list layout.

Gavin Henry 00:02:45 And how would you describe a tree algorithm? Is that where you can search things quickly or,

Howard Chu 00:02:53 Well, it’s called a tree just because of the, the logical way that data is arranged. You start with a central root point and then it branches out to different levels to cover the entire span of the data range.

Gavin Henry 00:03:13 So, if you just, just think about a regulatory in the ground, the root would be the trunk and you turn upside down and think about,

Howard Chu 00:03:21 Yeah, the computer science version of a tree is upside down from what you would see in nature.

Gavin Henry 00:03:27 Okay. I’ll keep that image in my head. So, there’s a few terms coming up that, you know, have a stab at so we can try and break these up. So, what a b tree.

Howard Chu 00:03:38 Okay. So, that would be the most fundamental starting point for most of these tree designs. And it’s based on a simple search algorithm where let’s go back to our flat list. Let’s say you have a list of a hundred items and you want to find one specific item in that list. The slow way to find it is to just start at the beginning of the list and check every element until you find the one you’re looking for. And on average, it will take you one half of the number of elements in the list to find the item you want. So, if you’ve got a hundred items on average, it will take you 50 lookups to find what you’re looking for. Okay? So, we can, we can improve that drastically using a binary search. In a binary search, you take whatever list you have and you examine the element in the middle, right?

Howard Chu 00:04:37 So, if we’ve got our 50 items, sorry, if we have a 100 items, we look at the 50th item and see, you know, is it the one we’re looking for? And sorry, I forgot to mention that this only works if the list is in sorted order, right? So, go to the halfway point and you find the 50th item you see, is this what I want? If not, you see, say, is it higher or lower than what I want? If it’s higher than what I want, then I go and divide the next interval down. So, from the middle of the 50th, I go down to the 25th. Okay. So, at each point, if I don’t find what I’m looking for, I can either go higher or lower and divide the next interval in half. So, that that’s the essence of the binary search. And the way this way, this improves things is that every step you eliminate at least one half of the elements from consideration. So, it’ll get you to the one you’re looking for much faster than a simple linear search.

Gavin Henry 00:05:46 Excellent. I think you’ve answered my question of why we need a tree algorithm and binary search tree. So, is there a particular time you would use them or, you know, certain type of data, or does it matter?

Howard Chu 00:06:00 It could be used with any kind of data. So, the other thing that gets us to the tree structure. So, first of all, what I described as binary search is how you execute it. If you have a flat linear list, right? But you can basically encode all of those having and subdivisions by rearranging your flat list into a tree structure, right? So, at the root of the tree, you put your 50th element and then at the two children of that, you put on the left half, say, you put the lower 50 items and the right half, you put the upper 50 items. And then at each level below that you subdivide and, and keep on branching out. And the other thing to point out is, you know, every element, every node in this binary tree has up to two children.

Gavin Henry 00:06:56 Yeah. I was just trying to visualize the imagined through as a dot and then you draw a triangle shape, at the top, and then two dots on the other side, you’re trying to leave children as it were. Yeah. If you have the dot and draw a straight line down that most you’re going to split that into two and then have two more dots, split them down again. So, maybe like, um, visualize it, um, deck of a house of cards, you know, when you’re stocking cards, You got the two cards pointing together at the top of the one card base. And then every end of the card is another two points of a card. Isn’t it. If you’re looking at that fun tone, you’d see the root structure, two points, Just pick it down on the spot.

Howard Chu 00:07:57 Good analogy. So, you’re asking, why do we need them? Or when do we use them?

Gavin Henry 00:08:02 Yeah. How and when is it used?

Howard Chu 00:08:07 That’s the original, you know, the generic binary tree that we just described. You know, these days it’s only used as a teaching tool to give you the idea of what you’re trying to accomplish. You know, in practice, a plain binary tree has a lot of management problems. So, it really isn’t used in real life. It’s, it’s used in computer science classes to teach you about binary searches.

Gavin Henry 00:08:37 Okay, well, let’s, at least it helps us discuss the terms, which we’ve touched on a root being, the base of that tree or the top of the house of cards and leaf is that joining a root to the next node

Howard Chu 00:08:51 Leaf would be the bottom most, a terminal node, and anything between a root and a leaf is generally called a branch.

Gavin Henry 00:09:00 Okay. Okay. So, there’s not enough foundation to understand what B+tree is? Do you think, or,

Howard Chu 00:09:07 You know, that covers binary trees. So, I, as I alluded to, you know, there, there are some problems when you work with just a plain binary tree that would lead us into, you know, discussing B+trees.

Gavin Henry 00:09:21 Would you want to touch on some of the problems? I know it’s academic, but

Howard Chu 00:09:25 Yeah. Well, the first thing is, you know, it’s easy to construct a binary tree. If all of your input data is already sorted when you begin. But if you’re, if you have a very naive algorithm that says, for example, you start with a node with any particular value and you want to insert the next node. And you could say, is the new node less than the current node? Or is it greater than the current node? And if it’s less, you make it the left child. And if it’s great or you make it the right child, if you actually start with a perfectly sorted input data set, then this very simple binary tree construction algorithm will actually give you a linear linked list because every new node will always be either greater than, okay, say you have a list that’s sorted in a sending orders. And every new node will always be greater than the one that preceded it. So, every time you do an insert, you’ll always insert onto the right most branch. And so you’ll get a degenerate tree, which actually looks like a straight line, which is not very useful for the other search property. So, the tree,

Gavin Henry 00:10:47 Because the left, the left leaf is the lower side and the right leaf is the upper side. So, if everything’s higher than the previous one, that you’re only going to be populating in the right side on you.

Howard Chu 00:10:59 That’s right. Yeah. You’ll never populate the left side. So, this is one of the problems with a plain binary tree is that it’s very easy to get into these degenerate forms that aren’t balanced. Right. And if it’s not balanced, then you don’t get any advantage of the binary search properties.

Gavin Henry 00:11:16 Yeah. Where are you split it in half and look at the higher or lower, because there’s never a lower, So, you’d be better off just going back to the original way and counting.

Howard Chu 00:11:27 Yeah. Yeah. And so that’s the most fundamental problem that motivated the development of something like a B+tree It’s been around at least since the 1970s. I think the first, first published paper in 1971 or 72.

Gavin Henry 00:11:49 Okay. Just before we move on to what be chief issues solve, can we just revisit a root, a leaf? The node? Was that the right word? Root

Howard Chu 00:12:01 They’re all nodes, right? There’s a root node. There’s a leaf node. And there are branch nodes in between.

Gavin Henry 00:12:06 Yeah. So, a, a leaf is, yeah. Describe it again. I won’t get on the way.

Howard Chu 00:12:12 So, all right. So, the roots is, you know, the starting point, right? And the leaf is a terminal point. There, a leaf has no children. It’s the end of the line and anything that is interposed, anything that connects between the root and the leaf is a brand.

Gavin Henry 00:12:35 Okay. I know a node can only have two branches

Howard Chu 00:12:40 And retreat. Yes, I, no. It can only have two branches. I know it can have up to two branches. It could have zero. It could have one,

Gavin Henry 00:12:52 It could be a right or left depending on whether the value is higher or lower. We had a good stab at covering the basics and the history there. So, if we feel lost, move on to the main types of beachy. So, what is a balanced tree? I can revisit that.

Howard Chu 00:13:09 Okay. So, you know, we mentioned that, you know, if you just naively construct a binary tree and you have data of a particular arrangement, for example, sorted in a sending order, then you won’t have a very good tree structure. You won’t be balanced and it won’t give you any search advantages. So, this problem was recognized pretty early on. And people thought about ways to make the tree structure self-managing and self-balancing so that as you insert data to it, you know, it always manages to check its own status and rebalance things as necessary and rebalancing in this case means actually changing the shape of the tree. For example, let’s say I, I construct a tree with three nodes, right. And I’m just inserting the values one, two, and three. And if I start naively by just saying, uh, insert node one, and then I insert node two, and since two is greater than one, I put it on the right child. And then I insert nodes. Three, three is greater than two. So, I put three as the right child of two. Okay. So, now I have a three, the three node tree, that’s a straight line. Right. And a rebalancing operation in this case would look at that tree and say, well, that’s no good. We’re going to rotate it left by one position. So, that node number two becomes the root then no number one becomes the left child node. Number three is still the right child of no two. And then we have a perfectly balanced tree.

Gavin Henry 00:15:00 Yeah. I’m thinking of might be totally irrelevant, but you know, when you go to a website and they’ve got one of those looks like a molecule structure, when your mouse hovers over and moves about kind of like one of those where you’ve started with the root one and then two, three. So, you’ve just drag number two and per the other top, then you’ve got one below it in three reforming the tree structure. Yeah, I got that.

Howard Chu 00:15:26 So, what we just described is a balancing operation and the idea is to develop an algorithm that manages a tree structure so that it balances itself automatically so that it can do the analysis that we just described and just automatically keep a tree balanced. And the bee tree is actually only one of several ways that have come up in computing to solve this problem, to solve the self-balancing problem. And two other ones that are very common. Again, they’re taught early on in computer science courses, a red, black tree, and then AVL tree. So,

Gavin Henry 00:16:11 Well, let me just chat by those two, where are we talking about an algorithm that rebalances the tree, some practical way of that might be there’s an insert value. And part of that function looks to see what is there already, and then make some decisions and gives you an okay. But behind the scenes is reject things. So, how would that

Howard Chu 00:16:35 Behind the scenes it’s rebalancing.

Gavin Henry 00:16:38 Yeah. So, you have an API that you’re using to insert a child or a value. And then part of the algorithm, the library that you’re using looks at everything behind the scenes and rearranges it, it gives you like a return value of one or, okay.

Howard Chu 00:16:56 Yeah. Generally it’s for example, in the case of a bee tree, it will attempt to do the insert first and then fix up anything that needs to be fixed up afterwards.

Gavin Henry 00:17:10 So, I’ll take the data as quick as possible. And then,

Howard Chu 00:17:13 I mean, you know, the effect is the same because you know, the caller will have to wait for whatever cleanup or rebalancing to finish before the API returns.

Gavin Henry 00:17:24 Okay. Just like a normal IPO. Okay. Um, since you mentioned those other two types of trees that you’re familiar with, if you can give us a bit of information about them, that’d be great.

Howard Chu 00:17:34 Well, they’re, they’re both improved versions of a binary tree, for example, in an AVL tree. First of all, we were talking about binary trees, where we inserted a node with a simple value right node with I knowed with the value too. And I know that the value three in an AVL tree, we add a little extra information to each node. We call it a balance factor. So, when you insert a value into a node or into the tree, goes into a node and the node starts off with a balance factor, say of zero that says it, that means it’s balanced. And if you add a left child to it, it might get a balanced factor of minus one that says, okay, there’s something on the left side. Or if you add a child on the right side, you might get a balance factor of plus one, meaning that it’s got a child on the right side. And the algorithms that maintains this tree will go up and down examine these balanced factors. And if the balance factor ever exceeds plus one or minus one, like if the balance factor it gets to two, or it gets to minus two, then it knows, oh, this tree is now too heavy on one side and it’ll know that it needs to rebalance it. That’s the basic idea behind an AVL tree.

Gavin Henry 00:19:03 And this ought to be value something. What does the AVL stand for?

Howard Chu 00:19:07 As I recall, they were the initials of the people who invented it. Red, black tree is, is actually named, they maintain balance factors as well, but they, they named them those colors right in black. That’s the same general idea.

Gavin Henry 00:19:28 Okay. Ray, how do I say this?

Howard Chu 00:19:35 Okay. So, so far we’ve been talking about a binary search tree where every node can have two children, right. And every tree is where any node can,

Gavin Henry 00:19:48 Sorry to interrupt. That’s just clicked why it’s a binary search tree because there’s only two values episode.

Howard Chu 00:20:01 So, you have binary search, binary search tree branch factor of two up to two children. And then M area tree is just the more generalized form where any node can have up to M children or whatever arbitrary number that you want to assign it to,

Gavin Henry 00:20:22 But they have to be equal than they have to all levels of the tree. You need to have up to that amount.

Howard Chu 00:20:29 That’s right. You know, this we’re going to start diverging soon from typical computer science teaching, as we go a little further into this, but for now, okay. We’re talking about B+trees and a fixed branching factor, your branch factor being the maximum number of children, a particular note.

Gavin Henry 00:20:52 Yeah. We should probably just stick to the tree. Okay. Um, so I think we touched upon logarithmic and beggar notation and our first show together, probably quite across quite a few other shows we’ve done on Software Engineering Radio. Just like to give a summary of that because that’s what this all brings us. Isn’t it? That’s how it’s

Howard Chu 00:21:15 Sure. Well, the big O notation basically you’ll see capital and parentheses and some factor inside. And that means, you know, Symantec rate means on the order of, right. So, if you say something is Vigo of log in, that means the complexity of that function is on the order of a logarithmic value, logarithm of the number of elements. So, if we look at our very first example of the linear search, you know, in big O notation, that’s big and where the complexity of searching that element, searching that structure is basically equal to the number of elements in the structure.

Gavin Henry 00:22:04 So, it’s a form of notation to let you know how fast or slow,

Howard Chu 00:22:09 Well, how complex something is, which usually means how fast or slow it is. Yeah. Uh, you know, the, the larger, the number, the slower it is. And so, you know, logarithm of an is always smaller than the value event. Um, and it increases much more slowly and itself increases, right? And so that’s, that’s a very desirable property in algorithm because it means the complexity grows very slowly. That means as data volumes increase, the performance decreases very slowly, you know, ideally you would want the performance to not decrease at all, but

Gavin Henry 00:22:52 This is normally the graph starts level and then carves up near the end or the other way around when you see,

Howard Chu 00:23:00 Uh, starts curves up slowly. Yeah.

Gavin Henry 00:23:04 Yeah. That’s right. So, B+tree as a balanced tree, that’s just the short way of saying it.

Howard Chu 00:23:13 Yeah. It’s it is one of the self-balancing binary trees, except it’s not, it’s not necessarily binary. Right. It’s in arbitrary factor.

Gavin Henry 00:23:25 Okay. And is there a proven statistic of how fast they are? You know, if you follow that structure?

Howard Chu 00:23:33 Well, it, you know, it is a logarithmic algorithm and the base of the logarithm just depends on the M the branch factor. If you have a binary, if you have a bee tree that has, you know, M equals two, then it will be log and face two. If you give it a branch factor of four, then there’ll be some base for,

Gavin Henry 00:24:03 And, um, how would you decide between a binary tree and, you know, instead of

Howard Chu 00:24:11 Some larger factor. So, that is really, that’s really a judgment call that needs to be tuned for the use case, because one thing you’re, you’re trading off a couple of things here, the more branches, the more, yeah. The more elements you can pack to a node, more branches you can put to a node, the fewer nodes you need. Okay. But, you know, the, the larger node becomes the more work it takes to maintain the pointers inside. So, like, if I have a node that has a hundred elements, that would be slower to manage than the node, that only has four elements.

Gavin Henry 00:25:01 Yeah. And if the, if you’re doing, I think, anyway, this is just how I’m understanding it. If you’ve got a binary tree and the no’s just got two children, the right leaf as the higher value on the left, but then if you go to four or seven or wherever, you’ve got to then order all those in the right way, doing you.

Howard Chu 00:25:22 Yeah. You have to keep all of them sorted and you have to, you actually have to execute a binary search within that node to find what you’re looking for.

Gavin Henry 00:25:32 And you’ve got to keep that on in your head. I was just thinking about it when you write the code.

Howard Chu 00:25:38 Yeah. Generally, you’re going to find examples in textbook examples of a bee tree that, you know, they’ll pick some fixed number and never worry about it again, it will be like, okay, we’re going to do eight nodes or 16, 16 weaves per node or whatever. And then just forget about it, set it and not worry again.

Gavin Henry 00:26:05 Is it something that you would come back and revisit at some point or,

Howard Chu 00:26:11 Well, you know, we’re talking about the general bee tree here when, when we get to B plus trees, you know, those, those considerations go out the window. So,

Gavin Henry 00:26:25 So, we’ve spoken about a bee tree. We know we’re going to move on to a B plus tree, but there’s a B minus tree. Can you tell us a bit about that?

Howard Chu 00:26:34 Um, B minus three is just another way of writing poetry. That’s

Gavin Henry 00:26:40 Maybe it’s a typo. Okay. We’ve covered logarithmic time. So, B+tree is well-suited for storage systems that read and write relatively large blocks of data such as desks. Do you agree with that?

Howard Chu 00:26:57 I think that’s more, uh, more suitable for B plus tree, a bee tree, the general form of a bee tree. It can be tailored for block storage. And the idea again is you make your nodes the same size as your disc locks.

Gavin Henry 00:27:20 Okay. Yeah. So, this is the actual link of getting back to our systems and why it’s used there. Yes. Okay. That makes sense. So, if, if the block size, which nowadays is 496 kilobytes, isn’t it. And that would be your biggest please note, what would that be?

Howard Chu 00:27:42 That would be the size of your note. Yeah. Yeah. In the bee tree, all of your nodes should be the same size.

Gavin Henry 00:27:51 Okay. So, let’s start the big one. So, just to recap, we’ve gone over a binary search tree. That’s a root node, and then children, each leaf can have two children left and all right. The rights generally, the higher value in the left is a lower then sort of founded upon that we’ve moved on to balance trees. So, there’s always a left and the right we’ve discussed bigger and logarithmic time. The current uses for B+trees. Now we’re going to move on to B plus trees. All right, let’s start again. What is the B plus tree data structure used for today?

Howard Chu 00:28:36 It finds a lot of use in all kinds of data management applications. For example, in many say SQL databases, relational databases will be plus tree is almost always used for indexing. And in some cases it’s also used for primary data tables.

Gavin Henry 00:29:00 And why is that?

Howard Chu 00:29:03 Well, it’s always used for indexing because you know, it is a structure, it’s an order to structure, right? So, it’s, uh, it’s fast to look up elements soon. And, um, you know, it lets you move sequentially through the data set. So, if you want to retrieve elements in the sorted order, in a particular range, it’s very, very well suited for that in, for the primary data sets. It’s not always used because you know, generally will be, plus tree has range limits on the sizes of items you can put into it. So, while you might have data with relatively small keys, say eight byte keys, if you’re attaching, say a 200 megabyte value to that key, you know, it may not be so easy to manage in a B plus tree. So, you might use a different data structure to store the actual data that goes with the keys.

Gavin Henry 00:30:10 Okay. What relation? So, it was say it was a simple key value story or trying to implement what relationship does the Beech tree they’re trying to have to the key and the volume as the note, a key, or is the key just a bit of data in that tree.

Howard Chu 00:30:28 Okay. So, if we back up and look at what we talked about in B trees, you know, we actually were using a much more simplified example where a node only contains a single value, like an integer, right. One, two or three. So, in, in that case, in the terms that you’re talking about, it would be a key only database. The only thing we were storing in there was a searchable value. That was a key in the case of the B-plus tree. You know, we have those keys, those numeric keys, but we can also associate with them an arbitrary blob of data and in a B-plus tree structure, you know, the only the leaves of the tree actually store the data values, all of the root interior, uh, nodes will always store only the keys.

Gavin Henry 00:31:30 Okay. So, the key coming in, for example, in a database search, part of a select, because you’ve got an index created where such and such equals something, then you know, that’s the key. So, then you can reliably protect how quickly that’s going to take. You know, it’s going to be fast. You can see if you find an index, which will then recover the value, which is the, the row, where are the information stored.

Howard Chu 00:31:58 Yeah. And again, if you’re using a, B plus three only as an index, then the value that you get back is just going to be a pointer to the data that lives somewhere else.

Gavin Henry 00:32:12 Yeah. I think I’m getting,

Howard Chu 00:32:16 And again, there’s the alternative where you actually use the B plus tree to store the primary data as well. You know, in that case, the value that you retrieve will be, you know, the complete data item that was associated to that key.

Gavin Henry 00:32:29 Yeah. Because if you’ve got a quick lookup and the index, you want still want a quick way to find the actual data you’re asking for as well. So, you would, you could potentially have to be plus trees, like you just said, okay, we can move on to question three. So, what extent is adoption driven by some of the hardware like SSTs or types of memory for this chase jock charges up to, they have to change it for where the blocks are stored or anything like that, or is that of transparent?

Howard Chu 00:33:06 You know, it’s a lot of the answer to that question depends on what the operating system is doing with the different memory structures or the, you know, the newer types of memories you’re talking about where there’s persistent memory or like Optane memory or some other things like that. Uh, you have the option of making that memory transparently usable, or you have the option of making it an explicit block device. So, it really depends on what the operating system has done. One of the things that we, we didn’t state explicitly here is that in the B plus tree, again, we’re talking about nodes or nodes that are the size of disk blocks, right. In the B plus tree terminology, we’re going to call those pages. So, one page of a, B plus tree equals one page in a file system. One page on disc.

Gavin Henry 00:34:11 Yeah. I think I’m thinking back to the LMDB show and we were talking about CPU caches and all sorts of things like that. Keeping the data that can be held in cash small and nice what you’re trying to do here. Isn’t it?

Howard Chu 00:34:28 Yeah. That’s why we only store keys in the interior pages and keep the, the actual data at the week. That means you can fit the maximum proportion of the key index in memory.

Gavin Henry 00:34:47 So, you’re not always having to go back to the file system to get the keys,

Howard Chu 00:34:51 Right? Yeah. So, when you’re doing key lookups, they can all be in memory and very fast. And then when you actually get to the bottom of the tree and find the value that you want, then you might actually have to go out to disk or some other slower storage, and that’s the idealized case, right. But obviously if you have a very small database, it could all be entirely in memory. And if you have a very large database, you could wind up with, you know, portions of the interior index also pushed outside of memory.

Gavin Henry 00:35:27 When we’re talking about a B plus treat data structure, obviously that’s the thing, but what does that thing look like on a computer?

Howard Chu 00:35:39 Well, there’s, you know, there’s certainly a lot of leeway in how you construct it bits and bytes level, but conceptually it’s not much different than the Beech tree, right? You have a page, every page can have some number of pointers in it to children. And in this case, as I mentioned before, we no longer talk about it as an Anne Mary tree it’s as many child pointers as will fit in a page. So, it’s not a fixed number. It’s just cRAM in as much as will fit before we have to overflow into a new page.

Gavin Henry 00:36:25 Okay. I’m still asking some really basic questions here. Sorry. Um, so is this day a structure constructed from something on a file system or in a tax file or something like that? My question is coming from thinking about the data dot MDB file LMDB crates in the first show we did together, what’s inside there, you know,

Howard Chu 00:36:52 Well, that is a, just a sequence of pages and each page, you know, has pointers to other pages.

Howard Chu 00:37:05 So, in the textbooks, they will give you an example of a B plus tree where, you know, all of the dimensions are fixed Constance. It’ll say you’re going to create a B plus tree where every key is eight bytes long, and there are eight keys per page. That’s it you’re done in the real world. You know, keys can be various sizes and because the sizes vary, that means the number of items that you can fit per page also varies. Okay. So, this is where like a database engine, like LM DB departs from what the textbook tells you, because the textbooks give you a very simplified version where, you know, there’s, there are no variables and the reality you need to accommodate keys and records of varying sizes

Gavin Henry 00:38:06 And you revisit and provide a simple description of a B plus.

Howard Chu 00:38:15 Yeah, we’ll start again. W we have a root page which can have, for example, a list of up to eight pointers to children, right. And the children will be pages as well. So, all of the pages are always the same size. So, a branch page, again, points to more children. And then at the bottom of the tree, you have the leaf page, which points to actual data. And within each page of the root or the branch pages, you have the only thing that you store in those pages are keys. He’s an appointed to the next level down, or the next thing does that make sense?

Gavin Henry 00:39:05 I was just following it down. My, my brain there. And B plus tree is a replacement for bee trees or are they just different options?

Howard Chu 00:39:14 Yeah, I would say it’s a replacement. You know, you find B plus trees being used everywhere and B+trees not so much.

Gavin Henry 00:39:24 Okay. Well, we don’t really need to discuss the trade-offs between them. If everyone just uses,

Howard Chu 00:39:30 We touched on it before, you know, the main, the main reason is that by keeping all the data values at the bottom leaf pages, you get more keys in memory.

Gavin Henry 00:39:42 Should it be the smaller structures we’ve touched on? Um, RDBMS has asked you all day basis that use them for primary keys and actually looking up the data that the keys point to as well. Can you think of any, like a products, examples that use them or is it mainly just indexes?

Howard Chu 00:40:05 Yeah. Well, um, you know, for example, sequel light, uh, you know, uses B plus trees for its indexes and for some of its tables as well, obviously open Nelda app uses L and D. So, that’s using B plus trees for indexes and for, you know, main data, you know, they’re popular databases, Postgres, uh, based on B plus trees, Maria DB, or my sequel, they’re actually using something called I Sam completely different structure.

Gavin Henry 00:40:42 Yeah. I’ve seen that a few times. Okay. Are there any places that you think could benefit from PE plus, please, where you don’t see them mainly in the database world?

Howard Chu 00:40:53 I’m going to guess that, you know, the database world has been around this for long enough. They know pretty well where they can leverage it. I would say in databases, they’re all already using it where it needs to be.

Gavin Henry 00:41:07 And, um, have we touched upon the term height when we talk about leaf nodes?

Howard Chu 00:41:14 Well, height has just, you know, a description of, you know, how many levels of pages are in the tree.

Gavin Henry 00:41:23 Yeah. So, how big the house of cards could be? How many

Howard Chu 00:41:26 Exactly. Yeah. So, if you, if you have a tree that’s just a single root page, then you’ve got a height of one there’s, there’s nothing going on. You know, if that root has just a bunch of leaf pages immediately under it, then it’s got a height of two. If it’s got a root page and then some branch pages immediately under that, some leaf pages got a height of three.

Gavin Henry 00:41:52 Okay. And, um, do you have any tips on how to keep, where you’re working on in your head while you’re dealing with these best to visualize things, write some sketches down, or do you just get used to them when you’re doing it?

Howard Chu 00:42:11 I suppose, sketching on paper when you need to, but you know,

Gavin Henry 00:42:16 Using a library

Howard Chu 00:42:18 Doesn’t matter, you know, when you’re using the library, all you care about is here, insert this key and here find this key and giving this value. So, you know, you don’t really care what, what the structure looks like underneath.

Gavin Henry 00:42:31 And, um, you did your own B plus tree algorithm than you. You obviously know the front end and the back end to that thing. Were there any lessons learned or tips or warnings to never do it yourself? It’s almost thinking about your takeaways from actually doing the B plus two algorithm yourself.

Howard Chu 00:42:58 First of all, as I mentioned before, what you learned in the textbooks is only a very simplified version of what you’re going to need in the real world. So, if you want to write one of these yourself, you know, be prepared for the textbooks to not tell you everything you need to know, and then you really need to have an extremely good reason to write this yourself because they’re at, you know, at this point there are many libraries available and probably should just use one that exists off the shell.

Gavin Henry 00:43:35 Yeah. I think it’s the profile and benchmark on looking at stuff as, and when there’s an issue with something we’ll trust that that person or that company or team knows what they’re doing.

Howard Chu 00:43:50 Right. And what, you know, 10 years ago, when I started writing L and D you know, the things that we wanted it to have obviously did not exist off the shelf. So, that prompted me to write it. But I’d say it’s, it’s hard today to find a use case that’s not well covered by an existing library.

Gavin Henry 00:44:12 Yeah. I think people that are doing new things tend to reinvent things and forgot that there’s probably something already there that was designed 20 years ago. They just haven’t done enough.

Howard Chu 00:44:24 And there’s, there’s so much tendency to just reinvent

Gavin Henry 00:44:30 With yourself for trusting someone else has done it better than you. You’ve got a very strong reputation of, you know, speed and performance and getting the best out of things. Have you had to have words with yourself to not go down that root?

Howard Chu 00:44:50 It’s always the last resort. Yeah. Always, always. I search for something that exists that, you know, that would fit my needs before sitting down to write my own tool.

Gavin Henry 00:45:02 Yeah. So, I think that was good advice for any of the listeners. So, I think we’ve touched on a couple of my questions already. I was going to ask why B plus trees are used for indexing and table indexes. And that’s because I’ll try and recap. I’ve got predictable times for searches and finding the things on there. Very efficient, key value operations. So,

Howard Chu 00:45:28 Yeah, they’re, they’re actually optimal for searches. You’ll always get the lower, lower bound search time.

Gavin Henry 00:45:38 And, um, okay. So, what the B plus tree, is it complicated to get the data into it in the first place? How would you put a ton of data into a B plus tree structure?

Howard Chu 00:45:53 Is it complicated?

Gavin Henry 00:45:55 We’re doing a whole show on it. So, yeah, it’s complicated.

Howard Chu 00:46:02 It’s certainly, you know, it’s not as easy as the plain binary tree. That’s not as easy as some of the other self-balancing trees, but one of the advantages, again, that it has since it uses these relatively large pages is as you insert records into it, it has a fairly, fairly low rate of Neo allocations, right? Every time you allocate a new page, it’s good for, you know, 20, 30, 50 elements before that fills up and you’re forced to allocate another page. So, it’s got pretty good efficiency and in that respect as far as resource constraints.

Gavin Henry 00:46:45 Okay. That makes sense. I’ve got question from one of the other hosts here. Um, how do we handle concurrency control database basis that you use?

Howard Chu 00:46:57 There’s a lot of different ways to handle that.

Gavin Henry 00:47:01 So, that would be somebody to processes trying to insert data using the B+tree library. And I haven’t quite finished putting that data in the right place. So, an easy way to understand that.

Howard Chu 00:47:16 So, there’s, this is one of those topics that, you know, like I said, it has multiple different approaches. Okay. There’s the very simple approach which actually LMDB uses, which we lock the entire tree for one writer and we don’t let go of it until that write has been completely finished. So, in that sense, we actually have, uh, no concurrency for writes. There’s always only one writer at a time. And as I said, this is very simple. And,

Gavin Henry 00:47:52 And there’s a decision that you thought about changing at any point?

Howard Chu 00:47:57 No, no, no. That’s that, that’s a decision that was reached after quite a bit of experience with other concurrency schemes.

Gavin Henry 00:48:06 And did you have to make a trade off for that, or

Howard Chu 00:48:11 In some ways, you know, as, as you get to much larger databases, you might see that your overall right throughput could be slower than optimal. It really tends to depend on the right patterns. Another approach that, uh, for example, Berkeley DB uses is to use a lock coupling that is, as you navigate down the tree from the root to the leaf, where you need to insert data, you take locks on every page that you touch on the way down.

Gavin Henry 00:48:51 So, then when the next right comes through him might be trying, put it somewhere else.

Howard Chu 00:48:56 So, it it’s the next right coming through is going to a different part of the tree. Then it’s possible for the two of them to proceed without interfering with each other.

Gavin Henry 00:49:07 And when you traverse the tree to find what the algorithm is trying to figure out where to put the data in the tree, does it always need to kind of go straight to that point or does it have to do a similar start each time?

Howard Chu 00:49:21 Generally you always search from the roots for every new operation.

Gavin Henry 00:49:27 Okay. So, there’s no quick way to say this is the last place I put something, this new one is best placed here.

Howard Chu 00:49:36 Uh, well actually in LMDB, we do have an optimization that does that putting in a couple of elements that are, you know, that sort close to each other, or sort of next to each other LMDB will check that before it goes back to the root. So, you have a cursor that points to where you’re inserting data. When you insert a new record, it’ll look at what the cursor is currently pointing at and see if it’s, uh, see if it spans the space that the new record belongs to. And if it does, then it’ll just stay on current page. And if it doesn’t, then it’ll fall back to what it needs to do. It’ll go back to the group and search itís way down.

Gavin Henry 00:50:23 Do you fall back to searching the whole root philosophy leaf, if that cursor breaks or something happens to it? Or do you just trust the fact that at points, the last place something was post

Howard Chu 00:50:37 Well, since there is no other writers interfering with it, then yeah. The curse will always points to

Gavin Henry 00:50:45 Reason to have one writer.

Howard Chu 00:50:46 Exactly. Yeah. As I said, it’s simplifies so many things that, you know, in my opinion, it wasn’t worth the headaches of maintaining all the other state needed to do multi page locking.

Gavin Henry 00:51:02 Yeah. It’s like having a plus one backup for the backup, for the backup, the backup, you know,

Howard Chu 00:51:08 Yeah, how deep you go. That’s one of the problems with the approach in Berkeley DB is that you can get into locking conflicts very easily, you know, lock conflicts are not and exceptional, unusual condition. They’re a routine occurrence in Berkeley DB when you have multiple writers. So, every time you hit a lock conflict, one of the writers has to give up release solids locks and retry later.

Gavin Henry 00:51:45 And then I try to call sometimes if there was a crash there’ve been blocks left.

Howard Chu 00:51:51 Yeah. That’s, that’s another, some other aspect of that’s that’s to do with, um, you know, writing in place versus LMDB is comprehend, right. But that’s not quite related to the locking issues. So, for concurrency management, you know, that’s, that’s one approach and that’s the most common approach that B trees use is the multi-page locks that we use. There’s another method, which, you know, was relatively recent. You know, I was reading about it back in 2009, called a B link tree where leaf nodes contain link pointers to their siblings. And by maintaining these link pointers, it actually allows you to do adds and deletes with high concurrency without as much lock overhead. So, this was a really interesting mechanism when I was reading about it. Unfortunately at the time it was very new research and there were no working implementations that I could find anywhere on the web. And, you know, the, the research paper that I read didn’t provide me enough information that I could implement it myself. So, I kind of set it aside, but it was, it was a very promising paper at the time, I think in the years between then and now they’re having at least two implementations of it developed, of course, while it solves some of the locking problems that we were just talking about, it doesn’t solve the crash problem that you just mentioned where there’s something crashes in the middle, then pages will be lost.

Gavin Henry 00:53:37 Yeah. And w what was the last place write to and who was doing, what do you think there’s, you just touched upon improvements that could be made for these types of data structures. Do you think in 2021, we’ve reached the protectable times for certain things and the software algorithms, these data structures, and we’re now relying on hardware, or do you think there’ll still be some major breakthrough? Because I know you stay on top of all these stuff.

Howard Chu 00:54:09 You know, I, I see a lot of research into new algorithms that are supposed to be tuned specifically for persistent random or SSDs and thus and such. I find that, you know, the research I’ve read so far is not making any massive breakthroughs. And I don’t know from that trend, I’m not expecting anything huge in the future. There’s a lot of small scale refinements is really what it looks like. That’s what it amounts to.

Gavin Henry 00:54:47 And what do you think the future is for this type of data access?

Howard Chu 00:54:53 Uh, well, it’s, it’s going to be more of the same, you know, today we have a multi-tiered memory hierarchy, right? Where we talk about CPU caches, and then RAM and then storage, which is disks or SSDs or whatever in the future, we’re going to have a couple of different tiers of RAM. We’ll have static RAM dynamic RAM, persistent RAM. I would like to see further in the future, you know, a hundred percent persistent RAM and no more plain dynamic RAM, but you know, we’re not going to get there yet. We’re going to have this intermediate step first, where we have dynamic RAM co-existing with persistent RAM. And so that will be another layer of the memory hierarchy. And, uh, it’ll just, it’s just one more level of a pyramid,

Gavin Henry 00:55:48 But that won’t change the page sizes that we’re dealing with.

Howard Chu 00:55:54 It doesn’t have to,

Gavin Henry 00:55:57 The same algorithms could still work, but because the access speeds are faster, it should still be pretty predictably, the same speed or?

Howard Chu 00:56:07 Yeah. And generic B plus tree algorithm will still provide optimal search performance as you layer these different memory technologies above, but try so that, that really doesn’t need to change. Uh, page sizes will increase. As you mentioned, four kilobyte pages are already a bit small for today’s CPU, disk subsystems 16 kilobytes is more reasonable, 30 to 64 Kane probably be pushing it. Yeah. Four K is bit too small.

Gavin Henry 00:56:45 And is that some size you have to Peck and then the B plus tree structure, or

Howard Chu 00:56:51 From the software side, you have to use what the hardware supports. So, some CPU’s today, for example, apples, and one processor are now using 16 kilobyte pages. So, that’s the page size of your database reviews as well.

Gavin Henry 00:57:09 So, that, that means you can have a shallower tree. Yes, you’re storing more value back or values or partners to the

Howard Chu 00:57:19 Volume page,

Gavin Henry 00:57:22 Which again, speeds up getting things because it’s not so deep.

Gavin Henry 00:58:42 I’d like to start wrapping up. I think we’ve done a pretty good job on that. Let’s just round off with the description of a B+tree again, cause we’ve digressed a little bit. So, correct me on this — I’ve nabbed it off the internet — just confirm for me: in the B plus tree copies of the keys are stored in the internal notes. The keys and records are stored and leaves. In addition, the leaf node may include a pointer to the next leaf node to speed sequential hacks. So, are we calling the notes pages there?

Howard Chu 00:59:15 In that case there, Yes. The nodes are pages. Yes.

Gavin Henry 00:59:18 Why did it make sense to use the word page when everything else is just nodes in the previous data structures?

Howard Chu 00:59:25 The word page is associated with virtual memory management and disk management.

Gavin Henry 00:59:32 And that’s what we’re trying to improve here. That’s relaying the data on top of the disks.

Howard Chu Yeah.

Gavin Henry Excellent. Okay. So, obviously selection of the correct data structures to use is extremely important to us for software engineers and companies. But if there was one thing from this journey of an episode, you would want our listeners to remember what would you like that to be?

Howard Chu 00:59:59 I would say most of the time for your data management needs, you’re just going to need a B+ tree and you can ignore anything else.

Gavin Henry 01:00:10 Perfect. And don’t try and do it yourself. There should be a library for it. Yeah. Was there anything we missed that you’d like to mention? We did cover quite a lot.

Howard Chu 01:00:21 We covered a lot of ground here, so I think we’re good.

Gavin Henry 01:00:25 Excellent. So, where can people find out more about your B+tree work or what you’re working on? Obviously we’ve got your Twitter account that we’ll link to, but I’m sure if people are interested, they can look into the source code of LMDB and get their head around how you’ve done B+trees.

Howard Chu 01:00:46 Sure. Yeah. And also the LMDB.tech website has the formatted documentation online. So, while you’re browsing through the source code.

Gavin Henry 01:01:02 Thank you. I’ll put that in the show notes. So, this show personally has really helped me understand B trees. Hopefully the listeners will come off a little bit better than when they started listening.

Howard Chu 01:01:13 They’re less confused,

Gavin Henry 01:01:15 Howard, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

You must be logged in to post a comment.

1 comment

Kristjan S says:

November 17, 2021 at 6:31 pm

Great podcast! Just a little note about MySQL/MariaDB at around 40:30. It was mentioned that they use ISAM, and not B+ tree structure, I guess referring to the MyISAM storage engine. MyISAM was the default storage engine for MySQL until December 2009. The new default is InnoDB, which does use B+ trees for indices.

SE Radio 485: Howard Chu on B+tree Data Structure in Depth

Show Notes

Transcript

Join the discussion

1 comment

More from this show

SE Radio 718: Will Sentance on JS Modernization

SE Radio 717: Eric Tschetter on Decoupling Observability

SE Radio 716: Martin Kleppmann Local-First Software

Menu

Recent posts

Search

Search

SE Radio 485: Howard Chu on B+tree Data Structure in Depth

Show Notes

Transcript

Join the discussion

1 comment

More from this show

SE Radio 718: Will Sentance on JS Modernization

SE Radio 717: Eric Tschetter on Decoupling Observability

SE Radio 716: Martin Kleppmann Local-First Software

Menu

Recent posts