Understanding B+trees (by looking at the implementation)

If you’ve ever wondered how databases handle millions of rows efficiently, the cited answer is often B+trees. If you’re unable to visualize a B+tree off the top of your head, then this article is for you.

The language of choice is awk. It’s a simple scripting language, which allows us to focus on the algorithm itself more deeply than implementation details and quirks. Thus, some familiarity of awk will make it easier to follow along, but isn’t strictly necessary.

The full implementation can be found here.

Outline of a B+tree

A B+tree is a self-balancing tree data structure that maintains sorted data and allows searching, and manipulation of data in logarithmic time (in less technical terms: fast). In B+trees, the data itself is stored inside the leaf nodes and the internal nodes act as pointers to the leaf nodes. (In this implementation I’ll skip the data, because it has no bearing on the idea and structure of a B+tree)

Nodes

In a B+tree, all the data is pointed to by the leaf nodes, while the internal nodes are only used for the pointing purposes. All the leaf nodes are in the same level.

-> Internal Nodes: These contain only keys to guide the search but not actual data values. These keys are used to search for the elements inside the B+ trees. They have pointers to child nodes.
-> Leaf Nodes: These contain keys and actual data values. They are linked together to form a linked list, which allows for efficient range queries and ordered traversal.

The number of children a node can have is dependent on the order of the B+tree.

Order

The order of a B+tree defines the maximum number of children that an internal node can have. This usually depends on implementation. The root node has special rules, which can have fewer child nodes than the order suggests, but even the root node of the B+tree must have a minimum of two children and must contain at least one key.

So picture it something like this:

            [17 | 35]         <- root/internal node
          /     |     \
[5 | 10]  [20 | 25 | 30]  [40 | 45 | 50]     <- leaf nodes

Here:

-> The root [17 | 35] is an internal node, contains only keys and pointers to child nodes
-> Each bracketed group below it is a leaf node, containing keys (and their data).
-> The leaf nodes are linked together like [5|10] <-> [20|25|30] <-> [40|45|50].

Representation of a B+Tree in awk

The first necessary building block is to create nodes:

function new_node(leaf, id) {
    id = ++node_count;
    node[id]["leaf"] = leaf;
    node[id]["n"] = 0;
    node[id]["next"] = 0;
    return id;
}

Awk’s associative arrays can be used to store multiple facts about a single “thing”, kind of similar (but different) compared to structs in C. Each node is just a small collection of key/value pairs indexed by a numeric identifier:

-> id = ++node_count;

Every node in our B+tree needs a unique identifier. We keep a global counter node_count that increments each time we create a new node. This gives us a simple way to refer to nodes by number (e.g., node 1, node 2, etc.). It’s worth noting that in awk, almost everything lives in the global scope, because passing more than one value from a function in awk is not simple, and local variables are not commonly used in awk scripts. Leaving everything in the global scope keeps things simple.

-> node[id]["leaf"] = leaf;

Each node stores a flag indicating whether it’s a leaf node (leaf == 1) or an internal node (leaf == 0). This lets later search/insert functions know whether the node holds real data or just keys/pointers.

-> node[id]["n"] = 0;

This tracks how many keys are currently stored in the node. Initially, every new node starts empty.

-> node[id]["next"] = 0;

For leaf nodes, this will eventually link to the next leaf (if there is a next leaf). We will use this for creating the linked list that makes B+trees efficient for range queries. For internal nodes, this value isn’t used.

-> return id;

Finally, we return the ID of the new node so the caller can store or reference it.

Inserting keys

Now that the basic representation of a node is understood, the next step is to find an algorithm for inserting keys into the tree. Insertion in a B+tree is recursive and split-driven: when a node is full, it needs to be split, and a key needs to be promoted upwards in order to keep the structure balanced.

Let’s look at the basic premise:

function insert(r, key, s) {
    if (node[r]["n"] == 2 * order - 1) {
        s = new_node(0);
        node[s]["children"][1] = r;
        split_child(s, 1, r);
        insert_nonfull(s, key);
        root = s;
    } else {
        insert_nonfull(r, key);
    }
}

-> if (node[r]["n"] == 2 * order - 1), checking if the root node (r) is full. Each node can hold at most 2*order-1 keys before needing to be split.
-> s = new_node(0), If the root is full, a new root is created, which is going to be an internal node, then make the old root its first child.
-> split_child(s, 1, r), then split the old root into two halves. One key moves up into the new root, and two new children are created beneath. At this point, the tree grows in height by one level. This is the only situation during which a B+tree increases its height.
-> insert_nonfull(s, key), the new root is guaranteed to be empty (non-full), so the function should try to insert a key into it.
-> root = s, update the global root reference to point to this new node.
-> else { insert_nonfull(r, key), if the root is not full, try to insert a key into the tree

Inserting keys into non-full nodes

The function for doing so might look a little complicated, but it follows a simple idea: never descend into a full node

Before moving deeper into the tree, the child that is about to visited has to have room for at least one more key. If it doesn’t, it should be split it on the way down, so by the time a leaf is encountered, insertion is guaranteed to succeed without any backtracking.

The idea is to keep the tree balanced and prevent complicated backtracking logic after insertion.

function insert_nonfull(x, k, i, c) {
    i = node[x]["n"]
    if (node[x]["leaf"]) {
        while (i >= 1 && k < node[x]["keys"][i]) {
            node[x]["keys"][i+1] = node[x]["keys"][i];
            i--;
        }
        node[x]["keys"][i+1] = k;
        node[x]["n"]++;
    } else {
        while (i >= 1 && k < node[x]["keys"][i]) i--
        i++;
        c = node[x]["children"][i];
        if (node[c]["n"] == 2 * order - 1) {
            split_child(x, i, c);
            if (k >= node[x]["keys"][i]) {
                i++;
            }
        }
        insert_nonfull(node[x]["children"][i], k);
    }
}

-> if (node[x]["leaf"]), if the current node is a leaf node, shift the existing keys to make space and then place the new key in sorted order
-> otherwise, if it is not a leaf node, while (i >= 1 && k < node[x]["keys"][i]) i--, find which child should receive the new key by scanning the node’s key list from right to left
-> if (node[c]["n"] == 2 * order - 1), check for overflow, if the child node is full, split it using split_child()

Here the focus is on the structure of the B+tree, so there’s only keys, no values. Adding values would simply mean adding another array element at node[x], such as node[x]["values"].

It would be just a couple of minor changes along the lines of needing to shift/insert yet another value:

if (node[x]["leaf"]) {
    while (i >= 1 && k < node[x]["keys"][i]) {
        node[x]["keys"][i+1]   = node[x]["keys"][i];
        node[x]["values"][i+1] = node[x]["values"][i];
        i--;
    }
    node[x]["keys"][i+1]   = k;
    node[x]["values"][i+1] = v;
    node[x]["n"]++;
}

Splitting a child node

This is the most important (and complicated) function in the entire implementation, and it is what keeps the B+tree balanced.

Whenever a node gets full, by containing the maximum number of keys, it needs to be split into two halves and a key needs to be promoted up to its parents. This ensures no node ever exceeds its capacity, and all leaves stay at the same depth.

function split_child(p, i, y, z, t, j) {
    t = order;
    z = new_node(node[y]["leaf"]);

    if (node[y]["leaf"]) {
        for (j = 1; j <= t - 1; j++) {
            node[z]["keys"][j] = node[y]["keys"][j + t];
        }

        node[z]["n"] = t - 1;

        node[z]["next"] = node[y]["next"];
        node[y]["next"] = z;

        node[y]["n"] = t;

        promote_key = node[z]["keys"][1];

    } else {
        for (j = 1; j <= t - 1; j++) {
            node[z]["keys"][j] = node[y]["keys"][j + t];
        }

        for (j = 1; j <= t; j++) {
            node[z]["children"][j] = node[y]["children"][j + t];
        }

        node[z]["n"] = t - 1;
        node[y]["n"] = t - 1;

        promote_key = node[y]["keys"][t];
    }

    for (j = node[p]["n"]; j >= i; j--) {
        node[p]["keys"][j+1] = node[p]["keys"][j];
    }
    node[p]["keys"][i] = promote_key;

    for (j = node[p]["n"] + 1; j > i; j--) {
        node[p]["children"][j+1] = node[p]["children"][j];
    }
    node[p]["children"][i+1] = z;
    node[p]["n"]++;
}

-> z = new_node(node[y]["leaf"]), create a new sibling node of the same type
-> if (node[y]["leaf"]) {, if it is a leaf node
-> for (j = 1; j <= t - 1; j++) { node[z]["keys"][j] = node[y]["keys"][j + t], copy the upper half of the keys into the new node
-> node[z]["n"] = t - 1, it will have t - 1 keys (this is the right leaf)
-> node[z]["next"] = node[y]["next"]
-> node[y]["next"] = z, for leaf nodes, we preserve the linked-list structure, so the new node becomes the next leaf, keeping range searches efficient
-> node[y]["n"] = t, left node keeps t keys
-> promote_key = node[z]["keys"][1];, promote first key of right leaf to parent, the key also stays in the right leaf
-> else {, if it is an internal node
-> for (j = 1; j <= t - 1; j++) { node[z]["keys"][j] = node[y]["keys"][j + t], promote the middle key upwards
-> for (j = 1; j <= t; j++) { node[z]["children"][j] = node[y]["children"][j + t], copy the children
-> node[z]["n"] = t - 1; node[y]["n"] = t - 1;, reduce the size of the tree on the old nodes
-> promote_key = node[y]["keys"][t], promote middle key from left node, this is sent upwards and removed
-> for (j = node[p]["n"]; j >= i; j--) { node[p]["keys"][j+1] = node[p]["keys"][j];, shift keys in parent node to make room for the key that is to be promoted
-> node[p]["keys"][i] = promote_key, insert promote_key into parent p at slot i
-> node[p]["children"][i+1] = z, make parent node aware of its new child node

This can be visualized the following way:

Before split (t = 3):
[ 1 | 2 | 3 | 4 | 5 ]   (leaf)

After split:
       promote(4)
         ↑
[ 1 2 3 ] ---- [ 4 5 ]

Basically, the important notions here are that

-> all leaves appear at the same depth, and the number of keys in each leaf is constrained by the order (t), max 2t -1
-> the number of children (except root) is also constrained by the order (t), max 2t -1, higher t => flatter tree, higher number of keys per node, more time needed to traverse through each node
-> internal nodes store separator keys to help locate leaf nodes, but not data, this is to help searches become more efficient
-> leaf nodes store all data keys, sorted, and linked left -> right
-> splits happen only on the way down, so the step to create new nodes by splitting doesn’t need any additional traversal logic

Searching

Compared to insertion, searching in a B+tree is refreshingly simple. Begin at the root and recursively follow the correct child pointers until the correct leaf is found. Once there, scan the keys in order, and if a match is found, the search is complete.

function search(x, k, i, child) {
    if (node[x]["leaf"]) {
        for (i = 1; i <= node[x]["n"]; i++)
            if (node[x]["keys"][i] == k) {
                printf("Found %d in leaf node %d\n", k, x)
                return 1
            }
        printf("%d not found (reached leaf %d)\n", k, x)
        return 0
    }
    i = 1
    while (i <= node[x]["n"] && k >= node[x]["keys"][i]) i++
    child = node[x]["children"][i]
    return search(child, k)
}

-> if (node[x]["leaf"]), if in a leaf, scan through its keys and see if the given key (k) matches one of them.
-> otherwise if in an internal node, find which child branch to follow and search again recursively.

This mirrors how the tree is structured: internal nodes act as guides, to find the relevant key in the leaf nodes.

Range search

Unlike a plain B-tree, where data might be scattered across internal and leaf nodes, a B+tree keeps all the actual keys in the leaves, and those leaves are linked together. It means it is possible to efficiently scan a whole range of values without repeatedly climbing up and down the tree.

function search_range(r, a, b, x, i) {
    x = r;
    while (!node[x]["leaf"]) {
        i = 1;
        while (i <= node[x]["n"] && a >= node[x]["keys"][i]) {
            i++;
        }
        x = node[x]["children"][i];
    }

    while (x) {
        for (i = 1; i <= node[x]["n"]; i++) {
            k = node[x]["keys"][i];
            if (k >= a && k <= b) {
                printf "%d ", k;
            }
            if (k > b) {
                print ";
                return;
            }
        }
        x = node[x]["next"];
    }
    print ";
}

-> while (!node[x]["leaf"]), walk down the tree to the leaf that would contain the lower bound of the range search. Basically the same logic as a single-key search, except that it stops as soon as the first leaf node is found.
-> while (x), in this second loop, the function looks for the upper bound of the range search, and prints everything between and including the two bounds. The moment a key greater than the upper bound is hit, the search stops. No need to climb back up at any point in this process.

Summary on searching

Both of these search functions (search() and search_range()) map directly to how databases use B+trees internally.

With a query like SELECT * FROM users WHERE id = 20 the database performs an index lookup, something that works similarly to the search() function implemented here. It starts at the root of the B+tree, walks through internal nodes using the key values, and stops when it reaches the leaf that holds the row pointer for id = 20. If the index is on a primary key, that pointer may even be the data itself in the case of a clustered index.

This basic principle portrayed in the function search_range() is also how range search queries work with indexes in InnoDB storage engine, and this is why range queries like SELECT * FROM users WHERE age BETWEEN 25 AND 35 are incredibly efficient. Instead of repeatedly climbing up and down the tree for each value, the engine navigates down to the starting leaf once, then scans forward through the linked leaves, doing one linear search through k results rather than k separate tree traversals.

This is the big selling point of a B+tree over a regular B-tree. In a B-tree, data can exist in both internal nodes and leaves, so the same range query might need to bounce up and down the tree repeatedly. The B+tree’s linked leaves eliminate this backtracking entirely. This type of operation would appear as Using index range scan in an EXPLAIN query.

Visualizing the index structure in action

root@debian-test:~# awk -f b.awk
----
Inserting keys 1-5:
Node 1: 1 2 3 4 5 (leaf)

At this point, the tree is just a single leaf node. But it's now full
(2*order-1 = 5 keys), so the next insertion will trigger a split.
-----
Observe split happen at key 6:
Node 2: 4
    Node 1: 1 2 3 (leaf)
    Node 3: 4 5 6 (leaf)

The leaf split into two nodes, and key 4 was promoted to a new internal node.
This is the tree's first growth in height.
-----
Full tree:
Node 8: 10
    Node 2: 4 7
        Node 1: 1 2 3 (leaf)
        Node 3: 4 5 6 (leaf)
        Node 4: 7 8 9 (leaf)
    Node 9: 13 16
        Node 5: 10 11 12 (leaf)
        Node 6: 13 14 15 (leaf)
        Node 7: 16 17 18 19 20 (leaf)

Search tests:
Found 5 in leaf node 3
Found 15 in leaf node 6
25 not found (reached leaf 7)

Range search [8..17]:
8 9 10 11 12 13 14 15 16 17

This small output shows the entire lifecycle of a B+tree: from a single leaf, through the first split, to a balanced multi-level structure. Every internal node acts purely as a guide, while the leaves store all the actual keys, linked together for fast range scans.

In MariaDB, this is exactly what happens inside an index: inserts trigger node splits, searches walk the tree, and range queries stream through the linked leaves.

Conclusion

The important takeaway, is that with B+trees, searches work almost instantly, even on huge data sets. That is largely because of the tree balance, which allows accessing all elements with the same number of steps, and secondly because of the very slow growth of the tree depth. Real world indexes with millions of records will still only have a tree depth of four or five, making the process very smooth.

Besides that, the other noteworthy point is that by keeping all data in leaf nodes and connecting them via linked list, B+trees turn expensive range queries into cheap sequential scans. This is why many storage engines use variants of B+tree like structures for indexing.

In this small awk example, only the bare-bone idea is presented, but the behavior is the same as in real systems like MariaDB. When you run a query with a condition like WHERE id BETWEEN 1000 AND 2000, the engine is effectively walking through a B+tree structure much like the one built here.

It is worth noting that B+tree indexes do have limitations: they are less suitable for complicated text matches, like WHERE REGEXP {expression}. This kind of structure simply helps you point to something, quickly finding the left-most prefix of whatever is found. Great when looking for single values like numbers, bad when trying to look for a specific word or phrase inside a paragraph.

Consider this piece of text:

Today was not such a rainy day

Using regular indexes in InnoDB we can index the full sentence. When querying this column, the lookup needs to start with leftmost prefix. So a query like:

SELECT data FROM sometable WHERE data REGEXP 'Today was not such';

Will benefit from this index but a query like:

SELECT data FROM sometable WHERE data REGEXP 'rainy';

Will not.

There’s no entry in the index that starts from ‘rainy’, so the index structure/algorithm will be blind to this. Using B+trees, it is basically impossible to efficiently do complicated queries on text.

So even though B+trees are very cool, they’re not always the solution, and it’s good to be aware of other types of indexes for cases like these.