#LLRBT | Explore Tumblr posts and blogs

lesliebklein · 11 years ago

Text

Explaining Red-Black Trees Standing on One Foot

"Cracking the Coding Interview" makes the claim: Academic books prepare you for fancy research, but they’re not going to help you much in an interview. Why? I'll give you a hint: your interviewers haven’t seen Red-Black Trees since they were in school either.

Red-Black Trees were invented in 1978 and re-invented in 2008. They are sooooo intuitive now! Lots of material on the Internet presents Red-Black trees using the old techniques and the words come across as total mumbo jumbo. I heard a MIT prof use words like "non-intuitive" and "I need to use my cheat sheet." A perfect example of an impenetrable lecture on Red-Black trees is MIT's OCW lecture. It is 83 mins of classroom torture.

I'm going to try to explain Red-Black Trees standing on one foot. Here I go.

Standing On One Foot

After studying Binary Search Trees (BSTs), we realize that it would be great to construct a balanced tree, no matter the order that data is inserted into the BST. Even if the data arrives in ascending order, the tree is guaranteed to be no higher than lg N (that's lg base 2).

To balance a tree, we need to store more than 1 data item (the "key") in a node. So let's allow 1 or 2 keys per node. Each node then has 2 or 3 children. Let's call this a 2-3 tree.

Data is always inserted into a node at the bottom of the tree. If it is inserted into a 2-node then the node becomes a 3-node node. Easy. If the key is inserted into a 3-node, then we have to fix that.

We fix this "temporary" 4-node by splitting it. The middle key is moved up into its parent node. The other 2 keys are each put into a 2-node.

Now we check the parent. If it is a 3-node then we have nothing to fix. Otherwise, we have to repeat the process of splitting the temporary 4-node. Again, the middle key is moved up into its parent node. We repeat this until the parent doesn't need to be split. If the parent is the root of the tree and it must be split, then we split it (as usual) and the height of the tree is increased by 1.

A 2-3 node tree is perfectly balanced because we constructed it to be so. But implementing this tree directly results in very complex code.

Let's explore another idea: since we spent so much time learning about BSTs and we wrote code to get and put and find the min and max, and rank and in-order traversal.... wouldn't it be great to use all that on whatever balanced tree we create?

So the question is: How can we represent a 2-3 tree as our well-known and loved BST?

Here's an idea (not mine!): lets connect the 2 keys in a 3-node with a link. If we draw a 3-node as two 2-nodes connected by an internal link then we have a perfectly balanced tree that consists only of 2-nodes! In order to mark these links as "internal" so that they can be distinguished from the other links in the tree, let's color them red.

In order to keep this story simple, let's draw the tree so that all red links are drawn as left branches. (A tree with this constraint is called a left-leaning red-black tree (LLRB tree)).

You might say, but if I add a key to a 2-node (which makes it a 3-node) and I color the internal link red, and I draw the tree as a BST, the link could be a right branch. Good point. So we have to reassign 2 links: the links of the 2-nodes created from the keys in the 3-node. That's it. Just 2 link reassignments!

There are 2 other cases where we must reassign links to maintain the invariant that all red links lean left. These reassignments are done after a key is inserted into the tree. The result? A nearly-balanced BST.

And the best part is that all the code we wrote when studying BSTs works fine!!! No change required.

Okay. I've got to put my foot down. Whew.

Just put my foot down...

In fact, the additional code required to insert data into the BST to keep the tree near-perfectly balanced consists of 3 functions: rotateLeft, rotateRight, flipColors. Each of these functions consists of 5 assignment statements. Unbelievable how powerful these simple statements can be.

I've said the tree is nearly-perfectly balanced because the black height from root to any leaf is constant at lg N. The red links do add height in the BST. In the worse case, a path can alternate red-black-red-black... so the longest path is 2 lg N. In a 2-3 tree, there is no red link and that is why the 2-3 tree is guaranteed to have height lg N. (It was constructed that way!)

A Python Implementation

I've implemented a LLRB Tree in python and did some experiments. (Each experiment inserts N random keys and is averaged over 100 trials.) About 25% of the links are red. I calculate the average number of red links per path after each insertion, and I can see (in a printout of numbers) the number of red links grow and fall, which is the result of a temporary 4-node split at the root. This would be really cool if I knew how plot this data graphically. (Another project.)

EDIT: Results from running binary tree experiments in ipython notebook are here and here .

Here are some results:

run experiments.py -N 1000000 -t 100 mean/std % red 25.392963 / 0.0333445637398 min/max 25.2962 / 25.4836 mean/std black height 15.5 / 0.5 mean/std height 28.18 / 0.698283610004

run experiments.py -N 500000 -t 100 mean/std % red 25.38765 / 0.0410224365439 min/max 25.2844 / 25.4916 mean/std black height 14.88 / 0.324961536185 mean/std height 26.6 / 0.632455532034

run experiments.py -N 50000 -t 100 mean/std % red 25.40336 / 0.155529773355 min/max 25.03 / 25.818 mean/std black height 12.0 / 0.0 mean/std height 21.85 / 0.779422863406

run experiments.py -N 10000 -t 100 mean/std % red 25.3713 / 0.340821522208 min/max 24.56 / 26.26 mean/std black height 10.0 / 0.0 mean/std height 18.39 / 0.676683086829

run experiments.py -N 5000 -t 100 mean/std % red 25.3846 / 0.44915792323 min/max 24.48 / 26.72 mean/std black height 9.0 / 0.0 mean/std height 16.95 / 0.683739716559

A Generalized Red-Black Tree

The next question to ask is: What about a balanced tree based not on a 2-3 tree but on a 2-3-4-5-6-...-M-1 tree?

Such a tree actually has a name. It is called an M-order B-tree. A B-Tree describes a balanced tree that can have any number of keys in a node, not just 2 or 3. B-Trees are used to search through huge data stores that can't fit into memory and may be stored on local disk or remotely somewhere on the web.

Python modules do exist that implement B-trees for web frameworks. I haven't explored these yet but I want to see how they are implemented. Some of the code is over 7 years old which leads me to think it is implemented in the pre-2008 way. I'm sure there is room for improvement.

#algorithms #red-black trees #LLRBT #2-3 trees

0 notes