Paper Review 06 - Tree Indexing on Flash Disks

My thoughts on Tree Indexing on Flash Disks

Lately, I’ve been reading Alex Petrov’s Database Internals. While my professional career so far has exposed me to a lot of these concepts, I’ve never had to build a database from the ground up. This has left my understanding of a lot of core database concepts to be somewhat superficial, so I’ve been reading this book to better understand where the gaps in my knowledge are. While my familiarity with core concepts gave me a really strong intuition for most concepts, I found the topic of FD Trees to be enigmatic. The claim of a data structure that built upon the concepts of B-Trees but optimized for SSDs with slow random writes seemed neat, but the description in the book didn’t help me understand how such a property was achieved. Fortunately, the original paper for FD-Trees, along with some source code, was able to shed light on the matter.

What is an FD-Tree?

Consider SSDs/flash storage. SSDs have blazing fast random reads compared to spinning disks. However, random reads tend to be slow due to overheads of looking up and allocating blocks. Sequential writes are also better for the lifespan of the SSD. While B-Trees do optimize for page locality, reading and write can incur a lot of random seeking.

FD-Trees approach the problem by using a B-Tree followed by tiers of sorted runs of increasing length. The key intuition is that creating and merging two sorted lists will always only require sequential writes. However, to maintain efficiency, these sorted lists are still a tree-like structure. To prevent having to preserve the tree-like properties on every write, writes are initially “buffered” in the top level (a small B-Tree - this keeps random writes localized to a small region), and then propagated to lower levels as the top level fills up.

Get off my lawn

In order to maintain the tree-nature of the sorted runs, the first entry of every page on a level \(L_i\) has a corresponding entry in level \(L_{i-1}\) that acts as a pointer to that page. Additionally if the last entry of any page is not already a fence, it must be an internal fence (a fence to an adjacent key/fence). This seems to be required to make operations on the FD-Tree easier to express. The structure of these fences necessitate that any write on a level \(i\) must modify all preceding levels.

Deamortization

The amortized costs of merging can be too high in practice. This can be resolved by having two head B-Trees, and performing the merge on one, while letting the other service new read requests. During the merge, fences from the new levels are inserted into the active tree instead of the old one.

Thoughts

Merges being fast enough is counter intuitive

To me, I would have expected that the amortized cost of merge two sorted levels as being prohibitively expensive. However, breaking the problem into layers, and the cost being amortized plays a bigger role than I would have guessed. The deamortization strategy also helps to mitigate issues.

More async?

While deamortization helps, I wonder if there’s benefits to an approach that merges levels that are almost full (threshold) in the background. These merges would have to be done out of place (with an atomic pointer swap at some point) in order to allow reads to continue to work. Writes would need to be buffered elsewhere - maybe writes that cause the head to fill up could block/trigger immediate execute of a merge on level 0.

Can we do better than rewriting the whole tree?

Merge seems like it can be very expensive since every merge modifies all previous levels. I wonder if there are ways this can be mitigated - e.g. making every level a B-Tree with write buffering or having stronger invariants that introduce some amount of random writing, but only to fixup fences in some limited way (maybe a combination of in-memory state and an append only log?).

Storing write state as a linked list

Something clever in the source code was that write state is actually a linked list, encoding state for every level that is being “concurrently” written. Not exactly a ground breaking technique, but I always love elegant uses of linked lists.

This paper is almost a decade old, so now I’m curious how the state of the art has progressed since then. This was a fun read, and it goes to show that sometimes just reading the original material is more approachable than second hand sources.

Written on March 3, 2024