Paper Review 06 - Tree Indexing on Flash Disks
My thoughts on Tree Indexing on Flash Disks
Lately, I’ve been reading Alex Petrov’s Database
Internals. While my professional career so far has
exposed me to a lot of these concepts, I’ve never had to build a database from
the ground up. This has left my understanding of a lot of core database concepts
to be somewhat superficial, so I’ve been reading this book to better understand
where the gaps in my knowledge are. While my familiarity with core concepts gave
me a really strong intuition for most concepts, I found the topic of FD Trees
to be enigmatic. The claim of a data structure that built upon the concepts of
B-Trees
but optimized for SSDs with slow random writes seemed neat, but the
description in the book didn’t help me understand how such a property was
achieved. Fortunately, the original paper for FD-Trees
, along with some source
code, was able to shed light on the matter.
What is an FD-Tree?
Consider SSDs/flash storage. SSDs have blazing fast random reads compared to
spinning disks. However, random reads tend to be slow due to overheads of
looking up and allocating blocks. Sequential writes are also better for the
lifespan of the SSD. While B-Trees
do optimize for page locality, reading and
write can incur a lot of random seeking.
FD-Trees
approach the problem by using a B-Tree
followed by tiers of sorted
runs of increasing length. The key intuition is that creating and merging two
sorted lists will always only require sequential writes. However, to maintain
efficiency, these sorted lists are still a tree-like structure. To prevent
having to preserve the tree-like properties on every write, writes are initially
“buffered” in the top level (a small B-Tree
- this keeps random writes
localized to a small region), and then propagated to lower levels as the top
level fills up.
Get off my lawn
In order to maintain the tree-nature of the sorted runs, the first entry of
every page on a level \(L_i\) has a corresponding entry in level \(L_{i-1}\)
that acts as a pointer to that page. Additionally if the last entry of any page
is not already a fence, it must be an internal fence (a fence to an adjacent
key/fence). This seems to be required to make operations on the FD-Tree
easier
to express. The structure of these fences necessitate that any write on a level
\(i\) must modify all preceding levels.
Deamortization
The amortized costs of merging can be too high in practice. This can be resolved by having two head B-Trees, and performing the merge on one, while letting the other service new read requests. During the merge, fences from the new levels are inserted into the active tree instead of the old one.
Thoughts
- Merges being fast enough is counter intuitive
To me, I would have expected that the amortized cost of merge two sorted levels as being prohibitively expensive. However, breaking the problem into layers, and the cost being amortized plays a bigger role than I would have guessed. The deamortization strategy also helps to mitigate issues.
- More async?
While deamortization helps, I wonder if there’s benefits to an approach that merges levels that are almost full (threshold) in the background. These merges would have to be done out of place (with an atomic pointer swap at some point) in order to allow reads to continue to work. Writes would need to be buffered elsewhere - maybe writes that cause the head to fill up could block/trigger immediate execute of a merge on level 0.
- Can we do better than rewriting the whole tree?
Merge seems like it can be very expensive since every merge modifies all
previous levels. I wonder if there are ways this can be mitigated - e.g. making
every level a B-Tree
with write buffering or having stronger invariants that
introduce some amount of random writing, but only to fixup fences in some
limited way (maybe a combination of in-memory state and an append only log?).
- Storing write state as a linked list
Something clever in the source code was that write state is actually a linked list, encoding state for every level that is being “concurrently” written. Not exactly a ground breaking technique, but I always love elegant uses of linked lists.
This paper is almost a decade old, so now I’m curious how the state of the art has progressed since then. This was a fun read, and it goes to show that sometimes just reading the original material is more approachable than second hand sources.