Paper Review 03 - Exploiting Nil-Externality for Fast Replicated Storage
My thoughts on Exploiting Nil-Externality for Fast Replicated Storage
Today’s paper comes from the DASSL lab at UIUC
and was published in SOSP '21
.
Go Illini!
What is replication and why is it slow?
A replica is a copy of a database, often read-only, that aims to either be a backup, or a potential endpoint for load balancing. A naive approach to replication may involve periodically taking snapshots of the entire state of the leader, but in many applications this isn’t practical. In the case of very large datasets, this is prohibitively expensive, and in the case of frequently modified data, the snapshot frequency might not be fast enough to ensure the level of availability required. While a possible improvement might be to instead track the delta between snapshots, a far simpler and more elegant approach exists - replicate the log of operations being submitted to the database, and reconstruct the state on each replica. While this is straightforward in the case of single-node databases, distributed databases introduce issues of consistency.
There are two consistency models to consider - eventually consistent and strongly consistent systems. Eventually consistent system guarantee that if a write is submitted, eventually all reads will show the results of the write having been executed. This would allow for reads on some replicas to lag behind some nodes, and possible show updates from one node of the database, but not earlier updates from another until a later point in time. Strongly consistent (also called linearizable) systems guarantee that all reads submitted after a write is committed will see the effects of the write. This means that there is an agreed upon ordering (linearizable) of operations executed and all replicas will only present states that correspond to prefixes of the operation log. Many applications require strong consistency despite the performance penalties it may bring when compared to eventual consistency.
What is a Nil-Externalizing interface, and how does it speed up replication?
Nil-Externality
is a term defined by this paper:
A nil-externalizing (
nilext
) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world (apart from the acknowledgment itself). As a result, a storage system can apply anilext
operation in a deferred manner after acknowledgment, improving performance.
An example of a nilext
interface is writes in systems where the write does not
return any information about the state of the database. Let’s take a simple
example of a dictionary in python to illustrate the concept:
myDB = {}
# what if this set didn't actually happen at this point:
myDB['key0'] = 0
# We're reading the key, so we have to apply the write before/at this point:
print(myDB['key0'])
Let’s model the nil-ext interface advantages in python
from time import sleep
# Extending `dict` so that this behaves like a special dictionary
class SlowDict(dict):
# Override the setitem operator - defines dict[key] = value
def __setitem__(self, key, value):
# Let's penalize applying writes to simulate slow storage operations that
# the paper addresses
sleep(1)
# Use the orignal setitem implementation from the underlying dict
super().__setitem__(key, value)
def update(self, other):
# To keep things simple, we'll make setitem the only write operation
raise Exception("Not implemented")
class NilExtAwareDict(SlowDict):
def __init__(self, *args, **kwargs):
self.deferred_ops = []
# Dictionary from key with differed updates to number of differred updates
# for that key - let's assume that we can store this in a fast way
self.deferred_keys = {}
# Apply updates in the background
def background_process():
while True:
self._apply_one_update()
sleep(1)
self.bg_thread = Thread(target=background_process)
self.bg_thread.start()
super().__init__(*args, **kwargs)
def _apply_one_update(self):
key, value = self.deferred_ops[0]
self.deferred_ops = self.deferred_ops[1:]
self.deferred_keys[key] -= 1
if self.deferred_keys[key] == 0:
del self.deferred_keys[key]
super().__setitem__(key, value)
def __setitem__(self, key, value):
self.deferred_ops.append((key, value))
# Increment the counter for `key`
self.deferred_keys[key] = self.deferred_keys.get(key, 0) + 1
def __getitem__(self, key):
# fast path - this key isn't affected by any deferred writes
if key not in self.deferred_keys:
return super().__getitem__(key)
while len(self.deferred_ops):
# the paper's impl directly triggers updates here
sleep(1)
return super().__getitem__(key)
With this example we can see that nilext
operations (writes, or in our example,
__setitem__
calls) can return quickly, while operations that externalize state
(reads/__getitem__
) might be slower if the write operation has not already
been executed in the background.
Unlike my example, the paper needs to deal with distributed operations, which
means that it additional needs a consensus protocol to actually order operations
and agree upon the order. The authors deal with this by deferring the points
where order is established to operations at externalize state. For consensus,
the authors introduce a protocol SKYROS
, which seems to be quite similar to
VR
with some additional considerations to benefit from knowledge of nilext
operations. One key detail is that nilext
updates still need to be applied in
an agreed upon order. This is accomplished by adding updates to the durability
log, which requires a super majority (at least \(\frac{3}{4}\) of the nodes)
of acknowledgments from all participating replicas (this is explained in the
paper with some really clear examples). This can be problematic in cases where a
super majority is not available, which the authors handle by having clients
retry operations a few times before marking them as non-nilext
, which
immediately finalizes the order, and only requires a simple majority (at least
\(\frac{1}{2}\) of the nodes).
The paper goes on to claim and show that in many cases, this can improve performance as many workload lie mainly in the fast paths.
My thoughts
- Connections to filesystem approaches
This approach reminds me a lot of how some filesystems I’ve worked on handled writes and tiered storage. In some of the systems I’ve seen, writes remain in memory, and reads that touch dirty blocks read from the newly written memory, while other reads either read from a cache, or move down to the next tier. This paper seems to apply a similar methodology towards improving performance of strongly consistent databases.
- Allowing different operations to have different consistency guarantees
Not exactly related to this paper, but I wonder if there’s applications that would benefit from being able to request reads that prioritize performance over consistency, allowing potentially better read performance for some applications. Maybe some model like a per-key configuration of consistency would be useful.
- Detecting workloads and switching strategies
I wonder if there’s possible workloads that perform worse due to the overhead of
deferring execution (e.g. write, read, write, read, etc), that could be detected
and used to trigger a mode where execution is immediate (i.e. treat all updates
as externalizing), and if that has material performance gains. I imagine this
could work by keeping a counter to determine the frequency of non-nilext
requests submitted. The discussion near the end of the paper where SKYROS
is
combined with commutative protocols seems to suggest that this might be
beneficial.
All in all, interesting paper, deferred execution often opens up opportunities for a lot of optimizations (e.g. combining updates to the same key so storage only has to be accessed once - though this could still be done at other level without the specific approach in this paper), which makes me really excited about the ideas presented here!