Paper Review 03 - Exploiting Nil-Externality for Fast Replicated Storage

My thoughts on Exploiting Nil-Externality for Fast Replicated Storage

Today’s paper comes from the DASSL lab at UIUC and was published in SOSP '21. Go Illini!

What is replication and why is it slow?

A replica is a copy of a database, often read-only, that aims to either be a backup, or a potential endpoint for load balancing. A naive approach to replication may involve periodically taking snapshots of the entire state of the leader, but in many applications this isn’t practical. In the case of very large datasets, this is prohibitively expensive, and in the case of frequently modified data, the snapshot frequency might not be fast enough to ensure the level of availability required. While a possible improvement might be to instead track the delta between snapshots, a far simpler and more elegant approach exists - replicate the log of operations being submitted to the database, and reconstruct the state on each replica. While this is straightforward in the case of single-node databases, distributed databases introduce issues of consistency.

There are two consistency models to consider - eventually consistent and strongly consistent systems. Eventually consistent system guarantee that if a write is submitted, eventually all reads will show the results of the write having been executed. This would allow for reads on some replicas to lag behind some nodes, and possible show updates from one node of the database, but not earlier updates from another until a later point in time. Strongly consistent (also called linearizable) systems guarantee that all reads submitted after a write is committed will see the effects of the write. This means that there is an agreed upon ordering (linearizable) of operations executed and all replicas will only present states that correspond to prefixes of the operation log. Many applications require strong consistency despite the performance penalties it may bring when compared to eventual consistency.

What is a Nil-Externalizing interface, and how does it speed up replication?

Nil-Externality is a term defined by this paper:

A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world (apart from the acknowledgment itself). As a result, a storage system can apply a nilext operation in a deferred manner after acknowledgment, improving performance.

An example of a nilext interface is writes in systems where the write does not return any information about the state of the database. Let’s take a simple example of a dictionary in python to illustrate the concept:

myDB = {}
# what if this set didn't actually happen at this point:
myDB['key0'] = 0

# We're reading the key, so we have to apply the write before/at this point:
print(myDB['key0'])

Let’s model the nil-ext interface advantages in python

from time import sleep

# Extending `dict` so that this behaves like a special dictionary
class SlowDict(dict):
  # Override the setitem operator - defines dict[key] = value
  def __setitem__(self, key, value):
    # Let's penalize applying writes to simulate slow storage operations that
    # the paper addresses
    sleep(1)
    # Use the orignal setitem implementation from the underlying dict
    super().__setitem__(key, value)

  def update(self, other):
    # To keep things simple, we'll make setitem the only write operation
    raise Exception("Not implemented")

class NilExtAwareDict(SlowDict):
  def  __init__(self, *args, **kwargs):
    self.deferred_ops = []
    # Dictionary from key with differed updates to number of differred updates
    # for that key - let's assume that we can store this in a fast way
    self.deferred_keys = {}

    # Apply updates in the background
    def background_process():
      while True:
        self._apply_one_update()
        sleep(1)
    self.bg_thread = Thread(target=background_process)
    self.bg_thread.start()

    super().__init__(*args, **kwargs)

  def _apply_one_update(self):
    key, value = self.deferred_ops[0]
    self.deferred_ops = self.deferred_ops[1:]

    self.deferred_keys[key] -= 1
    if self.deferred_keys[key] == 0:
      del self.deferred_keys[key]

    super().__setitem__(key, value)

  def __setitem__(self, key, value):
    self.deferred_ops.append((key, value))
    # Increment the counter for `key`
    self.deferred_keys[key] = self.deferred_keys.get(key, 0) + 1


  def __getitem__(self, key):
    # fast path - this key isn't affected by any deferred writes
    if key not in self.deferred_keys:
      return super().__getitem__(key)

    while len(self.deferred_ops):
      # the paper's impl directly triggers updates here
      sleep(1)

    return super().__getitem__(key)

With this example we can see that nilext operations (writes, or in our example, __setitem__ calls) can return quickly, while operations that externalize state (reads/__getitem__) might be slower if the write operation has not already been executed in the background.

Unlike my example, the paper needs to deal with distributed operations, which means that it additional needs a consensus protocol to actually order operations and agree upon the order. The authors deal with this by deferring the points where order is established to operations at externalize state. For consensus, the authors introduce a protocol SKYROS, which seems to be quite similar to VR with some additional considerations to benefit from knowledge of nilext operations. One key detail is that nilext updates still need to be applied in an agreed upon order. This is accomplished by adding updates to the durability log, which requires a super majority (at least \(\frac{3}{4}\) of the nodes) of acknowledgments from all participating replicas (this is explained in the paper with some really clear examples). This can be problematic in cases where a super majority is not available, which the authors handle by having clients retry operations a few times before marking them as non-nilext, which immediately finalizes the order, and only requires a simple majority (at least \(\frac{1}{2}\) of the nodes).

The paper goes on to claim and show that in many cases, this can improve performance as many workload lie mainly in the fast paths.

My thoughts

Connections to filesystem approaches

This approach reminds me a lot of how some filesystems I’ve worked on handled writes and tiered storage. In some of the systems I’ve seen, writes remain in memory, and reads that touch dirty blocks read from the newly written memory, while other reads either read from a cache, or move down to the next tier. This paper seems to apply a similar methodology towards improving performance of strongly consistent databases.

Allowing different operations to have different consistency guarantees

Not exactly related to this paper, but I wonder if there’s applications that would benefit from being able to request reads that prioritize performance over consistency, allowing potentially better read performance for some applications. Maybe some model like a per-key configuration of consistency would be useful.

Detecting workloads and switching strategies

I wonder if there’s possible workloads that perform worse due to the overhead of deferring execution (e.g. write, read, write, read, etc), that could be detected and used to trigger a mode where execution is immediate (i.e. treat all updates as externalizing), and if that has material performance gains. I imagine this could work by keeping a counter to determine the frequency of non-nilext requests submitted. The discussion near the end of the paper where SKYROS is combined with commutative protocols seems to suggest that this might be beneficial.

All in all, interesting paper, deferred execution often opens up opportunities for a lot of optimizations (e.g. combining updates to the same key so storage only has to be accessed once - though this could still be done at other level without the specific approach in this paper), which makes me really excited about the ideas presented here!

Written on February 3, 2024