Senior/Staff System Design Interviews: LSM Trees (or LSM Engines)

Whether a senior/staff system design interview ends up meriting an offer often depends on how well one understands scalability and how to tune a database to achieve high levels of performance.

We touched upon B-trees in a previous post, which often underlie the relational databases we are familiar with. We’ll now begin discussing a contrasting data structure that underlies a number of NoSQL stores: LSM Trees.

LSM Trees (Log-Structured Merge Trees) have emerged as a popular data structure that optimizes for write-heavy workloads. We’ll get into two component of the LSM Tree: an in-memory memtable and disk-stored SSTables (Sorted String Tables). We’ll also describe ‘sequential writes’, a concept central to how LSM trees make super fast writes to memory. But first, let’s address a frequent major impediment to performance: Disk I/O.

Why is Disk I/O considered slow?

Disk I/O (input/output) of traditional spinning hard disk drives (HDDs) is relatively expensive due to their mechanical characteristics. When reading or writing, HDDs require physical movement, resulting in seek time, rotational latency, and slow ‘random accesses’. Solid state drives (SSDs) instead use semiconductors as the basis for their storage mechanism, which leads to faster data retrieval times (often > 10x faster). Though some state that SSDs have zero seek-time (as they lack the same physical mechanisms of HDDs), there is still overhead/latency associated with them.

And while SSDs are much faster than HDDs, RAM can be an order faster than both! The input/output to/from these drives often become unnecessary bottlenecks in systems when they haven’t been accounted for properly.

The Speed-Up of Sequential Writes

As mentioned earlier, the efficiency of the LSM Tree lies in its ability to capitalize on sequential writes to memory. As new data is written to the database, it is first stored in the memtable, residing in memory. Memory access is significantly faster than disk access, enabling this to be a rapid process. The memtable is always chosen to be a structure that will maintain the ordering of its elements (such as a red-black tree). We’ll discuss why in a moment. 

As seen above, LSM Trees accumulate data in the sorted memtable. That is, until a certain pre-configured threshold is reached. Once the threshold is met, the contents of the memtable are flushed to disk in a sequential manner, appending to an immutable ‘SSTable’ that stores the data in contiguous memory. 

Note: the elements in the tree can be interpreted as a ‘primary key’ of a database row. Perhaps the illustration below better describes the relation between the tree and database indexing. Logically, it does not make a difference whether we include the values in our illustrations, as LSM trees affect keys, not values. Below we have alphabetic keys corresponding to values V1, …, V6:

The SSTable is an on-disk data structure that stores key-value pairs, which are in order due to the memtable’s self-sorting nature, as shown below.

Above we see that the SSTable being written to is using consecutive (sequential) slots in memory. 

There is overhead associated with formulating a request to write to a certain location in memory, and this formulation happens numerous times for random writes and just once for sequential writes – hence the speed-up associated with sequential writes.

There are more technical details involved in this process (a proper in-depth introduction to sequential writes in SSDs should touch upon how memory is arranged into blocks and the lack of a notion of two blocks’ truly being “next to each other” physically). For this article, however, we don’t dive too deeply into computer architecture for fear of potentially missing out on the bigger picture.

Properties of SSTables

We’ve talked a lot about how quickly LSM trees write data, but let’s not forget that we ultimately want a database we can quickly read from as well. In this vein, there are a few interesting properties to note.

Handling Duplicate Keys

There are two scenarios regarding duplicate keys that we care about – when duplicates are in the memtable vs. in the SSTables.

We can think about the first-case as a data-structures and algorithms problem. In particular, in the process of adding a new element to a sorted binary tree, it requires only constant extra time to identify whether the key is already present in the tree, and, if so, replace it with the new key-value pair.

In the second case, suppose a key-value pair <key, val1> ends up being written to one SSTable and <key, val2> to another. In this case, the true value is whichever was added last (i.e. is in the most recent SSTable). Therefore, we’d need to go through the SSTables reverse-chronologically, if we were searching for an element.

Due to the sortedness of the SSTable, we can search for a key in a given SSTable in log-time. We can use an optimization based on Bloom Filters to filter out entire tables that definitely don’t contain our key. Another optimization is a Sparse Index, which can rule out large chunks of a given table (though the search will still have logarithmic time complexity).

Deletion

We don’t delete anything from SSTables ever, as this would introduce random writes. SSTables are immutable once created. Instead, we associate a special value to the to-be-deleted key, called a tombstone, which we use as a semantically equivalent representation of a key’s deletion.

Ever-expanding

As described, since there is no concept of deletion, SSTables are always getting bigger, even though a lot of the keys may have duplicate/stale values. To curb this, there are often background threads running a process called compaction, which take k different SSTables and create a new result SSTable containing only the most recent version of a value. Since SSTables are sorted structures by nature, this is equivalent to the algorithmic problem of combining k sorted arrays, while avoiding duplicates. After the new SSTable is created, the disk space housing the k original ones can be freed.

Preparing for interviews? Just interested in learning?

Get system design articles written by Staff+ engineers delivered straight to your inbox!