Present the new storage engine we are developing for Couchbase Explain the motivations of the Couchbase new storage engine – why we need a new storage engine based on observations on customer use cases Give an overview of two popular storage structures, as well as a more detailed overview on ForestDB, the codename for our next-generation storage engine Review the main index structures and key features for ForestDB Show you the performance evaluations we did with ForestDB, and compare that to popular storage engines in the market today Then I will talk about the various optimizations we do with SSDs and explain why they are critical
Modern web and mobile applications at global scale need to support, in some cases, hundreds of millions of users and devices, and these users/devices are generating large volume of unstructured data at ever increasing rate These applications need to maintain high performance and very fast response time as the database grows larger and larger To support massive databases and large number of concurrent users, we have been working on various innovations in CB, leveraging fast memory, fast network, and fast storage.. Storage engine is one of the most critical components to address high performance and scalability requirements.
B+Tree is one of the most popular storage structures used by many database systems It is basically a generalization of the binary tree many of us learned in introductory computer science classes Each node in B+Tree contains 2 or more key-value or key-pointer pairs As the diagram shows, there are two types of nodes: index node and leaf node. Index nodes contains sorted list of keys, with pointers to next level of index nodes Leaf nodes contains the sorted list of key-value pairs generated by the application It is important to note that the ENTIRE key string is maintained in all the nodes This has performance and storage implications as we will see in the next slide
Because we store the entire key in each of the index node, as the keys get larger, we incur significant storage overhead in the index nodes Also, as the keys get larger, the number of key-pointer pairs we can fit in a node become smaller, and the depth of the tree grows as more data is loaded Performance degrades corresponding as one needs to traverse more nodes to get to a particular key-value pair in the leaf node We observed this storage overhead and performance issue from our customers Before CB 2.0, we used SQLite, which supports in-place update, general-purpose B+Tree. SQLite database suffers from disk fragmentation when there are lots of inserts and deletes. Fragmentation means the data blocks are spread out sub-optimally on disk and the system has to do many random seeks to commit changes to the database, leading to slower response time To address this problem, we developed Couchstore, which uses an append-only approach. This allows our system to write changes sequentially to the end of the file very efficiently. Other variatns of B+Tree were introduced as well, and we will review the popular LSM storage structure used by LevelDB, RocksDB, and WierdTiger LSM storage engines.
LSM storage structure improve write performance by Appending updates and new data to the end of the file Deferring and batching index changes by maintaining these changes in memory
Reads may require traversing multiple trees. So need to do merge periodically to combine adjacent trees.
LSM has same limitations as B+Tree for keys that are long
Because of the limitations of current storage engine to meet the requirements of modern applications, we set up to develop the next-generation storage engine.
We designed the new storage engine to handle variable and fixed-length keys that may be long, optimized for SSDs as appropriate We want to have a compact index structure to reduce storage overhead and improve read and write performance, And we want the storage engine to index large number of keys regardless of the key patterns
We call our next-generation storage engine ForestDB Its main index structure is hierarchical B+-Tree based Trie Originally presented at ACM Sigmod in 2011 by Jung-Sang Ahn We did some performance benchmarks and it shows better read and write performance with less storage overhead compared to other KV storage engines We released 1.0 Beta last Oct, and is working hard toward 1.0 GA as part of the upcoming CB 4.0 release
This is a list of key features for ForestDB, instead of going through the list of features, I would like to go over …(next slide)
.. the main index structure and the key designs of ForestDB so that you understand how we achieve compact index structure and high read/write performance
ForestDB is based on HB+Trie Trie is basically a prefix tree The key difference between HB+Trie and traditional Trie tree is that each node in the HB+Trie tree contains a separate B+tree What does that mean? To add a key-value pair to the HB+Trie tree, we first split the key into fixed-size chunks. In this example, we split the long key into 4-byte chunks We use the first chunk to construct the first level of B+tree, using the second chunk, we construct the second level of B+tree, and so on So the leaf nodes of the B+Tree either point to other B+trees, or the documents We also maintain pointers across the documents, allow us to scan the document in lexicographical order, which is very useful for range scans
HB+Trie supports prefix compression, let me explain it by example
In this example, chunk size is one byte Initially the tree is empty, and we want to insert the fist key
Since this key can be distinguished by the first chunk ‘A’’, so we create first level of B+ tree using the first chunk ‘.
Next we insert the key ‘bbbb’. Again it is distinguishable by the first chunk ‘b’, so we can index this key using the first level of B+tree
Now the next key is ‘aaab’, the first chunk ‘a’ cannot distinguishable the new key ‘aaab’ from the existing one ‘aaaa’
The first distinguishable chunk is the 4th chunk
So we create the next level of B+Tree using the 4th chunk We store the common prefix ‘aa” in the second level B+tree, and do not have to create the second level or third-level B+trees, this is known as prefix compression
Next we want to insert ‘bbcd’, again it cannot be distinguishable by the first chunk
The first distinguishable chunk is the 3rd chunk
So we create the next level of B+Tree using the 3rd chunk, and store the second chunk as the common prefix
This example shows you two key features on HB+Trie structure to reduce index storage by not having to store the entire key in the B+Tree nodes: Splitting the keys into fixed-size chunks and further reducing storage using prefix compression
It is critical to maintain compact index structure If index consumes a lot of space, it uses more memory and more disk space, and is less efficient Let me show you example
We showed you how prefix compression help reduce the storage required when keys have common prefixes, which can happened in secondary index keys, for example indexing data by zip codes.
HB+Trie tree works well even when keys are randomly distributed…
In ForestDB, we maintain two main indexes We use HB+Trie tree to index the keys from applications as we explained earlier, and we also maintain a sequence number index The sequence index is to used to index the sequence numbers assigned to all mutations processed by ForestDB for inserts, deletes, and update ForestDB maintain a global sequence number for each KV instance, we increment the global sequence number for each mutation, and assign the sequence number to the mutation So application can retrieve a value by its key or sequence number
Let me explain the concept of Write-Ahead Logging With write-ahead-logging, we can simply append changes to data records to the data file, and defer the index changes We want to avoid updating the main indexes every time we do batch commit, as updating the indexes are expensive – they consume storage and incur additional I/O operations Using write-ahead-logging, we can improve performing by batching updates and writing them sequentially to the end of the DB file, and reduce the index update overhead by maintaining the WAL indexes in memory So we maintain the changes to the indexes in the memory, and periodically we append the index changes sequentially to the DB file
To retrieve a value based on a key, we first inspect if the key exists in the in-memory WAL index. If it does, we use the offset to look up the value from the file. If the key is not cached in the WAL index, then we traverse the HB+Trie tree to look up the offset and retrieve the data.
So the Write-Ahead-Logging not only improves write performance, it helps with reads as well.
Next let’s talk about Block Cache
ForestDB supports its own bock cache, to provide more flexible cache management It manages HB+Trie nodes, B+Tree nodes, and user data on a block basis With its own block cache management, ForestDB can prioritize index node blocks over data blocks Basically we want to cache more index nodes in the cache to support faster index traversal, Otherwise tree traversal may require more disk I/O’s Also we have the option to bypass OS page cache, which is shared by all user level processes In our performance evaluations, bypassing OS page cache gives us better performance, especially for SSDs
I am going to skip this slide, which gives the details on how we manage the Block Cache
Note – AVL-tree is self-balancing binary search tree
Let’s talk about compaction…
As I mentioned before, ForestDB uses append-only file structure to improve write performance, This means that we have to perform compact to garbage-collect deleted and state data blocks from disk We provide two compaction options First is manual compaction. One to to invoke compaction manually via an API. This allows user to monitor the level of stale data and compact the file as needed The other is done automatically by ForestDB
You can choose manual or automatic compaction based on your application requirements.
It is interesting to note that compaction does not block writes. The system is still performing reads and writes concurrently while the compaction is going on. While compaction is going on, we can still write to the WAL section of the new file
Diagram shows typical database storage stack, which goes through the OS file system layer to read/write from the storage devices. The OS file system cache is shared across all user-level applications, and the caching policy may not be tuned for the database system. In addition, current enterprise SSD systems provide ultra high performance but it is difficult to take advantage of that because one has to go through the OS file system.
Because of these limitations, advanced database storage stack bypass the file system cache – to provide more flexible cache management and overhead of file system cache. In this case, the database system manages the storage and caching directly. Many database systems use this architecture, and this is what we implement for ForestDB.
Compaction is significant overhead. Even though ForestDB has compact index structures, compaction is still expensive. Basically one has to read all the blocks, discard stale blocks, and write all valid blocks to a new file. We want to improve compaction performance so that it does not impact regular read/write operations.
We recently prototyped SWAT-Based Compaction Optimization, a collaboration between Couchbase and a professor Lee in South Korea. SSD does not support in-place updates. When a page is written, the SSD layer creates a new copy of the page, and the logical address in file system is updated to point to the new physical address. The logical to physical address mapping is managed by the Flash Translation Layer.
No need to write new file for compaction, just create a new logical to physical mappings to the valid blocks. Extend FTL with SWAT – Swap and Trim interface
This is just a prototype. We expect further improvements. Also working with major SSD vendors to support the SWAT interface.
SSD has parallel channels. If you look inside enterprise SSD, it has its own CPU, and cache, and multiple IO
Example – multiple async IOs for secondary indexes queries
Can have multiple chained deltas…at some point will compact into single object
Q & A . Support for raw file: In future we plan to implement our own volumn manager that works on unformatted disk volumn. . WAL index flushing. Right now doing this as part of the thread doing the set operation that causes the WAL index size to exceed a threshold. Plan to implement a separate system thread to flush the WAL index. . Compression Yes we currently use Snappy compression on index and data nodes. Compression is optional. Will explore other compression algorithms in the future. . Number of threads. At some point, the disk I/O utilization is maximized and adding more threads will not help with throughput.
Couchbase Live Europe 2015: Storing Big Data: ForestDB the Couchbase Next Generation Storage Engine
A Next Generation Storage Engine for
NoSQL Database Systems
SoftwareArchitect, Couchbase Inc.
VP Product Management,Couchbase Inc.