Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Preview: A Next Generation Storage 
Engine for NoSQL Database Systems 
Chiyoung Seo 
Software Architect, Couchbase Inc.
Contents 
 Why do we need a new KV storage engine? 
 ForestDB 
 HB+-Trie 
 Block aligning 
 Write Ahead Logging (WAL)...
Why do we need a new KV storage engine?
Modern Web / Mobile Applications 
 Operate on huge volumes of unstructured data 
 Significant amount of new data is cons...
B+Tree 
 Main storage index structure in a database field 
 Generalization of binary search tree 
 Each node consists o...
03/26 
B+Tree 
Ki: ith smallest key in the node 
Pi: pointer corresponding to Ki 
Vi: value corresponding to Ki 
f: fanout...
B+Tree Limitations 
 Not suitable to index variable or fixed-length long keys 
 Significant space overhead as entire key...
Goals 
 Fast and scalable index structure for variable or fixed-length long keys 
 Targeting block I/O storage devices n...
ForestDB
ForestDB 
 Key-Value storage engine developed by Couchbase Caching / Storage 
team 
 Its main index structure is built f...
Main Features 
 Multi-Version Concurrency Control (MVCC) with append-only storage model 
 Write-Ahead Logging (WAL) 
 A...
ForestDB: Main Index Structure
HB+Trie (Hierarchical B+Tree based Trie) 
 Trie (prefix tree) whose node is B+Tree 
 A key is split into the list of fix...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Prefix Compression 
As original trie, each node (B+Tree) is created on-demand (except for root 
node) 
Example: Chunk size...
Benefits 
 When keys are sufficiently long & uniform random (e.g., UUID or hash value) 
 When keys have common prefixes ...
Benefits 
 When keys are sufficiently long & uniform random (e.g., UUID or hash value) 
 When keys have common prefixes ...
Benefits 
 When keys are sufficiently long & uniform random (e.g., UUID or hash value) 
 When keys have common prefixes ...
ForestDB Index Structures 
 ForestDB maintains two index structures 
 HB+Trie: key index 
 Sequence B+Tree: sequence nu...
ForestDB: Write Ahead Logging
Write Ahead Logging 
 Append updates first, and update the main indexes later 
 Main purposes 
 To maximize write throu...
Append DB header 
for every commit 
Write Ahead Logging 
ID index Seq no. index 
h(key) 
h(key) 
… 
Offset 
Offset 
… 
h(s...
ForestDB: Block Cache
Block Cache 
ForestDB has its own block cache layer 
 Managed on a block basis 
 Give higher priority to index node bloc...
18/26 
Block Cache 
 Global LRU list for database files that are currently opened 
 Separate AVL tree for each file to k...
ForestDB: Compaction
Compaction 
 Manual compaction 
 Performed by calling the compact public API manually 
 Daemon compaction 
 A single d...
ForestDB: Evaluation
ForestDB DGM Performance 
 Evaluation Environments 
 64-bit machine running Centos 6.5 
 Intel Xeon 2.00 GHz CPU (6 cor...
KV Storage Engine Configurations 
 LevelDB 
 Compression is disabled 
 Write buffer size: 256 MB (initial load), 4 MB (...
Initial Load Performance 
3x ~ 6x less time
Initial Load Performance 
4x less write overhead
Read-Only Performance 
30000 
25000 
20000 
15000 
10000 
5000 
0 
Read-Only Performance 
1 2 4 8 
Operations per second 
...
Write-Only Performance 
12000 
10000 
8000 
6000 
4000 
2000 
0 
Write-Only Performance 
1 4 16 64 256 
Operations per sec...
Write-Only Performance 
450 
400 
350 
300 
250 
200 
150 
100 
50 
0 
Write Amplification 
1 4 16 64 256 
Write amplifica...
Mixed Workload Performance 
12000 
10000 
8000 
6000 
4000 
2000 
0 
Mixed (Unrestricted) Performance 
1 2 4 8 
Operations...
Optimizations for Solid-State Drives
26/26 
OS File System Stack Overhead 
Database Storage Engine 
OS File System 
Meta Data Mgmt 
Volume Manager 
… Buffer Ca...
Advanced Database Storage Stack 
 Bypass the entire OS file system stack 
 Volume manager 
 Operate on unformatted disk...
Database Compaction 
 Required for append-only storage model 
 Garbage collect stale data blocks 
 Use significant disk...
SWAT-Based Compaction Optimization 
 Logical page can change its physical address in flash memory 
whenever it is overwri...
SWAT-Based Compaction Optimization 
Document 
B+Tree (Node of HB+Trie) 
B+Tree Node 
Old Ver. of B+Tree Node 
I 
G H 
Curr...
SWAT-Based Compaction Optimization 
 Implement SWAT interface on the OpenSSD development platform by 
adapting its FTL co...
Utilizing Parallel Channels on SSDs 
 Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O 
capabi...
Append / Prepend Limitations 
 More and more applications use Append / Prepend APIs provided by 
Couchbase Server 
 Mobi...
Storage-Level Append / Prepend APIs 
 Internally check if a key exists or not 
 If exists, then write a delta value only...
Summary
Summary 
 ForestDB 
 Compacted main index structure built from HB+-Trie 
 High-performance, space efficiency, and scala...
Questions? 
chiyoung@couchbase.com 
©2014 Couchbase, Inc. 81
Upcoming SlideShare
Loading in …5
×

A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

2,789 views

Published on

B+-tree has been used as one of the main index structures in a database field fore more than four decades. However, typical B+-tree implementations show scalability and performance issues as modern global-scale Web or mobile applications generate huge volumes of data that has not been seen before. Various key-value storage engines with variants of B+-tree, such as log-structured merge tree (LSM-tree) have been proposed to address these limitations. At Couchbase, we also have been working on a new key-value storage engine that can provide high scalability and performance, and recently released the beta version of ForestDB, whose main index structure is based on Hierarchical B+-Tree based Trie or HB+-Trie. In this presentation, we introduce ForestDB and discuss why ForestDB can be fitted well for modern big data applications. We also explain various optimizations on ForestDB, which are planned especially for solid-state drives (SSDs).

Published in: Data & Analytics
  • Be the first to comment

A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

  1. 1. Preview: A Next Generation Storage Engine for NoSQL Database Systems Chiyoung Seo Software Architect, Couchbase Inc.
  2. 2. Contents  Why do we need a new KV storage engine?  ForestDB  HB+-Trie  Block aligning  Write Ahead Logging (WAL)  Block buffer cache  Evaluation  Optimizations for Solid-State Drives (SSDs)  Async I/O to exploit parallel I/O capabilities from SSDs  Volume manager inside ForestDB  Lightweight and I/O efficient compaction  Append / Prepend support  Summary ©2014 Couchbase, Inc. 2
  3. 3. Why do we need a new KV storage engine?
  4. 4. Modern Web / Mobile Applications  Operate on huge volumes of unstructured data  Significant amount of new data is constantly generated from hundreds of millions of users or devices  Still require high performance and scalability in managing their ever-growing database  Underlying storage engine is one of the most critical parts in database systems to provide high performance / scalability ©2014 Couchbase, Inc. 4
  5. 5. B+Tree  Main storage index structure in a database field  Generalization of binary search tree  Each node consists of two or more {key, value (or pointer)} pairs  Fanout (or branch) degree: # of KV pairs in a node  Node size is generally fitted into multiple page size ©2014 Couchbase, Inc. 5
  6. 6. 03/26 B+Tree Ki: ith smallest key in the node Pi: pointer corresponding to Ki Vi: value corresponding to Ki f: fanout degree K1 P1 … … Kj Pj K1 P1 … … Kd Pd Index (non-leaf) node … … K1 V1 K2 V2 … … Kf Vf … Leaf node … Kj Pj … … Kl Pl Root node … … … K1 P1 … … Kf Pf Kj Pj … … Kn Pn … … Kj Vj Kk Vk … … Kn Vn Index (non-leaf) node
  7. 7. B+Tree Limitations  Not suitable to index variable or fixed-length long keys  Significant space overhead as entire key strings are indexed in non-leaf nodes  Tree depth grows quickly as more data is loaded  I/O performance is degraded significantly as the data size gets bigger  Several variants of B+Tree were proposed  LevelDB (Google)  RocksDB (Facebook)  TokuDB (Tokutek) 04/26
  8. 8. Goals  Fast and scalable index structure for variable or fixed-length long keys  Targeting block I/O storage devices not only SSD but also legacy HDD  Less storage space overhead  Reduce write amplification  Regardless of the pattern of keys  Efficient to keys both sharing common prefix and not sharing common prefix 06/26
  9. 9. ForestDB
  10. 10. ForestDB  Key-Value storage engine developed by Couchbase Caching / Storage team  Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie - HB+-Trie was originally presented at ACM SIGMOD 2011 Programming Contest, by Jung-Sang Ahn who works at Couchbase (http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf)  Significantly better read and write performance with less storage overhead  Support various server OSs (Centos, Ubuntu, Debian, Mac OS x, Windows) and mobile OSs (iOS, Android)  1.0 beta was released last week ©2014 Couchbase, Inc. 11
  11. 11. Main Features  Multi-Version Concurrency Control (MVCC) with append-only storage model  Write-Ahead Logging (WAL)  A value can be retrieved by its sequence number or disk offset in addition to a key  Custom compare function to support a customized key order  Snapshot support to provide different views of database  Rollback to revert the database to a specific point  Ranged iteration by keys or sequence numbers  Transactional support with read_committed or read_uncommitted isolation level  Manual or auto compaction configured per KV instance ©2014 Couchbase, Inc. 12
  12. 12. ForestDB: Main Index Structure
  13. 13. HB+Trie (Hierarchical B+Tree based Trie)  Trie (prefix tree) whose node is B+Tree  A key is split into the list of fixed-size chunks (sub-string of the key) Variable length key: a83jgls83jgo29a… Fixed size (e.g. 4-byte) Lexicographical ordered traversal 07/26 Search using Chunk1 Document B+Tree (Node of HB+Trie) Node of B+Tree a83j gls8 3jgo … Chunk1 Chunk2 Chunk3 … Search using Chunk2 Search using Chunk3
  14. 14. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte 08/26 1st Insert ‘aaaa’ B+Tree using 1st chunk as key
  15. 15. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte 1st Insert ‘aaaa’ a aaaa Distinguishable by first chunk ‘a’ 08/26 B+Tree using 1st chunk as key
  16. 16. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte B+Tree using 1st chunk as key Distinguishable by first chunk ‘b’ 08/26 Insert ‘bbbb’ aaaa 1st a b bbbb
  17. 17. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte B+Tree using 1st chunk as key 08/26 Insert ‘aaab’ aaaa 1st a b bbbb Cannot distinguish using first chunk ‘a’
  18. 18. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte Insert ‘aaab’ aaaa Cannot distinguish using first chunk ‘a’ b First distinguishable chunk: 4th B+Tree using 1st chunk as key 08/26 1st a bbbb
  19. 19. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte Store skipped common prefix ‘aa’ 08/26 1st a b bbbb 4th aa a aaaa b aaab B+Tree using 4th chunk as key, skipping common prefix ‘aa’
  20. 20. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte 08/26 1st a b bbbb 4th aa a aaaa b aaab Insert ‘bbcd’ Cannot distinguish using first chunk ‘b’ B+Tree using 4th chunk as key, skipping common prefix ‘aa’
  21. 21. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte 1st a b bbbb 4th aa a aaaa b aaab Insert ‘bbcd’ Cannot distinguish using first chunk ‘b’ First distinguishable chunk: 3rd B+Tree using 4th chunk as key, skipping common prefix ‘aa’ 08/26
  22. 22. Prefix Compression As original trie, each node (B+Tree) is created on-demand (except for root node) Example: Chunk size = 1 byte 1st a b 4th aa a aaaa b aaab 3rd b Store skipped common prefix ‘b’ b c bbbb bbcd B+Tree using 3rd chunk as key, skipping common prefix ‘b’ 08/26
  23. 23. Benefits  When keys are sufficiently long & uniform random (e.g., UUID or hash value)  When keys have common prefixes (e.g., secondary index keys) Example: Chunk size = 2 bytes 1st Insert a83jfl2iejzm302k, dpwk3gjrieorigje, z9382h3igor8eh4k, 283hgoeir8goerha, 023o8f9o8zufisue a8 a83jfl2i ejzm302k dp dpwk3gjr ieorigje z9 z9382h3i gor8eh4k 28 283hgoei r8goerha 02 023o8f9o 8zufisue
  24. 24. Benefits  When keys are sufficiently long & uniform random (e.g., UUID or hash value)  When keys have common prefixes (e.g., secondary index keys) 09/26 1st Example: Chunk size = 2 bytes Insert a83jfl2iejzm302k, dpwk3gjrieorigje, z9382h3igor8eh4k, 283hgoeir8goerha, 023o8f9o8zufisue a8 a83jfl2i ejzm302k dp dpwk3gjr ieorigje z9 z9382h3i gor8eh4k 28 283hgoei r8goerha 02 023o8f9o 8zufisue Majority of keys can be indexed by first chunk  There will be only one B+Tree on HB+Trie  We don’t need to store & compare entire key string
  25. 25. Benefits  When keys are sufficiently long & uniform random (e.g., UUID or hash value)  When keys have common prefixes (e.g., secondary index keys) 1st Example: Chunk size = 2 bytes Insert a83jfl2iejzm302k, dpwk3gjrieorigje, z9382h3igor8eh4k, 283hgoeir8goerha, 023o8f9o8zufisue  When chunk size is n-bit  Up to 2n keys can be index by only first chunk  n=32 (4 bytes): 232 ~= 4 billion  n=64 (8 bytes): 264 ~= 1019 a8 a83jfl2i ejzm302k dp dpwk3gjr ieorigje z9 z9382h3i gor8eh4k 28 283hgoei r8goerha 02 023o8f9o 8zufisue Majority of keys can be indexed by first chunk 09/26  There will be only one B+Tree on HB+Trie  We don’t need to store & compare entire key string
  26. 26. ForestDB Index Structures  ForestDB maintains two index structures  HB+Trie: key index  Sequence B+Tree: sequence number (8-byte integer) index  Retrieve the file offset to a value using key or sequence number HB+Trie DB file Doc Doc Doc Doc Doc Doc … B+Tree key Sequence number 11/26
  27. 27. ForestDB: Write Ahead Logging
  28. 28. Write Ahead Logging  Append updates first, and update the main indexes later  Main purposes  To maximize write throughput by sequential writes (append-only updates)  To reduce # of index nodes to be written by batched updates Append DB header for every commit ID index Seq no. index h(key) h(key) … Offset Offset … h(seq no) h(seq no) … Offset Offset … DB file Docs Index nodes Docs H WAL indexes: in-memory structures (hash table) H
  29. 29. Append DB header for every commit Write Ahead Logging ID index Seq no. index h(key) h(key) … Offset Offset … h(seq no) h(seq no) … Offset Offset … DB file Docs Index nodes Docs H WAL indexes: in-memory structures (hash table) H 15/26  Append updates first, and update the main indexes later  Main purposes  To maximize write throughput by sequential writes (append-only updates)  To reduce # of index nodes to be written by batched updates < Key query> 1. Retrieve WAL index first 2. If hit  return immediately 3. If miss  retrieve HB+Trie (or B+Tree)
  30. 30. ForestDB: Block Cache
  31. 31. Block Cache ForestDB has its own block cache layer  Managed on a block basis  Give higher priority to index node blocks than data blocks  Provide an option to bypass the OS page cache HB+Trie (or Seq Index) WAL Index Block read/write Block Cache Layer File read/write (if cache miss/eviction occurs) DB File (on File System)
  32. 32. 18/26 Block Cache  Global LRU list for database files that are currently opened  Separate AVL tree for each file to keep track of dirty blocks  Separate hash table for each file with a key (block_id) and a value (pointer to a cache entry in either the clean LRU list or AVL tree) File LRU list File 4 File 2 File 1 File 5 hash(BID) hash(BID) … ptr ptr … AVL-tree Block Block Hash table Dirty blocks Block Block Block Block Clean LRU list … …
  33. 33. ForestDB: Compaction
  34. 34. Compaction  Manual compaction  Performed by calling the compact public API manually  Daemon compaction  A single daemon thread inside ForestDB manages the compaction automatically  A Compactor thread can interleave with a writer thread  While a compaction task is running, a writer thread can still write dirty items into the WAL section of a new file, which allows the compaction thread to be interleaved with the writer thread
  35. 35. ForestDB: Evaluation
  36. 36. ForestDB DGM Performance  Evaluation Environments  64-bit machine running Centos 6.5  Intel Xeon 2.00 GHz CPU (6 cores, 12 threads)  32GB RAM and Crucial M4 SSD  Data  Key size 32 bytes and value size 1KB  Load 100M items  Logical data size 100GB total
  37. 37. KV Storage Engine Configurations  LevelDB  Compression is disabled  Write buffer size: 256 MB (initial load), 4 MB (otherwise)  Buffer cache size: 8 GB  RocksDB  Compression is disabled  Write buffer size: 256 MB (initial load), 4 MB (otherwise)  Maximum number of background compaction threads: 8  Maximum number of background memtable flushes: 8  Maximum number of write buffers: 8  Buffer cache size: 8 GB (uncompressed)  ForestDB  Compression is disabled  WAL size: 4,096 documents  Buffer cache size: 8 GB
  38. 38. Initial Load Performance 3x ~ 6x less time
  39. 39. Initial Load Performance 4x less write overhead
  40. 40. Read-Only Performance 30000 25000 20000 15000 10000 5000 0 Read-Only Performance 1 2 4 8 Operations per second # reader threads ForestDB LevelDB RocksDB 2x ~ 5x
  41. 41. Write-Only Performance 12000 10000 8000 6000 4000 2000 0 Write-Only Performance 1 4 16 64 256 Operations per second Write batch size (# documents) ForestDB LevelDB RocksDB - Small batch size (e.g., < 10) is not usually common 3x ~ 5x
  42. 42. Write-Only Performance 450 400 350 300 250 200 150 100 50 0 Write Amplification 1 4 16 64 256 Write amplification (Normalized to a single doc size) Write batch size (# documents) ForestDB LevelDB RocksDB ForestDB shows 4x ~ 20x less write amplification
  43. 43. Mixed Workload Performance 12000 10000 8000 6000 4000 2000 0 Mixed (Unrestricted) Performance 1 2 4 8 Operations per second # reader threads ForestDB LevelDB RocksDB 2x ~ 5x
  44. 44. Optimizations for Solid-State Drives
  45. 45. 26/26 OS File System Stack Overhead Database Storage Engine OS File System Meta Data Mgmt Volume Manager … Buffer Cache Page Cache Block I/O Interface (SATA, PCI) SSD SSD SSD Database Storage Engine Block I/O Interface (SATA, PCI) SSD SSD SSD Typical Database Storage Stack Advanced Database Storage Stack
  46. 46. Advanced Database Storage Stack  Bypass the entire OS file system stack  Volume manager  Operate on unformatted disks  Maintain the list of valid blocks used by database  Garbage collect all invalid blocks  Buffer cache  Allow us to use different cache policies based on application workload ©2014 Couchbase, Inc. 71
  47. 47. Database Compaction  Required for append-only storage model  Garbage collect stale data blocks  Use significant disk I/O bandwidth  Read the entire database file and write all valid blocks into a new file  Affect other performance metrics  Regular read / write performance drops significantly ©2014 Couchbase, Inc. 72
  48. 48. SWAT-Based Compaction Optimization  Logical page can change its physical address in flash memory whenever it is overwritten  For this reason, the mapping table between LBA and PBA is maintained by Flash Translation Layer (FTL) A B C D E F … Logical Address in File System (LBA) FTL Address Mapping: LBA  PBA Physical Address in Flash Memory (PBA) A A’ … ©2014 Couchbase, Inc. 73
  49. 49. SWAT-Based Compaction Optimization Document B+Tree (Node of HB+Trie) B+Tree Node Old Ver. of B+Tree Node I G H Current DB file E A B F I’ H’ F’ C D C’ G New Compacted E F’ H’ I’ DB file A B C’ D  A new compacted file can be simply created by creating the new LBA to PBA mappings that contain the valid pages only in the current DB file  Need to extend the FTL by adding a new interface SWAT (Swap and Trim) ©2014 Couchbase, Inc. 74
  50. 50. SWAT-Based Compaction Optimization  Implement SWAT interface on the OpenSSD development platform by adapting its FTL code  Total time taken for compactions was reduced by 17x  Number of compactions triggered was reduced by 4x ©2014 Couchbase, Inc. 75
  51. 51. Utilizing Parallel Channels on SSDs  Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O capabilities by SSDs  Quite useful in querying secondary indexes when items satisfying a query predicate are located in multiple blocks on different channels ©2014 Couchbase, Inc. 76
  52. 52. Append / Prepend Limitations  More and more applications use Append / Prepend APIs provided by Couchbase Server  Mobile messaging services (e.g., Viber)  Value size gets bigger over time  Compaction happens frequently I G H curr_val = Get(“foo”); new_val = curr_val + delta; Set(“foo”, new_val); E A B F I’ H’ F’ C D C’ Document B+Tree (Node of HB+Trie) B+Tree Node Old Ver. of B+Tree Node key: “foo” key: “foo” ©2014 Couchbase, Inc. 77
  53. 53. Storage-Level Append / Prepend APIs  Internally check if a key exists or not  If exists, then write a delta value only into the file and have the offset to the old value  Appended / prepended values will be consolidated together periodically or as part of compaction  Less compaction overhead I G H E Append(“foo”, delta); A B F I’ H’ F’ C D C’ Document B+Tree (Node of HB+Trie) B+Tree Node Old Ver. of B+Tree Node key: “foo” key: “foo” ©2014 Couchbase, Inc. 78
  54. 54. Summary
  55. 55. Summary  ForestDB  Compacted main index structure built from HB+-Trie  High-performance, space efficiency, and scalability  Various optimizations for Solid-State Drives  Compaction  Volume manager  Exploiting parallel I/O channels on SSDs  Append / Prepend  ForestDB integrations  Couchbase Server secondary index  Couchbase Lite (The Future of Couchbase Mobile session @5:10pm today)  Couchbase Server KV engine ©2014 Couchbase, Inc. 80
  56. 56. Questions? chiyoung@couchbase.com ©2014 Couchbase, Inc. 81

×