Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Next Generation Storage Engine: ForestDB: Couchbase Connect 2015

3,192 views

Published on

B+-tree has been used as one of the main index structures in database fields for more than four decades. However, with the unprecedented amount of data being generated by modern, global-scale web, mobile, and IoT applications, typical B+-tree implementations are beginning to show scalability and performance issues. Various key-value storage engines with variants of B+-tree such as log-structured merge tree (LSM-tree), have been proposed to address these limitations. At Couchbase, we have been working on a new key-value storage engine, ForestDB, that has a main index structure based on Hierarchical B+- Tree (based Trie or HB+-Trie). and provides high scalability and performance. In this presentation, we introduce ForestDB and discuss why ForestDB is a good fit for modern big data applications.

Published in: Technology
  • Be the first to comment

Next Generation Storage Engine: ForestDB: Couchbase Connect 2015

  1. 1. FORESTDB: NEXT GENERATION STORAGE ENGINE FOR COUCHBASE Chiyoung Seo Software Architect, Couchbase Inc.
  2. 2. ©2015 Couchbase Inc. 2 Contents  Why do we need a new KV storage engine?  ForestDB  HB+-Trie  Write Ahead Logging (WAL)  Block buffer cache  Evaluation  Optimizations for Solid-State Drives (SSDs)  Volume manager inside ForestDB  Lightweight and I/O efficient compaction  Async I/O to exploit parallel I/O capabilities from SSDs  Summary
  3. 3. Why do we need a new KV storage engine?
  4. 4. ©2015 Couchbase Inc. 4  Operate on huge volumes of unstructured data  Significant amount of new data is constantly generated from hundreds of millions of users or devices  Still require high performance and scalability in managing their ever- growing database  Underlying storage engine is one of the most critical parts in database systems to provide high performance / scalability Modern Web / Mobile/ IoT Applications
  5. 5. ©2015 Couchbase Inc. 5  Main storage index structure in a database field  Generalization of binary search tree  Each node consists of two or more {key, value (or pointer)} pairs  Fanout (or branch) degree: # of KV pairs in a node  Node size is generally fitted into multiple page size B+Tree
  6. 6. ©2015 Couchbase Inc. 6 03/26 B+Tree Ki: ith smallest key in the node Pi: pointer corresponding to Ki Vi: value corresponding to Ki f: fanout degree K1 P1 … … Kd Pd K1 V1 K2 V2 … … Kf Vf … Index (non-leaf) node Leaf node … Kj Pj … … Kl Pl K1 P1 … … Kj Pj Root node … … K1 P1 … … Kf Pf Kj Pj … … Kn Pn … … … … … Kj Vj Kk Vk … … Kn Vn Index (non-leaf) node
  7. 7. ©2015 Couchbase Inc. 7  Not suitable to index variable or fixed-length long keys  Significant space overhead as entire key strings are indexed in non-leaf nodes  Tree depth grows quickly as more data is loaded  I/O performance is degraded significantly as the data size gets bigger  Several variants of B+Tree were proposed  LevelDB (Google)  RocksDB (Facebook)  TokuDB (Tokutek)  WiredTiger (MongoDB) B+Tree Limitations 04/26
  8. 8. ©2015 Couchbase Inc. 8  Fast and scalable index structure for variable or fixed-length long keys  Targeting block I/O storage devices not only SSD but also legacy HDD  Less storage space overhead  Reduce write amplification  Regardless of the pattern of keys  Efficient to keys both sharing common prefix and not sharing common prefix Goals 06/26
  9. 9. ForestDB
  10. 10. ©2015 Couchbase Inc. 10  Key-Value storage engine developed by Couchbase Caching / Storage team  Its main index structure is built from Hierarchical B+-Tree basedTrie or HB+- Trie  ForestDB paper accepted for publication in IEEETrans. On Computers  Significantly better read and write performance with less storage overhead  Support various server OSs (Centos, Ubuntu, Debian, MacOS x,Windows) and mobile OSs (iOS,Android)  Currently Beta and 1.0 GA will be released in July  Underlying storage engine for secondary index, mobile, and key-value engine in Couchbase ForestDB
  11. 11. ©2015 Couchbase Inc. 11  Multi-Version Concurrency Control (MVCC) with append-only storage model  Write-Ahead Logging (WAL)  A value can be retrieved by its sequence number or disk offset in addition to a key  Custom compare function to support a customized key order  Snapshot support to provide different views of database  Rollback to revert the database to a specific point  Ranged iteration by keys or sequence numbers  Transactional support with read_committed or read_uncommitted isolation level  Multiple key-value instances per database file  Manual or auto compaction configured per database file Main Features
  12. 12. ForestDB: Main Index Structure
  13. 13. ©2015 Couchbase Inc. 13  Trie (prefix tree) whose node is B+Tree  A key is split into the list of fixed-size chunks (sub-string of the key) HB+Trie (Hierarchical B+Tree basedTrie) Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a… 07/26 Lexicographical ordered traversal Search using Chunk1 Document B+Tree (Node of HB+Trie) Node of B+Tree Chunk1 Chunk2 Chunk3 … a83j gls8 3jgo … Search using Chunk2 Search using Chunk3 07/26
  14. 14. ©2015 Couchbase Inc. 14 Prefix Compression 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte 1stInsert ‘aaaa’ B+Tree using 1st chunk as key
  15. 15. ©2015 Couchbase Inc. 15 Prefix Compression 1stInsert ‘aaaa’ aaaa a Distinguishable by first chunk ‘a’ 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte B+Tree using 1st chunk as key
  16. 16. ©2015 Couchbase Inc. 16 Prefix Compression Distinguishable by first chunk ‘b’ B+Tree using 1st chunk as key 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte Insert ‘bbbb’ aaaa 1st a bbbb b
  17. 17. ©2015 Couchbase Inc. 17 Prefix Compression B+Tree using 1st chunk as key 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte Insert ‘aaab’ aaaa 1st a bbbb b Cannot distinguish using first chunk ‘a’
  18. 18. ©2015 Couchbase Inc. 18 Prefix Compression Insert ‘aaab’ aaaa Cannot distinguish using first chunk ‘a’ First distinguishable chunk: 4th B+Tree using 1st chunk as key 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte 1st a bbbb b
  19. 19. ©2015 Couchbase Inc. 19 Prefix Compression Store skipped common prefix ‘aa’ 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte 1st a bbbb b 4th aa aaaa a aaab b B+Tree using 4th chunk as key, skipping common prefix ‘aa’
  20. 20. ©2015 Couchbase Inc. 20 Prefix Compression 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte 1st a bbbb b 4th aa aaaa a aaab b Insert ‘bbcd’ Cannot distinguish using first chunk ‘b’ B+Tree using 4th chunk as key, skipping common prefix ‘aa’
  21. 21. ©2015 Couchbase Inc. 21 Prefix Compression 1st a bbbb b 4th aa aaaa a aaab b Insert ‘bbcd’ Cannot distinguish using first chunk ‘b’ First distinguishable chunk: 3rd B+Tree using 4th chunk as key, skipping common prefix ‘aa’ 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte
  22. 22. ©2015 Couchbase Inc. 22 Prefix Compression 1st a b 4th aa aaaa a aaab b 3rd b bbbb bbcd b c Store skipped common prefix ‘b’ B+Tree using 3rd chunk as key, skipping common prefix ‘b’ 08/26 As original trie, each node (B+Tree) is created on-demand (except for root node)  Example: Chunk size = 1 byte
  23. 23. ©2015 Couchbase Inc. 23 Benefits  When keys are sufficiently long & uniform random (e.g., UUID or hash value)  When keys have common prefixes (e.g., secondary index keys) Example: Chunk size = 4 bytes 1st Insert a83jfl2iejzm302k, dpwk3gjrieorigje, z9382h3igor8eh4k, 283hgoeir8goerha, 023o8f9o8zufisue a83jfl2i ejzm302k a8 dpwk3gjr ieorigje dp z9382h3i gor8eh4k z9 283hgoei r8goerha 28 023o8f9o 8zufisue 02
  24. 24. ©2015 Couchbase Inc. 24 Benefits 09/26 1st Insert a83jfl2iejzm302k, dpwk3gjrieorigje, z9382h3igor8eh4k, 283hgoeir8goerha, 023o8f9o8zufisue a83jfl2i ejzm302k a8 dpwk3gjr ieorigje dp z9382h3i gor8eh4k z9 283hgoei r8goerha 28 023o8f9o 8zufisue 02 Majority of keys can be indexed by first chunk  There will be only one B+Tree on HB+Trie  We don’t need to store & compare entire key strings  When keys are sufficiently long & uniform random (e.g., UUID or hash value)  When keys have common prefixes (e.g., secondary index keys) Example: Chunk size = 4 bytes
  25. 25. ©2015 Couchbase Inc. 25  ForestDB maintains two index structures  HB+Trie: key index  Sequence B+Tree: sequence number (8-byte integer) index  Retrieve the file offset to a value using key or sequence number ForestDB Index Structures DB file Doc Doc Doc Doc Doc Doc … HB+Trie B+Tree key Sequence number 11/26
  26. 26. ForestDB:Write Ahead Logging
  27. 27. ©2015 Couchbase Inc. 27  Append updates first, and reflect them in the main indexes later  Main purposes  To maximize write throughput by sequential writes (append-only updates)  To reduce # of index nodes to be written by batched updates Write Ahead Logging Append DB header for every commit HDocsDB file Docs Index nodes h(key) h(key) … Offset Offset … h(seq no) h(seq no) … Offset Offset … ID index Seq no. index WAL indexes: in-memory structures (hash table) H
  28. 28. ©2015 Couchbase Inc. 28 Append DB header for every commit HDocsDB file Docs Index nodes h(key) h(key) … Offset Offset … h(seq no) h(seq no) … Offset Offset … ID index Seq no. index WAL indexes: in-memory structures (hash table) H 15/26  Append updates first, and reflect them in the main indexes later  Main purposes  To maximize write throughput by sequential writes (append-only updates)  To reduce # of index nodes to be written by batched updates Write Ahead Logging < Key query> 1. RetrieveWAL index first 2. If hit  return immediately 3. If miss  retrieve HB+Trie (or B+Tree)
  29. 29. ForestDB: Block Cache
  30. 30. ©2015 Couchbase Inc. 30  ForestDB has its own block cache layer  Managed on a block basis  Give higher priority to index node blocks than data blocks  Provide an option to bypass the OS page cache Block Cache HB+Trie (or Seq Index) WAL Index Block Cache Layer Block read/write DB File (on File System) File read/write (if cache miss/eviction occurs)
  31. 31. ©2015 Couchbase Inc. 31 18/26  Global LRU list for database files that are currently opened  Separate AVL tree for each file to keep track of dirty blocks  Separate hash table for each file with a key (block_id) and a value (pointer to a cache entry in either the clean LRU list or AVL tree) Block Cache File LRU list File 4 File 2 File 1 File 5 hash(BID) hash(BID) … ptr ptr … AVL-tree Block Block Hash table Block Block Block Block Dirty blocks Clean LRU list … …
  32. 32. ForestDB: Compaction
  33. 33. ©2015 Couchbase Inc. 33  Manual compaction  Performed by calling the compact publicAPI manually  Daemon compaction  A single daemon thread inside ForestDB manages the compaction automatically  Support the additionalAPI that allows the application to retain the stale data up to a given snapshot marker  A Compactor thread can interleave with a writer thread Compaction
  34. 34. ForestDB: Evaluation
  35. 35. ©2015 Couchbase Inc. 35  Evaluation Environments  64-bit machine running Centos 6.5  Intel Xeon 2.00GHz CPU (6 cores, 12 threads)  32GB RAM and Crucial M4 SSD  Data  Key size 32 bytes and value size 1KB  Load 100M items  Logical data size 100GB total ForestDB Performance
  36. 36. ©2015 Couchbase Inc. 36  LevelDB  Compression is disabled  Write buffer size: 256 MB (initial load), 4 MB (otherwise)  Buffer cache size: 8 GB  RocksDB  Compression is disabled  Write buffer size: 256 MB (initial load), 4 MB (otherwise)  Maximum number of background compaction threads: 8  Maximum number of background memtable flushes: 8  Maximum number of write buffers: 8  Buffer cache size: 8 GB (uncompressed)  ForestDB  Compression is disabled  WAL size: 4,096 documents  Buffer cache size: 8 GB KV Storage Engine Configurations
  37. 37. ©2015 Couchbase Inc. 37 Initial Load Performance 3x ~ 6x less time
  38. 38. ©2015 Couchbase Inc. 38 Initial Load Performance 4x less write overhead
  39. 39. ©2015 Couchbase Inc. 39 Read-Only Performance 0 5000 10000 15000 20000 25000 30000 1 2 4 8 Operationspersecond # reader threads Throughput ForestDB LevelDB RocksDB 2x ~ 5x
  40. 40. ©2015 Couchbase Inc. 40 Write-Only Performance 0 2000 4000 6000 8000 10000 12000 1 4 16 64 256 Operationspersecond Write batch size (# documents) Throughput ForestDB LevelDB RocksDB - Small batch size (e.g., < 10) is not usually common 3x ~ 5x
  41. 41. ©2015 Couchbase Inc. 41 Write-Only Performance 0 50 100 150 200 250 300 350 400 450 1 4 16 64 256 Writeamplification (Normalizedtoasingledocsize) Write batch size (# documents) Write Amplification ForestDB LevelDB RocksDB ForestDB shows 4x ~ 20x less write amplification
  42. 42. ©2015 Couchbase Inc. 42 Mixed Workload Performance 0 2000 4000 6000 8000 10000 12000 1 2 4 8 Operationspersecond # reader threads Mixed (Unrestricted) Performance ForestDB LevelDB RocksDB 2x ~ 5x
  43. 43. Optimizations for Solid-State Drives Please join the deep dive session tomorrow presented by Prof. Sang-Won Lee and Sundar Sridharan
  44. 44. ©2015 Couchbase Inc. 44 26/26 OS File System Stack Overhead SSD SSD SSD Block I/O Interface (SATA, PCIe) OS File System Page Cache Meta Data Mgmt Database Storage Engine SSD SSD SSD Block I/O Interface (SATA, PCIe) Database Storage Engine … Buffer Cache Typical Database Storage Stack Advanced Database Storage Stack Volume Manager
  45. 45. ©2015 Couchbase Inc. 45  Required for append-only storage model  Garbage collect stale data blocks  Use significant disk I/O bandwidth  Read the entire database file and write all valid blocks into a new file  Affect other performance metrics  Regular read / write performance drops significantly Database Compaction
  46. 46. ©2015 Couchbase Inc. 46  Adapt the SSD FlashTranslation Layer (FTL) to provide the new API SHARE  Avoid copying non-stale physical blocks from the old file to the new file  Leverage Btrfs (B-tree file system) Copy-On-Write (COW)  Allow us to share non-stale physical blocks between the old file and new file  Much less write amplification and extend the SSD lifespan Compaction Optimization
  47. 47. ©2015 Couchbase Inc. 47  Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O capabilities by SSDs  Performance boost in various operations  Multi-GetAPI to fetch multiple documents at once  Reading non-stale blocks from the old file for compaction  Traversing secondary indexes when documents satisfying a query predicate are located in different blocks Utilizing Parallel I/O Channels on SSDs
  48. 48. ©2015 Couchbase Inc. 48 CompactionTime Reduction Through Async I/O Document B+Tree (Node of HB+Trie) B+Tree Node OldVer. of B+Tree Node I G H E A B F C D C’ F’ H’ I’ G E A B DC’ F’ H’ I’ Current DB file New Compacted DB file  Compaction time for 10GB file with 512MB buffer cache  Async I/O with queue depth 64 Sync I/O Async I/O compaction time (secs) 1543 586
  49. 49. Summary
  50. 50. ©2015 Couchbase Inc. 50  ForestDB  Compacted main index structure built from HB+-Trie  High-performance, space efficiency, and scalability  Various optimizations for Solid-State Drives  Compaction  Volume manager  Exploiting parallel I/O channels on SSDs  ForestDB integrations  Couchbase Server secondary index  Couchbase Lite  Couchbase Server KV engine  Couchbase full-text search engine Summary
  51. 51. ©2015 Couchbase Inc. 51 Questions? chiyoung@couchbase.com

×