Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Log Structured Merge Tree
Pinglei Guo
at15 at1510086
Agenda
● History
● Questions after reading the paper
● An example: Cassandra
● The original paper: Why & How & Visualizati...
1996 LSM Tree
The log-structured merge-tree
cited by 401
2006 Bigtable
Bigtable: A distributed storage system for
structur...
History of LSM Tree
1
What’s the trend for Database?
● Data become larger, more write
● Non-Relational Databases emerge, H...
Questions after reading the paper
2
● Do I still need WAL/WBL when I use log structured merge tree
● Is LSM Tree a data st...
Quick Answers
2
● Do I still need WAL/WBL when I use log structured merge tree
Yes
● Is LSM Tree a data structure like B+ ...
3
Cassandra as first example
Why? (not O'Neil 96, Bigtable, LevelDB)
why we pick Cassandra as first example?
3
1. It give us a high level overview of a full real system
2. It is easier to und...
3
● Commit Log (WAL)
● Memtable (C0 in paper)
Write goes to
Operations return before
the data is written to disk
(Fast)
Ca...
3
● Memtable are dumped to disk as SSTable
● SSTable are merged by background process
immutable
Cassandra ‘Merge’
SSTable:...
3
● Read from MemTable
● use Bloom filter to identify SSTables
● Load SSTable index
● Read from multiple SSTables
● Merge ...
4
O'Neil 96 The LSM tree
Its name leads to confusion
● Log structured merge tree is not log like WAL
● Log comes from log ...
4
O'Neil 96 The LS Merge tree
Let's talk about Merge
Merge is the subtle part (that I don't understand clearly)
Two Merges...
4
O'Neil 96 The LS Merge tree
Q1: Why we need to Merge?
A : Because we put data on different media
4
O'Neil 96 The LS Merge tree
1. Merge is needed because we put data on different media
Q2: Why put data on different medi...
4
O'Neil 96 The LS Merge tree
1. Merge is needed because we put data on different media
1. Speed & Access pattern
2. Price...
4
O'Neil 96 The LS Merge tree
1. Merge is needed because we put data on different media
1. Speed & Access pattern
2. Price...
4
O'Neil 96 The LS Merge tree
1. Merge is needed because we put data on different media
1. Speed & Access pattern
2. Price...
4
O'Neil 96 The LS Merge tree
1. Merge is needed because we put data on different media
1. Speed & Access pattern
2. Price...
4
O'Neil 96 The LS Merge tree
1. Merge is needed because we put data on different media
2. Put data on different media to ...
● Batch
● Append
● speed up
● more efficient space usage
Principle: You don't write to the next level until you have to, a...
Mem
Disk
10 12
1 8 10 11 12 13
Client: Write <6, "foo">
7
9
Before After
Mem
Disk
10 12
1 8 10 11 12 13
7
96 (foo)
10 12 1...
DB: I need to merge
load leaf node into memory emptying, pick node
Mem
Disk
10 12
1 8 10 11 12 13
7
96 (foo)
10 12
1 8
Mem...
DB: I need to merge
filling
Mem
Disk
10 12
1 8 10 11 12 13
7
96 (foo)
10 12
1 8
1 6
Mem
Disk
10 12
1 8 10 11 12 13
7
9
10 ...
Mem
Disk
10 12
1 8 10 11 12 13
7
9
10 12
8
1 6 (foo)
Client: Write <6, "bar">
6 (bar)
Client: Read <6, ?>
Mem
Disk
10 12
1...
Client: Delete <6, "bar">
Mem
Disk
10 12
1 8 10 11 12 13
7
9
10 12
8
1 6 (foo)
6 (bar) (dead)
O' Neil 96
4
Client: Read <6...
O' Neil 96
4
Mem
Disk
10 12
1 8 10 11 12 13
710 12
1 6 (foo) 8 9
Mem
Disk
10 12
1 8 10 11 12 13
7
9
10 12
8
1 6 (foo)
6 (b...
Mem
Disk
Client: Write <6, "I am foo">
Before
After Cassandra
4
7 Ha Ha
13 Excited
Mem
Disk
6 I am foo
7 Ha Ha
13 Excited
...
DB: I need to dump
Before
4
Mem
Disk
6 I am foo
7 Ha Ha
13 Excited
CassandraAfter
Mem
Disk
1 This 8 is 9 radom 10 gen 11 t...
DB: I need to compact
4
Disk
1 This 8 is 9 radom 10 gen 11 text 12 !
1 This 6 I am foo 7 Ha Ha 8 is 9 radom 10 gen 11 text...
Compare of O'Neil 96 and Cassandra
4
O'Neil 96 Cassandra
in memory structure AVL/2-3 Tree Map
on disk structure B+ Tree SS...
Summary
4
O'Neil 96 The LS Merge tree
● Write to fast level
● Read from both fast and slow.
● Data is flushed from fast le...
LevelDB & RocksDB
5
From RocksDB: Challenges of LSM-Trees in Practice
LevelDB & RocksDB
5
From RocksDB: Challenges of LSM-Trees in Practice
LevelDB & RocksDB
Bloom Filter for range queries
From RocksDB: Challenges of LSM-Trees in Practice
LevelDB & RocksDB
Bloom Filter for range queries
From RocksDB: Challenges of LSM-Trees in Practice
Full Answers
2
● Do I still need WAL/WBL when I use log structured merge tree
Yes
● Is LSM Tree a data structure like B+ T...
Reference & Suggested reading
1. SSTable and log structured storage leveldb
2. Notes for reading LSM paper
3. Cassandra: a...
Thank You!
Happy weekend and Lunar New Year!
Pinglei Guo
at1510086at15
Upcoming SlideShare
Loading in …5
×

Log Structured Merge Tree

5,391 views

Published on

Explain O'Neil 96 Log Structured Merge Tree and compare it with Cassandra and RocksDB

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Log Structured Merge Tree

  1. 1. Log Structured Merge Tree Pinglei Guo at15 at1510086
  2. 2. Agenda ● History ● Questions after reading the paper ● An example: Cassandra ● The original paper: Why & How & Visualization ● Suggested reading
  3. 3. 1996 LSM Tree The log-structured merge-tree cited by 401 2006 Bigtable Bigtable: A distributed storage system for structured data cited by 4917 2011 LevelDB LevelDB: A Fast Persistent Key-Value Store History of LSM Tree 1992 LSF The design and implementation of a log-structured file system cited by 1885 2013 RocksDB Under the Hood: Building and open-sourcing RocksDB 2015 TSM Tree The New InfluxDB Storage Engine: Time Structured Merge Tree 2010 Cassandra Cassandra: a decentralized structured storage system 2007 HBase 1
  4. 4. History of LSM Tree 1 What’s the trend for Database? ● Data become larger, more write ● Non-Relational Databases emerge, HBase, Cassandra ● Database are also used for analysis and decision making ● Bigtable ● Cassandra ● HBase ● PNUTS (from Yahoo! 阿里他爸) ● LevelDB && RocksDB ● MongoDB (wired tiger) ● SQLite (optional) ● InfluxDB Databases using LSM Tree Databases using LevelDB/RocksDB ● Riak KV (TS) ● TiKV ● InfluxDB (before 1.0) ● MySQL (in facebook) ● MongoDB (in facebook's dismissed Parse) Facebook: eat my own Rocks
  5. 5. Questions after reading the paper 2 ● Do I still need WAL/WBL when I use log structured merge tree ● Is LSM Tree a data structure like B+ Tree, is there a textbook implementation ● Can someone explain the rolling merge process in detail ● Databases using LSM Tree often have the concept of column family, is it an alias for Column Database
  6. 6. Quick Answers 2 ● Do I still need WAL/WBL when I use log structured merge tree Yes ● Is LSM Tree a data structure like B+ Tree, is there a textbook implementation No ● Can someone explain the rolling merge process in detail I will try ● Databases using LSM Tree often have the concept of column family, is it an alias for Column Database No, JavaScript != Java + Script
  7. 7. 3 Cassandra as first example Why? (not O'Neil 96, Bigtable, LevelDB)
  8. 8. why we pick Cassandra as first example? 3 1. It give us a high level overview of a full real system 2. It is easier to understand than original paper 3. It is battle tested 4. It is open source
  9. 9. 3 ● Commit Log (WAL) ● Memtable (C0 in paper) Write goes to Operations return before the data is written to disk (Fast) Cassandra Write
  10. 10. 3 ● Memtable are dumped to disk as SSTable ● SSTable are merged by background process immutable Cassandra ‘Merge’ SSTable: Sorted String Table ● Bloom Filter ● Index ● Data
  11. 11. 3 ● Read from MemTable ● use Bloom filter to identify SSTables ● Load SSTable index ● Read from multiple SSTables ● Merge the result and return Cassandra Read (simplified)
  12. 12. 4 O'Neil 96 The LSM tree Its name leads to confusion ● Log structured merge tree is not log like WAL ● Log comes from log structured file system ● LSM Tree is a concept than a concrete implementation ● Tree can be replaced by other data structure like map ● More intuitive name could be buffered write, multi level storage, write back cache for index Log is borrowed, Tree can be replaced, Merge is the king
  13. 13. 4 O'Neil 96 The LS Merge tree Let's talk about Merge Merge is the subtle part (that I don't understand clearly) Two Merges ● Post-Write: Merge fast (small) level to slow (big) level ● Read: Read from both fast level and slow level and return the merged result Merge Sort ● A new array need to be allocated ● Two sub array must be sorted before merge
  14. 14. 4 O'Neil 96 The LS Merge tree Q1: Why we need to Merge? A : Because we put data on different media
  15. 15. 4 O'Neil 96 The LS Merge tree 1. Merge is needed because we put data on different media Q2: Why put data on different media? 1. Speed & Access pattern The 5 minutes rule
  16. 16. 4 O'Neil 96 The LS Merge tree 1. Merge is needed because we put data on different media 1. Speed & Access pattern 2. Price 3. Durability Q2: Why put data on different media? ● Tape ● HDD ● SSD ● RAM
  17. 17. 4 O'Neil 96 The LS Merge tree 1. Merge is needed because we put data on different media 1. Speed & Access pattern 2. Price 3. Durability Q2: Why put data on different media? ● Tape ● HDD ● SSD ● RAM Other media? NVM?
  18. 18. 4 O'Neil 96 The LS Merge tree 1. Merge is needed because we put data on different media 1. Speed & Access pattern 2. Price 3. Durability Q2: Why put data on different media? ● Tape ● HDD ● SSD ● RAM Other media? Distributed system is also ‘media’
  19. 19. 4 O'Neil 96 The LS Merge tree 1. Merge is needed because we put data on different media 1. Speed & Access pattern 2. Price 3. Durability Q2: Why put data on different media? Distributed systems -> media that resist larger failure ● Natural disasters ● Human misbehave ● Fail of one machine ● Fail of entire datacenter ● Fail of a country ● Fail of planet earth
  20. 20. 4 O'Neil 96 The LS Merge tree 1. Merge is needed because we put data on different media 2. Put data on different media to gain 1. Faster Speed 2. Lower Price 3. Resistance to various level of Failures Q3: How to merge?
  21. 21. ● Batch ● Append ● speed up ● more efficient space usage Principle: You don't write to the next level until you have to, and you write in the fastest way How to Merge is important 4 O'Neil 96 The LS Merge tree
  22. 22. Mem Disk 10 12 1 8 10 11 12 13 Client: Write <6, "foo"> 7 9 Before After Mem Disk 10 12 1 8 10 11 12 13 7 96 (foo) 10 12 10 12 O' Neil 96 4
  23. 23. DB: I need to merge load leaf node into memory emptying, pick node Mem Disk 10 12 1 8 10 11 12 13 7 96 (foo) 10 12 1 8 Mem Disk 10 12 1 8 10 11 12 13 7 96 (foo) 10 12 1 8 O' Neil 96 4
  24. 24. DB: I need to merge filling Mem Disk 10 12 1 8 10 11 12 13 7 96 (foo) 10 12 1 8 1 6 Mem Disk 10 12 1 8 10 11 12 13 7 9 10 12 8 1 6 (foo) append to disk O' Neil 96 4
  25. 25. Mem Disk 10 12 1 8 10 11 12 13 7 9 10 12 8 1 6 (foo) Client: Write <6, "bar"> 6 (bar) Client: Read <6, ?> Mem Disk 10 12 1 8 10 11 12 13 7 9 10 12 8 1 6 (foo) 6 (bar) [foo, bar] Fetch from both level and return merged result O' Neil 96 4
  26. 26. Client: Delete <6, "bar"> Mem Disk 10 12 1 8 10 11 12 13 7 9 10 12 8 1 6 (foo) 6 (bar) (dead) O' Neil 96 4 Client: Read <6, ?> Mem Disk 10 12 1 8 10 11 12 13 7 9 10 12 8 1 6 (foo) 6 (bar) (dead) [foo]
  27. 27. O' Neil 96 4 Mem Disk 10 12 1 8 10 11 12 13 710 12 1 6 (foo) 8 9 Mem Disk 10 12 1 8 10 11 12 13 7 9 10 12 8 1 6 (foo) 6 (bar) (dead) DB: I need to merge Before After
  28. 28. Mem Disk Client: Write <6, "I am foo"> Before After Cassandra 4 7 Ha Ha 13 Excited Mem Disk 6 I am foo 7 Ha Ha 13 Excited 1 This 8 is 9 radom 10 gen 11 text 12 ! 1 This 8 is 9 radom 10 gen 11 text 12 !
  29. 29. DB: I need to dump Before 4 Mem Disk 6 I am foo 7 Ha Ha 13 Excited CassandraAfter Mem Disk 1 This 8 is 9 radom 10 gen 11 text 12 ! 6 I am foo 7 Ha Ha 13 Excited 1 This 8 is 9 radom 10 gen 11 text 12 !
  30. 30. DB: I need to compact 4 Disk 1 This 8 is 9 radom 10 gen 11 text 12 ! 1 This 6 I am foo 7 Ha Ha 8 is 9 radom 10 gen 11 text 12 ! 13 Excited 6 I am foo 7 Ha Ha 13 Excited Before After Cassandra
  31. 31. Compare of O'Neil 96 and Cassandra 4 O'Neil 96 Cassandra in memory structure AVL/2-3 Tree Map on disk structure B+ Tree SSTable, Index, Bloomfilter level (component) C_0, C_1 .... C_n Memtable, SSTable flush to disk when Memory can’t hold Memory can’t hold and/or timer persist to disk by Write new block (append) dump new SSTable from Memtable (append) merge is done at Memory (empty, filling block) Disk (Compaction in background) concurrency control Complex SSTable is immutable, data have (real world) timestamp for versioning, updating value does not bother dump or merge delete Tombstone, delete at merge Tombstone, delete at merge
  32. 32. Summary 4 O'Neil 96 The LS Merge tree ● Write to fast level ● Read from both fast and slow. ● Data is flushed from fast level to slow level when they are too big ● Real delete is defered to merge
  33. 33. LevelDB & RocksDB 5 From RocksDB: Challenges of LSM-Trees in Practice
  34. 34. LevelDB & RocksDB 5 From RocksDB: Challenges of LSM-Trees in Practice
  35. 35. LevelDB & RocksDB Bloom Filter for range queries From RocksDB: Challenges of LSM-Trees in Practice
  36. 36. LevelDB & RocksDB Bloom Filter for range queries From RocksDB: Challenges of LSM-Trees in Practice
  37. 37. Full Answers 2 ● Do I still need WAL/WBL when I use log structured merge tree Yes ● Is LSM Tree a data structure like B+ Tree, is there a textbook implementation No, it’s how you use different data structure in different storage media ● Can someone explain the rolling merge process in detail I tried ● Databases using LSM Tree often have the concept of column family, is it an alias for Column Database No, see Distinguishing Two Major Types of Column-Stores
  38. 38. Reference & Suggested reading 1. SSTable and log structured storage leveldb 2. Notes for reading LSM paper 3. Cassandra: a decentralized structured storage system 4. Bigtable: A distributed storage system for structured data 5. RocksDB Talks 6. Pathologies of Big Data 7. Distinguishing Two Major Types of Column-Stores 8. Visualization of B+ Tree 9. Time structured merge tree 10. Code: Cassandra, LevelDB, RocksDB, Indeed LSM Tree, InfluxDB (Talk is cheap, show me the code) code is cheap, show me the proof; proof is cheap, I just want to sleep
  39. 39. Thank You! Happy weekend and Lunar New Year! Pinglei Guo at1510086at15

×