Your SlideShare is downloading. ×
0
The Story of RocksDB
Embedded Key-Value Store for Flash and RAM

Dhruba Borthakur & Haobo Xu
Database Engineering@Facebook...
Monday, December 9, 13
Monday, December 9, 13
A Client-Server Architecture with disks

Application Server

Network roundtrip =
50 micro sec

Database
Server
Disk access...
Client-Server Architecture with fast storage

Application Server

Network roundtrip =
50 micro sec

Database
Server

100

...
Architecture of an Embedded Database

Application
Server

Monday, December 9, 13

Network roundtrip =
50 micro sec

Databa...
Architecture of an Embedded Database

Application
Server

Monday, December 9, 13

Network roundtrip =
50 micro sec

Databa...
Architecture of an Embedded Database

Network roundtrip =
50 micro sec

Application
Server

100

microsecs
SSD

Monday, De...
Architecture of an Embedded Database

Network roundtrip =
50 micro sec

Application
Server

100

microsecs
SSD

100

nanos...
Any pre-existing embedded databases?

Open
Source

Monday, December 9, 13

FB

Proprietary
Any pre-existing embedded databases?
Key-value stores
1.
2.
3.
4.

Berkeley DB
SQLite
Kyoto TreeDB
LevelDB

Open
Source

M...
Any pre-existing embedded databases?
Key-value stores

Open
Source

1. High Performant
2. No transaction log
3. Fixed size...
Comparison of open source databases

Monday, December 9, 13
Comparison of open source databases
Random Reads

Monday, December 9, 13
Comparison of open source databases
Random Reads

Random Writes

Monday, December 9, 13
Comparison of open source databases
Random Reads
LevelDB
Kyoto TreeDB
SQLite3

Random Writes

Monday, December 9, 13

129,...
Comparison of open source databases
Random Reads
LevelDB
Kyoto TreeDB
SQLite3

129,000 ops/sec
151,000 ops/sec
134,000 ops...
HBase and HDFS (in April 2012)

Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-...
HBase and HDFS (in April 2012)
Random Reads

Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and...
HBase and HDFS (in April 2012)
Random Reads
HDFS (1 node)
HBase (1 node)

93,000 ops/sec
35,000 ops/sec

Details of this e...
Log Structured Merge Architecture

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Monday, December 9, 13
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Transaction log
Monday, December...
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Transaction log
Monday, December...
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Transaction log
Monday, December...
Log Structured Merge Architecture
Write Request from Application

Read Write data
in RAM

Read Only data in RAM on
disk
Mo...
Log Structured Merge Architecture
Write Request from Application

Periodic
Compaction

Read Write data
in RAM

Read Only d...
Log Structured Merge Architecture
Scan Request from Application

Periodic
Compaction

Read Write data
in RAM

Read Only da...
Leveldb has low write rates
Facebook Application 1:

• Write rate 2 MB/sec only per machine
• Only one cpu was used

Monda...
Leveldb has low write rates
Facebook Application 1:

• Write rate 2 MB/sec only per machine
• Only one cpu was used
We dev...
Leveldb has stalls
Facebook Feed:

• P99 latencies were tens of seconds
• Single-threaded compaction

Monday, December 9, ...
Leveldb has stalls
Facebook Feed:

• P99 latencies were tens of seconds
• Single-threaded compaction
We implemented thread...
Leveldb has high write amplification
• Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very hig...
Leveldb has high write amplification
• Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very hig...
Our solution: lower write amplification
• Facebook Application 2:
• We implemented Universal
Style Compaction
• Start from...
Our solution: lower write amplification
• Facebook Application 2:
• We implemented Universal
Style Compaction
• Start from...
Leveldb has high read amplification

Monday, December 9, 13
Leveldb has high read amplification
• Secondary Index Service:
• Leveldb does not use blooms for scans

Monday, December 9,...
Leveldb has high read amplification
• Secondary Index Service:
• Leveldb does not use blooms for scans
• We implemented pre...
Leveldb: read modify write = 2X IOs

Monday, December 9, 13
Leveldb: read modify write = 2X IOs
• Counter increments
• Get value, value++, Put value
• Leveldb uses 2X IOPS

Monday, D...
Leveldb: read modify write = 2X IOs
• Counter increments
• Get value, value++, Put value
• Leveldb uses 2X IOPS

• We impl...
Leveldb has a Rigid Design

Monday, December 9, 13
Leveldb has a Rigid Design
• LevelDB Design
• Cannot tune system, fixed file sizes

Monday, December 9, 13
Leveldb has a Rigid Design
• LevelDB Design
• Cannot tune system, fixed file sizes

• We wanted a pluggable architecture
• P...
The Changes we did to LevelDB

Monday, December 9, 13
The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and R...
The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and R...
RocksDB is born!
• Key-Value persistent store
• Embedded
• Optimized for fast storage
• Server workloads

Monday, December...
RocksDB is born!
• Key-Value persistent store
• Embedded
• Optimized for fast storage
• Server workloads

Monday, December...
What is it not?
• Not distributed
• No failover
• Not highly-available,
if machine dies you
lose your data

Monday, Decemb...
What is it not?
• Not distributed
• No failover
• Not highly-available,
if machine dies you
lose your data

Monday, Decemb...
RocksDB API
▪

Keys and values are arbitrary byte arrays.

▪

Data is stored sorted by key.

▪

The basic operations are P...
RocksDB Architecture
Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst
...
RocksDB Architecture
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst
...
RocksDB Architecture
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
LSM
Flu...
RocksDB Architecture
Memory
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read Request
...
RocksDB Architecture
Memory
Write Request

Persistent Storage

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

lo...
Log Structured Merge Tree -- Writes
▪

Log Structured Merge Tree

▪

New Puts are written to memory and optionally
to tran...
RocksDB Write Path
Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

s...
RocksDB Write Path
Write Request

Active
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

s...
Log Structured Merge Tree -- Reads
▪

Data could be in memory or on disk

▪

Consult multiple files to find the latest
insta...
RocksDB Read Path
Active
MemTable

log

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

sst

d

Blooms

Mo...
RocksDB Read Path
Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst

sst

...
RocksDB Read Path
Memory
Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst

sst

sst...
RocksDB Read Path
Memory

Persistent Storage

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

ss...
RocksDB: Open & Pluggable
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday, December 9, 13
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Monday...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Transa...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Transa...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Transa...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Memtable
format in RAM

Plugga...
RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms

Pluggable
Compaction

Pluggable
Memtable...
RocksDB: Open & Pluggable
Get or Scan Request from Application

Write Request from Application
Customizable
WAL

Blooms

P...
Example: Customizable WALogging
• In-house Replication solution wants to be able to
embed arbitrary blob in the rocksdb WA...
Example: Customizable WALogging
Active
MemTable
k1
Replication Layer
In one write batch:
PutLogData(“I came from Mars”)
Pu...
Example: Customizable WALogging
Active
MemTable
Write Request
Replication Layer
In one write batch:
PutLogData(“I came fro...
Example: Pluggable SST format
• One Facebook use case needs extreme fast response
but could tolerate some loss of durabili...
Example: Blooms for MemTable
• Same use case, after we optimized sst access, memtable
lookup becomes a major cost in query...
RocksDB Read Path
Blooms

Blooms

Active
MemTable

log

ReadOnly
MemTable

log
log
LSM

Flush

sst

sst

sst

sst

sst

ss...
RocksDB Read Path
Blooms

Blooms

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

sst

sst
...
RocksDB Read Path
Memory
Blooms

Blooms

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Request
LSM
Flush

sst

ss...
RocksDB Read Path
Memory
Blooms

Blooms

Persistent Storage

Active
MemTable

log

ReadOnly
MemTable

log
log

Read Reques...
Example: Pluggable memtable format
• Another Facebook use case has a distinct load phase
where no query is issued.
• Probl...
Example: Pluggable memtable format
Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

Sort,
Flush
sst...
Example: Pluggable memtable format
Write Request

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log
LSM

...
Example: Pluggable memtable format
Write Request

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
log

Read...
Example: Pluggable memtable format
Memory
Write Request

Unsorted
MemTable

log

Switch

Switch

ReadOnly
MemTable

log
lo...
Example: Pluggable memtable format
Memory
Write Request

Persistent Storage

Unsorted
MemTable

log

Switch

Switch

ReadO...
Possible workloads for RocksDB?
▪

Serving data to users via a website

▪

A spam detection backend that needs fast access...
Futures

Monday, December 9, 13
Futures
• Scale linearly with number of cpus
• 32, 64 or higher core machines
• ARM processors

Monday, December 9, 13
Futures
• Scale linearly with number of cpus
• 32, 64 or higher core machines
• ARM processors

• Scale linearly with stor...
Come Hack with us

Monday, December 9, 13
Come Hack with us
• RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksd...
Come Hack with us
• RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksd...
Monday, December 9, 13
Upcoming SlideShare
Loading in...5
×

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

2,447

Published on

This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.

Published in: Technology, Education
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,447
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
100
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide

Transcript of "Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook"

  1. 1. The Story of RocksDB Embedded Key-Value Store for Flash and RAM Dhruba Borthakur & Haobo Xu Database Engineering@Facebook Monday, December 9, 13
  2. 2. Monday, December 9, 13
  3. 3. Monday, December 9, 13
  4. 4. A Client-Server Architecture with disks Application Server Network roundtrip = 50 micro sec Database Server Disk access = 10 milli seconds Locally attached Disks Monday, December 9, 13
  5. 5. Client-Server Architecture with fast storage Application Server Network roundtrip = 50 micro sec Database Server 100 microsecs SSD Latency dominated by network Monday, December 9, 13 100 nanosecs RAM
  6. 6. Architecture of an Embedded Database Application Server Monday, December 9, 13 Network roundtrip = 50 micro sec Database Server
  7. 7. Architecture of an Embedded Database Application Server Monday, December 9, 13 Network roundtrip = 50 micro sec Database Server
  8. 8. Architecture of an Embedded Database Network roundtrip = 50 micro sec Application Server 100 microsecs SSD Monday, December 9, 13 100 nanosecs RAM Database Server
  9. 9. Architecture of an Embedded Database Network roundtrip = 50 micro sec Application Server 100 microsecs SSD 100 nanosecs RAM Storage attached directly to application servers Monday, December 9, 13 Database Server
  10. 10. Any pre-existing embedded databases? Open Source Monday, December 9, 13 FB Proprietary
  11. 11. Any pre-existing embedded databases? Key-value stores 1. 2. 3. 4. Berkeley DB SQLite Kyoto TreeDB LevelDB Open Source Monday, December 9, 13 FB Proprietary
  12. 12. Any pre-existing embedded databases? Key-value stores Open Source 1. High Performant 2. No transaction log 3. Fixed size keys FB Proprietary Monday, December 9, 13
  13. 13. Comparison of open source databases Monday, December 9, 13
  14. 14. Comparison of open source databases Random Reads Monday, December 9, 13
  15. 15. Comparison of open source databases Random Reads Random Writes Monday, December 9, 13
  16. 16. Comparison of open source databases Random Reads LevelDB Kyoto TreeDB SQLite3 Random Writes Monday, December 9, 13 129,000 ops/sec 151,000 ops/sec 134,000 ops/sec
  17. 17. Comparison of open source databases Random Reads LevelDB Kyoto TreeDB SQLite3 129,000 ops/sec 151,000 ops/sec 134,000 ops/sec Random Writes LevelDB Kyoto TreeDB SQLite3 Monday, December 9, 13 164,000 ops/sec 88,500 ops/sec 9,860 ops/sec
  18. 18. HBase and HDFS (in April 2012) Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html Monday, December 9, 13
  19. 19. HBase and HDFS (in April 2012) Random Reads Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html Monday, December 9, 13
  20. 20. HBase and HDFS (in April 2012) Random Reads HDFS (1 node) HBase (1 node) 93,000 ops/sec 35,000 ops/sec Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html Monday, December 9, 13
  21. 21. Log Structured Merge Architecture Read Write data in RAM Monday, December 9, 13
  22. 22. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Monday, December 9, 13
  23. 23. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Monday, December 9, 13
  24. 24. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Monday, December 9, 13
  25. 25. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Transaction log Monday, December 9, 13
  26. 26. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Transaction log Monday, December 9, 13
  27. 27. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Transaction log Monday, December 9, 13
  28. 28. Log Structured Merge Architecture Write Request from Application Read Write data in RAM Read Only data in RAM on disk Monday, December 9, 13 Transaction log
  29. 29. Log Structured Merge Architecture Write Request from Application Periodic Compaction Read Write data in RAM Read Only data in RAM on disk Monday, December 9, 13 Transaction log
  30. 30. Log Structured Merge Architecture Scan Request from Application Periodic Compaction Read Write data in RAM Read Only data in RAM on disk Monday, December 9, 13 Write Request from Application Transaction log
  31. 31. Leveldb has low write rates Facebook Application 1: • Write rate 2 MB/sec only per machine • Only one cpu was used Monday, December 9, 13
  32. 32. Leveldb has low write rates Facebook Application 1: • Write rate 2 MB/sec only per machine • Only one cpu was used We developed multithreaded compaction 10x improvement on write rate Monday, December 9, 13 + 100% of cpus are in use
  33. 33. Leveldb has stalls Facebook Feed: • P99 latencies were tens of seconds • Single-threaded compaction Monday, December 9, 13
  34. 34. Leveldb has stalls Facebook Feed: • P99 latencies were tens of seconds • Single-threaded compaction We implemented thread aware compaction Dedicated thread(s) to flush memtable Monday, December 9, 13 Pipelined memtables P99 reduced to less than a second
  35. 35. Leveldb has high write amplification • Facebook Application 2: • Level Style Compaction • Write amplification of 70 very high Monday, December 9, 13
  36. 36. Leveldb has high write amplification • Facebook Application 2: • Level Style Compaction • Write amplification of 70 very high Level-0 5 bytes Level-1 6 bytes 11 bytes Level-2 10 bytes 10 bytes 10 bytes Stage 1 Stage 2 Stage 3 Two compactions by LevelDB Style Compaction Monday, December 9, 13
  37. 37. Our solution: lower write amplification • Facebook Application 2: • We implemented Universal Style Compaction • Start from newest file, include next file in candidate set if • Candidate set size >= size of next file Monday, December 9, 13
  38. 38. Our solution: lower write amplification • Facebook Application 2: • We implemented Universal Style Compaction • Start from newest file, include next file in candidate set if • Candidate set size >= size of next file Level-0 5bytes Level-1 6 bytes Level-2 10 bytes 10 bytes Stage 1 Stage 2 Single compaction by Universal Style Compaction Write amplification reduced to <10 Monday, December 9, 13
  39. 39. Leveldb has high read amplification Monday, December 9, 13
  40. 40. Leveldb has high read amplification • Secondary Index Service: • Leveldb does not use blooms for scans Monday, December 9, 13
  41. 41. Leveldb has high read amplification • Secondary Index Service: • Leveldb does not use blooms for scans • We implemented prefix scans • Range scans within same key prefix • Blooms created for prefix • Reduces read amplification Monday, December 9, 13
  42. 42. Leveldb: read modify write = 2X IOs Monday, December 9, 13
  43. 43. Leveldb: read modify write = 2X IOs • Counter increments • Get value, value++, Put value • Leveldb uses 2X IOPS Monday, December 9, 13
  44. 44. Leveldb: read modify write = 2X IOs • Counter increments • Get value, value++, Put value • Leveldb uses 2X IOPS • We implemented MergeRecord • Put “++” operation in MergeRecord • Background compaction merges all MergeRecords • Uses only 1X IOPS Monday, December 9, 13
  45. 45. Leveldb has a Rigid Design Monday, December 9, 13
  46. 46. Leveldb has a Rigid Design • LevelDB Design • Cannot tune system, fixed file sizes Monday, December 9, 13
  47. 47. Leveldb has a Rigid Design • LevelDB Design • Cannot tune system, fixed file sizes • We wanted a pluggable architecture • Pluggable compaction filter, e.g. TimeToLive • Pluggable memtable/sstable for RAM/Flash • Pluggable Compaction Algorithm Monday, December 9, 13
  48. 48. The Changes we did to LevelDB Monday, December 9, 13
  49. 49. The Changes we did to LevelDB Inherited from LevelDB • Log Structured Merge DB • Gets/Puts/Scans of keys • Forward and Reverse Iteration Monday, December 9, 13
  50. 50. The Changes we did to LevelDB Inherited from LevelDB • Log Structured Merge DB • Gets/Puts/Scans of keys • Forward and Reverse Iteration Monday, December 9, 13 RocksDB • 10X higher write rate • Fewer stalls • 7x lower write amplification • Blooms for range scans • Ability to avoid read-modify-write • Optimizations for flash or RAM • And many more…
  51. 51. RocksDB is born! • Key-Value persistent store • Embedded • Optimized for fast storage • Server workloads Monday, December 9, 13
  52. 52. RocksDB is born! • Key-Value persistent store • Embedded • Optimized for fast storage • Server workloads Monday, December 9, 13
  53. 53. What is it not? • Not distributed • No failover • Not highly-available, if machine dies you lose your data Monday, December 9, 13
  54. 54. What is it not? • Not distributed • No failover • Not highly-available, if machine dies you lose your data Monday, December 9, 13
  55. 55. RocksDB API ▪ Keys and values are arbitrary byte arrays. ▪ Data is stored sorted by key. ▪ The basic operations are Put(key,value), Get(key), Delete(key) and Merge(key, delta) ▪ Forward and backward iteration is supported over the data. Monday, December 9, 13
  56. 56. RocksDB Architecture Active MemTable log Switch Switch ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  57. 57. RocksDB Architecture Write Request Active MemTable log Switch Switch ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  58. 58. RocksDB Architecture Write Request Active MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  59. 59. RocksDB Architecture Memory Write Request Active MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  60. 60. RocksDB Architecture Memory Write Request Persistent Storage Active MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  61. 61. Log Structured Merge Tree -- Writes ▪ Log Structured Merge Tree ▪ New Puts are written to memory and optionally to transaction log ▪ Also can specify log write sync option for each individual write ▪ We say RocksDB is optimized for writes, what does this mean? Monday, December 9, 13
  62. 62. RocksDB Write Path Active MemTable log Switch Switch ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  63. 63. RocksDB Write Path Write Request Active MemTable log Switch Switch ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  64. 64. Log Structured Merge Tree -- Reads ▪ Data could be in memory or on disk ▪ Consult multiple files to find the latest instance of the key ▪ Use bloom filters to reduce IO Monday, December 9, 13
  65. 65. RocksDB Read Path Active MemTable log ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  66. 66. RocksDB Read Path Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  67. 67. RocksDB Read Path Memory Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  68. 68. RocksDB Read Path Memory Persistent Storage Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  69. 69. RocksDB: Open & Pluggable Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  70. 70. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  71. 71. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  72. 72. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Monday, December 9, 13
  73. 73. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Transaction log Monday, December 9, 13
  74. 74. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Transaction log Monday, December 9, 13
  75. 75. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Transaction log Monday, December 9, 13
  76. 76. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Memtable format in RAM Pluggable sst data format on storage Monday, December 9, 13 Transaction log
  77. 77. RocksDB: Open & Pluggable Write Request from Application Customizable WAL Blooms Pluggable Compaction Pluggable Memtable format in RAM Pluggable sst data format on storage Monday, December 9, 13 Transaction log
  78. 78. RocksDB: Open & Pluggable Get or Scan Request from Application Write Request from Application Customizable WAL Blooms Pluggable Compaction Pluggable Memtable format in RAM Pluggable sst data format on storage Monday, December 9, 13 Transaction log
  79. 79. Example: Customizable WALogging • In-house Replication solution wants to be able to embed arbitrary blob in the rocksdb WAL stream for log annotation • Use Case: Indicate where a log record came from in multi-master replication • Solution: Monday, December 9, 13 A Put that only speaks to the log
  80. 80. Example: Customizable WALogging Active MemTable k1 Replication Layer In one write batch: PutLogData(“I came from Mars”) Put(k1,v1) Monday, December 9, 13 v1 log “I came from Mars” k1/v1
  81. 81. Example: Customizable WALogging Active MemTable Write Request Replication Layer In one write batch: PutLogData(“I came from Mars”) Put(k1,v1) Monday, December 9, 13 k1 v1 log “I came from Mars” k1/v1
  82. 82. Example: Pluggable SST format • One Facebook use case needs extreme fast response but could tolerate some loss of durability • Quick hack: mount sst in tmpfs • Still not performant: • existing sst format is block based • Solution: A much simpler format that just stores sorted key/value pairs sequentially • no blocks, no caching, mmap the whole file • build efficient lookup index on load Monday, December 9, 13
  83. 83. Example: Blooms for MemTable • Same use case, after we optimized sst access, memtable lookup becomes a major cost in query • Problem: Get needs to go through the memtable lookups that eventually return no data • Solution: Monday, December 9, 13 Just add a bloom filter to memtable!
  84. 84. RocksDB Read Path Blooms Blooms Active MemTable log ReadOnly MemTable log log LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  85. 85. RocksDB Read Path Blooms Blooms Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  86. 86. RocksDB Read Path Memory Blooms Blooms Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  87. 87. RocksDB Read Path Memory Blooms Blooms Persistent Storage Active MemTable log ReadOnly MemTable log log Read Request LSM Flush sst sst sst sst sst sst d Blooms Monday, December 9, 13 Compaction
  88. 88. Example: Pluggable memtable format • Another Facebook use case has a distinct load phase where no query is issued. • Problem: write throughput is limited by single writer thread • Solution: A new memtable representation that does not keep keys sorted Monday, December 9, 13
  89. 89. Example: Pluggable memtable format Unsorted MemTable log Switch Switch ReadOnly MemTable log log LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  90. 90. Example: Pluggable memtable format Write Request Unsorted MemTable log Switch Switch ReadOnly MemTable log log LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  91. 91. Example: Pluggable memtable format Write Request Unsorted MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  92. 92. Example: Pluggable memtable format Memory Write Request Unsorted MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  93. 93. Example: Pluggable memtable format Memory Write Request Persistent Storage Unsorted MemTable log Switch Switch ReadOnly MemTable log log Read Request LSM Sort, Flush sst sst sst sst sst sst d Monday, December 9, 13 Compaction
  94. 94. Possible workloads for RocksDB? ▪ Serving data to users via a website ▪ A spam detection backend that needs fast access to data ▪ A graph search query that needs to scan a dataset in realtime ▪ Distributed Configuration Management Systems ▪ Fast serve of Hive Data ▪ A Queue that needs a high rate of inserts and deletes Monday, December 9, 13
  95. 95. Futures Monday, December 9, 13
  96. 96. Futures • Scale linearly with number of cpus • 32, 64 or higher core machines • ARM processors Monday, December 9, 13
  97. 97. Futures • Scale linearly with number of cpus • 32, 64 or higher core machines • ARM processors • Scale linearly with storage iops • Striped flash cards • RAM & NVRAM storage Monday, December 9, 13
  98. 98. Come Hack with us Monday, December 9, 13
  99. 99. Come Hack with us • RocksDB is Open Sourced • http://rocksdb.org • Developers group https://www.facebook.com/groups/rocksdb.dev/ Monday, December 9, 13
  100. 100. Come Hack with us • RocksDB is Open Sourced • http://rocksdb.org • Developers group https://www.facebook.com/groups/rocksdb.dev/ • Help us HACK RocksDB Monday, December 9, 13
  101. 101. Monday, December 9, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×