SlideShare a Scribd company logo
Yoshinori Matsunobu
Production Engineer, Meta
RocksDB Performance and Reliability
LESSONS LEARNED FROM YEARS OF PRODUCTION ENGINEERING AT SCALE
1. RocksDB Overview
2. Differences between LSM and B+Tree
3. Performance Practices
4. Operations and Reliability Practices
Agenda
ROCKSDB OVERVIEW
What is RocksDB
http://rocksdb.org/
Open Source Log-Structured Merge (LSM) database, forked from LevelDB
• Key-Value LSM persistent store
• Easier integration -- Embedded
• Native compression -- Optimized for fast storage
Used at many backend services at Meta, and many external large services and products
• Column Family, Transaction, Parallelism, etc
• Major use cases inside Meta:
﹘ MyRocks: MySQL on top of RocksDB (RocksDB Storage Engine)
﹘ ZippyDB: Distributed key value store on top of RocksDB
Write Request
Memory
Switch Switch
Persistent Storage
Flush Files
Compaction
Read Request
Memory Persistent Storage
Bloom Filter
Bloom Filter
Files
ROCKSDB OVERVIEW
Leveled Compaction
For each level, data is sorted by key
(In Level 0, data is sorted by key per file)
Compaction merges 1 Level n file + 10 Level n+1 files, then writing into Level n+1
Read Amplification: 1 ~ number of levels (depending on cache -- L0~L2 are usually cached)
Write Amplification: 1 + 1 + fanout * (number of levels – 2) / 2
Space Amplification: 1.11
• 11% is much smaller than B+Tree’s fragmentation
ROCKSDB OVERVIEW
RocksDB Features
Column Family
TransactionDB, BlobDB, TTLDB
Prefix Bloom Filter, Partitioned Filter
DeleteRange, SingleDelete
Merge Operator
Backup Engine
Most configuration parameters can be changed online
DIFFERENCES BETWEEN LSM AND B+TREE
LSM vs B+Tree
Smaller space usage
• Smaller fragmentation overhead
• Working well with compression (Saving better than InnoDB Compression)
Lower write amplification
Slower read performance. For memory bound workloads, it is relatively more visible.
Generally, faster write performance
• Maintaining secondary index is cheaper since LSM doesn’t need random reads
• Tables with only primary keys are slower to insert, due to higher unique key constraint check (Get) cost
Major difference vs InnoDB
• RocksDB TransactionDB does not support Gap Lock. Migrating from InnoDB Repeatable Read is tricky.
ROCKSDB PERFORMANCE
RocksDB Performance Practices
Use Jemalloc memory allocator
Understand RocksDB data formats, and keep important data sets in memory
Compression
Compaction
ROCKSDB PERFORMANCE
RocksDB file format – data, index and filter
<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]
[meta block 2: index block]
[meta block 3: compression dictionary block]
[meta block 4: range deletion block]
...
[meta block K: future extended block]
[metaindex block]
[Footer]
<end_of_file>
Data block -> Storing actual key/values
Filter block -> Storing bloom filter
Index block -> Offsets of each data block
Index block size depends on the number of data blocks
• 16KB -> 4KB data block will increase index block size by 4x
ROCKSDB PERFORMANCE
Index and Filter size reduction
Filter and Index block cache hit rate is important
Size info can be obtained from Table Property, and cache info is periodically logged in LOG
“optimize_filters_for_hits=true” avoids storing filter in Lmax (saving total filter size by 90%)
Ribbon Filter saves bloom filter size by ~30% with comparable CPU util
Parameters to save index block size
• format_version=4 or 5
• index_block_restart_interval=16
Watch rocksdb_block_cache_index_miss
enable_index_compression=false to save CPU time
MyRocks has information_schema to expose SST file metrics
mysql> select sum(data_block_size)/1024/1024/1024 as size_gb,
sum(index_block_size)/1024/1024/1024 as index_gb,
sum(filter_block_size)/1024/1024/1024 as filter_gb
from information_schema.rocksdb_sst_props;
+-------------------+----------------+----------------+
| size_gb | index_gb | filter_gb |
+-------------------+----------------+----------------+
| 1009.362400736660 | 2.661879514344 | 1.734282894991 |
+-------------------+----------------+----------------+
ROCKSDB PERFORMANCE
Direct I/O
RocksDB supports Direct I/O for SST files (data files)
Buffered I/O uses substantial memory (slab) in Linux Kernel
Better memory efficiency and lower %system CPU with Direct I/O, especially if your workload is memory bound
Adjust Block Cache accordingly, since filesystem cache can no longer be useful
Do not mix Buffered I/O and Direct I/O (serialized I/O)
use_direct_io_for_flush_and_compaction=ON
use_direct_reads=ON
cache_high_pri_pool_ratio=0.5
ROCKSDB PERFORMANCE
Hybrid Compression
RocksDB allows to set different compression algo between levels
Use stronger compression algorithm (Zstandard) in Lmax to save space
Use faster compression algorithm (LZ4 or None) in higher levels to keep up with writes
compression_per_level=kLZ4Compression or
kNoCompression
bottommost_compression=kZSTD
ROCKSDB PERFORMANCE
Avoid Compaction if possible
SST File Writer API
• It is users’ responsibility to presort rows by keys
Normal Write Path in RocksDB
….
Flush
Compaction
Compaction
Faster Write Path
ROCKSDB PERFORMANCE
Bloom Filter
Pay attention to Bloom Filter Size
- “optimize_filters_for_hits=true” avoids storing filter in Lmax (saving total filter size by 90%)
- Ribbon Filter saves bloom filter size by ~30% with comparable CPU util
Whole Key Filtering
Prefix Bloom Filter
ROCKSDB PERFORMANCE
Understand what happens with Delete
“Delete” adds a tombstone
MyRocks Update is a combination of Delete and Put
Tombstones don’t disappear until bottom level compaction happens
Some reads need to scan lots of tombstones => inefficient
• In this example, reading 5 entries is needed just for getting one row
RocksDB has an optimized API called SingleDelete, but it can’t eliminate tombstone overheads
• SingleDelete disappears when finding a matching Put. It has a requirement that same-key operations don’t repeat (e.g. Put(1) -> Put(1) -> SD(1) does not work)
• MyRocks internally uses SingleDelete for secondary keys
Put(1)
Put(2)
Put(3)
Put(4)
Put(5)
INSERT INTO t
VALUES (1),(2),(3),(4),(5);
Delete(1)
Delete(2)
Delete(3)
Delete(4)
Put(5)
DELETE FROM t WHERE
id <= 4;
Delete(1)
Delete(2)
Delete(3)
Delete(4)
Put(5)
SELECT COUNT(*) FROM t;
ROCKSDB PERFORMANCE
Scanning too many tombstones degrades read perf
Range scan (Seek) may hit this issue
Consecutive tombstones can be millions if you are not dealing properly
RocksDB exposes metrics as perf_context INTERNAL_DELETE_SKIPPED_COUNT, with perf context level >= 2
Operations can’t be killed during Seeking tombstones
Deletion-Triggered Compaction (DTC) is one of the workarounds
• When creating new SST files, if there are certain number of tombstones, trigger another compaction to wipe tombstones immediately
﹘ MyRocks has a sysvar to control that (rocksdb_compaction_sequential_deletes = 49999 / rocksdb_compaction_sequential_deletes_window = 50000)
﹘ RocksDB has an API to do that
﹘ Trade offs between high read cost and more compaction cost
ROCKSDB PERFORMANCE
Slowdown because of too many point lookups
Point Lookup calls Get(). This is more expensive than point lookup from B+Tree
May hit RocksDB LRU block cache contentions
• Visible as high %system CPU if that’s the case
• Improvements in RocksDB in progress
Typical workarounds
• Use MultiGet API
﹘ Instead of Get() x N times, issue one MultiGet()
﹘ MyRocks uses MultiGet when setting optimizer_switch = ‘mrr=on,mrr_cost_based=off, batched_key_access=on’
• Adding more secondary indexes (different key/values)
﹘ Convert non-covering index scans (1 + N reads) to covering index scans (1 or 1 + small number of reads)
﹘ Cost to update secondary index is cheaper in LSM thanks to skipping reads
RocksDB Reliability Practices
Write Stalls
Understand metrics to watch
Error Handling
Data consistency
Recovery on failure
ROCKSDB RELIABILITY
Preventing Write Stall
Write Stalling is one of the most common problems in RocksDB/LSM
Write stalls because:
• Writing too fast
• L0 flush and compactions are not fast enough
• Creating too many L0 files
• Too many pending compaction bytes
• Inefficient CompactRange API usage
• Wrong Bulk Loading API usage (loading SST file into L0 instead of Lmax,
invoking full compactions)
Write stall stats are available from status counters and LOGs
mysql> show global status like 'rocksdb_stall%';
+----------------------------------------------------+-------+
| Variable_name | Value |
+----------------------------------------------------+-------+
| rocksdb_stall_l0_file_count_limit_slowdowns | 0 |
| rocksdb_stall_locked_l0_file_count_limit_slowdowns | 0 |
| rocksdb_stall_l0_file_count_limit_stops | 0 |
| rocksdb_stall_locked_l0_file_count_limit_stops | 0 |
| rocksdb_stall_pending_compaction_limit_stops | 0 |
| rocksdb_stall_pending_compaction_limit_slowdowns | 0 |
| rocksdb_stall_memtable_limit_stops | 0 |
| rocksdb_stall_memtable_limit_slowdowns | 0 |
| rocksdb_stall_total_stops | 0 |
| rocksdb_stall_total_slowdowns | 0 |
| rocksdb_stall_micros | 0 |
+----------------------------------------------------+-------+
11 rows in set (0.00 sec)
2022/02/15-21:03:46.600403 7f5f077ff700 [WARN] [db/column_family.cc:929] [default]
Stopping writes because of estimated pending compaction bytes 1041689026590
ROCKSDB RELIABILITY
MemTable/L0 Stalls
If all MemTables get full, and if they can’t be flushed (e.g. max L0 files), further writes are blocked
Reported as these counters
• stall_memtable_limit_stops | slowdowns
• stall_l0_file_count_limit_stops | slowdowns
• stall_total_stops | slowdowns
Common workarounds
• Allow more L0 files -- Increase level0_slowdown_writes_trigger and level0_stop_writes_trigger (typically 20 | 30)
• Make MemTable flush faster -- use faster compression algorithm in L0 (kNoCompression, kLZ4Compression)
• Make L0 compactions faster – use faster compression algorithm in L1, 2
• Start compaction earlier -- decrease level0_file_num_compaction_trigger (typically 4)
• Be careful about implicit Flush in RocksDB (e.g. SetOptions, CheckPoint) since it creates a L0 file
ROCKSDB RELIABILITY
Metrics to watch
RocksDB has two important metrics structures
- Stats (e.g. stalls, data/index/filter block cache hit/miss, compaction bytes)
- Perf Context (e.g. tombstone scanned, block decompressed time)
- perf_context_level >= 2 is recommended to get most useful info like tombstone scanned.
3 is a little expensive to get time stats
MyRocks exposes most metrics via information_schema and show global status
mysql> select * from rocksdb_perf_context_global;
+---------------------------------+-----------------+
| STAT_TYPE | VALUE |
+---------------------------------+-----------------+
| USER_KEY_COMPARISON_COUNT | 270471364854 |
| BLOCK_CACHE_HIT_COUNT | 7014318274 |
| BLOCK_READ_COUNT | 555394733 |
| BLOCK_READ_BYTE | 4359686643590 |
| BLOCK_READ_TIME | 67045272264489 |
| BLOCK_CHECKSUM_TIME | 2065141339797 |
| BLOCK_DECOMPRESS_TIME | 27036226090470 |
| GET_READ_BYTES | 604107492243 |
| MULTIGET_READ_BYTES | 26614080073 |
| ITER_READ_BYTES | 4515817650181 |
| INTERNAL_KEY_SKIPPED_COUNT | 64344684548 |
| INTERNAL_DELETE_SKIPPED_COUNT | 1141058309 |
| INTERNAL_RECENT_SKIPPED_COUNT | 8580663 |
| INTERNAL_MERGE_COUNT | 0 |
| GET_SNAPSHOT_TIME | 478716678460 |
| GET_FROM_MEMTABLE_TIME | 3107700425345 |
| GET_FROM_MEMTABLE_COUNT | 1745423505 |
| GET_POST_PROCESS_TIME | 579743978173 |
| GET_FROM_OUTPUT_FILES_TIME | 102555066991914 |
| SEEK_ON_MEMTABLE_TIME | 226655444780 |
| SEEK_ON_MEMTABLE_COUNT | 104572447 |
| NEXT_ON_MEMTABLE_COUNT | 38671332 |
| PREV_ON_MEMTABLE_COUNT | 2687679 |
| SEEK_CHILD_SEEK_TIME | 23240171176784 |
| SEEK_CHILD_SEEK_COUNT | 668676730 |
…
ROCKSDB RELIABILITY
Most configurations are Dynamic
RocksDB has database level and column family level configurations
Majority of the configurations are column family level
You can change most RocksDB configuration parameters without stopping database
Parameter change examples:
• Decreasing Block cache size to avoid Memory Pressure
• Increasing L0 file limits to avoid L0 stalls
• Changing compression algorithm (effective on next Flush/Compaction)
Column Family parameter change (SetOptions API) involves MemTable Flush. So if you hit L0 stop, you can’t change parameters (fix in roadmap)
ROCKSDB RELIABILITY
I/O Error Handling
RocksDB returns an error to a caller on I/O errors, and it’s up to RocksDB users for how to handle
• Normally users get kIOError but it’s not guaranteed (e.g. kIncompelte)
Typical failure handling on errors
• Aborting server
• Returning errors
• Retrying
• In any case, don’t suppress errors
ROCKSDB RELIABILITY
I/O Error Handling in MyRocks
Can’t roll back on errors at engine commit. So we abort server instead,
and let crash recovery resolve binlog-engine consistency.
ROCKSDB RELIABILITY
Unique Key Constraints
RocksDB API Put() does not check if the same key exists or not.
Unlike INSERT in InnoDB, Put() does not return “key already exists” error
Call Get() for checking existence
Call GetForUpdate() to lock the key
MyRocks INSERT wraps with GetForUpdate() and Put(), so it can find unique key violation
You have a choice to blindly insert without reading at all (MyRocks REPLACE has an option to do that)
ROCKSDB RELIABILITY
Data consistency
When you physically copy RocksDB database elsewhere, make sure you copy all dependent files – SST files, WAL, Manifest, blob files
• Several online copy solutions – RocksDB backup engine, myrocks_hotbackup, xtrabackup
By default, RocksDB allows to open database even if missing WAL files
• This may end up opening database with inconsistency
• This is because Manifest file does not track WAL files
Use more strict option to enforce file integrity
• Turn track_and_verify_wals_in_manifest on
﹘ This tracks WAL file and size
﹘ Opening database with missing WALs is rejected
ROCKSDB RELIABILITY
Recovery on Database Crash
RocksDB has a parameter called wal_recovery_mode
• RocksDB default is 2 (kPointInTimeRecovery)
• It used to have default 1 (kAbsoluteConsistency)
• 1 has a side effect that it blocks to open RocksDB database, even if it can be recovered
Instance crash (incl process crash) may leave the tail WAL file incomplete
RocksDB refuses to start with param value 1 (kAbsoluteConsistency)
RocksDB does NOT refuse with param value 2 (kPointInTimeRecovery)
General recommendation:
• Use wal_recovery_mode=2 with track_and_verify_wals_in_manifest=ON
• Rely on replication to recover lost transactions
2022-03-26T02:21:26.166366-07:00 0 [Note] [MY-000000] [Server] RocksDB: Opening TransactionDB...
2022-03-26T02:21:28.620095-07:00 0 [ERROR] [MY-000000] [Server] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body
OTHER TOPICS
Dealing with Snapshot Conflicts
InnoDB natively supports range lock (next key lock / gap lock) by default
• This was for historical reason to work with Statement Based Binary Logging in MySQL
• Often caused hot row lock contentions
• Range lock is not held with Row Based Binary Logging + Read Committed Isolation Level
RocksDB (and many other databases including PostgreSQL) do not support range lock
• There is an ongoing work to support in RocksDB with external contributor
PostgreSQL Repeatable Read (and Serializable) returns “Snapshot Conflict” error on conflicts
MyRocks uses RocksDB TransactionDB and implements the same behavior
You can’t eliminate “snapshot conflict” errors with Repeatable Read / Serializable isolation without range lock
Handling errors, or switching to Read Committed are typical workarounds
OTHER TOPICS
InnoDB to MyRocks/RocksDB migration steps
InnoDB RR (Repeatable Read) -> InnoDB RC (Read Committed) -> MyRocks RC
• Evaluate if there are queries depending on gap lock
﹘ Meta-MySQL feature: gap_lock_write_log and gap_lock_raise_error are sysvars to help
• InnoDB RC to MyRocks RC is straight forward
InnoDB RR -> MyRocks RR (-> MyRocks RC)
• Evaluate if there are noticeable number of snapshot conflict errors
﹘ rocksdb_snapshot_conflict_errors is a status counter to tell how often hit snapshots
﹘ Users see ‘Snapshot Conflict’ error message with ‘DEADLOCK’ error code
• Flipping from RR to RC eliminates snapshot conflict errors
﹘ But it is necessary to verify if RC is safe
Summary
RocksDB is a modern LSM database library, with years of production deployments at scale
Compared to B+Tree, RocksDB (LSM) saves space and offers faster write performance, but pay attention to read performance drops
Pay attention to data, index and filter block size and cache miss
Utilize compression and compaction tuning options
Pay attention to tombstone scanning costs, and utilize several mitigations like Deletion Triggered Compaction
Pay attention to write stalls
RocksDB Performance and Reliability Practices

More Related Content

What's hot

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
Colin Charles
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
confluent
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

What's hot (20)

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 

Similar to RocksDB Performance and Reliability Practices

M|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocksM|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocks
MariaDB plc
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
Yoshinori Matsunobu
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
Michael Yarichuk
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
Ulf Wendel
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
Kenichiro Nakamura
 
My sql 56_roadmap_april2012
My sql 56_roadmap_april2012My sql 56_roadmap_april2012
My sql 56_roadmap_april2012
sqlhjalp
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
Denny Lee
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
ScyllaDB
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
MariaDB plc
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
Kapil Goyal
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
Ajeet Singh
 
POLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloud
oysteing
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
webhostingguy
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
Niko Neugebauer
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
Denny Lee
 
DBCC - Dubi Lebel
DBCC - Dubi LebelDBCC - Dubi Lebel
DBCC - Dubi Lebel
sqlserver.co.il
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum
 

Similar to RocksDB Performance and Reliability Practices (20)

M|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocksM|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocks
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
 
My sql 56_roadmap_april2012
My sql 56_roadmap_april2012My sql 56_roadmap_april2012
My sql 56_roadmap_april2012
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
POLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloud
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
 
DBCC - Dubi Lebel
DBCC - Dubi LebelDBCC - Dubi Lebel
DBCC - Dubi Lebel
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
 

More from Yoshinori Matsunobu

Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced Durability
Yoshinori Matsunobu
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話Yoshinori Matsunobu
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
Yoshinori Matsunobu
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social Games
Yoshinori Matsunobu
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
Yoshinori Matsunobu
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
Yoshinori Matsunobu
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
Yoshinori Matsunobu
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
Yoshinori Matsunobu
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
Yoshinori Matsunobu
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Yoshinori Matsunobu
 

More from Yoshinori Matsunobu (12)

Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced Durability
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social Games
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
 

Recently uploaded

How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 

Recently uploaded (20)

How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 

RocksDB Performance and Reliability Practices

  • 1. Yoshinori Matsunobu Production Engineer, Meta RocksDB Performance and Reliability LESSONS LEARNED FROM YEARS OF PRODUCTION ENGINEERING AT SCALE
  • 2. 1. RocksDB Overview 2. Differences between LSM and B+Tree 3. Performance Practices 4. Operations and Reliability Practices Agenda
  • 3. ROCKSDB OVERVIEW What is RocksDB http://rocksdb.org/ Open Source Log-Structured Merge (LSM) database, forked from LevelDB • Key-Value LSM persistent store • Easier integration -- Embedded • Native compression -- Optimized for fast storage Used at many backend services at Meta, and many external large services and products • Column Family, Transaction, Parallelism, etc • Major use cases inside Meta: ﹘ MyRocks: MySQL on top of RocksDB (RocksDB Storage Engine) ﹘ ZippyDB: Distributed key value store on top of RocksDB
  • 4. Write Request Memory Switch Switch Persistent Storage Flush Files Compaction
  • 5. Read Request Memory Persistent Storage Bloom Filter Bloom Filter Files
  • 6. ROCKSDB OVERVIEW Leveled Compaction For each level, data is sorted by key (In Level 0, data is sorted by key per file) Compaction merges 1 Level n file + 10 Level n+1 files, then writing into Level n+1 Read Amplification: 1 ~ number of levels (depending on cache -- L0~L2 are usually cached) Write Amplification: 1 + 1 + fanout * (number of levels – 2) / 2 Space Amplification: 1.11 • 11% is much smaller than B+Tree’s fragmentation
  • 7. ROCKSDB OVERVIEW RocksDB Features Column Family TransactionDB, BlobDB, TTLDB Prefix Bloom Filter, Partitioned Filter DeleteRange, SingleDelete Merge Operator Backup Engine Most configuration parameters can be changed online
  • 8. DIFFERENCES BETWEEN LSM AND B+TREE LSM vs B+Tree Smaller space usage • Smaller fragmentation overhead • Working well with compression (Saving better than InnoDB Compression) Lower write amplification Slower read performance. For memory bound workloads, it is relatively more visible. Generally, faster write performance • Maintaining secondary index is cheaper since LSM doesn’t need random reads • Tables with only primary keys are slower to insert, due to higher unique key constraint check (Get) cost Major difference vs InnoDB • RocksDB TransactionDB does not support Gap Lock. Migrating from InnoDB Repeatable Read is tricky.
  • 9. ROCKSDB PERFORMANCE RocksDB Performance Practices Use Jemalloc memory allocator Understand RocksDB data formats, and keep important data sets in memory Compression Compaction
  • 10. ROCKSDB PERFORMANCE RocksDB file format – data, index and filter <beginning_of_file> [data block 1] [data block 2] ... [data block N] [meta block 1: filter block] [meta block 2: index block] [meta block 3: compression dictionary block] [meta block 4: range deletion block] ... [meta block K: future extended block] [metaindex block] [Footer] <end_of_file> Data block -> Storing actual key/values Filter block -> Storing bloom filter Index block -> Offsets of each data block Index block size depends on the number of data blocks • 16KB -> 4KB data block will increase index block size by 4x
  • 11. ROCKSDB PERFORMANCE Index and Filter size reduction Filter and Index block cache hit rate is important Size info can be obtained from Table Property, and cache info is periodically logged in LOG “optimize_filters_for_hits=true” avoids storing filter in Lmax (saving total filter size by 90%) Ribbon Filter saves bloom filter size by ~30% with comparable CPU util Parameters to save index block size • format_version=4 or 5 • index_block_restart_interval=16 Watch rocksdb_block_cache_index_miss enable_index_compression=false to save CPU time MyRocks has information_schema to expose SST file metrics mysql> select sum(data_block_size)/1024/1024/1024 as size_gb, sum(index_block_size)/1024/1024/1024 as index_gb, sum(filter_block_size)/1024/1024/1024 as filter_gb from information_schema.rocksdb_sst_props; +-------------------+----------------+----------------+ | size_gb | index_gb | filter_gb | +-------------------+----------------+----------------+ | 1009.362400736660 | 2.661879514344 | 1.734282894991 | +-------------------+----------------+----------------+
  • 12. ROCKSDB PERFORMANCE Direct I/O RocksDB supports Direct I/O for SST files (data files) Buffered I/O uses substantial memory (slab) in Linux Kernel Better memory efficiency and lower %system CPU with Direct I/O, especially if your workload is memory bound Adjust Block Cache accordingly, since filesystem cache can no longer be useful Do not mix Buffered I/O and Direct I/O (serialized I/O) use_direct_io_for_flush_and_compaction=ON use_direct_reads=ON cache_high_pri_pool_ratio=0.5
  • 13. ROCKSDB PERFORMANCE Hybrid Compression RocksDB allows to set different compression algo between levels Use stronger compression algorithm (Zstandard) in Lmax to save space Use faster compression algorithm (LZ4 or None) in higher levels to keep up with writes compression_per_level=kLZ4Compression or kNoCompression bottommost_compression=kZSTD
  • 14. ROCKSDB PERFORMANCE Avoid Compaction if possible SST File Writer API • It is users’ responsibility to presort rows by keys Normal Write Path in RocksDB …. Flush Compaction Compaction Faster Write Path
  • 15. ROCKSDB PERFORMANCE Bloom Filter Pay attention to Bloom Filter Size - “optimize_filters_for_hits=true” avoids storing filter in Lmax (saving total filter size by 90%) - Ribbon Filter saves bloom filter size by ~30% with comparable CPU util Whole Key Filtering Prefix Bloom Filter
  • 16. ROCKSDB PERFORMANCE Understand what happens with Delete “Delete” adds a tombstone MyRocks Update is a combination of Delete and Put Tombstones don’t disappear until bottom level compaction happens Some reads need to scan lots of tombstones => inefficient • In this example, reading 5 entries is needed just for getting one row RocksDB has an optimized API called SingleDelete, but it can’t eliminate tombstone overheads • SingleDelete disappears when finding a matching Put. It has a requirement that same-key operations don’t repeat (e.g. Put(1) -> Put(1) -> SD(1) does not work) • MyRocks internally uses SingleDelete for secondary keys Put(1) Put(2) Put(3) Put(4) Put(5) INSERT INTO t VALUES (1),(2),(3),(4),(5); Delete(1) Delete(2) Delete(3) Delete(4) Put(5) DELETE FROM t WHERE id <= 4; Delete(1) Delete(2) Delete(3) Delete(4) Put(5) SELECT COUNT(*) FROM t;
  • 17. ROCKSDB PERFORMANCE Scanning too many tombstones degrades read perf Range scan (Seek) may hit this issue Consecutive tombstones can be millions if you are not dealing properly RocksDB exposes metrics as perf_context INTERNAL_DELETE_SKIPPED_COUNT, with perf context level >= 2 Operations can’t be killed during Seeking tombstones Deletion-Triggered Compaction (DTC) is one of the workarounds • When creating new SST files, if there are certain number of tombstones, trigger another compaction to wipe tombstones immediately ﹘ MyRocks has a sysvar to control that (rocksdb_compaction_sequential_deletes = 49999 / rocksdb_compaction_sequential_deletes_window = 50000) ﹘ RocksDB has an API to do that ﹘ Trade offs between high read cost and more compaction cost
  • 18. ROCKSDB PERFORMANCE Slowdown because of too many point lookups Point Lookup calls Get(). This is more expensive than point lookup from B+Tree May hit RocksDB LRU block cache contentions • Visible as high %system CPU if that’s the case • Improvements in RocksDB in progress Typical workarounds • Use MultiGet API ﹘ Instead of Get() x N times, issue one MultiGet() ﹘ MyRocks uses MultiGet when setting optimizer_switch = ‘mrr=on,mrr_cost_based=off, batched_key_access=on’ • Adding more secondary indexes (different key/values) ﹘ Convert non-covering index scans (1 + N reads) to covering index scans (1 or 1 + small number of reads) ﹘ Cost to update secondary index is cheaper in LSM thanks to skipping reads
  • 19. RocksDB Reliability Practices Write Stalls Understand metrics to watch Error Handling Data consistency Recovery on failure
  • 20. ROCKSDB RELIABILITY Preventing Write Stall Write Stalling is one of the most common problems in RocksDB/LSM Write stalls because: • Writing too fast • L0 flush and compactions are not fast enough • Creating too many L0 files • Too many pending compaction bytes • Inefficient CompactRange API usage • Wrong Bulk Loading API usage (loading SST file into L0 instead of Lmax, invoking full compactions) Write stall stats are available from status counters and LOGs mysql> show global status like 'rocksdb_stall%'; +----------------------------------------------------+-------+ | Variable_name | Value | +----------------------------------------------------+-------+ | rocksdb_stall_l0_file_count_limit_slowdowns | 0 | | rocksdb_stall_locked_l0_file_count_limit_slowdowns | 0 | | rocksdb_stall_l0_file_count_limit_stops | 0 | | rocksdb_stall_locked_l0_file_count_limit_stops | 0 | | rocksdb_stall_pending_compaction_limit_stops | 0 | | rocksdb_stall_pending_compaction_limit_slowdowns | 0 | | rocksdb_stall_memtable_limit_stops | 0 | | rocksdb_stall_memtable_limit_slowdowns | 0 | | rocksdb_stall_total_stops | 0 | | rocksdb_stall_total_slowdowns | 0 | | rocksdb_stall_micros | 0 | +----------------------------------------------------+-------+ 11 rows in set (0.00 sec) 2022/02/15-21:03:46.600403 7f5f077ff700 [WARN] [db/column_family.cc:929] [default] Stopping writes because of estimated pending compaction bytes 1041689026590
  • 21. ROCKSDB RELIABILITY MemTable/L0 Stalls If all MemTables get full, and if they can’t be flushed (e.g. max L0 files), further writes are blocked Reported as these counters • stall_memtable_limit_stops | slowdowns • stall_l0_file_count_limit_stops | slowdowns • stall_total_stops | slowdowns Common workarounds • Allow more L0 files -- Increase level0_slowdown_writes_trigger and level0_stop_writes_trigger (typically 20 | 30) • Make MemTable flush faster -- use faster compression algorithm in L0 (kNoCompression, kLZ4Compression) • Make L0 compactions faster – use faster compression algorithm in L1, 2 • Start compaction earlier -- decrease level0_file_num_compaction_trigger (typically 4) • Be careful about implicit Flush in RocksDB (e.g. SetOptions, CheckPoint) since it creates a L0 file
  • 22. ROCKSDB RELIABILITY Metrics to watch RocksDB has two important metrics structures - Stats (e.g. stalls, data/index/filter block cache hit/miss, compaction bytes) - Perf Context (e.g. tombstone scanned, block decompressed time) - perf_context_level >= 2 is recommended to get most useful info like tombstone scanned. 3 is a little expensive to get time stats MyRocks exposes most metrics via information_schema and show global status mysql> select * from rocksdb_perf_context_global; +---------------------------------+-----------------+ | STAT_TYPE | VALUE | +---------------------------------+-----------------+ | USER_KEY_COMPARISON_COUNT | 270471364854 | | BLOCK_CACHE_HIT_COUNT | 7014318274 | | BLOCK_READ_COUNT | 555394733 | | BLOCK_READ_BYTE | 4359686643590 | | BLOCK_READ_TIME | 67045272264489 | | BLOCK_CHECKSUM_TIME | 2065141339797 | | BLOCK_DECOMPRESS_TIME | 27036226090470 | | GET_READ_BYTES | 604107492243 | | MULTIGET_READ_BYTES | 26614080073 | | ITER_READ_BYTES | 4515817650181 | | INTERNAL_KEY_SKIPPED_COUNT | 64344684548 | | INTERNAL_DELETE_SKIPPED_COUNT | 1141058309 | | INTERNAL_RECENT_SKIPPED_COUNT | 8580663 | | INTERNAL_MERGE_COUNT | 0 | | GET_SNAPSHOT_TIME | 478716678460 | | GET_FROM_MEMTABLE_TIME | 3107700425345 | | GET_FROM_MEMTABLE_COUNT | 1745423505 | | GET_POST_PROCESS_TIME | 579743978173 | | GET_FROM_OUTPUT_FILES_TIME | 102555066991914 | | SEEK_ON_MEMTABLE_TIME | 226655444780 | | SEEK_ON_MEMTABLE_COUNT | 104572447 | | NEXT_ON_MEMTABLE_COUNT | 38671332 | | PREV_ON_MEMTABLE_COUNT | 2687679 | | SEEK_CHILD_SEEK_TIME | 23240171176784 | | SEEK_CHILD_SEEK_COUNT | 668676730 | …
  • 23. ROCKSDB RELIABILITY Most configurations are Dynamic RocksDB has database level and column family level configurations Majority of the configurations are column family level You can change most RocksDB configuration parameters without stopping database Parameter change examples: • Decreasing Block cache size to avoid Memory Pressure • Increasing L0 file limits to avoid L0 stalls • Changing compression algorithm (effective on next Flush/Compaction) Column Family parameter change (SetOptions API) involves MemTable Flush. So if you hit L0 stop, you can’t change parameters (fix in roadmap)
  • 24. ROCKSDB RELIABILITY I/O Error Handling RocksDB returns an error to a caller on I/O errors, and it’s up to RocksDB users for how to handle • Normally users get kIOError but it’s not guaranteed (e.g. kIncompelte) Typical failure handling on errors • Aborting server • Returning errors • Retrying • In any case, don’t suppress errors
  • 25. ROCKSDB RELIABILITY I/O Error Handling in MyRocks Can’t roll back on errors at engine commit. So we abort server instead, and let crash recovery resolve binlog-engine consistency.
  • 26. ROCKSDB RELIABILITY Unique Key Constraints RocksDB API Put() does not check if the same key exists or not. Unlike INSERT in InnoDB, Put() does not return “key already exists” error Call Get() for checking existence Call GetForUpdate() to lock the key MyRocks INSERT wraps with GetForUpdate() and Put(), so it can find unique key violation You have a choice to blindly insert without reading at all (MyRocks REPLACE has an option to do that)
  • 27. ROCKSDB RELIABILITY Data consistency When you physically copy RocksDB database elsewhere, make sure you copy all dependent files – SST files, WAL, Manifest, blob files • Several online copy solutions – RocksDB backup engine, myrocks_hotbackup, xtrabackup By default, RocksDB allows to open database even if missing WAL files • This may end up opening database with inconsistency • This is because Manifest file does not track WAL files Use more strict option to enforce file integrity • Turn track_and_verify_wals_in_manifest on ﹘ This tracks WAL file and size ﹘ Opening database with missing WALs is rejected
  • 28. ROCKSDB RELIABILITY Recovery on Database Crash RocksDB has a parameter called wal_recovery_mode • RocksDB default is 2 (kPointInTimeRecovery) • It used to have default 1 (kAbsoluteConsistency) • 1 has a side effect that it blocks to open RocksDB database, even if it can be recovered Instance crash (incl process crash) may leave the tail WAL file incomplete RocksDB refuses to start with param value 1 (kAbsoluteConsistency) RocksDB does NOT refuse with param value 2 (kPointInTimeRecovery) General recommendation: • Use wal_recovery_mode=2 with track_and_verify_wals_in_manifest=ON • Rely on replication to recover lost transactions 2022-03-26T02:21:26.166366-07:00 0 [Note] [MY-000000] [Server] RocksDB: Opening TransactionDB... 2022-03-26T02:21:28.620095-07:00 0 [ERROR] [MY-000000] [Server] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body
  • 29. OTHER TOPICS Dealing with Snapshot Conflicts InnoDB natively supports range lock (next key lock / gap lock) by default • This was for historical reason to work with Statement Based Binary Logging in MySQL • Often caused hot row lock contentions • Range lock is not held with Row Based Binary Logging + Read Committed Isolation Level RocksDB (and many other databases including PostgreSQL) do not support range lock • There is an ongoing work to support in RocksDB with external contributor PostgreSQL Repeatable Read (and Serializable) returns “Snapshot Conflict” error on conflicts MyRocks uses RocksDB TransactionDB and implements the same behavior You can’t eliminate “snapshot conflict” errors with Repeatable Read / Serializable isolation without range lock Handling errors, or switching to Read Committed are typical workarounds
  • 30. OTHER TOPICS InnoDB to MyRocks/RocksDB migration steps InnoDB RR (Repeatable Read) -> InnoDB RC (Read Committed) -> MyRocks RC • Evaluate if there are queries depending on gap lock ﹘ Meta-MySQL feature: gap_lock_write_log and gap_lock_raise_error are sysvars to help • InnoDB RC to MyRocks RC is straight forward InnoDB RR -> MyRocks RR (-> MyRocks RC) • Evaluate if there are noticeable number of snapshot conflict errors ﹘ rocksdb_snapshot_conflict_errors is a status counter to tell how often hit snapshots ﹘ Users see ‘Snapshot Conflict’ error message with ‘DEADLOCK’ error code • Flipping from RR to RC eliminates snapshot conflict errors ﹘ But it is necessary to verify if RC is safe
  • 31. Summary RocksDB is a modern LSM database library, with years of production deployments at scale Compared to B+Tree, RocksDB (LSM) saves space and offers faster write performance, but pay attention to read performance drops Pay attention to data, index and filter block size and cache miss Utilize compression and compaction tuning options Pay attention to tombstone scanning costs, and utilize several mitigations like Deletion Triggered Compaction Pay attention to write stalls