Mark Callaghan, Facebook
Upcoming SlideShare
Loading in...5
×
 

Mark Callaghan, Facebook

on

  • 972 views

HighLoad++ 2013

HighLoad++ 2013

Statistics

Views

Total Views
972
Views on SlideShare
502
Embed Views
470

Actions

Likes
1
Downloads
23
Comments
0

6 Embeds 470

http://www.highload.ru 432
http://starovov.blogspot.com 28
http://2012.highload.co 5
https://twitter.com 2
https://starovov.blogspot.com 2
https://www.blogger.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mark Callaghan, Facebook Mark Callaghan, Facebook Presentation Transcript

  • MySQL versus something else Evaluating alternative databases Mark Callaghan Small Data Engineer October, 2013 Friday, October 25, 13
  • What metric is important? ▪ Throughput ▪ Throughput while minimizing response time variance ▪ Efficiency - reduce cost while meeting response time goals Friday, October 25, 13
  • My focus is storage efficiency ▪ Use flash to get IOPs ▪ Use spinning disks to get capacity ▪ Use both to reduce cost while improving quality of service frequent reads frequent writes read IOPs write IOPs flash yes yes yes maybe flash yes no yes no SATA, /dev/null no yes no maybe SATA, /dev/null no no no no device Friday, October 25, 13
  • What technology would you choose today? ▪ How do you value flexibility? ▪ ▪ Servers you buy today will be in production for a few years ▪ ▪ Newer & faster hardware arrives each year Software can last even longer in production We have several generations of HW on the small data tiers ▪ Pure-disk (SAS array + HW RAID) ▪ Flashcache (SATA array + HW RAID, flash) ▪ Pure-flash Friday, October 25, 13
  • Common definitions ▪ Sorted run - rows stored in key order ▪ may be stored using many range-partitioned files ▪ Memtable - sorted run in memory ▪ L0 - 1 or more sorted runs on disk ▪ L1, L2, ... Lmax - each is 1 sorted run on disk ▪ ▪ ▪ Lmax is the largest level by size L1 < L2 ... < Lmax live% - percentage of live data in the database Friday, October 25, 13
  • Amplification factors ▪ Framework for describing efficiency of database algorithms ▪ How much is done physically in response to a logical change? ▪ ▪ Write amplification ▪ ▪ Read amplification Space amplification Can determine ▪ How many disks or flash you must buy ▪ How long your flash might last ▪ Whether you can buy lower endurance flash Friday, October 25, 13
  • Read amplification ▪ Read-amp == disk reads per query ▪ ▪ Assume some data is in cache ▪ ▪ Separate results for point query versus short range scan Assume the index is covering for the query Example: b-tree with all non-leaf levels in cache ▪ Point read-amp - 1 disk read to get the leaf block ▪ Short range read-amp - 1 or 2 disk reads to get the leaf blocks Friday, October 25, 13
  • Read amplification and bloom filters ▪ Bloom filter summary ▪ f(key) -> { no, maybe } ▪ Use ~10 bits/row to get reasonable false positive rate ▪ Great for avoiding disk reads on point queries ▪ Bonus - prevent disk reads for keys that don’t exist ▪ Useless for general range scans like “select x where y < 100” ▪ Can be useful for equality prefix like “select x where q = 10 and y < 100” ▪ ▪ use bloom filter on q Too many bloom filter checks can hurt response time ▪ Friday, October 25, 13 each sorted run on disk needs a bloom filter check
  • Write amplification ▪ Write-amp == bytes written per byte changed ▪ ▪ ▪ Insert 100 bytes with write-amp=5 and 500 bytes will be written For now ignore penalty from small random writes Some writes done immediately, others are deferred ▪ Immediate -> redo log ▪ Deferred -> b-tree dirty pages not forced on commit, LSM compaction Friday, October 25, 13
  • Write amplification, part 2 ▪ HW can increase write-amp ▪ ▪ ▪ Read live pages and write them elsewhere when cleaning flash blocks Only a cost for algorithms that do small random writes Redo log writes can increase write-amp ▪ Writes must be done to a multiple of 512 or larger ▪ Insert 100 byte row, force 512 byte sector for redo has write-amp=5 Friday, October 25, 13
  • Why write amplification matters ▪ Write endurance for flash device ▪ ▪ ▪ The wrong algorithm can wear out the device too soon The right algorithm might let you buy lower cost/endurance device Write-amp can predict peak performance ▪ If storage can sustain 400 MB/second of writes ▪ And write-amp is 10 ▪ Then database can sustain 40 MB/second of changes Friday, October 25, 13
  • Simple request - make counting faster ▪ Some web-scale workloads need to maintain counts ▪ ▪ ▪ Database is IO-bound Workload should be write-heavy, counters might not be read update foo set count = count + 1 where key = ‘bar’ ▪ Read-modify-write ▪ Write-only: write delta, merge deltas later when queried/compacted Friday, October 25, 13
  • Space amplification ▪ Space-amp == sizeof(database files) / sizeof(data) ▪ ▪ Assume database files are in steady state (fragmented & compacted) ▪ ▪ Ignore secondary indexes Space-amp == 100 / %live Things that change space amplification ▪ B-tree fragmentation ▪ Old versions of rows that are yet to be collected ▪ Compression ▪ Per row/page metadata (rollback pointer, transaction ID, ...) Friday, October 25, 13
  • Space versus write amplification ▪ Sorry for the confusion ▪ ▪ ▪ Databases store N blocks in 1 extent Flash devices store N pages in 1 block Copy out ▪ Read live data from the cleaned extent, write it elsewhere ▪ Cost is a function of the percentage of live data ▪ Larger live% means less space and more write amplification ▪ Smaller live% means more space and less write amplification Friday, October 25, 13
  • Space versus write amplification Old flash block assuming all blocks have 25% live pages 75 dead pages 25 live pages Block cleaning copies 25 pages New flash block 75 pages ready for new writes 25 copied pages Write 100 pages total per 75 new page writes: * %live is 25% * write-amp is 100 / (100 - %live) == 100 / 75 * space-amp is 100 / %live == 4 Friday, October 25, 13
  • Disclaimer ▪ There are many assumptions in the rest of the slides. ▪ Assumption #1: workloads have no skew. ▪ ▪ ▪ ▪ Most real workloads have skew. Lets save skew for a much longer discussion Assumption #2: workload is update-only I am trying to start a discussion rather than solve everything. ▪ This won’t be confused as a lecture on algorithm analysis. ▪ We might disagree on technology, but we can agree on terminology Friday, October 25, 13
  • Database algorithms ▪ B-tree ▪ ▪ ▪ Update-in-place (UIP) Copy-on-write using sequential (COW-S) and random (COW-R) writes Log structured merge tree (LSM) ▪ ▪ ▪ LevelDB-style compaction (leveled) HBase-style compaction (n-files, size-tiered) Other ▪ Log-only - Bitcask ▪ Memtable + L1 - Sophia via Sphia.org ▪ Memtable, L0, L1 - MaSM ▪ TokuDB/TokuMX - fractional cascading Friday, October 25, 13
  • B-tree fixed-page (fragments) in-place write-back needs garbage collection (block or extent cleaning) UIP yes yes single-block HW GC if flash InnoDB COW-R yes no single-block HW GC if flash LMDB COW-S no no multi-block SW GC ? algorithm Friday, October 25, 13 example
  • B-tree: UIP and COW-R ▪ When non-leaf levels are in cache ▪ ▪ Point read-amp is 1, range read-amp is 1 or 2 When dirty pages are forced after each row change ▪ Write-amp is sizeof(page) / sizeof(row) ▪ More write-amp from torn-page protection ▪ Add +1 for redo log ▪ Include HW write-amp when using flash ▪ Forcing data pages too soon increases write-amp Friday, October 25, 13
  • B-tree: UIP and COW-R, space amplification ▪ Fragmentation because b-tree pages are not full on average ▪ ▪ ▪ After a page split 1 full page becomes 2 half-full pages With InnoDB we have many indexes with pages that are ~60% full Fixed page size reduces compression, with InnoDB 2X compression ▪ Default fixed page size is 8kb ▪ Compress 16kb to 6kb, still write out 8kb ▪ It is hard to use a compression window larger than one page ▪ Per-row metadata uses 13+ bytes on InnoDB Friday, October 25, 13
  • B-tree: COW-S ▪ Read amplification is the same as for UIP and COW-R ▪ Write amplification ▪ ▪ Has SW write-amp, cost of cleaning previously written extents ▪ ▪ Smaller page size from better compression and no fragmentation No HW write-amp on flash Space amplification ▪ Compresses better than UIP/COW-R because page size not fixed ▪ Almost no fragmentation ▪ Space-amp from old versions of pages that have yet to be cleaned ▪ More (less) space-amp means less (more) write-amp Friday, October 25, 13
  • LSM with leveled compaction ▪ Implemented by LevelDB and Cassandra ▪ Database is memtable, L0, L1, ..., Lmax ▪ Less read-amp and space-amp, more write-amp ▪ Similar to original LSM design from paper by O’Neil ▪ Difference is the use of many range-partitioned files per level ▪ ▪ ▪ Increases write-amp by a small amount Prevents temporary doubling of Lmax during compaction Compaction from L1 to L2 ▪ reads N bytes from L1 ▪ reads 10*N bytes from L2 ▪ writes 10*N + N bytes back to L2 Friday, October 25, 13
  • LSM with leveled compaction memtable keys: 00..01 keys: 0..99 keys: 11..19 keys: 0..99 keys: 0..99 keys: 90..99 Level 0 (1GB) Level 1 (1GB) 10X more data keys: 000..001 Friday, October 25, 13 keys: 002...003 keys: 90..99 Level 2 (10GB)
  • LSM with leveled compaction ▪ Point read amplification ▪ ▪ Range read amplification ▪ ▪ 1 disk read per level and per L0 file, bloom filters don’t help Write amplification ▪ ▪ 1 bloom filter check per L0 file and per level for L1->Lmax + 1 disk read 10 per level starting with L2 + 1 for redo + 1 for L0 + ~1 for L1 Space amplification ▪ Friday, October 25, 13 1.1 assuming 90% of data is on the maximum level
  • LSM with n-files compaction ▪ Implemented by Hbase, WiredTiger and Cassandra ▪ Database is memtable, L0, L1 ▪ Files in L0 have varying sizes ▪ Less write-amp, more read-amp and space-amp ▪ Compaction cost determined by: ▪ ▪ ▪ #files merged at a time sizeof(L1) / sizeof(file created by memtable flush) If memtable is 1 GB, L1 is 64 GB, 2 files are merged at a time ▪ then a row is written to files of size 1, 2, 4, 8, 16, 32 and 64 GB ▪ write-amp is 7 Friday, October 25, 13
  • LSM with n-files compaction, L1=64 GB memtable 64 GB 1 GB 1 GB 2 GB 2 GB 4 GB 4 GB L0 files Friday, October 25, 13 8 GB 8 GB 16 GB 16 GB 32 GB 32 GB L1
  • LSM with n-files compaction ▪ Point read amplification ▪ ▪ Range read amplification ▪ ▪ 1 bloom filter check per file + 1 disk read 1 disk read per file, bloom filters don’t help with range scans Write amplification ▪ ▪ Trade write for space amplification ▪ ▪ Usually much less than leveled compaction Add 1 for redo Space amplification ▪ Friday, October 25, 13 Usually greater than 2
  • Log-only ▪ Bitcask (part of Riak/Basho) is an example of this ▪ Data is written 1+ times ▪ ▪ ▪ Write data once to a log Write again when row is live during log cleaning Copy data from tail to head of log when out of disk space Friday, October 25, 13
  • Log-only new data newest log file Log 4 Log 3 live data Log 2 oldest log file Friday, October 25, 13 Log 1 cleaner dead data /dev/null
  • Log-only ▪ Point read amplification is 1 ▪ Range read amplification is one per value in the range ▪ Write and space amplification are related ▪ ▪ ▪ Write amplification is 100 / (100 - %live) Space amplification is 100 / %live When 66% of the data in the logs is live ▪ Space-amp is 1.5 ▪ Write-amp is 3 Friday, October 25, 13
  • Memtable + L1 ▪ I think Sophia (sphia.org) is an example of this ▪ Database is memtable, L1 ▪ Do compaction between memtable & L1 when memtable is full ▪ Great when database on disk not too much bigger than RAM Friday, October 25, 13
  • Memtable + L1 memtable compact L1 new L1 Friday, October 25, 13
  • Memtable + L1 ▪ Point read amplification is 1 ▪ Range read amplification is 1 ▪ Write amplification ▪ ▪ ▪ The ratio sizeof(database) / sizeof(memtable) +1 for redo log Space amplification is 1 Friday, October 25, 13
  • Memtable + L0 + L1 ▪ MaSM is an example of this ▪ Database is memtable, L0, L1 ▪ ▪ ▪ sizeof(L0) == sizeof(L1) Looks like file structures from 2-pass external sort Tradeoffs ▪ Minimize write-amp ▪ Maximize read-amp Friday, October 25, 13
  • Memtable + L0 + L1 memtable L0 L0 L0 L0 L0 Merge all on compaction L1 Friday, October 25, 13
  • Memtable + L0 + L1 ▪ Point read amplification is 1 disk read + many bloom filter checks ▪ Range read amplification 1 disk read per L0 file + 1 ▪ Write amplification is 3 ▪ ▪ Write to redo log, L0 and L1 Space amplification is 2 Friday, October 25, 13
  • TokuDB, TokuMX ▪ Read amplification ▪ ▪ ▪ 1 disk read for point queries 1 or 2 disk reads for range read queries Write amplification ▪ ▪ ▪ 10 per level + 1 for redo Won’t use as many levels as LevelDB Space amplification ▪ No internal fragmentation, variable sizes pages written ▪ Similar to LevelDB Friday, October 25, 13
  • Database algorithms point read-amp range read-amp write-amp space-amp UIP b-tree 1 1 or 2 page/row * HW GC 1.5 to 2 COW-R b-tree 1 1 or 2 page/row * HW GC 1.5 to 2 COW-S b-tree 1 1 or 2 page/row * SW GC 1 LSM leveled 1 + N*bloom N 10 per level 1.1 X LSM n-files 1 + N*bloom N can be < 10 can be > 2 log-only 1 N 1 / (1 - %live) 1 / %live memtable+L1 1 1 database/mem 1 1 + N*bloom N 3 2 1 2 10 per level 1.1 X algorithm memtable+L0+L1 tokudb Friday, October 25, 13
  • Two things to remember ▪ You can trade space/read versus write amplification ▪ ▪ ▪ Switch database algorithms or tune existing algorithm Hard to minimize read, write & space amplification One size doesn’t fit all ▪ The workload I care about has different types of indexes ▪ ▪ ▪ Friday, October 25, 13 Some indexes should be optimized for short range scans Other indexes can be optimized for write amplification would be nice to support both in one database engine
  • Thank you facebook.com/MySQLatFacebook Mark Callaghan Small Data Engineer October, 2013 Friday, October 25, 13