Mark Callaghan, Facebook

MySQL versus something else
Evaluating alternative databases

Mark Callaghan
Small Data Engineer
October, 2013

Friday, October 25, 13

What metric is important?
▪

Throughput

▪

Throughput while minimizing response time variance

▪

Efﬁciency - reduce cost while meeting response time goals


My focus is storage efficiency
▪

Use flash to get IOPs

▪

Use spinning disks to get capacity

▪

Use both to reduce cost while improving quality of service
frequent
reads

frequent
writes

read IOPs

write IOPs

flash

yes

yes

yes

maybe

flash

yes

no

yes

no

SATA, /dev/null

no

yes

no

maybe

SATA, /dev/null

no

no

no

no

device


What technology would you choose today?
▪

How do you value flexibility?
▪
▪

Servers you buy today will be in production for a few years

▪

▪

Newer & faster hardware arrives each year
Software can last even longer in production

We have several generations of HW on the small data tiers
▪

Pure-disk (SAS array + HW RAID)

▪

Flashcache (SATA array + HW RAID, flash)

▪

Pure-flash


Common deﬁnitions
▪

Sorted run - rows stored in key order
▪

may be stored using many range-partitioned ﬁles

▪

Memtable - sorted run in memory

▪

L0 - 1 or more sorted runs on disk

▪

L1, L2, ... Lmax - each is 1 sorted run on disk
▪
▪

▪

Lmax is the largest level
by size L1 < L2 ... < Lmax

live% - percentage of live data in the database


Amplification factors
▪

Framework for describing efficiency of database algorithms

▪

How much is done physically in response to a logical change?
▪
▪

Write amplification

▪

▪

Read amplification
Space amplification

Can determine
▪

How many disks or flash you must buy

▪

How long your flash might last

▪

Whether you can buy lower endurance flash


Read ampliﬁcation
▪

Read-amp == disk reads per query
▪
▪

Assume some data is in cache

▪

▪

Separate results for point query versus short range scan
Assume the index is covering for the query

Example: b-tree with all non-leaf levels in cache
▪

Point read-amp - 1 disk read to get the leaf block

▪

Short range read-amp - 1 or 2 disk reads to get the leaf blocks


Read amplification and bloom filters
▪

Bloom filter summary
▪

f(key) -> { no, maybe }

▪

Use ~10 bits/row to get reasonable false positive rate

▪

Great for avoiding disk reads on point queries

▪

Bonus - prevent disk reads for keys that don’t exist

▪

Useless for general range scans like “select x where y < 100”

▪

Can be useful for equality prefix like “select x where q = 10 and y < 100”
▪

▪

use bloom filter on q

Too many bloom filter checks can hurt response time
▪


each sorted run on disk needs a bloom filter check

▪

Write-amp == bytes written per byte changed
▪
▪

▪

Insert 100 bytes with write-amp=5 and 500 bytes will be written
For now ignore penalty from small random writes

Some writes done immediately, others are deferred
▪

Immediate -> redo log

▪

Deferred -> b-tree dirty pages not forced on commit, LSM compaction


Write ampliﬁcation, part 2
▪

HW can increase write-amp
▪
▪

▪

Read live pages and write them elsewhere when cleaning ﬂash blocks
Only a cost for algorithms that do small random writes

Redo log writes can increase write-amp
▪

Writes must be done to a multiple of 512 or larger

▪

Insert 100 byte row, force 512 byte sector for redo has write-amp=5


Why write ampliﬁcation matters
▪

Write endurance for ﬂash device
▪
▪

▪

The wrong algorithm can wear out the device too soon
The right algorithm might let you buy lower cost/endurance device

Write-amp can predict peak performance
▪

If storage can sustain 400 MB/second of writes

▪

And write-amp is 10

▪

Then database can sustain 40 MB/second of changes


Simple request - make counting faster
▪

Some web-scale workloads need to maintain counts
▪
▪

▪

Database is IO-bound
Workload should be write-heavy, counters might not be read

update foo set count = count + 1 where key = ‘bar’
▪

Read-modify-write

▪

Write-only: write delta, merge deltas later when queried/compacted


▪

Space-amp == sizeof(database files) / sizeof(data)
▪
▪

Assume database files are in steady state (fragmented & compacted)

▪

▪

Ignore secondary indexes
Space-amp == 100 / %live

Things that change space amplification
▪

B-tree fragmentation

▪

Old versions of rows that are yet to be collected

▪

Compression

▪

Per row/page metadata (rollback pointer, transaction ID, ...)


Space versus write amplification
▪

Sorry for the confusion
▪
▪

▪

Databases store N blocks in 1 extent
Flash devices store N pages in 1 block

Copy out
▪

Read live data from the cleaned extent, write it elsewhere

▪

Cost is a function of the percentage of live data

▪

Larger live% means less space and more write amplification

▪

Smaller live% means more space and less write amplification


Space versus write amplification
Old flash block assuming all blocks have 25% live pages

75 dead pages

25 live pages

Block cleaning copies 25 pages
New flash block

75 pages ready for new writes

25 copied
pages

Write 100 pages total per 75 new page writes:
* %live is 25%
* write-amp is 100 / (100 - %live) == 100 / 75
* space-amp is 100 / %live == 4


Disclaimer
▪

There are many assumptions in the rest of the slides.
▪

Assumption #1: workloads have no skew.
▪
▪

▪

▪

Most real workloads have skew.
Lets save skew for a much longer discussion

Assumption #2: workload is update-only

I am trying to start a discussion rather than solve everything.
▪

This won’t be confused as a lecture on algorithm analysis.

▪

We might disagree on technology, but we can agree on terminology


Database algorithms
▪

B-tree
▪
▪

▪

Update-in-place (UIP)
Copy-on-write using sequential (COW-S) and random (COW-R) writes

Log structured merge tree (LSM)
▪
▪

▪

LevelDB-style compaction (leveled)
HBase-style compaction (n-ﬁles, size-tiered)

Other
▪

Log-only - Bitcask

▪

Memtable + L1 - Sophia via Sphia.org

▪

Memtable, L0, L1 - MaSM

▪

TokuDB/TokuMX - fractional cascading


B-tree
fixed-page
(fragments)

in-place

write-back

needs
garbage
collection
(block or extent
cleaning)

UIP

yes

yes

single-block

HW GC if flash

InnoDB

COW-R

yes

no

single-block

HW GC if flash

LMDB

COW-S

no

no

multi-block

SW GC

?

algorithm


example

B-tree: UIP and COW-R
▪

When non-leaf levels are in cache
▪

▪

Point read-amp is 1, range read-amp is 1 or 2

When dirty pages are forced after each row change
▪

Write-amp is sizeof(page) / sizeof(row)

▪

More write-amp from torn-page protection

▪

Add +1 for redo log

▪

Include HW write-amp when using ﬂash

▪

Forcing data pages too soon increases write-amp


B-tree: UIP and COW-R, space ampliﬁcation
▪

Fragmentation because b-tree pages are not full on average
▪
▪

▪

After a page split 1 full page becomes 2 half-full pages
With InnoDB we have many indexes with pages that are ~60% full

Fixed page size reduces compression, with InnoDB 2X compression
▪

Default ﬁxed page size is 8kb

▪

Compress 16kb to 6kb, still write out 8kb

▪

It is hard to use a compression window larger than one page

▪

Per-row metadata uses 13+ bytes on InnoDB


B-tree: COW-S
▪

Read amplification is the same as for UIP and COW-R

▪

▪
▪

Has SW write-amp, cost of cleaning previously written extents

▪

▪

Smaller page size from better compression and no fragmentation
No HW write-amp on flash

▪

Compresses better than UIP/COW-R because page size not fixed

▪

Almost no fragmentation

▪

Space-amp from old versions of pages that have yet to be cleaned

▪

More (less) space-amp means less (more) write-amp


LSM with leveled compaction
▪

Implemented by LevelDB and Cassandra

▪

Database is memtable, L0, L1, ..., Lmax

▪

Less read-amp and space-amp, more write-amp

▪

Similar to original LSM design from paper by O’Neil
▪

Difference is the use of many range-partitioned ﬁles per level
▪
▪

▪

Increases write-amp by a small amount
Prevents temporary doubling of Lmax during compaction

Compaction from L1 to L2
▪

reads N bytes from L1

▪

reads 10*N bytes from L2

▪

writes 10*N + N bytes back to L2


memtable

keys: 00..01

keys: 0..99

keys: 11..19

keys: 0..99

keys: 0..99

keys: 90..99

Level 0 (1GB)

Level 1 (1GB)

10X more data

keys:
000..001


keys:
002...003

keys: 90..99

Level 2 (10GB)

▪

Point read amplification
▪

▪

Range read amplification
▪

▪

1 disk read per level and per L0 file, bloom filters don’t help

▪

▪

1 bloom filter check per L0 file and per level for L1->Lmax + 1 disk read

10 per level starting with L2 + 1 for redo + 1 for L0 + ~1 for L1

▪


1.1 assuming 90% of data is on the maximum level

LSM with n-files compaction
▪

Implemented by Hbase, WiredTiger and Cassandra

▪

Database is memtable, L0, L1
▪

Files in L0 have varying sizes

▪

Less write-amp, more read-amp and space-amp

▪

Compaction cost determined by:
▪
▪

▪

#files merged at a time
sizeof(L1) / sizeof(file created by memtable flush)

If memtable is 1 GB, L1 is 64 GB, 2 files are merged at a time
▪

then a row is written to files of size 1, 2, 4, 8, 16, 32 and 64 GB

▪

write-amp is 7


LSM with n-ﬁles compaction, L1=64 GB
memtable

64 GB
1 GB

1 GB

2 GB

2 GB

4 GB

4 GB

L0 ﬁles


8 GB

8 GB

16 GB 16 GB 32 GB 32 GB

L1

LSM with n-files compaction
▪

Point read amplification
▪

▪

Range read amplification
▪

▪

1 bloom filter check per file + 1 disk read
1 disk read per file, bloom filters don’t help with range scans

▪
▪

Trade write for space amplification

▪

▪

Usually much less than leveled compaction
Add 1 for redo

▪


Usually greater than 2

Log-only
▪

Bitcask (part of Riak/Basho) is an example of this

▪

Data is written 1+ times
▪
▪

▪

Write data once to a log
Write again when row is live during log cleaning

Copy data from tail to head of log when out of disk space


Log-only
new data
newest log ﬁle

Log 4

Log 3
live data
Log 2

oldest log ﬁle


Log 1

cleaner

dead data

/dev/null

Log-only
▪

Point read amplification is 1

▪

Range read amplification is one per value in the range

▪

Write and space amplification are related
▪
▪

▪

Write amplification is 100 / (100 - %live)
Space amplification is 100 / %live

When 66% of the data in the logs is live
▪

Space-amp is 1.5

▪

Write-amp is 3


Memtable + L1
▪

I think Sophia (sphia.org) is an example of this

▪

Database is memtable, L1

▪

Do compaction between memtable & L1 when memtable is full

▪

Great when database on disk not too much bigger than RAM


Memtable + L1
memtable
compact

L1

new L1


Memtable + L1
▪

Point read amplification is 1

▪

Range read amplification is 1

▪

▪
▪

▪

The ratio sizeof(database) / sizeof(memtable)
+1 for redo log

Space amplification is 1


Memtable + L0 + L1
▪

MaSM is an example of this

▪

Database is memtable, L0, L1
▪
▪

▪

sizeof(L0) == sizeof(L1)
Looks like ﬁle structures from 2-pass external sort

Tradeoffs
▪

Minimize write-amp

▪

Maximize read-amp


Memtable + L0 + L1
memtable
L0

L0

L0

L0

L0

Merge all on compaction

L1


Memtable + L0 + L1
▪

Point read amplification is 1 disk read + many bloom filter checks

▪

Range read amplification 1 disk read per L0 file + 1

▪

Write amplification is 3
▪

▪

Write to redo log, L0 and L1

Space amplification is 2


TokuDB, TokuMX
▪

Read ampliﬁcation
▪
▪

▪

1 disk read for point queries
1 or 2 disk reads for range read queries

▪
▪

▪

10 per level + 1 for redo
Won’t use as many levels as LevelDB

▪

No internal fragmentation, variable sizes pages written

▪

Similar to LevelDB


Database algorithms
point
read-amp

range
read-amp

write-amp

space-amp

UIP b-tree

1

1 or 2

page/row * HW GC

1.5 to 2

COW-R b-tree

1

1 or 2

page/row * HW GC

1.5 to 2

COW-S b-tree

1

1 or 2

page/row * SW GC

1

LSM leveled

1 + N*bloom

N

10 per level

1.1 X

LSM n-ﬁles

1 + N*bloom

N

can be < 10

can be > 2

log-only

1

N

1 / (1 - %live)

1 / %live

memtable+L1

1

1

database/mem

1

1 + N*bloom

N

3

2

1

2

10 per level

1.1 X

algorithm

memtable+L0+L1
tokudb


Two things to remember
▪

You can trade space/read versus write amplification
▪
▪

▪

Switch database algorithms or tune existing algorithm
Hard to minimize read, write & space amplification

One size doesn’t fit all
▪

The workload I care about has different types of indexes
▪
▪

▪


Some indexes should be optimized for short range scans
Other indexes can be optimized for write amplification

would be nice to support both in one database engine

Thank you
facebook.com/MySQLatFacebook

Mark Callaghan
Small Data Engineer
October, 2013


Mark Callaghan, Facebook

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Mark Callaghan, Facebook

Similar to Mark Callaghan, Facebook (20)

More from Ontico

More from Ontico (20)

Recently uploaded

Recently uploaded (20)

Mark Callaghan, Facebook