Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Percona Live 2016

E
Ernie SouhradaDatabase Engineer and Bit Wrangler
Novel Approaches to MySQL Compression for Modern Data Sets
Less Is More
Ernie Souhrada
Database Engineer / Bit Wrangler, Pinterest
Percona Live Data Performance Conference – 19 April 2016
 1
•  Introductions
•  The Data Explosion
•  Stand Back, I’m Going to Math
•  So Many Options, So Little CPU
•  Don’t Try This At Home
•  Not Your Grandfather’s GZIP
•  Ooh, Shiny Numbers!
•  Q&A
Agenda
2
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
My god, it’s full of cats!
Who am I?
•  Database Engineer at Pinterest (January 2015)
–  One of two people solely responsible for hundreds of TB of MySQL data
–  Also loosely affiliated with HBase and Core SRE teams
•  Previously: Percona, Sun, assorted random small companies
•  Jack of many trades, master of some

Why am I here?
•  Interested in almost EVERYTHING (not just tech)
•  Mathematician by training; compression is fundamentally a math
problem.
Who Am I, Why Am I Here?
3
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Turning technical skill into cat food since 1996
“Every two days now we create as much information as we did from the
dawn of civilization up to 2003.” – Eric Schmidt, Google [1]

He said this in 2010.

•  Mostly user-generated content
–  Over 2 million cat videos on YouTube in 2015 [2]
–  Lots of unstructured data, not easily put into relational form

•  Don’t forget the NSA!
–  Although nobody really knows how much data they have….
The Data Explosion
4
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Because ‘DELETE’ is a four-letter word.
The Data Explosion
5
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
In 2012, there were 2.1 billion people on the internet[3]
2012
The Data Explosion
6
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Two years later, that number rose to 2.4 billion[4]
2014
The Data Explosion
7
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Drowning in a sea of bits
Storage costs are stabilizing[5]
$0.02/GB
The Data Explosion
8
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Drowning in a sea of bits
But data volume is still increasing!

2016: 1.1 ZB of global IP traffic per year (>1 billion GB/month)

2019: 2 ZB[6]


2011: 1.8 ZB of information created

2012: 2.8 ZB

2020: 40 ZB[7]
The Data Explosion
9
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Mo’ data, mo’ problems.
TRUNCATE is also a four-letter word. (So is DROP…)
The Data Explosion
10
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
What to do?
•  Delete
•  Some organizations afraid to delete anything
•  Creation velocity still a problem
•  Collect less? 
•  Pray to the storage gods?
•  Panic!

•  Spend the money, buy more storage
•  May be inevitable
•  ROI and efficiency still matter
Trading CPU cycles for disk space since 2015
The Data Explosion
11
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Compression to the rescue!
•  Well, sort of.
•  Workload matters.
•  Structure of data matters.


•  Decrease velocity of data growth



•  Thank you, Gordon Moore!
Compressed pins are compressed.
The Data Explosion
12
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Pinterest, 12 months ago:
•  Lots of data stored as JSON blobs
•  Workload is read-heavy, but not overall QPS-heavy
•  No compression being used
•  i2.4xlarge for DB servers (3TB of disk)
•  Estimated disk space exhaustion around EOQ1 2016
•  More servers?
•  Bigger servers?
•  Panic?
Compressed pins are compressed.
The Data Explosion
13
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Pinterest, today:
•  Pin data still stored as JSON blobs
•  i2.4xlarge for DB servers (3TB of disk)
•  Workload profile hasn’t changed much
•  InnoDB page compression being used
•  Approximately 50% space reduction
•  Reduction in data growth velocity
•  Disk space exhaustion estimated Q2 2017
•  Still looking for ways to do more with our
existing resources
Entropy is more than just the heat death of the universe.
Stand Back, I’m Going To Math
14
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Entropy: A mathematical measure of information or uncertainty.
•  Computed as a function of a probability distribution.
•  Claude Shannon (1948): A Mathematical Theory of Communication
More formally:
Suppose X is a discrete random variable which takes on values from a finite set X.
Then, then entropy of the random variable X is defined to be:
H(X) = − P(x)log
x∈X
∑ 2P(x)
Encoding to binary strings for fun and profit
Stand Back, I’m Going To Math
15
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
An encoding is a function that maps elements from the set X to the set of finite binary strings.
f : X → {0,1}*
Extend this to finite sequences (strings) of elements: 
f (x1x2 x3...xk ) = f (x1)|| f (x2 )|| f (x3)||... || f (xk )
f : X*
→ {0,1}*
where || is the concatenation operator
So, we can really think of the encoding like this:
For a given set X, there are infinitely many encodings. Why?
But not just any encoding will do.
Stand Back, I’m Going To Math
16
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
•  Injective
•  Guarantees an unambiguous decoding
•  Prefix-free
•  Allows sequential decoding, no memory required
•  An encoding is prefix-free if there do not exist elements x, y in X and a string S in {0,1}*
such that f(x) = f(y) || S
•  Lossless
•  Informally, exactly what it sounds like – given an encoded string E, we can decode it back
precisely into the original string S
•  Efficient!
•  Use as few bits as possible to encode each string.
•  How low can we go?
A little theory before some practice.
Stand Back, I’m Going To Math
17
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
One more definition.
Suppose that we have a string such that each in the string occurs according to a
specified probability distribution. The probability of any such string (note that the elements of the
string do not need to be distinct) is given by:
x1!xk
xi
P(x1!xk ) = P(xi )
i=1
k
∏
This is just basic probability.
Consider a fair coin that gets flipped twice. Possible outcomes are: HH, HT, TH, TT
CAT BREAK!
Stand Back, I’m Going To Math
18
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Efficiency cat likes short strings
Stand Back, I’m Going To Math
19
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
The efficiency of a particular encoding f is
defined as the weighted average length of an
encoding of an element of X.
ℓ( f ) = P(x)
x∈X
∑ f (x)
Where |y| denotes the length of string y.
Putting it all together
Stand Back, I’m Going To Math
20
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Source Coding Theorem (informally stated):
A string S of length N consisting of elements of X and probability distribution X that has entropy H(X)
can be compressed into more than N*H(X) bits with negligible risk of data loss as N à ∞, but it
cannot be compressed into fewer than N*H(X) bits without virtually guaranteeing data loss.
H(X)≤ ℓ( f )< H(X)+1
What does this mean?
It provides a bound on encoding efficiency for lossless compression algorithms.
Proof is left as an exercise to the reader.
But you can use Huffman coding to actually find an efficient code that satisfies the above.
Looking at things differently
Stand Back, I’m Going To Math
21
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
It’s not possible to have an average information
content of more than one bit per bit of message
without losing data.

On average, English text has roughly one bit of
entropy per letter.[8]

ASCII is an 8-bit encoding. It should come as no
surprise that English text compresses quite well.
The last slide on theory, I promise
Stand Back, I’m Going To Math
22
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
We don’t necessarily have to think of individual letters.
-  Bigrams, trigrams
-  Words or tokens (think about SQL keywords or a JSON document)

Some strings come out smaller when compressed. 
Some come out larger.

There’s no universal encoding that works equally-well for every set of source strings.
•  “Old” compression technology
•  Application layer
•  SQL functions: COMPRESS() / DECOMPRESS()
•  ARCHIVE storage engine
•  InnoDB page compression
•  “New” compression technology
•  TokuDB
•  MyRocks
•  MySQL 5.7 “punch hole” transparent compression
•  Server-level column compression… what?!
So Many Options, So Little CPU!
23
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Compression sounds great! I want some for my database, too.
Don’t Try This At Home
24
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Just because you can do something doesn’t mean you should.
Application-Level Compression
The Good:
•  Not limited in choice of algorithm
•  Scales horizontally with app servers
•  Minimizes network traffic
•  Works with any storage engine
•  Fine-grained control over what to
compress and what to leave alone
The Bad:
•  Might require a lot of code retrofit
•  Significant operational overhead in the
event of incidents
•  Potentially-significant loss of SQL
functionality
•  WHERE clauses on compressed data
•  SQL functions
Unless you’re Batman. Then be Batman.
Don’t Try This At Home
25
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
When might you consider it?
•  New projects, maybe
•  Existing projects, maybe not
•  The data to be compressed doesn’t need anything more than store/retrieve
•  You’re OK with the output of ‘SHOW PROCESSLIST’ screwing up your terminal
•  Network bandwidth is at a premium but CPU is plentiful (MySQL on Mars?)
Don’t Try This At Home
26
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
You’re not Batman.
SQL Function Compression (COMPRESS/DECOMPRESS)
The Good:
•  Works with any storage engine
•  Fine-grained control over what to
compress and what to leave alone
The Bad:
•  All of the same negatives of
application-level compression but
without any of the major benefits.
•  Extra load on the MySQL server
When might you consider it?
•  For any serious project, probably never
Don’t Try This At Home
27
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Included for the sake of completeness only
ARCHIVE Storage Engine
The Good:
•  Convenient
•  Mature
The Bad:
•  No UPDATE or DELETE
•  SELECT is a table scan
•  Not a usable general-purpose engine
When might you consider it?
•  Data that never needs to be updated and is rarely accessed
•  Data that can be lost or regenerated in an emergency
Honey, I shrunk the database!


Not Your Grandfather’s GZIP
28
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
InnoDB Page Compression (pre-5.7)
The Good:
•  Mature
•  No need to retrofit code
•  Decent compression ratio
•  Reasonably performant for many things
The Bad:
•  Memory inefficient
•  Not as space-efficient as it could be
•  Not much configurability
When might you consider it?
•  Read-mostly workloads of low to moderate concurrency
•  For many users, it’s still the only game in town
Eh.


Not Your Grandfather’s GZIP
29
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
InnoDB Punch-Hole Compression (5.7+)
The Good:
•  Configurable choice of algorithm
•  No need to retrofit code
•  No more buffer pool inefficiency
The Bad:
•  Immature
•  Crashed my test server
•  FS fragmentation
•  Doesn’t seem to play well with XFS
When might you consider it?
•  Maybe 5.8, but that’s just my opinion.
•  Maybe if you’re using FusionIO NVMFS
Hole-punching revisited (or, how I learned to stop worrying and love deadlocks)
Not Your Grandfather’s GZIP
30
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
InnoDB Punch-Hole Compression (5.7+) continued.
Lots of this in dmesg:
[203516.812112] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

CPUs reporting nontrivial IO wait and nothing else:
05:54:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
05:54:39 PM all 0.31 0.00 0.00 6.20 0.00 0.00 0.00 0.00 93.49
05:54:39 PM 0 1.00 0.00 0.00 13.00 0.00 0.00 0.00 0.00 86.00
05:54:39 PM 1 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00
05:54:39 PM 2 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 88.00
05:54:39 PM 3 0.00 0.00 0.00 10.00 0.00 0.00 0.00 0.00 90.00
05:54:39 PM 4 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00
05:54:39 PM 5 0.00 0.00 0.00 13.13 0.00 0.00 0.00 0.00 86.87
05:54:39 PM 6 3.00 0.00 1.00 11.00 0.00 0.00 0.00 0.00 85.00
05:54:39 PM 7 0.00 0.00 0.00 14.14 0.00 0.00 0.00 0.00 85.86
05:54:39 PM 8 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
05:54:39 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:54:39 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:54:39 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:54:39 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:54:39 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:54:39 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:54:39 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
What does Tokutek mean, anyway?


Not Your Grandfather’s GZIP
31
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
TokuDB
The Good:
•  Fully transactional
•  Very good compression ratio
•  Optimized for high write volume
•  Code changes not likely needed
The Bad:
•  Reads can be slower than InnoDB
•  MySQL’s datadir becomes a mess
•  Some InnoDB constructs unsupported
•  Limited MySQL community knowledge
When might you consider it?
•  Lower-end storage technology (slow SSD vs. Flash)
•  Data that can benefit from multiple clustering indexes (time series data, perhaps)
•  Dedicated server (no InnoDB)
Get your rocks on!


Not Your Grandfather’s GZIP
32
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
RocksDB (MyRocks)
The Good:
•  Fully transactional
•  Good compression ratio
•  Optimized for high write volume
•  Generally very fast
•  Low write amplification
The Bad:
•  Not GA yet.
•  Currently only available as part of
Facebook MySQL 5.6
•  Some InnoDB constructs unsupported
•  Locking behavior different from InnoDB
When might you consider it?
•  Need high compression ratio
•  Concerned about SSD burnout
•  Becomes available separately from FB-MySQL
Hey, I didn’t see THAT in the manual


Not Your Grandfather’s GZIP
33
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
InnoDB Column Compression
The Good:
•  Configurable compression dictionary
•  Very good compression ratio possible
•  Excellent performance under load
•  Very memory-efficient
The Bad:
•  Not yet released to the public (not GA)
When should you consider it?
•  Storage of a lot of JSON, XML, or other compressible BLOB data
•  After it becomes GA
But first… A CAT.
Ooh, Shiny Numbers!
34
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
There are so many of them
Ooh, Shiny Numbers!
35
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Recall that we’ve already gone from uncompressed to InnoDB page compression
•  Performance is good
•  We think we can do better on disk space efficiency

However…
•  Not going to engage in massive code rewrite
•  ARCHIVE engine isn’t relevant to us
•  MyRocks isn’t yet in a state where we’d spend significant time on it

So…
•  Page compression
•  Column compression without dictionary
•  Column compression with dictionary of various sizes
•  TokuDB
•  Punch-hole (or not...)
Servers, start your engines
Ooh, Shiny Numbers!
36
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Choose a typical ‘pins’ shard, of which there are thousands. Call it N.
•  Shard N contains about 20GB of raw, uncompressed data
•  InnoDB page compression brings this down to around 10GB
•  Up to 20% fragmentation overhead
•  Run ‘OPTIMIZE TABLE’ and we go down to 8.4GB – this is our starting point
•  Set up several test servers with various compression configurations
Server A: page compressed – the control
Server B: column compression, no dictionary
Server C: column compression, one pin dictionary
Server D: column compression, four pin dictionary
Server E: column compression, eight pin dictionary
Server F: column compression, 32K dictionary
Server G: TokuDB, default settings
They don’t lie. And 65% of all statistics are made up.
Ooh, Shiny Numbers!
37
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
Server A Server B Server C Server D Server E Server F Server G
Size (GB) 8.4 8.2 5.4 5.4 5.4 5.2 3.6
dump rate
(rows/sec)
52.2K 33.3K 34.3K 32.4K 30.6K 25K 53.5K
replication 1 2:40 2:52 2:35 2:57 2:47 3:00 6:36
replication 16 0:19 0:19 0:21 0:19 0:19 0:22 1:46
RO QPS 16 35K 40K-50K 40K-50K 40K-50K 40K-50K 40K-50K 20K
P99.9999 10ms 10ms 10ms 10ms 10ms 10ms 40ms
RW QPS 16 25K-30K 30K-40K 30K-40K 30K-40K 30K-40K 30K-40K 18K
P99.9999 30ms 25ms 25ms 25ms 25ms 25ms 40ms
Replication resync rate, single thread
Ooh, Shiny Numbers!
38
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Replication resync rate, 16-thread MTS
Ooh, Shiny Numbers!
39
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Interpreting the images on the pages to come
For the graphs on the next several slides:
•  Server A (page compression) is RED
•  Server B (column compression, no dictionary) is LIGHT GREEN
•  Server C (column compression, one pin) is BLUE
•  Server D (column compression, four pins) is LIGHT BLUE
•  Server E (column compression, eight pins) is DARK RED
•  Server F (column compression, 32K of pins) is PURPLE
•  Server G (TokuDB) is GOLD/YELLOW
A Key to the Graphics Kingdom
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
 40
SELECT 256, 128, 32, 16, 8, 4, 1 threads(pquery)
Ooh, Shiny Numbers
41
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
p99.9 Read Performance (Log Scale y-axis)
Ooh, Shiny Numbers
42
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Read performance for ALL the 9s! (p99.9999)
Ooh, Shiny Numbers
43
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Read/write QPS for 16, 8, 4, 1, 32, 64, 128 threads
Ooh, Shiny Numbers
44
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
P99.9 write performance for the previous graph (log10 scale)
Ooh, Shiny Numbers
45
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
P99.9999 overall performance for the previous QPS (r/w) graph (log10 scale)
Ooh, Shiny Numbers
46
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
What’d we get out of this?
Summary Results
47
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
•  Even with just the simplest predefined dictionary – a single pin – thus capturing all of the JSON
field names - we get dramatically improved space efficiency. With a better dictionary, we can likely
do even better, and at our scale, a few percent can be a nontrivial improvement.
•  At low concurrency (running threads <= number of cores), there isn’t too much difference between
column compression and page compression when it comes to performance.
•  At higher concurrency (number of running threads > number of cores in the machine), page
compression falls over pretty badly on the read-only test. Column compression continues working
quite well up to 256 active threads and perhaps even higher.
•  TokuDB wins on compression easily, but otherwise doesn’t do that well for our workload in a
default configuration (and with all the other tables on the server still InnoDB).
•  Column compression looks like a serious winner, at least for what we need. I don’t think we’ll be
the only ones.
Credit where credit is due.
Notes & References
48
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 
[1] http://techcrunch.com/2010/08/04/schmidt-data/ 
[2] http://nymag.com/scienceofus/2015/06/heres-a-study-about-internet-cats.html
[3] https://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute/
[4] https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ 
[5] http://www.mkomo.com/cost-per-gigabyte-update 
[6] http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.html 
[7] http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html 
[8] http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm
49
Questions? Answers!
email: esouhrada@pinterest.com | twitter: @denshikarasu | pinterest engineering blog: https://engineering.pinterest.com
We are hiring! https://careers.pinterest.com
1 of 49

More Related Content

Similar to Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Percona Live 2016

2014 pycon-talk2014 pycon-talk
2014 pycon-talkc.titus.brown
3.1K views35 slides

Similar to Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Percona Live 2016(20)

Recently uploaded(20)

UiPath Document Understanding_Day 2.pptxUiPath Document Understanding_Day 2.pptx
UiPath Document Understanding_Day 2.pptx
RohitRadhakrishnan8250 views
DU Series - Day 4.pptxDU Series - Day 4.pptx
DU Series - Day 4.pptx
UiPathCommunity73 views
informationinformation
information
khelgishekhar6 views
AI Powered event-driven translation botAI Powered event-driven translation bot
AI Powered event-driven translation bot
Jimmy Dahlqvist15 views
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
RohitRadhakrishnan8773 views
Sustainable MarketingSustainable Marketing
Sustainable Marketing
Theo van der Zee6 views
Pen Testing - Allendevaux.pdfPen Testing - Allendevaux.pdf
Pen Testing - Allendevaux.pdf
SourabhKumar328076 views
informing ideas.docxinforming ideas.docx
informing ideas.docx
MollyBrown8612 views
Existing documentaries (1).docxExisting documentaries (1).docx
Existing documentaries (1).docx
MollyBrown8613 views
 FS Design 2024 V2.pptx FS Design 2024 V2.pptx
FS Design 2024 V2.pptx
paswanlearning7 views
Audience profile.pptxAudience profile.pptx
Audience profile.pptx
MollyBrown8612 views
KHNOG 5: APNIC ServicesKHNOG 5: APNIC Services
KHNOG 5: APNIC Services
APNIC405 views
Serverless cloud architecture patternsServerless cloud architecture patterns
Serverless cloud architecture patterns
Jimmy Dahlqvist15 views

Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Percona Live 2016

  • 1. Novel Approaches to MySQL Compression for Modern Data Sets Less Is More Ernie Souhrada Database Engineer / Bit Wrangler, Pinterest Percona Live Data Performance Conference – 19 April 2016 1
  • 2. •  Introductions •  The Data Explosion •  Stand Back, I’m Going to Math •  So Many Options, So Little CPU •  Don’t Try This At Home •  Not Your Grandfather’s GZIP •  Ooh, Shiny Numbers! •  Q&A Agenda 2 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 My god, it’s full of cats!
  • 3. Who am I? •  Database Engineer at Pinterest (January 2015) –  One of two people solely responsible for hundreds of TB of MySQL data –  Also loosely affiliated with HBase and Core SRE teams •  Previously: Percona, Sun, assorted random small companies •  Jack of many trades, master of some Why am I here? •  Interested in almost EVERYTHING (not just tech) •  Mathematician by training; compression is fundamentally a math problem. Who Am I, Why Am I Here? 3 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Turning technical skill into cat food since 1996
  • 4. “Every two days now we create as much information as we did from the dawn of civilization up to 2003.” – Eric Schmidt, Google [1] He said this in 2010. •  Mostly user-generated content –  Over 2 million cat videos on YouTube in 2015 [2] –  Lots of unstructured data, not easily put into relational form •  Don’t forget the NSA! –  Although nobody really knows how much data they have…. The Data Explosion 4 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Because ‘DELETE’ is a four-letter word.
  • 5. The Data Explosion 5 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 In 2012, there were 2.1 billion people on the internet[3] 2012
  • 6. The Data Explosion 6 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Two years later, that number rose to 2.4 billion[4] 2014
  • 7. The Data Explosion 7 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Drowning in a sea of bits Storage costs are stabilizing[5] $0.02/GB
  • 8. The Data Explosion 8 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Drowning in a sea of bits But data volume is still increasing! 2016: 1.1 ZB of global IP traffic per year (>1 billion GB/month) 2019: 2 ZB[6] 2011: 1.8 ZB of information created 2012: 2.8 ZB 2020: 40 ZB[7]
  • 9. The Data Explosion 9 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Mo’ data, mo’ problems.
  • 10. TRUNCATE is also a four-letter word. (So is DROP…) The Data Explosion 10 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 What to do? •  Delete •  Some organizations afraid to delete anything •  Creation velocity still a problem •  Collect less? •  Pray to the storage gods? •  Panic! •  Spend the money, buy more storage •  May be inevitable •  ROI and efficiency still matter
  • 11. Trading CPU cycles for disk space since 2015 The Data Explosion 11 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Compression to the rescue! •  Well, sort of. •  Workload matters. •  Structure of data matters. •  Decrease velocity of data growth •  Thank you, Gordon Moore!
  • 12. Compressed pins are compressed. The Data Explosion 12 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Pinterest, 12 months ago: •  Lots of data stored as JSON blobs •  Workload is read-heavy, but not overall QPS-heavy •  No compression being used •  i2.4xlarge for DB servers (3TB of disk) •  Estimated disk space exhaustion around EOQ1 2016 •  More servers? •  Bigger servers? •  Panic?
  • 13. Compressed pins are compressed. The Data Explosion 13 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Pinterest, today: •  Pin data still stored as JSON blobs •  i2.4xlarge for DB servers (3TB of disk) •  Workload profile hasn’t changed much •  InnoDB page compression being used •  Approximately 50% space reduction •  Reduction in data growth velocity •  Disk space exhaustion estimated Q2 2017 •  Still looking for ways to do more with our existing resources
  • 14. Entropy is more than just the heat death of the universe. Stand Back, I’m Going To Math 14 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Entropy: A mathematical measure of information or uncertainty. •  Computed as a function of a probability distribution. •  Claude Shannon (1948): A Mathematical Theory of Communication More formally: Suppose X is a discrete random variable which takes on values from a finite set X. Then, then entropy of the random variable X is defined to be: H(X) = − P(x)log x∈X ∑ 2P(x)
  • 15. Encoding to binary strings for fun and profit Stand Back, I’m Going To Math 15 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 An encoding is a function that maps elements from the set X to the set of finite binary strings. f : X → {0,1}* Extend this to finite sequences (strings) of elements: f (x1x2 x3...xk ) = f (x1)|| f (x2 )|| f (x3)||... || f (xk ) f : X* → {0,1}* where || is the concatenation operator So, we can really think of the encoding like this: For a given set X, there are infinitely many encodings. Why?
  • 16. But not just any encoding will do. Stand Back, I’m Going To Math 16 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 •  Injective •  Guarantees an unambiguous decoding •  Prefix-free •  Allows sequential decoding, no memory required •  An encoding is prefix-free if there do not exist elements x, y in X and a string S in {0,1}* such that f(x) = f(y) || S •  Lossless •  Informally, exactly what it sounds like – given an encoded string E, we can decode it back precisely into the original string S •  Efficient! •  Use as few bits as possible to encode each string. •  How low can we go?
  • 17. A little theory before some practice. Stand Back, I’m Going To Math 17 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 One more definition. Suppose that we have a string such that each in the string occurs according to a specified probability distribution. The probability of any such string (note that the elements of the string do not need to be distinct) is given by: x1!xk xi P(x1!xk ) = P(xi ) i=1 k ∏ This is just basic probability. Consider a fair coin that gets flipped twice. Possible outcomes are: HH, HT, TH, TT
  • 18. CAT BREAK! Stand Back, I’m Going To Math 18 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 19. Efficiency cat likes short strings Stand Back, I’m Going To Math 19 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 The efficiency of a particular encoding f is defined as the weighted average length of an encoding of an element of X. ℓ( f ) = P(x) x∈X ∑ f (x) Where |y| denotes the length of string y.
  • 20. Putting it all together Stand Back, I’m Going To Math 20 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Source Coding Theorem (informally stated): A string S of length N consisting of elements of X and probability distribution X that has entropy H(X) can be compressed into more than N*H(X) bits with negligible risk of data loss as N à ∞, but it cannot be compressed into fewer than N*H(X) bits without virtually guaranteeing data loss. H(X)≤ ℓ( f )< H(X)+1 What does this mean? It provides a bound on encoding efficiency for lossless compression algorithms. Proof is left as an exercise to the reader. But you can use Huffman coding to actually find an efficient code that satisfies the above.
  • 21. Looking at things differently Stand Back, I’m Going To Math 21 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 It’s not possible to have an average information content of more than one bit per bit of message without losing data. On average, English text has roughly one bit of entropy per letter.[8] ASCII is an 8-bit encoding. It should come as no surprise that English text compresses quite well.
  • 22. The last slide on theory, I promise Stand Back, I’m Going To Math 22 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 We don’t necessarily have to think of individual letters. -  Bigrams, trigrams -  Words or tokens (think about SQL keywords or a JSON document) Some strings come out smaller when compressed. Some come out larger. There’s no universal encoding that works equally-well for every set of source strings.
  • 23. •  “Old” compression technology •  Application layer •  SQL functions: COMPRESS() / DECOMPRESS() •  ARCHIVE storage engine •  InnoDB page compression •  “New” compression technology •  TokuDB •  MyRocks •  MySQL 5.7 “punch hole” transparent compression •  Server-level column compression… what?! So Many Options, So Little CPU! 23 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Compression sounds great! I want some for my database, too.
  • 24. Don’t Try This At Home 24 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Just because you can do something doesn’t mean you should. Application-Level Compression The Good: •  Not limited in choice of algorithm •  Scales horizontally with app servers •  Minimizes network traffic •  Works with any storage engine •  Fine-grained control over what to compress and what to leave alone The Bad: •  Might require a lot of code retrofit •  Significant operational overhead in the event of incidents •  Potentially-significant loss of SQL functionality •  WHERE clauses on compressed data •  SQL functions
  • 25. Unless you’re Batman. Then be Batman. Don’t Try This At Home 25 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 When might you consider it? •  New projects, maybe •  Existing projects, maybe not •  The data to be compressed doesn’t need anything more than store/retrieve •  You’re OK with the output of ‘SHOW PROCESSLIST’ screwing up your terminal •  Network bandwidth is at a premium but CPU is plentiful (MySQL on Mars?)
  • 26. Don’t Try This At Home 26 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 You’re not Batman. SQL Function Compression (COMPRESS/DECOMPRESS) The Good: •  Works with any storage engine •  Fine-grained control over what to compress and what to leave alone The Bad: •  All of the same negatives of application-level compression but without any of the major benefits. •  Extra load on the MySQL server When might you consider it? •  For any serious project, probably never
  • 27. Don’t Try This At Home 27 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Included for the sake of completeness only ARCHIVE Storage Engine The Good: •  Convenient •  Mature The Bad: •  No UPDATE or DELETE •  SELECT is a table scan •  Not a usable general-purpose engine When might you consider it? •  Data that never needs to be updated and is rarely accessed •  Data that can be lost or regenerated in an emergency
  • 28. Honey, I shrunk the database! Not Your Grandfather’s GZIP 28 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Page Compression (pre-5.7) The Good: •  Mature •  No need to retrofit code •  Decent compression ratio •  Reasonably performant for many things The Bad: •  Memory inefficient •  Not as space-efficient as it could be •  Not much configurability When might you consider it? •  Read-mostly workloads of low to moderate concurrency •  For many users, it’s still the only game in town
  • 29. Eh. Not Your Grandfather’s GZIP 29 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Punch-Hole Compression (5.7+) The Good: •  Configurable choice of algorithm •  No need to retrofit code •  No more buffer pool inefficiency The Bad: •  Immature •  Crashed my test server •  FS fragmentation •  Doesn’t seem to play well with XFS When might you consider it? •  Maybe 5.8, but that’s just my opinion. •  Maybe if you’re using FusionIO NVMFS
  • 30. Hole-punching revisited (or, how I learned to stop worrying and love deadlocks) Not Your Grandfather’s GZIP 30 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Punch-Hole Compression (5.7+) continued. Lots of this in dmesg: [203516.812112] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) CPUs reporting nontrivial IO wait and nothing else: 05:54:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:54:39 PM all 0.31 0.00 0.00 6.20 0.00 0.00 0.00 0.00 93.49 05:54:39 PM 0 1.00 0.00 0.00 13.00 0.00 0.00 0.00 0.00 86.00 05:54:39 PM 1 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00 05:54:39 PM 2 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 88.00 05:54:39 PM 3 0.00 0.00 0.00 10.00 0.00 0.00 0.00 0.00 90.00 05:54:39 PM 4 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00 05:54:39 PM 5 0.00 0.00 0.00 13.13 0.00 0.00 0.00 0.00 86.87 05:54:39 PM 6 3.00 0.00 1.00 11.00 0.00 0.00 0.00 0.00 85.00 05:54:39 PM 7 0.00 0.00 0.00 14.14 0.00 0.00 0.00 0.00 85.86 05:54:39 PM 8 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00 05:54:39 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
  • 31. What does Tokutek mean, anyway? Not Your Grandfather’s GZIP 31 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 TokuDB The Good: •  Fully transactional •  Very good compression ratio •  Optimized for high write volume •  Code changes not likely needed The Bad: •  Reads can be slower than InnoDB •  MySQL’s datadir becomes a mess •  Some InnoDB constructs unsupported •  Limited MySQL community knowledge When might you consider it? •  Lower-end storage technology (slow SSD vs. Flash) •  Data that can benefit from multiple clustering indexes (time series data, perhaps) •  Dedicated server (no InnoDB)
  • 32. Get your rocks on! Not Your Grandfather’s GZIP 32 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 RocksDB (MyRocks) The Good: •  Fully transactional •  Good compression ratio •  Optimized for high write volume •  Generally very fast •  Low write amplification The Bad: •  Not GA yet. •  Currently only available as part of Facebook MySQL 5.6 •  Some InnoDB constructs unsupported •  Locking behavior different from InnoDB When might you consider it? •  Need high compression ratio •  Concerned about SSD burnout •  Becomes available separately from FB-MySQL
  • 33. Hey, I didn’t see THAT in the manual Not Your Grandfather’s GZIP 33 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Column Compression The Good: •  Configurable compression dictionary •  Very good compression ratio possible •  Excellent performance under load •  Very memory-efficient The Bad: •  Not yet released to the public (not GA) When should you consider it? •  Storage of a lot of JSON, XML, or other compressible BLOB data •  After it becomes GA
  • 34. But first… A CAT. Ooh, Shiny Numbers! 34 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 35. There are so many of them Ooh, Shiny Numbers! 35 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Recall that we’ve already gone from uncompressed to InnoDB page compression •  Performance is good •  We think we can do better on disk space efficiency However… •  Not going to engage in massive code rewrite •  ARCHIVE engine isn’t relevant to us •  MyRocks isn’t yet in a state where we’d spend significant time on it So… •  Page compression •  Column compression without dictionary •  Column compression with dictionary of various sizes •  TokuDB •  Punch-hole (or not...)
  • 36. Servers, start your engines Ooh, Shiny Numbers! 36 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Choose a typical ‘pins’ shard, of which there are thousands. Call it N. •  Shard N contains about 20GB of raw, uncompressed data •  InnoDB page compression brings this down to around 10GB •  Up to 20% fragmentation overhead •  Run ‘OPTIMIZE TABLE’ and we go down to 8.4GB – this is our starting point •  Set up several test servers with various compression configurations Server A: page compressed – the control Server B: column compression, no dictionary Server C: column compression, one pin dictionary Server D: column compression, four pin dictionary Server E: column compression, eight pin dictionary Server F: column compression, 32K dictionary Server G: TokuDB, default settings
  • 37. They don’t lie. And 65% of all statistics are made up. Ooh, Shiny Numbers! 37 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Server A Server B Server C Server D Server E Server F Server G Size (GB) 8.4 8.2 5.4 5.4 5.4 5.2 3.6 dump rate (rows/sec) 52.2K 33.3K 34.3K 32.4K 30.6K 25K 53.5K replication 1 2:40 2:52 2:35 2:57 2:47 3:00 6:36 replication 16 0:19 0:19 0:21 0:19 0:19 0:22 1:46 RO QPS 16 35K 40K-50K 40K-50K 40K-50K 40K-50K 40K-50K 20K P99.9999 10ms 10ms 10ms 10ms 10ms 10ms 40ms RW QPS 16 25K-30K 30K-40K 30K-40K 30K-40K 30K-40K 30K-40K 18K P99.9999 30ms 25ms 25ms 25ms 25ms 25ms 40ms
  • 38. Replication resync rate, single thread Ooh, Shiny Numbers! 38 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 39. Replication resync rate, 16-thread MTS Ooh, Shiny Numbers! 39 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 40. Interpreting the images on the pages to come For the graphs on the next several slides: •  Server A (page compression) is RED •  Server B (column compression, no dictionary) is LIGHT GREEN •  Server C (column compression, one pin) is BLUE •  Server D (column compression, four pins) is LIGHT BLUE •  Server E (column compression, eight pins) is DARK RED •  Server F (column compression, 32K of pins) is PURPLE •  Server G (TokuDB) is GOLD/YELLOW A Key to the Graphics Kingdom Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 40
  • 41. SELECT 256, 128, 32, 16, 8, 4, 1 threads(pquery) Ooh, Shiny Numbers 41 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 42. p99.9 Read Performance (Log Scale y-axis) Ooh, Shiny Numbers 42 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 43. Read performance for ALL the 9s! (p99.9999) Ooh, Shiny Numbers 43 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 44. Read/write QPS for 16, 8, 4, 1, 32, 64, 128 threads Ooh, Shiny Numbers 44 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 45. P99.9 write performance for the previous graph (log10 scale) Ooh, Shiny Numbers 45 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 46. P99.9999 overall performance for the previous QPS (r/w) graph (log10 scale) Ooh, Shiny Numbers 46 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  • 47. What’d we get out of this? Summary Results 47 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 •  Even with just the simplest predefined dictionary – a single pin – thus capturing all of the JSON field names - we get dramatically improved space efficiency. With a better dictionary, we can likely do even better, and at our scale, a few percent can be a nontrivial improvement. •  At low concurrency (running threads <= number of cores), there isn’t too much difference between column compression and page compression when it comes to performance. •  At higher concurrency (number of running threads > number of cores in the machine), page compression falls over pretty badly on the read-only test. Column compression continues working quite well up to 256 active threads and perhaps even higher. •  TokuDB wins on compression easily, but otherwise doesn’t do that well for our workload in a default configuration (and with all the other tables on the server still InnoDB). •  Column compression looks like a serious winner, at least for what we need. I don’t think we’ll be the only ones.
  • 48. Credit where credit is due. Notes & References 48 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 [1] http://techcrunch.com/2010/08/04/schmidt-data/ [2] http://nymag.com/scienceofus/2015/06/heres-a-study-about-internet-cats.html [3] https://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute/ [4] https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ [5] http://www.mkomo.com/cost-per-gigabyte-update [6] http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.html [7] http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html [8] http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm
  • 49. 49 Questions? Answers! email: esouhrada@pinterest.com | twitter: @denshikarasu | pinterest engineering blog: https://engineering.pinterest.com We are hiring! https://careers.pinterest.com