Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Percona Live 2016

1,367 views

Published on

As data volume grows, finding ways to slow the growth velocity becomes more and more important. We want to do everything possible to maximize the efficiency of our hardware before we spend the money on more storage, so one way to do that is with compression. These slides discuss compression theory and compression options in MySQL, ending with some benchmark data that compares column-level compression in InnoDB with other available compression technologies. Presented at Percona Live 2016.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Percona Live 2016

  1. 1. Novel Approaches to MySQL Compression for Modern Data Sets Less Is More Ernie Souhrada Database Engineer / Bit Wrangler, Pinterest Percona Live Data Performance Conference – 19 April 2016 1
  2. 2. •  Introductions •  The Data Explosion •  Stand Back, I’m Going to Math •  So Many Options, So Little CPU •  Don’t Try This At Home •  Not Your Grandfather’s GZIP •  Ooh, Shiny Numbers! •  Q&A Agenda 2 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 My god, it’s full of cats!
  3. 3. Who am I? •  Database Engineer at Pinterest (January 2015) –  One of two people solely responsible for hundreds of TB of MySQL data –  Also loosely affiliated with HBase and Core SRE teams •  Previously: Percona, Sun, assorted random small companies •  Jack of many trades, master of some Why am I here? •  Interested in almost EVERYTHING (not just tech) •  Mathematician by training; compression is fundamentally a math problem. Who Am I, Why Am I Here? 3 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Turning technical skill into cat food since 1996
  4. 4. “Every two days now we create as much information as we did from the dawn of civilization up to 2003.” – Eric Schmidt, Google [1] He said this in 2010. •  Mostly user-generated content –  Over 2 million cat videos on YouTube in 2015 [2] –  Lots of unstructured data, not easily put into relational form •  Don’t forget the NSA! –  Although nobody really knows how much data they have…. The Data Explosion 4 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Because ‘DELETE’ is a four-letter word.
  5. 5. The Data Explosion 5 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 In 2012, there were 2.1 billion people on the internet[3] 2012
  6. 6. The Data Explosion 6 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Two years later, that number rose to 2.4 billion[4] 2014
  7. 7. The Data Explosion 7 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Drowning in a sea of bits Storage costs are stabilizing[5] $0.02/GB
  8. 8. The Data Explosion 8 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Drowning in a sea of bits But data volume is still increasing! 2016: 1.1 ZB of global IP traffic per year (>1 billion GB/month) 2019: 2 ZB[6] 2011: 1.8 ZB of information created 2012: 2.8 ZB 2020: 40 ZB[7]
  9. 9. The Data Explosion 9 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Mo’ data, mo’ problems.
  10. 10. TRUNCATE is also a four-letter word. (So is DROP…) The Data Explosion 10 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 What to do? •  Delete •  Some organizations afraid to delete anything •  Creation velocity still a problem •  Collect less? •  Pray to the storage gods? •  Panic! •  Spend the money, buy more storage •  May be inevitable •  ROI and efficiency still matter
  11. 11. Trading CPU cycles for disk space since 2015 The Data Explosion 11 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Compression to the rescue! •  Well, sort of. •  Workload matters. •  Structure of data matters. •  Decrease velocity of data growth •  Thank you, Gordon Moore!
  12. 12. Compressed pins are compressed. The Data Explosion 12 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Pinterest, 12 months ago: •  Lots of data stored as JSON blobs •  Workload is read-heavy, but not overall QPS-heavy •  No compression being used •  i2.4xlarge for DB servers (3TB of disk) •  Estimated disk space exhaustion around EOQ1 2016 •  More servers? •  Bigger servers? •  Panic?
  13. 13. Compressed pins are compressed. The Data Explosion 13 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Pinterest, today: •  Pin data still stored as JSON blobs •  i2.4xlarge for DB servers (3TB of disk) •  Workload profile hasn’t changed much •  InnoDB page compression being used •  Approximately 50% space reduction •  Reduction in data growth velocity •  Disk space exhaustion estimated Q2 2017 •  Still looking for ways to do more with our existing resources
  14. 14. Entropy is more than just the heat death of the universe. Stand Back, I’m Going To Math 14 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Entropy: A mathematical measure of information or uncertainty. •  Computed as a function of a probability distribution. •  Claude Shannon (1948): A Mathematical Theory of Communication More formally: Suppose X is a discrete random variable which takes on values from a finite set X. Then, then entropy of the random variable X is defined to be: H(X) = − P(x)log x∈X ∑ 2P(x)
  15. 15. Encoding to binary strings for fun and profit Stand Back, I’m Going To Math 15 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 An encoding is a function that maps elements from the set X to the set of finite binary strings. f : X → {0,1}* Extend this to finite sequences (strings) of elements: f (x1x2 x3...xk ) = f (x1)|| f (x2 )|| f (x3)||... || f (xk ) f : X* → {0,1}* where || is the concatenation operator So, we can really think of the encoding like this: For a given set X, there are infinitely many encodings. Why?
  16. 16. But not just any encoding will do. Stand Back, I’m Going To Math 16 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 •  Injective •  Guarantees an unambiguous decoding •  Prefix-free •  Allows sequential decoding, no memory required •  An encoding is prefix-free if there do not exist elements x, y in X and a string S in {0,1}* such that f(x) = f(y) || S •  Lossless •  Informally, exactly what it sounds like – given an encoded string E, we can decode it back precisely into the original string S •  Efficient! •  Use as few bits as possible to encode each string. •  How low can we go?
  17. 17. A little theory before some practice. Stand Back, I’m Going To Math 17 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 One more definition. Suppose that we have a string such that each in the string occurs according to a specified probability distribution. The probability of any such string (note that the elements of the string do not need to be distinct) is given by: x1!xk xi P(x1!xk ) = P(xi ) i=1 k ∏ This is just basic probability. Consider a fair coin that gets flipped twice. Possible outcomes are: HH, HT, TH, TT
  18. 18. CAT BREAK! Stand Back, I’m Going To Math 18 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  19. 19. Efficiency cat likes short strings Stand Back, I’m Going To Math 19 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 The efficiency of a particular encoding f is defined as the weighted average length of an encoding of an element of X. ℓ( f ) = P(x) x∈X ∑ f (x) Where |y| denotes the length of string y.
  20. 20. Putting it all together Stand Back, I’m Going To Math 20 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Source Coding Theorem (informally stated): A string S of length N consisting of elements of X and probability distribution X that has entropy H(X) can be compressed into more than N*H(X) bits with negligible risk of data loss as N à ∞, but it cannot be compressed into fewer than N*H(X) bits without virtually guaranteeing data loss. H(X)≤ ℓ( f )< H(X)+1 What does this mean? It provides a bound on encoding efficiency for lossless compression algorithms. Proof is left as an exercise to the reader. But you can use Huffman coding to actually find an efficient code that satisfies the above.
  21. 21. Looking at things differently Stand Back, I’m Going To Math 21 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 It’s not possible to have an average information content of more than one bit per bit of message without losing data. On average, English text has roughly one bit of entropy per letter.[8] ASCII is an 8-bit encoding. It should come as no surprise that English text compresses quite well.
  22. 22. The last slide on theory, I promise Stand Back, I’m Going To Math 22 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 We don’t necessarily have to think of individual letters. -  Bigrams, trigrams -  Words or tokens (think about SQL keywords or a JSON document) Some strings come out smaller when compressed. Some come out larger. There’s no universal encoding that works equally-well for every set of source strings.
  23. 23. •  “Old” compression technology •  Application layer •  SQL functions: COMPRESS() / DECOMPRESS() •  ARCHIVE storage engine •  InnoDB page compression •  “New” compression technology •  TokuDB •  MyRocks •  MySQL 5.7 “punch hole” transparent compression •  Server-level column compression… what?! So Many Options, So Little CPU! 23 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Compression sounds great! I want some for my database, too.
  24. 24. Don’t Try This At Home 24 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Just because you can do something doesn’t mean you should. Application-Level Compression The Good: •  Not limited in choice of algorithm •  Scales horizontally with app servers •  Minimizes network traffic •  Works with any storage engine •  Fine-grained control over what to compress and what to leave alone The Bad: •  Might require a lot of code retrofit •  Significant operational overhead in the event of incidents •  Potentially-significant loss of SQL functionality •  WHERE clauses on compressed data •  SQL functions
  25. 25. Unless you’re Batman. Then be Batman. Don’t Try This At Home 25 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 When might you consider it? •  New projects, maybe •  Existing projects, maybe not •  The data to be compressed doesn’t need anything more than store/retrieve •  You’re OK with the output of ‘SHOW PROCESSLIST’ screwing up your terminal •  Network bandwidth is at a premium but CPU is plentiful (MySQL on Mars?)
  26. 26. Don’t Try This At Home 26 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 You’re not Batman. SQL Function Compression (COMPRESS/DECOMPRESS) The Good: •  Works with any storage engine •  Fine-grained control over what to compress and what to leave alone The Bad: •  All of the same negatives of application-level compression but without any of the major benefits. •  Extra load on the MySQL server When might you consider it? •  For any serious project, probably never
  27. 27. Don’t Try This At Home 27 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Included for the sake of completeness only ARCHIVE Storage Engine The Good: •  Convenient •  Mature The Bad: •  No UPDATE or DELETE •  SELECT is a table scan •  Not a usable general-purpose engine When might you consider it? •  Data that never needs to be updated and is rarely accessed •  Data that can be lost or regenerated in an emergency
  28. 28. Honey, I shrunk the database! Not Your Grandfather’s GZIP 28 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Page Compression (pre-5.7) The Good: •  Mature •  No need to retrofit code •  Decent compression ratio •  Reasonably performant for many things The Bad: •  Memory inefficient •  Not as space-efficient as it could be •  Not much configurability When might you consider it? •  Read-mostly workloads of low to moderate concurrency •  For many users, it’s still the only game in town
  29. 29. Eh. Not Your Grandfather’s GZIP 29 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Punch-Hole Compression (5.7+) The Good: •  Configurable choice of algorithm •  No need to retrofit code •  No more buffer pool inefficiency The Bad: •  Immature •  Crashed my test server •  FS fragmentation •  Doesn’t seem to play well with XFS When might you consider it? •  Maybe 5.8, but that’s just my opinion. •  Maybe if you’re using FusionIO NVMFS
  30. 30. Hole-punching revisited (or, how I learned to stop worrying and love deadlocks) Not Your Grandfather’s GZIP 30 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Punch-Hole Compression (5.7+) continued. Lots of this in dmesg: [203516.812112] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) CPUs reporting nontrivial IO wait and nothing else: 05:54:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:54:39 PM all 0.31 0.00 0.00 6.20 0.00 0.00 0.00 0.00 93.49 05:54:39 PM 0 1.00 0.00 0.00 13.00 0.00 0.00 0.00 0.00 86.00 05:54:39 PM 1 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00 05:54:39 PM 2 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 88.00 05:54:39 PM 3 0.00 0.00 0.00 10.00 0.00 0.00 0.00 0.00 90.00 05:54:39 PM 4 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00 05:54:39 PM 5 0.00 0.00 0.00 13.13 0.00 0.00 0.00 0.00 86.87 05:54:39 PM 6 3.00 0.00 1.00 11.00 0.00 0.00 0.00 0.00 85.00 05:54:39 PM 7 0.00 0.00 0.00 14.14 0.00 0.00 0.00 0.00 85.86 05:54:39 PM 8 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00 05:54:39 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
  31. 31. What does Tokutek mean, anyway? Not Your Grandfather’s GZIP 31 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 TokuDB The Good: •  Fully transactional •  Very good compression ratio •  Optimized for high write volume •  Code changes not likely needed The Bad: •  Reads can be slower than InnoDB •  MySQL’s datadir becomes a mess •  Some InnoDB constructs unsupported •  Limited MySQL community knowledge When might you consider it? •  Lower-end storage technology (slow SSD vs. Flash) •  Data that can benefit from multiple clustering indexes (time series data, perhaps) •  Dedicated server (no InnoDB)
  32. 32. Get your rocks on! Not Your Grandfather’s GZIP 32 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 RocksDB (MyRocks) The Good: •  Fully transactional •  Good compression ratio •  Optimized for high write volume •  Generally very fast •  Low write amplification The Bad: •  Not GA yet. •  Currently only available as part of Facebook MySQL 5.6 •  Some InnoDB constructs unsupported •  Locking behavior different from InnoDB When might you consider it? •  Need high compression ratio •  Concerned about SSD burnout •  Becomes available separately from FB-MySQL
  33. 33. Hey, I didn’t see THAT in the manual Not Your Grandfather’s GZIP 33 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 InnoDB Column Compression The Good: •  Configurable compression dictionary •  Very good compression ratio possible •  Excellent performance under load •  Very memory-efficient The Bad: •  Not yet released to the public (not GA) When should you consider it? •  Storage of a lot of JSON, XML, or other compressible BLOB data •  After it becomes GA
  34. 34. But first… A CAT. Ooh, Shiny Numbers! 34 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  35. 35. There are so many of them Ooh, Shiny Numbers! 35 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Recall that we’ve already gone from uncompressed to InnoDB page compression •  Performance is good •  We think we can do better on disk space efficiency However… •  Not going to engage in massive code rewrite •  ARCHIVE engine isn’t relevant to us •  MyRocks isn’t yet in a state where we’d spend significant time on it So… •  Page compression •  Column compression without dictionary •  Column compression with dictionary of various sizes •  TokuDB •  Punch-hole (or not...)
  36. 36. Servers, start your engines Ooh, Shiny Numbers! 36 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Choose a typical ‘pins’ shard, of which there are thousands. Call it N. •  Shard N contains about 20GB of raw, uncompressed data •  InnoDB page compression brings this down to around 10GB •  Up to 20% fragmentation overhead •  Run ‘OPTIMIZE TABLE’ and we go down to 8.4GB – this is our starting point •  Set up several test servers with various compression configurations Server A: page compressed – the control Server B: column compression, no dictionary Server C: column compression, one pin dictionary Server D: column compression, four pin dictionary Server E: column compression, eight pin dictionary Server F: column compression, 32K dictionary Server G: TokuDB, default settings
  37. 37. They don’t lie. And 65% of all statistics are made up. Ooh, Shiny Numbers! 37 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 Server A Server B Server C Server D Server E Server F Server G Size (GB) 8.4 8.2 5.4 5.4 5.4 5.2 3.6 dump rate (rows/sec) 52.2K 33.3K 34.3K 32.4K 30.6K 25K 53.5K replication 1 2:40 2:52 2:35 2:57 2:47 3:00 6:36 replication 16 0:19 0:19 0:21 0:19 0:19 0:22 1:46 RO QPS 16 35K 40K-50K 40K-50K 40K-50K 40K-50K 40K-50K 20K P99.9999 10ms 10ms 10ms 10ms 10ms 10ms 40ms RW QPS 16 25K-30K 30K-40K 30K-40K 30K-40K 30K-40K 30K-40K 18K P99.9999 30ms 25ms 25ms 25ms 25ms 25ms 40ms
  38. 38. Replication resync rate, single thread Ooh, Shiny Numbers! 38 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  39. 39. Replication resync rate, 16-thread MTS Ooh, Shiny Numbers! 39 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  40. 40. Interpreting the images on the pages to come For the graphs on the next several slides: •  Server A (page compression) is RED •  Server B (column compression, no dictionary) is LIGHT GREEN •  Server C (column compression, one pin) is BLUE •  Server D (column compression, four pins) is LIGHT BLUE •  Server E (column compression, eight pins) is DARK RED •  Server F (column compression, 32K of pins) is PURPLE •  Server G (TokuDB) is GOLD/YELLOW A Key to the Graphics Kingdom Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 40
  41. 41. SELECT 256, 128, 32, 16, 8, 4, 1 threads(pquery) Ooh, Shiny Numbers 41 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  42. 42. p99.9 Read Performance (Log Scale y-axis) Ooh, Shiny Numbers 42 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  43. 43. Read performance for ALL the 9s! (p99.9999) Ooh, Shiny Numbers 43 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  44. 44. Read/write QPS for 16, 8, 4, 1, 32, 64, 128 threads Ooh, Shiny Numbers 44 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  45. 45. P99.9 write performance for the previous graph (log10 scale) Ooh, Shiny Numbers 45 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  46. 46. P99.9999 overall performance for the previous QPS (r/w) graph (log10 scale) Ooh, Shiny Numbers 46 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
  47. 47. What’d we get out of this? Summary Results 47 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 •  Even with just the simplest predefined dictionary – a single pin – thus capturing all of the JSON field names - we get dramatically improved space efficiency. With a better dictionary, we can likely do even better, and at our scale, a few percent can be a nontrivial improvement. •  At low concurrency (running threads <= number of cores), there isn’t too much difference between column compression and page compression when it comes to performance. •  At higher concurrency (number of running threads > number of cores in the machine), page compression falls over pretty badly on the read-only test. Column compression continues working quite well up to 256 active threads and perhaps even higher. •  TokuDB wins on compression easily, but otherwise doesn’t do that well for our workload in a default configuration (and with all the other tables on the server still InnoDB). •  Column compression looks like a serious winner, at least for what we need. I don’t think we’ll be the only ones.
  48. 48. Credit where credit is due. Notes & References 48 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 [1] http://techcrunch.com/2010/08/04/schmidt-data/ [2] http://nymag.com/scienceofus/2015/06/heres-a-study-about-internet-cats.html [3] https://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute/ [4] https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ [5] http://www.mkomo.com/cost-per-gigabyte-update [6] http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.html [7] http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html [8] http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm
  49. 49. 49 Questions? Answers! email: esouhrada@pinterest.com | twitter: @denshikarasu | pinterest engineering blog: https://engineering.pinterest.com We are hiring! https://careers.pinterest.com

×