0
Massively Sharded MySQL            Evan Elias       Velocity Europe 2011
Tumblrʼs Size and Growth                                     1 Year Ago           Today                Impressions        ...
Our databases and dataset        •   Machines dedicated to MySQL: over 175             •   Thatʼs roughly how many product...
MySQL Replication 101        •   Asynchronous        •   Single-threaded SQL execution on slave        •   Masters can hav...
Why Partition?       Reason 1: Write scalability        •   No other way to scale writes beyond the limits of one machine ...
Why Partition?       Reason 2: Data size        •   Working set wonʼt fit in RAM        •   SSD performance drops as disk fi...
Types of Partitioning       •   Divide a table                •   Horizontal Partitioning                •   Vertical Part...
Horizontal Partitioning       Divide a table by relocating sets of rows        •   Some support internally by MySQL, allow...
Vertical Partitioning       Divide a table by relocating sets of columns        •   Not supported internally by MySQL, tho...
Functional Partitioning       Divide a dataset by moving one or more tables        •   First eliminate all JOINs across ta...
When to Shard        •   Sharding is very complex, so itʼs best not to shard until itʼs obvious that you            will a...
Sharding Decisions        •   Sharding key — a core column present (or derivable) in most tables.        •   Sharding sche...
Sharding Schemes       Determining which shard a row lives on        •   Ranges: Easy to implement and trivial to add new ...
Application Requirements        •   Sharding key must be available for all frequent look-up operations. For            exa...
Service Requirements        •   ID generation for PK of sharded tables        •   Nice-to-have: Centralized service for ha...
Operational Requirements        •   Automation for adding and rebalancing shards, and sufficient monitoring to            k...
How to initially shard a table       Option 1: Transitional migration with legacy DB        •   Choose a cutoff ID of the ...
How to initially shard a table       Option 2: All at once        1. Dark mode: app redundantly sends all writes (inserts,...
Shard Automation       Tumblrʼs custom automation software can:        •   Crawl replication topology for all shards      ...
Splitting shards: goals       •   Rebalance an overly-large shard by dividing it into N new shards, of even or           u...
Splitting shards: assumptions       •   All tables using InnoDB       •   All tables have an index that begins with your s...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Parent Master, blogs 1-1000,                                                          R/W                                 ...
Parent Master, blogs 1-1000,                                                          R/W                                 ...
Aside: copying files efficiently       To a single destination        •   Our shard-split process relies on creating new sla...
Aside: copying files efficiently       To multiple destinations        •   Add tee and a FIFO to the mix, and you can create...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Importing/exporting in chunks        •   Divide ID range into many chunks, and then execute export queries in            p...
Importing/exporting gotchas        •   Disable binary logging before import, to speed it up        •   Disable any query-k...
Parent Master, blogs 1-1000,                                                 R/W                                          ...
Parent Master, blogs 1-1000,                                                 R/W                                          ...
Parent Master, blogs 1-1000,                                                 R/W                                          ...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Moving reads and writes separately        •   If configuration updates do not simultaneously reach all of your web/app     ...
Parent Master, blogs 1-1000,                                                 R/W                                          ...
Parent Master, blogs 1-1000,                                                 W                                            ...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Parent Master, blogs 1-1000,                                                 W                                            ...
Parent Master, blogs 1-1000,                                                           no longer in use by app            ...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Parent Master, blogs 1-1000,                                                           no longer in use by app            ...
Parent Master, blogs 1-1000                                                       Global read-only set                    ...
X       Deprecated Parent Master, blogs 1-1000                                                         Recycle soon       ...
Splitting shards: process       Large “parent” shard divided into N “child” shards        1. Create N new slaves in parent...
Splitting shards: cleanup        •   Until writes are moved to the new shard masters, writes to the parent shard          ...
General sharding gotchas        •   Sizing shards properly: as small as you can operationally handle        •   Choosing r...
Questions?                          Email: evan -at- tumblr.comMassively Sharded MySQL
Upcoming SlideShare
Loading in...5
×

Evan Ellis "Tumblr. Massively Sharded MySQL"

2,810

Published on

Velocity Europe 2011

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,810
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
47
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Evan Ellis "Tumblr. Massively Sharded MySQL""

  1. 1. Massively Sharded MySQL Evan Elias Velocity Europe 2011
  2. 2. Tumblrʼs Size and Growth 1 Year Ago Today Impressions 3 Billion/month 15 Billion/month Total Posts 1.5 Billion 12.5 Billion Total Blogs 9 Million 33 Million Developers 3 17 Sys Admins 1 5 Total Staff (FT) 13 55Massively Sharded MySQL
  3. 3. Our databases and dataset • Machines dedicated to MySQL: over 175 • Thatʼs roughly how many production machines we had total a year ago • Relational data on master databases: over 11 terabytes • Unique rows: over 25 billionMassively Sharded MySQL
  4. 4. MySQL Replication 101 • Asynchronous • Single-threaded SQL execution on slave • Masters can have multiple slaves • A slave can only have one master • Can be hierarchical, but complicates failure-handling • Keep two standby slaves per pool: one to promote when a master fails, and the other to bring up additional slaves quickly • Scales reads, not writesMassively Sharded MySQL
  5. 5. Why Partition? Reason 1: Write scalability • No other way to scale writes beyond the limits of one machine • During peak insert times, youʼll likely start hitting lag on slaves before your master shows a concurrency problemMassively Sharded MySQL
  6. 6. Why Partition? Reason 2: Data size • Working set wonʼt fit in RAM • SSD performance drops as disk fills up • Risk of completely full disk • Operational difficulties: slow backups, longer to spin up new slaves • Fault isolation: all of your data in one place = single point of failure affecting all usersMassively Sharded MySQL
  7. 7. Types of Partitioning • Divide a table • Horizontal Partitioning • Vertical Partitioning • Divide a dataset / schema • Functional PartitioningMassively Sharded MySQL
  8. 8. Horizontal Partitioning Divide a table by relocating sets of rows • Some support internally by MySQL, allowing you to divide a table into several files transparently, but with limitations • Sharding is the implementation of horizontal partitioning outside of MySQL (at the application level or service level). Each partition is a separate table. They may be located in different database schemas and/or different instances of MySQL.Massively Sharded MySQL
  9. 9. Vertical Partitioning Divide a table by relocating sets of columns • Not supported internally by MySQL, though you can do it manually by creating separate tables. • Not recommended in most cases – if your data is already normalized, then vertical partitioning introduces unnecessary joins • If your partitions are on different MySQL instances, then youʼre doing these “joins” in application code instead of in SQLMassively Sharded MySQL
  10. 10. Functional Partitioning Divide a dataset by moving one or more tables • First eliminate all JOINs across tables in different partitions • Move tables to new partitions (separate MySQL instances) using selective dumping, followed by replication filters • Often just a temporary solution. If the table eventually grows too large to fit on a single machine, youʼll need to shard it anyway.Massively Sharded MySQL
  11. 11. When to Shard • Sharding is very complex, so itʼs best not to shard until itʼs obvious that you will actually need to! • Predict when you will hit write scalability issues — determine this on spare hardware • Predict when you will hit data size issues — calculate based on your growth rate • Functional partitioning can buy timeMassively Sharded MySQL
  12. 12. Sharding Decisions • Sharding key — a core column present (or derivable) in most tables. • Sharding scheme — how you will group and home data (ranges vs hash vs lookup table) • How many shards to start with, or equivalently, how much data per shard • Shard colocation — do shards coexist within a DB schema, a MySQL instance, or a physical machine?Massively Sharded MySQL
  13. 13. Sharding Schemes Determining which shard a row lives on • Ranges: Easy to implement and trivial to add new shards, but requires frequent and uneven rebalancing due to user behavior differences. • Hash or modulus: Apply function on the sharding key to determine which shard. Simple to implement, and distributes data evenly.  Incredibly difficult to add new shards or rebalance existing ones. • Lookup table: Highest flexibility, but impacts performance and adds a single point of failure. Lookup table may eventually become too large.Massively Sharded MySQL
  14. 14. Application Requirements • Sharding key must be available for all frequent look-up operations. For example, canʼt efficiently look up posts by their own ID anymore, also need blog ID to know which shard to hit. • Support for read-only and offline shards. App code needs to gracefully handle planned maintenance and unexpected failures. • Support for reading and writing to different MySQL instances for the same shard range — not for scaling reads, but for the rebalancing processMassively Sharded MySQL
  15. 15. Service Requirements • ID generation for PK of sharded tables • Nice-to-have: Centralized service for handling common needs • Querying multiple shards simultaneously • Persistent connections • Centralized failure handling • Parsing SQL to determine which shard(s) to send a query toMassively Sharded MySQL
  16. 16. Operational Requirements • Automation for adding and rebalancing shards, and sufficient monitoring to know when each is necessary • Nice-to-have: Support for multiple MySQL instances per machine — makes cloning and replication setup simpler, and overhead isnʼt too badMassively Sharded MySQL
  17. 17. How to initially shard a table Option 1: Transitional migration with legacy DB • Choose a cutoff ID of the tableʼs PK (not the sharding key) which is slightly higher than its current max ID. Once that cutoff has been reached, all new rows get written to shards instead of legacy. • Whenever a legacy row is updated by app, move it to a shard • Migration script slowly saves old rows (at the app level) in the background, moving them to shards, and gradually lowers cutoff ID • Reads may need to check shards and legacy, but based on ID you can make an informed choice of which to check firstMassively Sharded MySQL
  18. 18. How to initially shard a table Option 2: All at once 1. Dark mode: app redundantly sends all writes (inserts, updates, deletes) to legacy database as well as the appropriate shard. All reads still go to legacy database. 2. Migration: script reads data from legacy DB (sweeping by the sharding key) and writes it to the appropriate shard. 3. Finalize: move reads to shards, and then stop writing data to legacy.Massively Sharded MySQL
  19. 19. Shard Automation Tumblrʼs custom automation software can: • Crawl replication topology for all shards • Manipulate server settings or concurrently execute arbitrary UNIX commands / administrative MySQL queries, on some or all shards • Copy large files to multiple remote destinations efficiently • Spin up multiple new slaves simultaneously from a single source • Import or export arbitrary portions of the dataset • Split a shard into N new shardsMassively Sharded MySQL
  20. 20. Splitting shards: goals • Rebalance an overly-large shard by dividing it into N new shards, of even or uneven size • Speed • No locks • No application logic • Divide a 800gb shard (hundreds of millions of rows) in two in only 5 hours • Full read and write availability: shard-splitting process has no impact on live application performance, functionality, or data consistencyMassively Sharded MySQL
  21. 21. Splitting shards: assumptions • All tables using InnoDB • All tables have an index that begins with your sharding key, and sharding scheme is range-based. This plays nice with range queries in MySQL. • No schema changes in process of split • Disk is < 2/3 full, or thereʼs a second disk with sufficient space • Keeping two standby slaves per shard pool (or more if multiple data centers) • Uniform MySQL config between masters and slaves: log-slave-updates, unique server-id, generic log- bin and relay-log names, replication user/grants everywhere • No real slave lag, or already solved in app code • Redundant rows temporarily on the wrong shard donʼt matter to appMassively Sharded MySQL
  22. 22. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  23. 23. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  24. 24. Parent Master, blogs 1-1000, R/W all app reads/writes Replication Standby Slave Standby SlaveMassively Sharded MySQL
  25. 25. Parent Master, blogs 1-1000, R/W all app reads/writes Replication Clone Clone Standby Slave Standby Slave Standby Slave Standby SlaveMassively Sharded MySQL
  26. 26. Aside: copying files efficiently To a single destination • Our shard-split process relies on creating new slaves quickly, which involves copying around very large data sets • For compression we use pigz (parallel gzip), but there are other alternatives like qpress • Destination box: nc -l [port] | pigz -d | tar xvf - • Source box: tar vc . | pigz | nc [destination hostname] [port]Massively Sharded MySQL
  27. 27. Aside: copying files efficiently To multiple destinations • Add tee and a FIFO to the mix, and you can create a chained copy to multiple destinations simultaneously • Each box makes efficient use of CPU, memory, disk, uplink, downlink • Performance penalty is only around 3% to 10% — much better than copying serially or from copying in parallel from a single source • http://engineering.tumblr.com/post/7658008285/efficiently-copying-files-to- multiple-destinationsMassively Sharded MySQL
  28. 28. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  29. 29. Importing/exporting in chunks • Divide ID range into many chunks, and then execute export queries in parallel. Likewise for import. • Export via SELECT * FROM [table] WHERE [range clause] INTO OUTFILE [file] • Use mysqldump to export DROP TABLE / CREATE TABLE statements, and then run those to get clean slate • Import via LOAD DATA INFILEMassively Sharded MySQL
  30. 30. Importing/exporting gotchas • Disable binary logging before import, to speed it up • Disable any query-killer scripts, or filter out SELECT ... INTO OUTFILE and LOAD DATA INFILE • Benchmark to figure out how many concurrent import/export queries can be run on your hardware • Be careful with disk I/O scheduler choices if using multiple disks with different speedsMassively Sharded MySQL
  31. 31. Parent Master, blogs 1-1000, R/W all app reads/writes Standby Slave, Standby Slave, Standby Slave, Standby Slave, blogs 1-1000 blogs 1-1000 blogs 1-1000 blogs 1-1000Massively Sharded MySQL
  32. 32. Parent Master, blogs 1-1000, R/W all app reads/writes Two slaves export different halves of the dataset, drop Child Shard Child Shard Standby Slave, Standby Slave, Master, Master, and recreate their tables, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 and then import the dataMassively Sharded MySQL
  33. 33. Parent Master, blogs 1-1000, R/W all app reads/writes Child Shard Child Shard Standby Slave, Standby Slave, Master, Master, Every pool needs two blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 standby slaves, so spin them up for the child shard pools Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  34. 34. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  35. 35. Moving reads and writes separately • If configuration updates do not simultaneously reach all of your web/app servers, this would create consistency issues if reads/writes moved at same time • Web A gets new config, writes post for blog 200 to 1st child shard • Web B is still on old config, renders blog 200, reads from parent shard, doesnʼt find the new post • Instead only move reads first; let writes keep replicating from parent shard • After first config update for reads, do a second one moving writes • Then wait for parent master binlog to stop moving before proceedingMassively Sharded MySQL
  36. 36. Parent Master, blogs 1-1000, R/W all app reads/writes Child Shard Child Shard Standby Slave, Standby Slave, Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  37. 37. Parent Master, blogs 1-1000, W all app writes (which then replicate to children) Reads now go to children R R for appropriate queries to Standby Slave, Standby Slave, Child Shard Child Shard these ranges Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  38. 38. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  39. 39. Parent Master, blogs 1-1000, W all app writes (which then replicate to children) Reads now go to children R R for appropriate queries to Standby Slave, Standby Slave, Child Shard Child Shard these ranges Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  40. 40. Parent Master, blogs 1-1000, no longer in use by app Reads and writes now go to R/W R/W children for appropriate Standby Slave, Standby Slave, Child Shard Child Shard queries to these ranges Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  41. 41. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  42. 42. Parent Master, blogs 1-1000, no longer in use by app Reads and writes now go to R/W R/W children for appropriate Standby Slave, Standby Slave, Child Shard Child Shard queries to these ranges Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  43. 43. Parent Master, blogs 1-1000 Global read-only set App user dropped These are now complete R/W R/W shards of their own Shard Shard (but with some surplus data Standby Slave, Standby Slave, Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 that needs to be removed) Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  44. 44. X Deprecated Parent Master, blogs 1-1000 Recycle soon These are now complete X X R/W Shard R/W Shard shards of their own (but with some surplus data Standby Slave, Standby Slave, Master, Master, blogs 1-1000 blogs 1-1000 blogs 1-500 blogs 501-1000 that needs to be removed) Recycle soon Recycle soon Standby Slaves Standby Slaves blogs 1-500 blogs 501-1000Massively Sharded MySQL
  45. 45. Splitting shards: process Large “parent” shard divided into N “child” shards 1. Create N new slaves in parent shard pool — these will soon become masters of their own shard pools 2. Reduce the data set on those slaves so that each contains a different subset of the data 3. Move app reads from the parent to the appropriate children 4. Move app writes from the parent to the appropriate children 5. Stop replicating writes from the parent; take the parent pool offline 6. Remove rows that replicated to the wrong child shardMassively Sharded MySQL
  46. 46. Splitting shards: cleanup • Until writes are moved to the new shard masters, writes to the parent shard replicate to all its child shards • Purge this data after shard split process is finished • Avoid huge single DELETE statements — they cause slave lag, among other issues (huge long transactions are generally bad) • Data is sparse, so itʼs efficient to repeatedly do a SELECT (find next sharding key value to clean) followed by a DELETE.Massively Sharded MySQL
  47. 47. General sharding gotchas • Sizing shards properly: as small as you can operationally handle • Choosing ranges: try to keep them even, and better if they end in 000ʼs than 999ʼs or, worse yet, random ugly numbers. This matters once shard naming is based solely on the ranges. • Cutover: regularly create new shards to handle future data. Take the max value of your sharding key and add a cushion to it. • Before: highest blog ID 31.8 million, last shard handles range [30m, ∞) • After: that shard now handles [30m, 32m) and new last shard handles [32m, ∞)Massively Sharded MySQL
  48. 48. Questions? Email: evan -at- tumblr.comMassively Sharded MySQL
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×