Your SlideShare is downloading. ×

Methods of Sharding MySQL


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Methods of Sharding MySQL Percona Live NYC 2012Who are Palomino?Bespoke Services: we work with and like you.Production Experienced: senior DBAs, admins, and engineers.24x7: globally-distributed on-call staff.Short-term no-lock-in contracts.Professional Services (DevOps): ➢ Chef, ➢ Puppet, ➢ Ansible.Big Data Cluster Administration (OpsDev): ➢ MySQL, PostgreSQL, ➢ Cassandra, HBase, ➢ MongoDB, Couchbase.
  • 2. Methods of Sharding MySQL Percona Live NYC 2012Who am I?Tim EllisCTO/Principal Architect, PalominoAchievements: ➢ Palomino Big Data Strategy. ➢ Datawarehouse Cluster at Riot Games. ➢ Back-end Storage Architecture for Firefox Sync. ➢ Led DB teams at Digg for four years. ➢ Harassed the Reddit team at one of their parties.Ensured Successful Business for: ➢ Digg, Friendster, ➢ Riot Games, ➢ Mozilla, ➢ StumbleUpon.
  • 3. Methods of Sharding MySQL What is this Talk?Large cluster admin: when one DB isnt enough. ➢ What is a shard? ➢ What shard types can I choose? ➢ How to build a large DB cluster. ➢ How to administer that giant mess of DBs.Types of large clusters: ➢ Just a bunch of databases. ➢ Distributed database across machines.
  • 4. Methods of Sharding MySQL Where the Focus will Lie12% – Sharding theory/considerations.25% – Building a Cluster to administer (tutorial): ➢ Palomino Cluster Tool.50% – Flexible large-cluster administration (tutorial): ➢ Tumblrs Jetpants.13% – Other sharding technologies (talk-only): ➢ Youtubes Vtocc (Vitess), ➢ Twitters Gizzard, ➢ HAproxy.
  • 5. Methods of Sharding MySQL What about the Silver Bullets?NoSQL Distributed Databases:➢ Promise “sharding” for free,➢ Uptime and horizontal scaling trivially.Reality:➢ RDBMS is 40-yr-old tech,➢ NoSQL is 10-yr-old tech,➢ Which responsible for how many high-profile downtimes in the past 10 years?➢ Evaluate the alternatives without illusions.
  • 6. Methods of Sharding MySQL What is a Shard?A location for a subset of data:➢ Itself made of pieces.➢ Typically itself redundant. Shard for User Data Shard for Logging Data Shard for Posts Data Master Master Master Slave Slave Slave Slave Slave Slave Slave Slave Slave
  • 7. Methods of Sharding MySQL What are the Sharding Method Choices?By-Function:➢ Move busy tables onto new shard.➢ Writes of busiest tables on new hardware.➢ Writes of remaining tables on current.By-Columns:➢ Split table into chunks of related columns, store each set on its own Master/Slaves shard.By-Rows:➢ A table is split into N shards, shard gets a subset of the rows of the table.
  • 8. Methods of Sharding MySQL Shard Method ChoicesBy-function and By-Column Methods:➢ Much easier.➢ Can get you through months to years.➢ Eventually you run out of options here.By-Row Method:➢ The hardest to do.➢ Requires new ways of accessing data.➢ Often requires sophisticated cache strategies.➢ Itself can be done several ways.
  • 9. Methods of Sharding MySQL By-Function ShardingPicking a Functional Split: ➢ A subset of tables commonly joined. ➢ Tables outside this subset nearly never joined. ➢ One of them responsible for many writes.Every table outside subset requires rewritingJOINs into code-based multi-SELECTs.Once subset of tables moved onto their ownserver, writes are distributed.
  • 10. Methods of Sharding MySQL By-Column Sharding (Vertical Partition)Identifying candidate table: ➢ Many columns (“users” anyone?), ➢ Many updates, ➢ Many indexes.Required: even split of columns/indexes byupdate frequency. Attempt: logical grouping.JOINs not possible nor desireable: write multi-SELECT code in application DAL.
  • 11. Methods of Sharding MySQL Row-based Sharding ChoicesRange-based Sharding:➢ Easy to understand.➢ Each shard gets a range of rows.➢ Oft-times some shards are “hot.”➢ Hot shards are split into separate shards.➢ Cold shards are joined into a single shard.➢ Juggling shard load is a frequent process.Typically the best solution. Shortcomings haveknown work-arounds.
  • 12. Methods of Sharding MySQL Row-based Sharding ChoicesModulus/Hash-based Sharding:➢ Row key is hashed to integer modulo number of shards, then placed on that shard.➢ Only rarely are some shards are “hot.”➢ Shard splitting is difficult to implement.Also a common method of sharding. We hopenot to split shards often (or ever).When we do, its a multi-week process.
  • 13. Methods of Sharding MySQL Row-based Sharding ChoicesLookup Table-based Sharding: ➢ Easy to understand. ➢ Row key mapped to shard in a lookup table. ➢ Easy to move load off hot shards. ➢ Lookup table method is problematic: ➢ Single point of failure. ➢ Performance bottleneck. ➢ Billions of rows, itself may need sharding.
  • 14. Prerequisite: Build a Large Cluster Allocating the HardwareGetting Hardware – your own companys:➢ Can be politically-charged.➢ Get a small batch first.➢ Build small demonstration cluster.➢ Get everyone on-board with the demo.Renting/Leasing Hardware – the Cloud:➢ Allocate hardware in EC2 or elsewhere.➢ Usually easier, but possibly harder admin: ➢ Hardware failure more common. ➢ Hardware/network flakiness more common.
  • 15. Prerequisite: Build a Large Cluster Building the ClusterOkay, Ive got the hardware. What next?
  • 16. Prerequisite: Build a Large Cluster Building the ClusterConfiguring the Hardware. The old dilemma:➢ Spend days to install/configure DB software? Subsequent management is painful.➢ Use SSH in “for” loops? Rolling your own configuration management tools is a lot of work.➢ Learn a configuration management tool? Obvious choice in 2012. Well-documented tools like Chef, Puppet, Ansible.
  • 17. Configuration Management Tools My ExperiencePuppet: 6 years ago at Digg ➢ Manage/Deploy of hundreds of servers. ➢ Painful, but not as bad as hand-coding it all.Chef: 2 years ago at Drawn to Scale and Riot ➢ Manage/Deploy dozens of servers. ➢ Learning Ruby is a “joy” of its own.Ansible: 6 months ago at Palomino ➢ Manage/Deploy dozens of servers. ➢ First Palomino Cluster Tool subset built.
  • 18. Prerequisite: Build a Large Cluster Configuration Management OptionsPick your Configuration Management: ➢ Chef: Popular, use Ruby to “code your infrastructure.” Must learn Ruby. ➢ Puppet: Mature, use data structures to “define your infrastructure.” Less coding. ➢ Ansible: Tiny and modular, similar to Puppet, but with ordering for deployment. Pragmatic.Write/Get Recipes, Manifests, Playbooks? ➢ Writing is tedious. Can take >1 week. ➢ Get from internet? Often incomplete.
  • 19. Prerequisite: Build a Large Cluster The Palomino Cluster ToolPalominos tool for building large DB clusters: ➢ Chef, Puppet, Ansible modules. ➢ Open-source on Github. ➢ ➢ Google: “Palomino Cluster Tool.”➢ Will build a large cluster for you in hours: ➢ Master(s) ➢ Slaves – hundreds of them as easy as two. ➢ MHA – when master fails, a slave takes over.➢ Previously this would take days.
  • 20. The Palomino Cluster Tool Building the Management NodeCluster Management Node:➢ Will build the initial cluster.➢ Will do subsequent cluster management.Tool for Initial Cluster Build: ➢ Palomino Cluster Tool (Ansible subset).Tool for Cluster Management: ➢ Jetpants (Ruby).
  • 21. The Palomino Cluster Tool Building the Management NodePalomino Cluster Tool (Ansible subset).Why Ansible?➢ No server to set up, simply uses SSH.➢ Easy-to-understand non-code Playbooks.➢ Use a language you know for modules.➢ For demo purposes, obvious choice.➢ Also production-worthy: ➢ Built by Michael DeHaan, long-time configuration management guru.
  • 22. The Palomino Cluster Tool Building the Management NodeManagement node lives alongside your cluster.➢ We are building our cluster in EC2.➢ Thus management node in EC2.➢ This tutorial assumes Ubuntu 12.04.➢ t1.micro is fine for management node.Install basic tools: ➢ apt-get install git (for Ansible/P.C.T.) ➢ apt-get install make python-jinja2 (for Ansible)
  • 23. The Palomino Cluster Tool Configuring the Management NodeInstall Ansible: ➢ git clone git:// ➢ make installInstall Palomino Cluster Tool: ➢ git clone git:// palominodb/PalominoClusterTool.gitI think we just finished the management node!
  • 24. The Palomino Cluster Tool Allocating Shard NodesShard nodes: ➢ m1.small or larger: at least 1.6GB RAM, ➢ :3306, :80, and :22 open between all (one security group in EC2), ➢ Ubuntu 12.04 (other Debian-alikes at your own risk – but may work!).Do not need OS/database configuration:➢ Ansible will configure them.
  • 25. The Palomino Cluster Tool Building the First Shard – Step 1 From README: edit IP addresses in cluster layout file (PalominoClusterToolLayout.ini):# Alerting/Trending -----[alertmaster][trendmaster] Servers -----[mhamanager] This section identical for all Shards.
  • 26. The Palomino Cluster Tool Building the First Shard – Step 2 From README: edit IP addresses in cluster layout file (PalominoClusterToolLayout.ini):[mysqlmasters][mysqlslaves][mysqls:vars]master_host= This section different for every Shard.
  • 27. The Palomino Cluster Tool Building the First Shard – Step 3 Run setup command to put configuration and SSH keys into /etc:$ cd PalominoClusterTool/AnsiblePlaybooks/Ubuntu-12.04$ ./ ShardA Run build command – its a wrapper around Ansible Playbooks:$ ./ ShardA
  • 28. The Palomino Cluster Tool Building the Second Shard Just make one shard with a master and many slaves. In real life, you might do something like this instead:for i in ShardB ShardC ShardD ; do (manual step): vim PalominoClusterToolLayout.ini (scriptable steps): ./ $i ./ $idone Run them in separate terminals to save time.
  • 29. Make the Cluster Real Data makes Shard Split Interesting Fill ShardA using random data script.* Palomino Cluster Tool includes such a tool. ➢ HelperScripts/$ ssh root@sharda-master# cd PalominoClusterTool/HelperScripts# mysql -e create database palomino# ./ 1200000 3 | mysql -f palomino Install Jetpants, do shard split now. * Be sure /var/lib/mysql is on large partition!
  • 30. Administering the Cluster Install Jetpants General idea: Install Ruby >=1.9.2 and RubyGems, then Jetpants via RubyGems. On my systems, /etc/alternatives always incorrect, ln the proper binaries for Jetpants.# apt-get install ruby1.9.3 rubygems libmysqlclient-dev# ln -sf /usr/bin/ruby1.9.3 /etc/alternatives/ruby# ln -sf /usr/bin/gem1.9.3 /etc/alternatives/gem# gem install jetpants
  • 31. Administering the Cluster Configure Jetpants General idea: edit /etc/jetpants.yaml and create/own Jetpants inventory and application configuration to Jetpants user:# vim /etc/jetpants.yaml# mkdir -p /var/jetpants# touch /var/jetpants/assets.json# chown jetpantsusr: /var/jetpants/assets.json# mkdir -p /var/www# touch /var/www/databases.yaml# chown jetpantsusr: /var/www/databases.yaml
  • 32. Administering the Cluster Jetpants Shard Splits Tell Jetpants Console about your ShardA:Jetpants> s =, 999999999,,:ready) # masterJetpants> s.sync_configuration Create spares within Console for all others (improved workflow in Jetpants 0.7.8):Jetpants> topology.tracker.spares <<> topology.tracker.spares <<> topology.tracker.spares <<> topology.write_configJetpants> topology.update_tracker_data
  • 33. Administering the Cluster Jetpants Shard SplitsJust for this tutorial: ➢ Create the “palomino” database, ➢ Break the replication on all the spares, ➢ Be sure spares are read/write: ➢ Edit my.cnf, ➢ service mysql restart➢ Ensure “jetpants pools” proper: ➢ One master, ➢ Two slaves.
  • 34. Administering the Cluster Jetpants Shard Splits How to perform an actual Shard Split:$ jetpants shard_split --min-id=1 --max-id=999999999 Notes: ➢ Process takes hours. Use screen or nohup. ➢ LeftID == parents first, RightID == parents last, no overlap/gap. ➢ Make children 1-300000,300001-999999999.
  • 35. Jetpants Shard Splitting The Gory Details After “jetpants shard_split”:ubuntu@ip-10-252-157-110:~$ jetpants poolsshard-1-999999999 [3GB]master = ip-10-244-136-107standby slave 1 = ip-10-244-143-195standby slave 2 = ip-10-244-31-91shard-1-400000 (state: replicating) [2GB]master = ip-10-244-144-183shard-400001-999999999 (state: replicating) [1GB]master = ip-10-244-146-27 0 global pools 3 shard pools---- -------------- 3 total pools 3 masters 0 active slaves 2 standby slaves 0 backup slaves---- -------------- 5 total nodes
  • 36. Jetpants Improvements The Result of an ExperimentJetpants only well-tested on RHEL/CentOS.Palomino Cluster Tool only well-tested to buildUbuntu 12.04 clusters.Little effort to fix Jetpants: ➢ /sbin/service location different, ➢ service mysql status output different.
  • 37. Jetpants Improvements The Result of an ExperimentJetpants only well-tested on MySQL 5.1.I built a cluster of MySQL 5.5.A little more effort to fix Jetpants:➢ Set master_host= is syntax error,➢ reset slave needs keyword “all” appended.
  • 38. Jetpants Improvements The Result of an ExperimentJetpants only well-tested on large datasets.I built a cluster with only hundreds of MB.A wee tad more effort to fix Jetpants:➢ Some timings assumed large datasets,➢ Edge cases for small/quick operations reported back to the author.
  • 39. Jetpants Improvements OSS Collaboration and WinEvan Elias implemented these fixes last week! ➢ jetpants add_pool, ➢ jetpants add_shard, ➢ jetpants add_spare (with sanity-check spare), ➢ Shards with 1 slave (not for prod!), ➢ read_only spares not fatal, ➢ Debian-alike (Ubuntu) fixes, ➢ MySQL 5.5 fixes, ➢ Mid-split Jetpants pools output simpler.Really responsive ownership of project!
  • 40. Twitters Gizzard What is it?General Framework for distributed database.➢ Hides sharding from you.➢ Literally, it is middleware. ➢ Applications connect to Gizzard, ➢ Gizzard sends connections to proper place, ➢ Shard splits and hardware failure taken care of.➢ Created at Twitter by rogue cowboys.➢ Not completely production-ready. ➢ Better than rolling your own!
  • 41. Twitters Gizzard Why should I use it?Youve settled on row-based partition scheme: ➢ Master nearing I/O capacity, wont scale up, ➢ Cant move some tables to their own pool, ➢ Cant split the columns/indexes out, ➢ You want to keep using the DBMS you already know and love: Percona Server.* ➢ Dont want to think about fault-tolerance or shard splits (much),* Actually use any storage back-end.
  • 42. Twitters Gizzard The Fine PrintThis sounds perfect. Why not Gizzard?Writes must follow strict diet. Must be:➢ Idempotent*,➢ Commutative**,➢ Must not have tuberculosis.* Pfizer cannot remove the idempotencyrequirement of Gizzard.** Even on evenings and weekends.
  • 43. Twitters Gizzard Expanding the Fine PrintIdempotency: ➢ Submit a write. Again. And again. ➢ Must be identical to doing it once. ➢ Bad: “update set col = col + 1”Commutative – writes in arbitrary order:➢ WriteA→WriteB→WriteC on Node1.➢ WriteB→WriteC→WriteA on Node2.➢ Bad: “update set col1 = 42”→“update set col2 = col1 + 5”
  • 44. Twitters Gizzard Expanding the Fine PrintCluster is Eventually Consistent:➢ May return old values for reads.➢ Unknown when consistency will occur.Like a politicians position on the budget: ➢ Might be consistent in the future. ➢ Just not right now. ➢ Or now.
  • 45. Twitters Gizzard Working Around the ShortcomingsGizzard work-around:➢ Add timestamp to every transaction.➢ Good: ➢ “col1.ts=1; update set col1=42” → ➢ “update set col2=col1 + 5 where col1.ts=1”➢ Implementation trickier if DBMS doesnt support column attributes.Cannot escape: must radically re-think schemaand application/DBMS interaction.
  • 46. Twitters Gizzard Trying it OutIm convinced! How do I begin? ➢ Learn Scala. ➢ Clone “rowz” from Github. ➢➢ Modify it to suit your needs.➢ Learn how it interacts with existing tools.➢ Write new monitoring/alerting plugins.➢ Write unit tests!➢ You should OSS it to help with overhead.
  • 47. Twitters Gizzard Trying it OutSounds daunting. Maybe Ill roll my own?Learn from others mistakes: ➢ Digg: 2 engineers 6 months. Code thrown away. Digg out of business. ➢ Countless identical stories in Silicon Valley.NIHS attitude == Go out of business*.* 8-figure R&D budgets excepted.
  • 48. Youtubes Vitess/Vtocc What is it?Vitess is a library. Vtocc is an implemenationusing it.Vtocc is another middleware solution.➢ Sharding,➢ Caching,➢ Connection-pooling,➢ In-use at Youtube,➢ Built-in fail-safe features.
  • 49. Youtubes Vtocc Why use it?Proven high-volume sharding solution.Interesting feature-list: ➢ Auto query/transaction over-limit killing. ➢ Better query-cache implementation. ➢ Query comment-stripping for query cache. ➢ Query consolidation. ➢ Zero downtime restarts.Less coding than Gizzard (more plug-in).
  • 50. Youtubes Vtocc Hold on, Zero Downtime Restarts?Just start new Vtocc instance. ➢ Instance1 passes new requests to Instance2, ➢ Instance1s connections get 30s to complete, ➢ Instance2 kills Instance1 and takes over. Vtocc Instance 1 Vtocc Instance 2
  • 51. Youtubes Vtocc The Fine PrintRequires Particular Primary Keys:➢ varbinary datatype,➢ Choose carefully to prevent hot-spots.Max result-set size: larger resultsets fail.Additional administration burden:➢ “My query was killed. Why?”➢ Middleware adds spooky hard-to-diagnose failure modes.
  • 52. Youtubes Vtocc Implementation Details➢ Run Vtocc on same server as MySQL.➢ Configure Vtocc fail-safes for expected load: ➢ Pool Size (connection count), ➢ Max Transactions (has own connection pool), ➢ Query Timeout (before killed), ➢ Transaction Timeout (before killed), ➢ Max Resultset Size in rows ➢ Go language doesnt free allocated memory, so pick this value carefully.➢ More details:
  • 53. HAproxy Re-thinking Proxy TopologyOld-school Proxy Topology:➢ DB Clients one one side,➢ DB Servers on the other,➢ Proxy in-between. Single Point of Failure
  • 54. HAproxy Re-thinking Proxy TopologyFree proxy provides new architecture option: ➢ Proxy on every DB client node. ➢ Good-bye single-point-of-failure. ➢ Hello configuration management for proxy. HAproxy HAproxy HAproxy HAproxy HAproxy
  • 55. Methods of Sharding MySQL Q&AQuestions? Suggestions:➢ Interesting stuff. Got a job for me?➢ Well I got a job for you. Interested?➢ Warn me next time so I can sleep in the back row.➢ Was that a question?Thank you! Emails to domain palominodb,username time. Percona Live 2012 in New YorkCity. Enjoy the rest of the show!