Methods of Sharding MySQL Percona Live NYC 2012Who are Palomino?Bespoke Services: we work with and like you.Production Experienced: senior DBAs, admins, and engineers.24x7: globally-distributed on-call staff.Short-term no-lock-in contracts.Professional Services (DevOps): ➢ Chef, ➢ Puppet, ➢ Ansible.Big Data Cluster Administration (OpsDev): ➢ MySQL, PostgreSQL, ➢ Cassandra, HBase, ➢ MongoDB, Couchbase.
Methods of Sharding MySQL Percona Live NYC 2012Who am I?Tim EllisCTO/Principal Architect, PalominoAchievements: ➢ Palomino Big Data Strategy. ➢ Datawarehouse Cluster at Riot Games. ➢ Back-end Storage Architecture for Firefox Sync. ➢ Led DB teams at Digg for four years. ➢ Harassed the Reddit team at one of their parties.Ensured Successful Business for: ➢ Digg, Friendster, ➢ Riot Games, ➢ Mozilla, ➢ StumbleUpon.
Methods of Sharding MySQL What is this Talk?Large cluster admin: when one DB isnt enough. ➢ What is a shard? ➢ What shard types can I choose? ➢ How to build a large DB cluster. ➢ How to administer that giant mess of DBs.Types of large clusters: ➢ Just a bunch of databases. ➢ Distributed database across machines.
Methods of Sharding MySQL Where the Focus will Lie12% – Sharding theory/considerations.25% – Building a Cluster to administer (tutorial): ➢ Palomino Cluster Tool.50% – Flexible large-cluster administration (tutorial): ➢ Tumblrs Jetpants.13% – Other sharding technologies (talk-only): ➢ Youtubes Vtocc (Vitess), ➢ Twitters Gizzard, ➢ HAproxy.
Methods of Sharding MySQL What about the Silver Bullets?NoSQL Distributed Databases:➢ Promise “sharding” for free,➢ Uptime and horizontal scaling trivially.Reality:➢ RDBMS is 40-yr-old tech,➢ NoSQL is 10-yr-old tech,➢ Which responsible for how many high-profile downtimes in the past 10 years?➢ Evaluate the alternatives without illusions.
Methods of Sharding MySQL What is a Shard?A location for a subset of data:➢ Itself made of pieces.➢ Typically itself redundant. Shard for User Data Shard for Logging Data Shard for Posts Data Master Master Master Slave Slave Slave Slave Slave Slave Slave Slave Slave
Methods of Sharding MySQL What are the Sharding Method Choices?By-Function:➢ Move busy tables onto new shard.➢ Writes of busiest tables on new hardware.➢ Writes of remaining tables on current.By-Columns:➢ Split table into chunks of related columns, store each set on its own Master/Slaves shard.By-Rows:➢ A table is split into N shards, shard gets a subset of the rows of the table.
Methods of Sharding MySQL Shard Method ChoicesBy-function and By-Column Methods:➢ Much easier.➢ Can get you through months to years.➢ Eventually you run out of options here.By-Row Method:➢ The hardest to do.➢ Requires new ways of accessing data.➢ Often requires sophisticated cache strategies.➢ Itself can be done several ways.
Methods of Sharding MySQL By-Function ShardingPicking a Functional Split: ➢ A subset of tables commonly joined. ➢ Tables outside this subset nearly never joined. ➢ One of them responsible for many writes.Every table outside subset requires rewritingJOINs into code-based multi-SELECTs.Once subset of tables moved onto their ownserver, writes are distributed.
Methods of Sharding MySQL By-Column Sharding (Vertical Partition)Identifying candidate table: ➢ Many columns (“users” anyone?), ➢ Many updates, ➢ Many indexes.Required: even split of columns/indexes byupdate frequency. Attempt: logical grouping.JOINs not possible nor desireable: write multi-SELECT code in application DAL.
Methods of Sharding MySQL Row-based Sharding ChoicesRange-based Sharding:➢ Easy to understand.➢ Each shard gets a range of rows.➢ Oft-times some shards are “hot.”➢ Hot shards are split into separate shards.➢ Cold shards are joined into a single shard.➢ Juggling shard load is a frequent process.Typically the best solution. Shortcomings haveknown work-arounds.
Methods of Sharding MySQL Row-based Sharding ChoicesModulus/Hash-based Sharding:➢ Row key is hashed to integer modulo number of shards, then placed on that shard.➢ Only rarely are some shards are “hot.”➢ Shard splitting is difficult to implement.Also a common method of sharding. We hopenot to split shards often (or ever).When we do, its a multi-week process.
Methods of Sharding MySQL Row-based Sharding ChoicesLookup Table-based Sharding: ➢ Easy to understand. ➢ Row key mapped to shard in a lookup table. ➢ Easy to move load off hot shards. ➢ Lookup table method is problematic: ➢ Single point of failure. ➢ Performance bottleneck. ➢ Billions of rows, itself may need sharding.
Prerequisite: Build a Large Cluster Allocating the HardwareGetting Hardware – your own companys:➢ Can be politically-charged.➢ Get a small batch first.➢ Build small demonstration cluster.➢ Get everyone on-board with the demo.Renting/Leasing Hardware – the Cloud:➢ Allocate hardware in EC2 or elsewhere.➢ Usually easier, but possibly harder admin: ➢ Hardware failure more common. ➢ Hardware/network flakiness more common.
Prerequisite: Build a Large Cluster Building the ClusterOkay, Ive got the hardware. What next?
Prerequisite: Build a Large Cluster Building the ClusterConfiguring the Hardware. The old dilemma:➢ Spend days to install/configure DB software? Subsequent management is painful.➢ Use SSH in “for” loops? Rolling your own configuration management tools is a lot of work.➢ Learn a configuration management tool? Obvious choice in 2012. Well-documented tools like Chef, Puppet, Ansible.
Configuration Management Tools My ExperiencePuppet: 6 years ago at Digg ➢ Manage/Deploy of hundreds of servers. ➢ Painful, but not as bad as hand-coding it all.Chef: 2 years ago at Drawn to Scale and Riot ➢ Manage/Deploy dozens of servers. ➢ Learning Ruby is a “joy” of its own.Ansible: 6 months ago at Palomino ➢ Manage/Deploy dozens of servers. ➢ First Palomino Cluster Tool subset built.
Prerequisite: Build a Large Cluster Configuration Management OptionsPick your Configuration Management: ➢ Chef: Popular, use Ruby to “code your infrastructure.” Must learn Ruby. ➢ Puppet: Mature, use data structures to “define your infrastructure.” Less coding. ➢ Ansible: Tiny and modular, similar to Puppet, but with ordering for deployment. Pragmatic.Write/Get Recipes, Manifests, Playbooks? ➢ Writing is tedious. Can take >1 week. ➢ Get from internet? Often incomplete.
Prerequisite: Build a Large Cluster The Palomino Cluster ToolPalominos tool for building large DB clusters: ➢ Chef, Puppet, Ansible modules. ➢ Open-source on Github. ➢ https://github.com/time-palominodb/PalominoClusterTool ➢ Google: “Palomino Cluster Tool.”➢ Will build a large cluster for you in hours: ➢ Master(s) ➢ Slaves – hundreds of them as easy as two. ➢ MHA – when master fails, a slave takes over.➢ Previously this would take days.
The Palomino Cluster Tool Building the Management NodeCluster Management Node:➢ Will build the initial cluster.➢ Will do subsequent cluster management.Tool for Initial Cluster Build: ➢ Palomino Cluster Tool (Ansible subset).Tool for Cluster Management: ➢ Jetpants (Ruby).
The Palomino Cluster Tool Building the Management NodePalomino Cluster Tool (Ansible subset).Why Ansible?➢ No server to set up, simply uses SSH.➢ Easy-to-understand non-code Playbooks.➢ Use a language you know for modules.➢ For demo purposes, obvious choice.➢ Also production-worthy: ➢ Built by Michael DeHaan, long-time configuration management guru.
The Palomino Cluster Tool Building the Management NodeManagement node lives alongside your cluster.➢ We are building our cluster in EC2.➢ Thus management node in EC2.➢ This tutorial assumes Ubuntu 12.04.➢ t1.micro is fine for management node.Install basic tools: ➢ apt-get install git (for Ansible/P.C.T.) ➢ apt-get install make python-jinja2 (for Ansible)
The Palomino Cluster Tool Configuring the Management NodeInstall Ansible: ➢ git clone git://github.com/ansible/ansible.git ➢ make installInstall Palomino Cluster Tool: ➢ git clone git://github.com/time- palominodb/PalominoClusterTool.gitI think we just finished the management node!
The Palomino Cluster Tool Allocating Shard NodesShard nodes: ➢ m1.small or larger: at least 1.6GB RAM, ➢ :3306, :80, and :22 open between all (one security group in EC2), ➢ Ubuntu 12.04 (other Debian-alikes at your own risk – but may work!).Do not need OS/database configuration:➢ Ansible will configure them.
The Palomino Cluster Tool Building the First Shard – Step 1 From README: edit IP addresses in cluster layout file (PalominoClusterToolLayout.ini):# Alerting/Trending -----[alertmaster]10.252.157.110[trendmaster]10.252.157.110# Servers -----[mhamanager]10.252.157.110 This section identical for all Shards.
The Palomino Cluster Tool Building the First Shard – Step 2 From README: edit IP addresses in cluster layout file (PalominoClusterToolLayout.ini):[mysqlmasters]10.244.17.6[mysqlslaves]10.244.26.19910.244.18.178[mysqls:vars]master_host=10.244.17.6 This section different for every Shard.
The Palomino Cluster Tool Building the First Shard – Step 3 Run setup command to put configuration and SSH keys into /etc:$ cd PalominoClusterTool/AnsiblePlaybooks/Ubuntu-12.04$ ./00-Setup_PalominoClusterTool.sh ShardA Run build command – its a wrapper around Ansible Playbooks:$ ./10-MySQL_MHA_Manager.sh ShardA
The Palomino Cluster Tool Building the Second Shard Just make one shard with a master and many slaves. In real life, you might do something like this instead:for i in ShardB ShardC ShardD ; do (manual step): vim PalominoClusterToolLayout.ini (scriptable steps): ./00-Setup_PalominoClusterTool.sh $i ./10-MySQL_MHA_Manager.sh $idone Run them in separate terminals to save time.
Make the Cluster Real Data makes Shard Split Interesting Fill ShardA using random data script.* Palomino Cluster Tool includes such a tool. ➢ HelperScripts/makeGiantDatafile.pl$ ssh root@sharda-master# cd PalominoClusterTool/HelperScripts# mysql -e create database palomino# ./makeGiantDatafile.pl 1200000 3 | mysql -f palomino Install Jetpants, do shard split now. * Be sure /var/lib/mysql is on large partition!
Administering the Cluster Install Jetpants General idea: Install Ruby >=1.9.2 and RubyGems, then Jetpants via RubyGems. On my systems, /etc/alternatives always incorrect, ln the proper binaries for Jetpants.# apt-get install ruby1.9.3 rubygems libmysqlclient-dev# ln -sf /usr/bin/ruby1.9.3 /etc/alternatives/ruby# ln -sf /usr/bin/gem1.9.3 /etc/alternatives/gem# gem install jetpants
Administering the Cluster Configure Jetpants General idea: edit /etc/jetpants.yaml and create/own Jetpants inventory and application configuration to Jetpants user:# vim /etc/jetpants.yaml# mkdir -p /var/jetpants# touch /var/jetpants/assets.json# chown jetpantsusr: /var/jetpants/assets.json# mkdir -p /var/www# touch /var/www/databases.yaml# chown jetpantsusr: /var/www/databases.yaml
Administering the Cluster Jetpants Shard Splits Tell Jetpants Console about your ShardA:Jetpants> s = Shard.new(1, 999999999, 10.12.34.56,:ready) #10.12.34.56==ShardA masterJetpants> s.sync_configuration Create spares within Console for all others (improved workflow in Jetpants 0.7.8):Jetpants> topology.tracker.spares << 10.23.45.67Jetpants> topology.tracker.spares << 10.23.45.68Jetpants> topology.tracker.spares << 10.23.45.69Jetpants> topology.write_configJetpants> topology.update_tracker_data
Administering the Cluster Jetpants Shard SplitsJust for this tutorial: ➢ Create the “palomino” database, ➢ Break the replication on all the spares, ➢ Be sure spares are read/write: ➢ Edit my.cnf, ➢ service mysql restart➢ Ensure “jetpants pools” proper: ➢ One master, ➢ Two slaves.
Administering the Cluster Jetpants Shard Splits How to perform an actual Shard Split:$ jetpants shard_split --min-id=1 --max-id=999999999 Notes: ➢ Process takes hours. Use screen or nohup. ➢ LeftID == parents first, RightID == parents last, no overlap/gap. ➢ Make children 1-300000,300001-999999999.
Jetpants Improvements The Result of an ExperimentJetpants only well-tested on RHEL/CentOS.Palomino Cluster Tool only well-tested to buildUbuntu 12.04 clusters.Little effort to fix Jetpants: ➢ /sbin/service location different, ➢ service mysql status output different.
Jetpants Improvements The Result of an ExperimentJetpants only well-tested on MySQL 5.1.I built a cluster of MySQL 5.5.A little more effort to fix Jetpants:➢ Set master_host= is syntax error,➢ reset slave needs keyword “all” appended.
Jetpants Improvements The Result of an ExperimentJetpants only well-tested on large datasets.I built a cluster with only hundreds of MB.A wee tad more effort to fix Jetpants:➢ Some timings assumed large datasets,➢ Edge cases for small/quick operations reported back to the author.
Jetpants Improvements OSS Collaboration and WinEvan Elias implemented these fixes last week! ➢ jetpants add_pool, ➢ jetpants add_shard, ➢ jetpants add_spare (with sanity-check spare), ➢ Shards with 1 slave (not for prod!), ➢ read_only spares not fatal, ➢ Debian-alike (Ubuntu) fixes, ➢ MySQL 5.5 fixes, ➢ Mid-split Jetpants pools output simpler.Really responsive ownership of project!
Twitters Gizzard What is it?General Framework for distributed database.➢ Hides sharding from you.➢ Literally, it is middleware. ➢ Applications connect to Gizzard, ➢ Gizzard sends connections to proper place, ➢ Shard splits and hardware failure taken care of.➢ Created at Twitter by rogue cowboys.➢ Not completely production-ready. ➢ Better than rolling your own!
Twitters Gizzard Why should I use it?Youve settled on row-based partition scheme: ➢ Master nearing I/O capacity, wont scale up, ➢ Cant move some tables to their own pool, ➢ Cant split the columns/indexes out, ➢ You want to keep using the DBMS you already know and love: Percona Server.* ➢ Dont want to think about fault-tolerance or shard splits (much),* Actually use any storage back-end.
Twitters Gizzard The Fine PrintThis sounds perfect. Why not Gizzard?Writes must follow strict diet. Must be:➢ Idempotent*,➢ Commutative**,➢ Must not have tuberculosis.* Pfizer cannot remove the idempotencyrequirement of Gizzard.** Even on evenings and weekends.
Twitters Gizzard Expanding the Fine PrintIdempotency: ➢ Submit a write. Again. And again. ➢ Must be identical to doing it once. ➢ Bad: “update set col = col + 1”Commutative – writes in arbitrary order:➢ WriteA→WriteB→WriteC on Node1.➢ WriteB→WriteC→WriteA on Node2.➢ Bad: “update set col1 = 42”→“update set col2 = col1 + 5”
Twitters Gizzard Expanding the Fine PrintCluster is Eventually Consistent:➢ May return old values for reads.➢ Unknown when consistency will occur.Like a politicians position on the budget: ➢ Might be consistent in the future. ➢ Just not right now. ➢ Or now.
Twitters Gizzard Working Around the ShortcomingsGizzard work-around:➢ Add timestamp to every transaction.➢ Good: ➢ “col1.ts=1; update set col1=42” → ➢ “update set col2=col1 + 5 where col1.ts=1”➢ Implementation trickier if DBMS doesnt support column attributes.Cannot escape: must radically re-think schemaand application/DBMS interaction.
Twitters Gizzard Trying it OutIm convinced! How do I begin? ➢ Learn Scala. ➢ Clone “rowz” from Github. ➢ https://github.com/twitter/Rowz➢ Modify it to suit your needs.➢ Learn how it interacts with existing tools.➢ Write new monitoring/alerting plugins.➢ Write unit tests!➢ You should OSS it to help with overhead.
Twitters Gizzard Trying it OutSounds daunting. Maybe Ill roll my own?Learn from others mistakes: ➢ Digg: 2 engineers 6 months. Code thrown away. Digg out of business. ➢ Countless identical stories in Silicon Valley.NIHS attitude == Go out of business*.* 8-figure R&D budgets excepted.
Youtubes Vitess/Vtocc What is it?Vitess is a library. Vtocc is an implemenationusing it.Vtocc is another middleware solution.➢ Sharding,➢ Caching,➢ Connection-pooling,➢ In-use at Youtube,➢ Built-in fail-safe features.
Youtubes Vtocc Why use it?Proven high-volume sharding solution.Interesting feature-list: ➢ Auto query/transaction over-limit killing. ➢ Better query-cache implementation. ➢ Query comment-stripping for query cache. ➢ Query consolidation. ➢ Zero downtime restarts.Less coding than Gizzard (more plug-in).
Youtubes Vtocc Hold on, Zero Downtime Restarts?Just start new Vtocc instance. ➢ Instance1 passes new requests to Instance2, ➢ Instance1s connections get 30s to complete, ➢ Instance2 kills Instance1 and takes over. Vtocc Instance 1 Vtocc Instance 2
Youtubes Vtocc The Fine PrintRequires Particular Primary Keys:➢ varbinary datatype,➢ Choose carefully to prevent hot-spots.Max result-set size: larger resultsets fail.Additional administration burden:➢ “My query was killed. Why?”➢ Middleware adds spooky hard-to-diagnose failure modes.
Youtubes Vtocc Implementation Details➢ Run Vtocc on same server as MySQL.➢ Configure Vtocc fail-safes for expected load: ➢ Pool Size (connection count), ➢ Max Transactions (has own connection pool), ➢ Query Timeout (before killed), ➢ Transaction Timeout (before killed), ➢ Max Resultset Size in rows ➢ Go language doesnt free allocated memory, so pick this value carefully.➢ More details: http://code.google.com/p/vitess/wiki/Operations
HAproxy Re-thinking Proxy TopologyOld-school Proxy Topology:➢ DB Clients one one side,➢ DB Servers on the other,➢ Proxy in-between. Single Point of Failure
HAproxy Re-thinking Proxy TopologyFree proxy provides new architecture option: ➢ Proxy on every DB client node. ➢ Good-bye single-point-of-failure. ➢ Hello configuration management for proxy. HAproxy HAproxy HAproxy HAproxy HAproxy
Methods of Sharding MySQL Q&AQuestions? Suggestions:➢ Interesting stuff. Got a job for me?➢ Well I got a job for you. Interested?➢ Warn me next time so I can sleep in the back row.➢ Was that a question?Thank you! Emails to domain palominodb,username time. Percona Live 2012 in New YorkCity. Enjoy the rest of the show!