Managing MySQL at scale requires the ability to confidently plan for the future while remaining flexible and responsive to the dynamic needs of the present. Quickly responding to requirements to increase performance, deploy additional read-slaves, refresh Dev/Test, QA, and business copies of databases and improve backup and restore times are critical capabilities in the fast paced world of DevOps. In this session, you will learn how to avoid over-provisioning storage to improve performance, reduce replication slave creation times from hours to seconds, significantly shrink backup windows, and slash restore times, all while maintaining the ability to scale storage resources without downtime or performance impact.
1. Hello
Managing MySQL Scale Through Consolidation
Percona Live 04/15/15
Chris Merz, @merzdba
DB Systems Architect, SolidFire
2. Enterprise Scale MySQL Challenges
• Many MySQL instances (10s-100s-1000s)
• Often 100s of GB or multi-TB range
• Capacity planning and resource management
• Quickly respond to changing requirements
• Ability to quickly scale in real-time
3. Confidently Planning for the Future
• Understand application data storage profile
• Predict growth trajectory for MySQL platform
• Capacity planning and resource management
– Compute Resources
– Memory Allocation
– Storage Resources
• Growth and scaling plan for application
• Public cloud TCO tipping-point plan
4. Planning for the Future: Compute
• Compute Resources
– What is contributing to total CPU consumption?
(query processing, i/o wait, virtualized steal?, etc)
– Is CPU utilization truly driven from mysqld
churning and processing data (rather than wait,
steal, network wait, etc)?
– What is the CPU growth rate for your systems?
5. Planning for the Future: Memory
• Memory Resources
– Memory bound?
– ‘Hot set’ of data fit in memory?
– Are queries optimized to only pull required
data?
– Indexes? Over indexed? Under indexed?
– When is the next RAM increase required?
6. Planning for the Future: Storage
• Storage Resources
– Disk resource heavy?
– IOPS bound?
– Max IOPS ceiling for disk configuration?
– Disk latencies within acceptable tolerance?
– Sequential reads, large chunk processing?
– More random in nature? Disk heads thrashing?
– Is flash memory required?
– Disk usage consumption rate?
– When is the next storage addition required?
– How will you add that capacity?
– Extend existing filesystem natively? (xfs_grow)
– Filesystem-per-database strategy (symlinks)?
7. Planning for the Future: Topology
• MySQL topology scaling plan
– Master cluster servers (Percona Cluster, Galera)
– Master shard servers (Homegrown, ClusterixDB, etc)
– Slave servers (read-heavy environments)
– DevTest instances that require prod db copies
– Reporting and Data Warehouse instances
– Public vs Private Cloud
– Orchestration, OpenStack, DBaaS: Trove
8. Instrument and Gather
• System Performance Data is Essential
– Quantify change rate over time
– Monitor every layer, from app to storage
– Zabbix, Zenoss, Munin, Cacti, Nagios, Graphite
– Critical to real-time troubleshooting
“Trust in God, all others bring Data”
9. Larger Data Sets == Larger Challenges
• Automation increasingly important
• Leverage software defined infrastructure
• Quickly react to increase performance
• Deploy additional slaves for read scaling
• Refresh DevTest, QA, Business copies
• Improve Backup and Restore times
10. Leverage Point: Pivot on the Storage Layer
• Invert the Dominant Paradigm
• Data Gravity and Storage Jiu-jitsu
• Move the platform around the data
• Modern shared storage capabilities
– Snap/clone, writable snapshots
– De-duplication, QoS allocation, Scale-out
• Efficient use of storage resources
11. Shared Storage: MySQL Ops Secret Weapon
• Real-time resource allocation
– Extend volume capacity
– Designate Min/Max/Burst IOPS
• Dev/Test secondary copies
– Deployment for growing teams
– Refreshes for faster iterations
• Replication slave creation
– MySQL read arrays
– HA warm standby copies
• Decrease/eliminate backup windows
• Accelerate restore scenarios
12. Real-time resource allocation
• Modern storage virtualizes resources
• Allows for dynamic allocation
• Increase capacity on the fly
• Change QoS settings (Min/Max/Burst IOPS)
• Scale ‘up’ instantly without lead time
• Scale out – horizontal storage growth
13. Deploy Secondary Copies in Seconds
• Snapshot the prod MySQL storage volume
• Create a volume clone from the snapshot
• Attach the cloned volume to a MySQL VM
• service mysqld start
• For a 1TB dataset: ~6 hrs -> ~90 seconds
14. Deploy Replication Slaves in Seconds
• Flush the target master, SHOW MASTER STATUS
• Snapshot the prod MySQL storage volume
• Create a volume clone from the snapshot
• Mount the cloned volume to a MySQL instance
• Increment server_id; remove auto.cnf
• Script in the CHANGE MASTER config
• service mysqld start
• start slave
• For a 1TB dataset: ~9+ hrs -> ~100 seconds
15. Decrease or Eliminate Backup Windows
• Leverage instant snapshots
• Creates crash consistent backups
• Zero impact to production performance
• Multiple volumes? Group snapshot
• Suitable for many use cases
• Efficient storage utilization (meta data only)
17. Managing Scale Through Consolidation
• Enterprise/Web Scale MySQL: new set of challenges
• Understanding growth patterns is essential
• Virtualization and Orchestration ecosystems
• Data Gravity requires a capable storage layer
• Pivot on storage to avoid data transfer
When you’re using MySQL in production, at scale in the Enterprise, you’re usually not dealing with 100MB databases (Unless you’re a service provider of some sort, maybe SaaS such as WordPress). When datasets get large, into the 100s of GB or multi-TB range, you end up managing a new class of challenges which broadly include 1) planning for the future (or, scaling your MySQL platform resources over time), and 2) Quickly responding to the needs of the present (or, scaling and re-allocating your system resources in real-time)
Requires understanding your application data storage profile,
Being able to predict, as accurately as possible, the growth trajectory for your MySQL systems
This requires knowing your resource consumption profiles for Compute, Memory, Storage
* Compute resources
* What is contributing to CPU consumption? (query processing, i/o wait, virtualized steal?, etc)
* Is CPU consumption is truly from mysqld churning and processing data (rather than wait, steal, network wait, etc)?
* What is the CPU growth rate for your systems?
Are you memory bound?
Does the hotset of data fit in memory?
Are your application queries optimized to accurately target the required information, rather than pulling unnecessary data into memory? (select * vs select 1)
How about indexes? Are you over indexed? Under indexed?
Given data growth rate, when is the next RAM increase required to avoid serious performance degradation?
Are you disk resource heavy?
Are you IOPS bound?
What is the max IOPS ceiling for your disk configuration?
Are disk latencies within acceptable range for the required query and application response time?
Does your application lean on sequential reads, such as range queries or large chunk processing?
Is the query profile more random in nature? If so, are your disk heads thrashing (if on spinning disk)
Is flash memory indicated for your dataset and application needs?
What is your disk usage consumption rate?
When is the next storage resource addition required?
How will you add that capacity? Are you going to extend the existing filesystem via LVM?
Can you extend the existing filesystem natively? (xfs_grow)
Do you use a filesystem-per-database strategy (symlinks)?
At a higher level, do you have growth rate and and topology scaling plan for:
* Master cluster servers (via technologies such as Percona Cluster or Galera)
* Master shard servers (via homegrown or mysql extension technologies such as ClusterixDB)
* Slave servers (used in read-heavy environments)
* DevTest instances that require copies of production datasets
* Reporting instances for consolidated warehouse querying by various departments
* Even if you’re running in the Public Cloud, such as AWS, there comes a point in the growth curve where you have to evaluate the costs of not bringing servers back in-house and creating your own Private Cloud to control costs. True dedicated performance in the public cloud understandably costs a considerable amount more than oversubscribed best-effort shared resource pools (example: pIOPS database storage volumes)
All of these things require you to understand your systems current resource consumption profiles, and be able to quantify the change rate over time, and in relation to other growth factors, such as application end-users (for every million users of our platform we add, we’ll require X additional resources for the MySQL layer), or in relation to internal company growth (for every 10 developers we add, we’ll require Y additional MySQL resource pools).
Gathering and accessing this data absolutely requires the ability to monitor every layer of your platform stack, and there are a great number of monitoring tools that you can use to instrument your environment in order to collect, graph, and analyze this data over time. Nagios + Graphite, Zabbix, Zenoss, Munin, Cacti, and many more. The most important point is that you capture this data over time and reserve time to analyze and understand growth trends.
These tools also are invaluable for real-time troubleshooting, allowing you to zero in and better ascertain and pinpoint root cause during incidents. Incident data can also be used to set up predictive and preventative alerting to avoid outages before they happen.
This type of tooling, data, and insight, in turn, enables one to confidently plan and to be responsive to the needs of the present, but as datasets get larger, so do the operational challenges.
DBAs and DevOps teams are faced with a specific class of problems when MySQL datasets move toward the TB and multi-TB range, or when the scale of various copies of the system exceeds the cost-benefit of manual operations. Automation becomes an increasingly important dimension of the MySQL landscape. When very large databases (VLDBs) are in play, the time it takes to copy or manipulate the corresponding data volumes becomes a major operational hinderance, and a whole host of strategies and tactics need to be employed to reduce or eliminate performance impact and downtime that would otherwise be required during maintenance operations. Operations staff touch-time requirements can also be minimized with various tools and techniques.
* Quickly respond to requirements to increase performance
*
* In virtualized environments, CPU and RAM allocations can be added quickly (though a restart is likely required)
* Storage capacity can be added, or storage allocations can be reconfigured to provide additional throughput
* In advanced storage systems, you can even dial up the IOPS allocations for a MySQL data volume on the fly, allowing more throughput to the database in real time, no restart required
* Deploy additional read-slaves for copies quickly
* Refresh Dev/Test, QA, and business copies of databases
* Improve backup and restore times
As your enterprise grows, your data gravity increases. This comes along with additional customers, end-user, and organizational growth. This increased data mass can mean decreasing agility and drag on your business in the form of increased MySQL platform management complexity, longer response times for provisioning database systems, delays in refreshing of business reporting and data warehouse systems, among other challenges.
As compute platforms get more flexible, both in the public cloud, and in private clouds, it becomes increasingly possible to turn the traditional model of database platform management on its head. Rather than seeing the data layer as the child or property of the compute layer, Make the compute layer an attribute of the dataset. With data gravity and ‘stickiness’ being the operational bottleneck, you can pivot on your storage layer and use the capabilities of modern shared storage systems, combined with virtualization, allow you to flip that model. I refer to this as the technique of Storage Jujitsu. Some of the capabilities include instant snapshots and quick volume cloning, or writable snapshots. Data de-duplication for very efficient use of high performance shared storage platforms. QoS or Quality of Service allocation to segregate performance resources between storage volumes, allowing you to put production MySQL database systems on the same storage array as DevTest and other secondary systems without fear of resource conflict (commonly known as the noisy neighbor effect). Scale-out architectures are also a great development in storage, allowing for seamless storage growth without the need for huge migrations and massive capital poured in to system refreshes. These systems are all designed to make efficient use of storage resources and avoid the need to over provision. The key here is to avoid the use of the network to copy and transfer data, but instead, lean on the native capabilities of the underlying storage layer.
Let’s take a look at some examples of how this can be done, in practice. Here are some the operational models we use at SolidFire for managing our Percona Server MySQL platforms that have been completely reimagined and retooled to leverage next-generation storage capabilities.
- Dev/test secondary copies
- Replication slave creation
- Backup windows
- Restore scenarios