Scaling MySQL in AWS
Presented by: Laine Campbell
April 3rd, 2014
Agenda
1. Overview of options: RDS and EC2/MySQL
2. MySQL scaling patterns
3. Performance/Availability
4. Implementation c...
Who the *&%^#$ am I?
Laine Campbell
Co-Founder and CEO, Blackbird (formerly PalominoDB)
9 years building the DB team/infra...
AWS Options for MySQL:
RDS and EC2/MySQL
A love story...
AWS Relational Database Service
(RDS)
Basic Operations Managed
Ease of Deployment
Supports Scaling via Replication
Reliabl...
Managed Operations
Backups and Recovery
Provisioning
Patching
Auto Failover
Replication
RDS Backup and Recovery
Storage is done via EBS
Snapshot and binlog based (point in time)
A Non Multi-AZ implementation cr...
Advanced Backup and Recovery
Creating non-RDS backups done via mysqldump,
mydumper, custom extraction
You can create non-R...
Disaster Recovery
Cross region replication is
supported in 5.6
Cross region replication incurs
cross-region data transfer
...
Provisioning
Initial creation of single or multi-AZ
masters
Single command replica creation
(serialized)
via snapshots, mu...
Patching
Automatically managed in
maintenance windows
Alerts sent for the coming week, so
you can determine impact,
resche...
RDS Challenges (Opportunities?)
Abstraction from kernel, OS processlist, OS commands
etc...
No SUPER access, changes to ma...
RDS Challenges (Opportunities?)
Snapshot backups not
portable/accessible outside of
RDS
Multi-AZ failover can strand
repli...
RDS Visibility Impacts
Agent based instrumentation that requires localhost
installation won’t work
No access to TCPDUMP/Po...
EC2 and MySQL
All the MySQL you’ve come to love and hate
Any topologies you can dream
Access to many more types of instanc...
Why RDS or EC2?
You can’t run 5.6, and you can’t tolerate the risk of
single region? (~99.65% SLA per month) Use EC2
You d...
Why RDS or EC2?
Want MariaDB, XtraDB? Use EC2
Large data-sets generally require file level backups and
portability? Use EC...
Scaling Patterns for MySQL in AWS
Scaling in RDS - Vertical
RAM up to 244 GB per instance, creating excellent
ability to put large datasets in RAM
Network p...
Scaling in RDS - Provisioned IOPs
1,000 - 30,000 IOPS
100 GB to 3 TB
Stable, predictable IO
Realizing Max IOPS - 20,000
● ...
Scaling in RDS - Provisioned IOPs
Overprovisioning from realized, can create latency
reductions
● In an unbalanced workloa...
Scaling in RDS - Reads
Native replication allows for scale out of reads, just as
in EC2 or your own datacenter
RAM up to 2...
Scaling in RDS - Writes
Like any system, you must split workloads if writes
consume max capacity of PIOPS.
● Functional Pa...
Scaling in RDS - Concerns
Sharding:
● Management of RDS instances to roll shards up and
down can be a new paradigm.
● Over...
Scaling in EC2 - Vertical
Higher variety of instances. Similar top level
constraints of:
● RAM
● CPU
● PIOPS
● Network
Eph...
Scaling in EC2 - Reads
In addition to standard MySQL replication, you have
new options
● Galera, MariaDB/Galera and XtraDB...
Scaling in EC2 - Writes
Sharding still becomes necessary, but in EC2 over
RDS, one has access to snapshots:
● Management o...
Scaling in EC2 - Concerns
SSD and Ephemeral Storage
● Instances become even more volatile
● Backups via EBS snapshot are i...
Availability for MySQL in AWS
AWS Availability: Regions and Zones
AWS Availability: Regions and Zones
Amazon Regions equate to data-centers in different
geographical regions.
Availability ...
AWS Availability: Regions and Zones
Amazon states AZs do not share :
•Cooling
•Network
•Security
•Generators
•Facilities
AWS Availability: Regions and Zones
Apr, 2011 - US East Region EBS Failed
● Incorrect network failover.
● Saturated intra-...
AWS Availability: Regions and Zones
99.95% Monthly SLA for a region (multiple AZs)
● Implies multiple AZ is mandatory
● Im...
Availability in RDS - Multi-AZ
The core of an HA solution
Block level replication, active/passive
Saves you from most mast...
Availability in RDS - Multi-AZ
IO impact from
replication
You do not get to choose
the failover AZ, meaning
you must be re...
Availability in RDS - Replicas
Redundant replicas make total sense. N+1 meets most
needs with the ease of provisioning
You...
Availability in RDS - Replicas
Redundant replicas make total sense. N+1 meets most
needs with the ease of provisioning
You...
Availability in EC2 - Options
You can use Galera, XtraDB Cluster, or similar for a
read/write anywhere solution
MySQL MHA ...
AWS Benefits: Dynamicity
AWS Availability: Regions and Zones
Type of Change EC2 RDS Master
(Non Multi-AZ)
RDS Master
(Multi-AZ)
RDS Replica
Instanc...
AWS Failure Scenarios
Predicting and Managing Failure
Operations is about managing
change and mitigating risk
Predicting and Managing Failure
Local Failures
• Database crashes
• Human error
o Misconfigure
o Write to a replica
o Drop...
Predicting and Managing Failure
Local Failures
● When it goes bad, don’t waste time diagnosing.
o Shoot it in the head!
● ...
Predicting and Managing Failure
Mitigation
In RDS:
Use Multi-AZ
Use replicas in multiple AZs
Replicate to multiple regions...
Percona Live 2014 - Scaling MySQL in AWS
Upcoming SlideShare
Loading in …5
×

Percona Live 2014 - Scaling MySQL in AWS

2,984 views

Published on

Laine Campbell, CEO of Blackbird, will explain the options for running MySQL at high volumes at Amazon Web Services, exploring options around database as a service, hosted instances/storages and all appropriate availability, performance and provisioning considerations using real-world examples from Call of Duty, Obama for America and many more. Laine will show how to build highly available, manageable and performant MySQL environments that scale in AWS—how to maintain then, grow them and deal with failure. Some of the specific topics covered are:

* Overview of RDS and EC2 – pros, cons and usage patterns/antipatterns.
* Implementation choices in both offerings: instance sizing, ephemeral SSDs, EBS, provisioned IOPS and advanced techniques (RAID, mixed storage environments, etc…)
* Leveraging regions and availability zones for availability, business continuity and disaster recovery.
* Scaling patterns including read/write splitting, read distribution, functional dataset partitioning and horizontal dataset partitioning (aka sharding)
* Common failure modes – AZ and Region failures, EBS corruption, EBS performance inconsistencies and more.
* Managing and mitigating cost with various instance and storage options

Published in: Software, Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,984
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
46
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Percona Live 2014 - Scaling MySQL in AWS

  1. 1. Scaling MySQL in AWS Presented by: Laine Campbell April 3rd, 2014
  2. 2. Agenda 1. Overview of options: RDS and EC2/MySQL 2. MySQL scaling patterns 3. Performance/Availability 4. Implementation choices 5. Common failure patterns
  3. 3. Who the *&%^#$ am I? Laine Campbell Co-Founder and CEO, Blackbird (formerly PalominoDB) 9 years building the DB team/infrastructure at Travelocity. 7 years at PalominoDB/Blackbird, supporting 50+ companies, 1000s of databases and way too much coffee.
  4. 4. AWS Options for MySQL: RDS and EC2/MySQL A love story...
  5. 5. AWS Relational Database Service (RDS) Basic Operations Managed Ease of Deployment Supports Scaling via Replication Reliable via Replication, EBS RAID, Multi-AZ
  6. 6. Managed Operations Backups and Recovery Provisioning Patching Auto Failover Replication
  7. 7. RDS Backup and Recovery Storage is done via EBS Snapshot and binlog based (point in time) A Non Multi-AZ implementation creates spikes in latency during backups Avoided in Multi-AZ via backups on the secondary Snapshots only
  8. 8. Advanced Backup and Recovery Creating non-RDS backups done via mysqldump, mydumper, custom extraction You can create non-RDS replicas using a logical backup in 5.6 only non-RDS replicas will break during AZ failovers - thus not useful for production or for large datasets
  9. 9. Disaster Recovery Cross region replication is supported in 5.6 Cross region replication incurs cross-region data transfer costs Relay replicas recommended if you wish to minimize expenses
  10. 10. Provisioning Initial creation of single or multi-AZ masters Single command replica creation (serialized) via snapshots, multi-AZ avoids a one minute IO suspension.
  11. 11. Patching Automatically managed in maintenance windows Alerts sent for the coming week, so you can determine impact, reschedule, etc… Multi-AZ mitigates impact of invasive maintenance
  12. 12. RDS Challenges (Opportunities?) Abstraction from kernel, OS processlist, OS commands etc... No SUPER access, changes to management via Stored Procedure (minimal but annoying) Log access becomes more challenging (but manageable) The more experienced of an operator you are, the grumpier you will be!
  13. 13. RDS Challenges (Opportunities?) Snapshot backups not portable/accessible outside of RDS Multi-AZ failover can strand replicas when relaxing binlog consistency for performance. (sync_binlog=0). Without the ability to manually CHANGE MASTER, one must rebuild all replicas after a failover.
  14. 14. RDS Visibility Impacts Agent based instrumentation that requires localhost installation won’t work No access to TCPDUMP/Port listening SAR, processlist for swapping, vmstat, iostat etc... Log forensics become harder but manageable (must download first)
  15. 15. EC2 and MySQL All the MySQL you’ve come to love and hate Any topologies you can dream Access to many more types of instances and storage
  16. 16. Why RDS or EC2? You can’t run 5.6, and you can’t tolerate the risk of single region? (~99.65% SLA per month) Use EC2 You don’t have operational expertise to manage backups, provisioning and replication? Use RDS pro-tip, if you can’t manage a system, how can you troubleshoot advanced performance issues with the visibility issues in RDS?
  17. 17. Why RDS or EC2? Want MariaDB, XtraDB? Use EC2 Large data-sets generally require file level backups and portability? Use EC2 pro-tip, if you can’t get a mysqldump or a parallel dump to load/export in a timely fashion, you probably don’t want RDS
  18. 18. Scaling Patterns for MySQL in AWS
  19. 19. Scaling in RDS - Vertical RAM up to 244 GB per instance, creating excellent ability to put large datasets in RAM Network performance up to 10 GB CPU up to 32 cores Provisioned IOPs are game changers, and mandatory for production, performance sensitive applications.
  20. 20. Scaling in RDS - Provisioned IOPs 1,000 - 30,000 IOPS 100 GB to 3 TB Stable, predictable IO Realizing Max IOPS - 20,000 ● cr1.8xlarge Instance Type ● MySQL 16 KB Page Size ● Full Duplex IO Channel ● 50% reads, 50% writes
  21. 21. Scaling in RDS - Provisioned IOPs Overprovisioning from realized, can create latency reductions ● In an unbalanced workload, for instance reads consuming channel limits ● Write channel bandwidth remains unsaturated ● By doubling IOPS, you increase concurrency, thus reducing latency. Transaction rates increase ● Consumption of IOPS can reduce as transaction rates increase, and manifest as: ○ Improved use of group commit ○ larger log writes
  22. 22. Scaling in RDS - Reads Native replication allows for scale out of reads, just as in EC2 or your own datacenter RAM up to 244 GB per instance, creating much better ability to put large datasets in RAM 5.6 allows for the memcache plugin
  23. 23. Scaling in RDS - Writes Like any system, you must split workloads if writes consume max capacity of PIOPS. ● Functional Partitioning ● Sharding
  24. 24. Scaling in RDS - Concerns Sharding: ● Management of RDS instances to roll shards up and down can be a new paradigm. ● Overall, this can be done, but does require a logical shift. Resource Constraints: ● No access to SSDs (up to 91,250 read or 78,750 write IOPS of 14KB size) Data Movements: ● No access to data copies outside of replica builds can dramatically increase data movement time
  25. 25. Scaling in EC2 - Vertical Higher variety of instances. Similar top level constraints of: ● RAM ● CPU ● PIOPS ● Network Ephemeral storage SSD create a whole new class of IO performance: (up to 91,250 read or 78,750 write IOPS of 14KB size)
  26. 26. Scaling in EC2 - Reads In addition to standard MySQL replication, you have new options ● Galera, MariaDB/Galera and XtraDB Cluster ● Tungsten Replicator and Cluster
  27. 27. Scaling in EC2 - Writes Sharding still becomes necessary, but in EC2 over RDS, one has access to snapshots: ● Management of large datasets becomes much easier ● Shard management functions in more typical paradigms
  28. 28. Scaling in EC2 - Concerns SSD and Ephemeral Storage ● Instances become even more volatile ● Backups via EBS snapshot are impossible, requiring LVMs or similar ● One might consider keeping writes to PIOPs max (20,000) for writes and leverage SSD for reads
  29. 29. Availability for MySQL in AWS
  30. 30. AWS Availability: Regions and Zones
  31. 31. AWS Availability: Regions and Zones Amazon Regions equate to data-centers in different geographical regions. Availability zones are isolated from one another in the same region to minimize impact of failures.
  32. 32. AWS Availability: Regions and Zones Amazon states AZs do not share : •Cooling •Network •Security •Generators •Facilities
  33. 33. AWS Availability: Regions and Zones Apr, 2011 - US East Region EBS Failed ● Incorrect network failover. ● Saturated intra-node communications. ● Cascading failures impacted EBS in all AZs. Jul, 2012 - US East Partial Impact ● Electrical storms impacted multiple sites. ● Failover of metadata DB took too long. ● EBS I/O was frozen to minimize corruption.
  34. 34. AWS Availability: Regions and Zones 99.95% Monthly SLA for a region (multiple AZs) ● Implies multiple AZ is mandatory ● Implies multi-region is necessary for 99.99% or higher
  35. 35. Availability in RDS - Multi-AZ The core of an HA solution Block level replication, active/passive Saves you from most master crashes Reduces impact of backups, upgrades, locks for provisioning replicas When not in 5.6, and using log_sync != 1, you often lose replicas during failover
  36. 36. Availability in RDS - Multi-AZ IO impact from replication You do not get to choose the failover AZ, meaning you must be ready to move app servers
  37. 37. Availability in RDS - Replicas Redundant replicas make total sense. N+1 meets most needs with the ease of provisioning You must have replicas in every AZ you have app servers in (if using replicas for reads) AWS states cross-AZ latency impact of low single digit millisecond impact. Real world indicates occasional much larger spikes
  38. 38. Availability in RDS - Replicas Redundant replicas make total sense. N+1 meets most needs with the ease of provisioning You must have replicas in every AZ you have app servers in (if using replicas for reads) AWS states cross-AZ latency impact of low single digit millisecond impact. Real world indicates occasional much larger spikes
  39. 39. Availability in EC2 - Options You can use Galera, XtraDB Cluster, or similar for a read/write anywhere solution MySQL MHA can be used to do failovers Continuent’s Tungsten product can also manage failovers
  40. 40. AWS Benefits: Dynamicity
  41. 41. AWS Availability: Regions and Zones Type of Change EC2 RDS Master (Non Multi-AZ) RDS Master (Multi-AZ) RDS Replica Instance resize up/down Rolling Migrations Moderate Downtime Minimal Downtime Moderate Downtime (take out of service) EBS <-> PIOPS Severe Performance impact. Severe Performance impact. Minor Performance impact. Severe Performance Impact (take out of service) PIOPS Amount Change Minor Performance impact. Minor Performance impact. Minor Performance impact. Performance Impact (take out of service) Disk Space Change (add) Performance impact. Performance impact. Minor Performance impact. Performance Impact (take out of service) Disk Space Change (reduce) Rolling Migrations Moderate Downtime Moderate Downtime Moderate Downtime (take out of service)
  42. 42. AWS Failure Scenarios
  43. 43. Predicting and Managing Failure Operations is about managing change and mitigating risk
  44. 44. Predicting and Managing Failure Local Failures • Database crashes • Human error o Misconfigure o Write to a replica o Drop a table/database/career • Localized EBS hangs and corruption • Unacceptable/unpredictable performance
  45. 45. Predicting and Managing Failure Local Failures ● When it goes bad, don’t waste time diagnosing. o Shoot it in the head! ● Plan! ○ Simulate availability and region level failures ○ Wipe storage, reduce IOPS, shut down ○ Chaos monkey is your friend ● Observe! ○ Monitor for early failures, predict
  46. 46. Predicting and Managing Failure Mitigation In RDS: Use Multi-AZ Use replicas in multiple AZs Replicate to multiple regions, and out of AWS In EC2: Use a failover (Galera, Tungsten, MHA/HAProxy) Use multiple AZs and regions Frequent Backups (practicing restores)

×