RDS for MySQL, No BS Operations and Patterns

4,974 views
4,799 views

Published on

Amazon's RDS for MySQL is a wonderful tool with a significant value. It can also create a lot of havoc if you are not aware of it's limitations and changes before you make it a core part of your environment. In this deck, we discuss those issues.

Published in: Technology

RDS for MySQL, No BS Operations and Patterns

  1. 1. Laine Campbell, CEO PalominoDB RDS for MySQLNo BS Operations and Patterns
  2. 2. The Party LineRelational Database Service Fully Managed Simple to Deploy Easy to Scale Reliable Cost Effective
  3. 3. Fully ManagedIgnore the man behind the curtain Backups Provisioning Patching Performance Management Failover Replication
  4. 4. Fully ManagedBackups Snapshot Based - Same as EBS Snapshots cause spikes in latency Avoided in Multi-AZ Snapshots are taken from master Or the standby in Multi-AZ Set up automatic schedules Point in Time Recovery via binlogs User executed snapshots
  5. 5. RDS BackupsCan I snapshot a replica? Nope. Backup from your master. Of course, you can promote a replica, then snapshot it for testbeds.
  6. 6. RDS BackupsI like RDS Backups When using Multi-AZ AND When loads are minimal Its like unicorns are flying my binlogs to heaven
  7. 7. Fully ManagedProvisioning Rapid Master Launches Master in a few minutes (or its free?) Standby in a different AZ? Push a button! Rapid Replica Builds Need more replicas? Push a button!
  8. 8. RDS ProvisioningProvisioning your master Standalone - no failover or redundancy Multi-AZ - standby in a separate availability zone Pick your Version Pick your maintenance window
  9. 9. RDS ProvisioningOverview of AZ and Regions Amazon Regions equate to data-centers in different geographical regions. (99.5% SLA based on more than one AZ being unavailable) Availability zones are isolated from one another in the same region to minimize impact of failures. RDS does not interact across regions.
  10. 10. RDS ProvisioningCan multiple AZs save me? Amazon states AZs do not share : ● Cooling ● Network ● Security ● Generators ● Facilities
  11. 11. RDS ProvisioningCan multiple AZs save me? Apr, 2011 - US East Region EBS Failed * Incorrect network failover. * Saturated intra-node communications. * Cascading failures impacted EBS in all AZs. Jul, 2012 - US East Partial Impact * Electrical storms impacted multiple sites. * Failover of metadata DB took too long. * EBS I/O was frozen to minimize corruption.
  12. 12. RDS ProvisioningCan multiple AZs save me? They can reduce risk. Cross AZ latency can vary as much as 3x. (too slow to allow mysql cluster across AZs) A multi-az failover can create a degraded performance condition when minimal latency is required.
  13. 13. Multi-AZ FailoverFrom AWS Docs
  14. 14. RDS ProvisioningMulti-AZ Magical Failover Replicates via unicorn express Fails over quite often, with up to 30 seconds of downtime You do not get to choose your failover AZ Typical I/O write impact for synch replication aka unicorn express
  15. 15. Multi-AZ FailoverFrom AWS Blog
  16. 16. RDS ProvisioningPick Your Version MySQL 5.1 or MySQL 5.5 :( No MariaDB :( :( No XtraDB :( :( No Drizzle :( :( No TokuDB :(
  17. 17. RDS ProvisioningPick Your Maintenance Window 30 minute window your software patching can occur Can be different for different instances You need to plan ahead for instances to be out of service.
  18. 18. RDS ProvisioningTheyll shut off my DB????
  19. 19. RDS ProvisioningAuto-Version Minor Upgrade If you choose no, you will not experience automatic upgrades (and thus downtime). Some critical security patches can still be done. RDS team is fairly good about communicating upgrades.
  20. 20. RDS ProvisioningBasic Instance Types Micro - 630 MB RAM, 2 ECU - Low I/O Small - 1.7 GB RAM, 1 ECU - Med I/O Large - 7.5 GB RAM, 4 ECU - High I/O XLarge - 15 GB RAM, 8 ECU - High I/O
  21. 21. RDS ProvisioningFancy Instance Types High Mem XL - 17.1 GB RAM, 6.5 ECU - High I/O High Mem 2XL - 34 GB RAM, 13 ECU - High I/O High Mem 4XL - 68 GB RAM, 26 ECU - High I/O
  22. 22. RDS ProvisioningStorage Provisioning From 5 GB to 3 TB At 300 GB, EBS Volumes start to get striped. Striping = better performance Provisioned IOPS (up to 30,000) = more stable I/O and costs more too!
  23. 23. RDS ProvisioningVirtual Private Cloud (VPC) Allows you to create your own virtual network simulating traditional DC networks. You must create a DB Subnet Group in VPC VPC Subnets cannot cross availability zones. VPC security group allows access control to your DB
  24. 24. RDS ProvisioningVirtual Private Cloud (VPC) Mixed architectures with some VPC, and some non- VPC creates major issues. Auto-scaling becomes difficult. Dont do it!
  25. 25. RDS ProvisioningDatabase Security Groups Controls all MySQL access to RDS instances. Defaults to "deny all" Access can be granted by IP Range and EC2 sec groups.
  26. 26. RDS ProvisioningDatabase Security Groups Dont grant access to 10.x.x.x, use a security group. IPs entered with CIDR - Classless Inter-Domain Routing Make sure you understand CIDR! (or you may have unwelcome visitors!)
  27. 27. RDS ProvisioningParameter Groups Defines parameters used by your RDS instances. There is a "default" group that you can modify. One or more RDS instances can map to an individual parameter group.
  28. 28. RDS ProvisioningParameter Group Best Practices Dont ever use the default group. The default group doesnt allow dynamic parameter changes. Everything requires a restart. Build different groups for each mysql master/replica grouping.
  29. 29. RDS ProvisioningParameter Group Best Practices Use different parameter groups for masters vs. replicas. Consider using different parameter groups for different replica types (app query, ad hoc, ETL) Remember to use test environments. Test!!!
  30. 30. RDS ProvisioningWhy different parameter groups? Granularity - Do you want to apply the same parameter to everything in the cluster? ● Read Only? ● Slow Logging? ● innodb_flush_method
  31. 31. RDS Provisioning
  32. 32. RDS ProvisioningProvisioning your Replicas Does not have to be the same instance type as the master. Pick your availability zone (great for mapping replicas to app servers in the same AZ.) Dont forget to apply a different parameter group than your master.
  33. 33. RDS ProvisioningProvisioning your Replicas Adding a replica impacts your master performance. (If not in multi-az) You can only launch in serial - and it can take a non-trivial amount of time to launch. Adding many replicas can take awhile. Script it!
  34. 34. RDS ProvisioningWhat can I do with my replica? Send queries to it Promote it to a master Poke it with a stick Use it for special purposes (mysqldump, ETL, ad hoc)
  35. 35. RDS ProvisioningSending queries to the replica? Set up Route53 cnames - weighted round robin. Internal elastic load balancer in the VPC. VPC/Route53 does not do a mysql health check. HAProxy can be leveraged.
  36. 36. RDS ProvisioningReplica master Promotion This is a great way to build a test environment. Can be leveraged for rolling migrations But a replica cant have a replica! Must promote first!
  37. 37. RDS ProvisioningReplica promotion for failover This can be used instead of Multi-AZ. Why? When using log_sync=0, a master failover in multi-az may strand your replicas. Old log doesnt close correctly. Replica cannot proceed. And you cant move to the next log!
  38. 38. RDS ProvisioningAll of my replicas must be rebuilt!
  39. 39. A Day in the LifeWhat does an RDS DBA do?
  40. 40. A Day in the LifeWhat does an RDS DBA do? Need a replica? Push a button or call an API. Need to create a test environment? Promote a replica, call an API. New Cluster? Push a button or call an API.
  41. 41. A Day in the LifeWhat does an RDS DBA do? Need a backup? Push a button or call an API. Need to recover a database? Push a button or call an API. New Cluster? Push a button or call an API.
  42. 42. A Day in the LifeNeed to do a query review? You dont have access to the logs at the file system level. You can look in the console or via API for some initial diagnostics.
  43. 43. A Day in the Life Query Reviews Need to do a REAL query review? Log to the csv table - slow_log mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( # Time:, DATE_FORMAT(start_time, %y%m%d %H%i%s), n, # User@Host: , user_host, n, #Query_time: , TIME_TO_SEC(query_time), Lock_time: , TIME_TO_SEC(lock_time), Rows_sent: , rows_sent, Rows_examined: , rows_examined, n, sql_text, ; ) FROM mysql.slow_log" > /tmp/mysql.slow_log.log pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query- digest.txt
  44. 44. A Day in the LifeQuery Reviews No Microsecond Patch Using long-query-time=0 logs all queries But they record as 0 on time You have no accurate profile of query time for < 1 sec. You also cant use TCPDump on the MySQL Instance. We often use this if logging everything will drop performance on your DB instance to unacceptable levels. WHICH IT CAN
  45. 45. A Day in the LifeNeed to rotate logs? call mysql.rds_rotate_slow_log; call mysql.rds_rotate_general_log;
  46. 46. A Day in the LifeNeed to kill a process? call mysql.rds_kill_query (99); kills the current query for this thread. call mysql.rds_kill (99); kills the thread.
  47. 47. A Day in the LifeManaging Replication Need to stop replication? Break it yourself! call mysql.rds_skip_repl_error; Skips the current replication error.
  48. 48. A Day in the LifeReviewing Status Trends Global Status History Event snapshots status into mysql. rds_global_status_history; You can trend this into many tools.
  49. 49. Monitoring MySQLCloudwatch CPUUtilization Database Connections FreeStorageSpace Network In/Out Read/Write IOPs Read/Write Bytes Read/Write Latency
  50. 50. Monitoring MySQLWhere are the MySQL Metrics? Cloudwatch doesnt expose them. You can use: Cacti, Graphite, Zabbix, etc... for trending.
  51. 51. Monitoring MySQLCan I alert on cloudwatch metrics? Cloudwatch allows you to set up your alerts. But you probably want all metrics and alerts in the same system, dont you?
  52. 52. Monitoring MySQLAlso cloudwatch is unreliable It often doesnt poll at every interval. Can miss/skip important events.
  53. 53. Monitoring MySQLWhat can I use? Nagios can poll mysql directly Poll from graphite
  54. 54. Some things that suckMoving data in and out Want to do a dump and load upgrade? Want to migrate to a new region? Want to do multi-layer replication?
  55. 55. Some things that suckMigrations/Upgrades out of RDS Take a replica out of service. Dump your data. Upgrade your binaries. Load your data. Give replicas to your replica. Failover reads, then writes. MINIMAL DOWNTIME
  56. 56. Some things that suck Migrations/Upgrades in RDS
  57. 57. Some things that suckMigrations/Upgrades in RDS Dump a bunch of tables. Load deltas via tons of scripting. Keep the deltas on each table minimal. Take a few hours downtime. Sync the delta. Test. Go live and drink a lot.
  58. 58. Some things that suckThis also applies to: Moving data between regions. Migration to EC2 from RDS. Migrating to a datacenter from AWS
  59. 59. Patterns for RDSPrototyping and Testing: Rapid build and destroy. Short lifecycles. Quick testing lifecycles.
  60. 60. Patterns for RDSModerate Uptime SLAs:Region Level SLA is 99.5% across two AZs (43.8 hours of downtime per year)Add in failover times for multi-AZ master (6 more hours)Expect around 4 days of downtime withoutmulti-region
  61. 61. Patterns for RDSThat doesnt include: Downtime from bad queries Downtime from user error Downtime from upgrades/migrations
  62. 62. Patterns for RDSRelaxed Latency Requirements:Multi-AZ can introduce cross-AZ latencywithout AZ specific architectural design.EBS storage can introduce unpredictableLatency without P-IOPSSnapshots of master, replica builds and multi-AZfailovers can impact write latency.
  63. 63. Patterns for RDSRelaxed Latency Requirements:If you use write-through cache, this can be mitigatedIf you use significant caching, this can be mitigatedIf you use AZ aware design, this can be mitigated
  64. 64. Patterns for RDSDataset Specifics: Small datasets can allow for rapid region migrations Read only datasets can also allow for this Data you dont mind losing can also allow for this
  65. 65. Patterns for RDSNo DBA(s):You still need DBAs to design, tune and configure.But RDS does reduce some DBA overhead.With investment in automation, this overhead is notsignificant.Still, automation requires money/hours. If you haveno budget, RDS is a good way to start.
  66. 66. War StoriesObama for America:US-East RegionMulti-AZ5 Clusters, 30 InstancesProvisioned IOPs, 1 TB Storage
  67. 67. Obama for AmericaData Growth:Opsview had no visibility to OS, and thus wewere surprised regularly by storage growth. Hadto build custom plugins.Upgrading storage or instance size in multi-AZcan cause an unpredictable downtime window.Downtime is small, but the whole process can take30 minutes and you dont know when the REALdowntime will occur.
  68. 68. Obama for AmericaHurricane Sandy:Hurricane Sandy was poised to strike Virginia andUS East.Luckily we had built out EC2 and data migrationscripts.Took 3 days solid for the whole team to build outUS-West region.
  69. 69. Obama for AmericaHuman Error:While doing rolling DDL, sql_log_bin disabled at theglobal level on master. (Damn you 5.5!!!!)No access to binlogs made troubleshooting verychallenging.An hour of troubleshooting because we blamed thedisk and had no visibility.Had to rebuild all replicas in serial overnight once
  70. 70. Obama for AmericaMigration to P-IOPs: Things that make you go hmmm....
  71. 71. War StoriesCall of Duty, Black Ops 2: 5 Clusters, 25 instances. US East Multi-AZ Provisioned IOPs
  72. 72. CoD Black Ops 2Hurricane Sandy:Data migration scripts not setup for continuousreplication.Had to draw a line in the sand on when to movedata.Any additional data would be lost, if cutoveroccurred.
  73. 73. CoD Black Ops 2Multi-AZ Failover: Writes required sync_binlog=0 Master failed over to standby. All replicas stopped replicating. DBA couldnt “change master” Read load swarmed the master while we rebuilt.
  74. 74. CoD Black Ops 2Provisioned IOPs: Came out, super exciting! Lets migrate! Oh, no push button migration. 2 Senior DBAs, 3 weeks to build migration scripts and test/migrate.
  75. 75. Q&A Laine Campbell, CEO PalominoDBhttp://www.slideshare.net/lainecampbell

×