Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | AWS re:Invent 2013

6,890 views

Published on

Learn how to monitor your database performance closely and troubleshoot database issues quickly using a variety of features provided by Amazon RDS and MySQL including database events, logs, and engine-specific features. You also learn about the security best practices to use with Amazon RDS for MySQL. In addition, you learn about how to effectively move data between Amazon RDS and on-premises instances. Lastly, you learn the latest about MySQL 5.6 and how you can take advantage of its newest features with Amazon RDS.

Published in: Technology, Business

Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | AWS re:Invent 2013

  1. 1. DAT302 - A Closer Look at Amazon RDS for MySQL Deep Dive into Diagnostics, Security, and Migration Pavan Pothukuchi, Sr. Product Manager, Amazon RDS Sorin Stoina, Operations Lead, Optaros Antonio Graeff, Technology Director, Titans Group November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. What’s in 2013? Features
  3. 3. Diagnostics
  4. 4. Monitoring
  5. 5. Monitoring • CloudWatch & alarms • SNS notifications • Other monitoring tools • Connect dots!
  6. 6. RDS – CloudWatch & Alarms Swap Usage Freeable Memory • Scale Instance Up • Use Read Replicas
  7. 7. RDS – CloudWatch & Alarms Metric trend Potential Action Freeable Space Scale Storage Up Binary Log Disk Usage Check Read Replicas Freeable Memory and Swap Usage Scale Compute Write Latency and Queue Depth Add Provisioned IOPS DB Connections Check connection pooling
  8. 8. RDS - Event Notifications
  9. 9. RDS - Event Notifications
  10. 10. Database Logs • Error Log • Slow Query Log • General Log
  11. 11. Error Log • Archived ever 5 min • Retained for 24 hours • Example: Unable to start MySQL Sample Log content: InnoDB: Initializing buffer pool, size = 6.0G InnoDB: Completed initialization of buffer pool InnoDB: Fatal error: cannot allocate memory for the buffer pool • Action: Audit mem parameters (for e.g., innodb_buffer_pool_size)
  12. 12. Slow Query Log • Download from AWS Management Console • Access from tables • Connect the dots
  13. 13. Other Monitoring Tools • • • • • MONyog Percona New Relic Graphite Splunk
  14. 14. Security
  15. 15. Security Internet VPC IAM
  16. 16. VPC • Run DB in a private subnet • Use separate Sec. Group for DB • Connect through CNAME • Use
  17. 17. AWS Identity and Access Management (IAM) • • • • DO NOT share AWS account credentials Create IAM users Tag resources Delegate access
  18. 18. Data Migration
  19. 19. Advanced Migration T2 T1 T2 T1 Setup replication On Premises AWS
  20. 20. Customer Highlights e-Commerce solutions Value Added Services - Telecom Sorin Stoina Antonio Graeff
  21. 21. Who we are, What we do • Optaros is a global digital commerce service partner • Hosting and support for multiple customers • New and emerging shopping models – – Flash sales Private event retailing • High traffic “Daily Deal” sites – – – – 5 mio. unique visitors 2000 page views/second 15 add to carts per second 3 orders/sec • Using AWS since 2009, RDS since 2010
  22. 22. Private Event Retailing (PER) • • • • “Daily Deal” or “Private Sales” 24, 48, or 72 hour events Massive discounts designed to entice customers Invitation only – Customers are selected based on purchase history – Email blast is sent as the event starts • Users can “reserve” items for a limited time by adding them to their cart • “Cyber Monday every Monday”
  23. 23. Typical Shopping Cart Architecture
  24. 24. PER Traffic Pattern
  25. 25. RDS in E-Commerce • Highly transactional, ACID is a must • Highly available – Multi-AZ: fail-over, on-the-fly changes to RDS instances • Massive write and read-intensive loads – Writes: sign-up, add to cart, checkout – Provisioned IOPS – Reads: catalog browsing, stock availability – read replicas • Operational efficiency – High/low peak traffic ratio is huge, sometimes as high as 100:1 – 50+ database servers with 5 devops engineers
  26. 26. Tools & Techniques • Jenkins – Event prep automation • CloudFormation – Environment management • CloudWatch for metrics – And Graphite for good measure • Percona toolkit – http://www.percona.com/software/percona-toolkit • MONyog • Optaros Cloud Console – Database monitor
  27. 27. Jenkins
  28. 28. Jenkins • We have automated jobs to “Scale up” the infrastructure: – Frontend servers – increase auto-scaling array to 30+ – Start up to 10 extra cache machines – RDS read replicas – start 4 read replicas in parallel • Jobs complete within 30 minutes – used to take a lot longer before parallel read replica creation
  29. 29. AWS CloudFormation • Keep your RDS parameter groups, security groups and network ACLs in sync across environments sorin-macbook:stacks sorin$ stack -d cross-client-tools-prod.rb @@ -7188,7 +7188,7 @@ "innodb_purge_threads": 1, "max_allowed_packet": 20971520, "max_connect_errors": "10000", "query_cache_size": 33554432, + "query_cache_size": 65554432, "thread_cache_size": 32, "tx_isolation": "READ-COMMITTED”
  30. 30. Amazon CloudWatch and Graphite • Graphite is our central system for metrics – Pull RDS data from CloudWatch into Graphite – Parse InnoDB and system variables and push to Graphite – Application and system metrics go in there as well • Single dashboard for the whole application • Graphite’s API is polled by other alerting and monitoring systems as well
  31. 31. Amazon CloudWatch and Graphite
  32. 32. MONyog
  33. 33. MONyog • Commercial app for MySQL management • Monitors and alerts on key metrics • Useful diagnostics – – – – Caches Deadlocks Temporary tables etc. • Advice on best practices
  34. 34. MONyog alert Server: prod rds-read-replica0 Sampling timeframe: All Time/Current Name Currently running threads Group Current Connections Type Critical Thresho 500 ld Value 1204 Advice If the database is overloaded you'll get an increased number of queries running. Occasional spikes are OK for very short period of time. Too many active threads indicate that: 1. MySQL is taking too much time to process you requests. 2. You are continuously retrieving/updating large datasets. Make sure that queries are tuned to use indexes. ExecuteSHOW FULL PROCESSLIST of find queries that are getting locked continuously. Try isolating long running queries by enabling the slow query log.
  35. 35. Percona Toolkit • http://percona.com/software/percona-toolkit • pt-query-digest in particular – Can be used on the slow query log or a tcpdump file – Since you can’t access the RDS instances, you can run it on your application server – #tcpdump -i eth0 port 3306 -s 65535 -x -n -q -tttt > tcpdump.out – #pt-query-digest --type=tcpdump tcpdump.out • pt-table-checksum won’t work  – – – – It requires special privileges Fortunately, it’s really easy to rebuild read replicas sync_binlog can be a problem when using read replicas Less of a problem with MySQL 5.6 crash free slaves
  36. 36. In-House Database Monitor • “Snapshot” InnoDB status and process list every 10 seconds • Go back in time up to 7 days • Helps identify contentions, rogue queries, etc. • Uses Amazon S3 for storage 
  37. 37. In-House Database Monitor
  38. 38. In-House Database Monitor
  39. 39. In-House Database Monitor
  40. 40. Up Next • • • • Manage read replicas using CloudFormation Use Provisioned IOPS more for lower latency Upgrade more environments to MySQL 5.6 Better disaster recovery – cross-region DB snapshot
  41. 41. Customer Highlights e-Commerce solutions Value Added Services - Telecom Sorin Stoina Antonio Graeff
  42. 42. Titans Group • VAS (Value Added Services) provider for mobile and fixed-line carriers and ISPs • White label personal cloud, mobile security and mobile learning products • Over 10 million active users in 17 countries in Latin America
  43. 43. Carrier billing platform • Complex business rules (trial and subscription periods, bundle, self-renewal) • Lots of safeguards to prevent overcharge • High volume, high value data • Uptime counts: lost transaction is lost revenue • Transactions concentrated in some days of the month • Many different regulatory issues for logging, privacy and data retention
  44. 44. Before
  45. 45. Before • Single pair of on-premises MySQL servers in master-slave configuration • Less than 100k transactions a day but growing fast • No full-time DBA • Rapidly iterating the application (while converting from PHP to Python)
  46. 46. Problems • Upgrading memory, CPU and storage (SSD) and still hitting hardware bottlenecks • Database for queues (please, don't!)
  47. 47. The turning point • AWS announces Provisioned IOPS Storage for RDS in September 2012 • Let's migrate!
  48. 48. Migrating from on-premises to RDS • Then: dump from MySQL and load on RDS, replay binary logs on RDS (downtime) • Percona Toolkit pt-table-sync for sanity checks • Now (much easier!): RDS as slave, promote slave to master (almost online)
  49. 49. After
  50. 50. After • Several RDS instances • Specialized databases by function (contracts, transactions, whitelists, blacklists) • Several million transactions a day and still growing fast • Still no full-time DBA
  51. 51. How RDS helped us • Focus on application versus focus on database operation • Easy scaling up • Multi-AZ - High availability (99.95% Uptime SLA) • Read Replica – for read load and ad hoc analysis • Snapshots - For testing and archival • Tagging - Cost reporting by product and client
  52. 52. Performance monitoring - New Relic
  53. 53. Performance monitoring - New Relic
  54. 54. Performance monitoring - New Relic
  55. 55. Performance monitoring - New Relic
  56. 56. Log management - Splunk
  57. 57. Next steps • Automate data lifecycle management • Migrating cold data from RDS to RedShift and to S3 and Glacier
  58. 58. Please give us your feedback on this presentation DAT302 As a thank you, we will select prize winners daily for completed surveys!

×