Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anurag Gupta, VP, Aurora, Amazon
Redshift, EMR
O...
MySQL-compatible relational database
Performance and availability of
commercial databases
Simplicity and cost-effectivenes...
4 client machines with 1,000 connections each
WRITE PERFORMANCE READ PERFORMANCE
Single client machine with 1,600 connecti...
Reproducing these results
https ://d0.a wsstat ic . com /product -m ark eting/Aurora /R DS_ Auro ra_Perf orm ance_Assessm ...
Beyond benchmarks
If only real-world applications saw benchmark
performance
DISTORTIONS IN THE REAL WORLD
Requests contend...
Scaling user connections
SysBench OLTP workload
250 tables
Connections Amazon Aurora
RDS MySQL
30K IOPS (single AZ)
50 40,...
Scaling table count
Tables
Amazon
Aurora
MySQL
I2.8XL
local SSD
MySQL
I2.8XL
RAM disk
RDS
MySQL
30K IOPS
(single AZ)
10 60...
Scaling data set size
DB Size Amazon Aurora
RDS MySQL
30K IOPS (single AZ)
1GB 107,000 8,400
10GB 107,000 2,400
100GB 101,...
Running with read replicas
Updates per
second Amazon Aurora
RDS MySQL
30K IOPS (single AZ)
1,000 2.62 ms 0 s
2,000 3.42 ms...
Do fewer IOs
Minimize network packets
Cache prior results
Offload the database engine
DO LESS WORK
Process asynchronously
...
IO traffic in RDS MySQL
BINLOG DATA DOUBLE-WRITELOG FRM FILES
T Y P E O F W R IT E
MYSQL WITH STANDBY
Issue write to EBS –...
IO traffic in Aurora (database)
AZ 1 AZ 3
Primary
instance
Amazon S3
AZ 2
Replica
instance
AMAZON AURORA
ASYNC
4/6 QUORUM
...
IO traffic in Aurora (storage node)
LOG RECORDS
Primary
instance
INCOMING QUEUE
STORAGE NODE
S3 BACKUP
1
2
3
4
5
6
7
8
UPD...
Asynchronous group commits
Read
Write
Commit
Read
Read
T1
Commit (T1)
Commit (T2)
Commit (T3)
LSN 10
LSN 12
LSN 22
LSN 50
...
Re-entrant connections multiplexed to active threads
Kernel-space epoll() inserts into latch-free event queue
Dynamically ...
IO traffic in Aurora (read replica)
PAGE CACHE
UPDATE
Aurora master
30% Read
70% Write
Aurora replica
100% New Reads
Share...
Improvements over the past few months
Write batch size tuning
Asynchronous send for read/write I/Os
Purge thread performan...
Availability
“Performance only matters if your database is up”
Storage node availability
Quorum system for read/write; latency tolerant
Peer-to-peer gossip replication to fill in holes
...
Traditional databases
Have to replay logs since the last
checkpoint
Typically 5 minutes between checkpoints
Single-threade...
Survivable caches
We moved the cache out of the
database process
Cache remains warm in the event of a
database restart
Let...
Faster, more predictable failover
App
runningFailure detection DNS propagation
Recovery Recovery
DB
failure
MYSQL
App
runn...
ALTER SYSTEM CRASH [{INSTANCE | DISPATCHER | NODE}]
ALTER SYSTEM SIMULATE percent_failure DISK failure_type IN
[DISK index...
Amazon Aurora team
work hard. have fun. make history.
Thank you!
Upcoming SlideShare
Loading in …5
×

(DAT405) Amazon Aurora Deep Dive

10,482 views

Published on

Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is disruptive technology in the database space, bringing a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.

Published in: Technology
  • Be the first to comment

(DAT405) Amazon Aurora Deep Dive

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anurag Gupta, VP, Aurora, Amazon Redshift, EMR October 2015 DAT405 Amazon Aurora Deep Dive Design Overview of Aurora Performance and Availability 6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lagCommercial Performance at Open Source Prices
  2. 2. MySQL-compatible relational database Performance and availability of commercial databases Simplicity and cost-effectiveness of open source databases What is Amazon Aurora?
  3. 3. 4 client machines with 1,000 connections each WRITE PERFORMANCE READ PERFORMANCE Single client machine with 1,600 connections MySQL SysBench R3.8XL with 32 cores and 244 GB RAM SQL benchmark results
  4. 4. Reproducing these results https ://d0.a wsstat ic . com /product -m ark eting/Aurora /R DS_ Auro ra_Perf orm ance_Assessm ent_Benchm ark ing_v 1-2 .pdf AMAZON AURORA R3.8XLARGE R3.8XLARGE R3.8XLARGE R3.8XLARGE R3.8XLARGE • Create an Amazon VPC (or use an existing one). • Create four EC2 R3.8XL client instances to run the SysBench client. All four should be in the same AZ. • Enable enhanced networking on your clients. • Tune your Linux settings (see whitepaper). • Install Sysbench version 0.5. • Launch a r3.8xlarge Amazon Aurora DB instance in the same VPC and AZ as your clients. • Start your benchmark! 1 2 3 4 5 6 7
  5. 5. Beyond benchmarks If only real-world applications saw benchmark performance DISTORTIONS IN THE REAL WORLD Requests contend with each other Metadata rarely fits in data dictionary cache Data rarely fits in buffer cache Production databases need to run HA
  6. 6. Scaling user connections SysBench OLTP workload 250 tables Connections Amazon Aurora RDS MySQL 30K IOPS (single AZ) 50 40,000 10,000 500 71,000 21,000 5,000 110,000 13,000 8x UP TO FA STER
  7. 7. Scaling table count Tables Amazon Aurora MySQL I2.8XL local SSD MySQL I2.8XL RAM disk RDS MySQL 30K IOPS (single AZ) 10 60,000 18,000 22,000 25,000 100 66,000 19,000 24,000 23,000 1,000 64,000 7,000 18,000 8,000 10,000 54,000 4,000 8,000 5,000 SysBench write-only workload 1,000 connections Default settings 11x UP TO FA STER
  8. 8. Scaling data set size DB Size Amazon Aurora RDS MySQL 30K IOPS (single AZ) 1GB 107,000 8,400 10GB 107,000 2,400 100GB 101,000 1,500 1TB 26,000 1,200 67x U P TO FA STER SYSBENCH WRITE-ONLY DB Size Amazon Aurora RDS MySQL 30K IOPS (single AZ) 80GB 12,582 585 800GB 9,406 69 CLOUDHARMONY TPC-C 136x U P TO FA STER
  9. 9. Running with read replicas Updates per second Amazon Aurora RDS MySQL 30K IOPS (single AZ) 1,000 2.62 ms 0 s 2,000 3.42 ms 1 s 5,000 3.94 ms 60 s 10,000 5.38 ms 300 s SysBench write-only workload 250 tables 500x UP TO LOW ER LA G
  10. 10. Do fewer IOs Minimize network packets Cache prior results Offload the database engine DO LESS WORK Process asynchronously Reduce latency path Use lock-free data structures Batch operations together BE MORE EFFICIENT How do we achieve these results? DATABASES ARE ALL ABOUT I/O NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND HIGH-THROUGHPUT PROCESSING DOES NOT ALLOW CONTEXT SWITCHES
  11. 11. IO traffic in RDS MySQL BINLOG DATA DOUBLE-WRITELOG FRM FILES T Y P E O F W R IT E MYSQL WITH STANDBY Issue write to EBS – EBS issues to mirror, ack when both done Stage write to standby instance using DRBD Issue write to EBS on standby instance IO FLOW Steps 1, 3, 5 are sequential and synchronous This amplifies both latency and jitter Many types of writes for each user operation Have to write data blocks twice to avoid torn writes OBSERVATIONS 780K transactions 7,388K I/Os per million txns (excludes mirroring, standby) Average 7.4 I/Os per transaction PERFORMANCE 30 minute SysBench write-only workload, 100 GB data set, RDS SingleAZ, 30K PIOPS EBS mirrorEBS mirror AZ 1 AZ 2 Amazon S3 EBS Amazon Elastic Block Store (EBS) Primary instance Standby instance 1 2 3 4 5
  12. 12. IO traffic in Aurora (database) AZ 1 AZ 3 Primary instance Amazon S3 AZ 2 Replica instance AMAZON AURORA ASYNC 4/6 QUORUM DISTRIBUTED WRITES BINLOG DATA DOUBLE-WRITELOG FRM FILES T Y P E O F W R IT E 30 minute SysBench write-only workload, 100 GB data set IO FLOW Only write redo log records; all steps asynchronous No data block writes (checkpoint, cache replacement) 6X more log writes, but 9X less network traffic Tolerant of network and storage outlier latency OBSERVATIONS 27,378K transactions 35X MORE 950K I/Os per 1M txns (6X amplification) 7.7X LESS PERFORMANCE Boxcar redo log records – fully ordered by LSN Shuffle to appropriate segments – partially ordered Boxcar to storage nodes and issue writes
  13. 13. IO traffic in Aurora (storage node) LOG RECORDS Primary instance INCOMING QUEUE STORAGE NODE S3 BACKUP 1 2 3 4 5 6 7 8 UPDATE QUEUE ACK HOT LOG DATA BLOCKS POINT IN TIME SNAPSHOT GC SCRUB COALESCE SORT GROUP PEER-TO-PEER GOSSIPPeer storage nodes All steps are asynchronous Only steps 1 and 2 are in foreground latency path Input queue is 46X less than MySQL (unamplified, per node) Favor latency-sensitive operations Use disk space to buffer against spikes in activity OBSERVATIONS IO FLOW ① Receive record and add to in-memory queue ② Persist record and ACK ③ Organize records and identify gaps in log ④ Gossip with peers to fill in holes ⑤ Coalesce log records into new data block versions ⑥ Periodically stage log and new block versions to S3 ⑦ Periodically garbage collect old versions ⑧ Periodically validate CRC codes on blocks
  14. 14. Asynchronous group commits Read Write Commit Read Read T1 Commit (T1) Commit (T2) Commit (T3) LSN 10 LSN 12 LSN 22 LSN 50 LSN 30 LSN 34 LSN 41 LSN 47 LSN 20 LSN 49 Commit (T4) Commit (T5) Commit (T6) Commit (T7) Commit (T8) LSN GROWTH Durable LSN at head-node COMMIT QUEUE Pending commits in LSN order TIME GROUP COMMIT TRANSACTIONS Read Write Commit Read Read T1 Read Write Commit Read Read Tn TRADITIONAL APPROACH AMAZON AURORA Maintain a buffer of log records to write out to disk Issue write when buffer full or time out waiting for writes First writer has latency penalty when write rate is low Request I/O with first write, fill buffer till write picked up Individual write durable when 4 of 6 storage nodes ACK Advance DB durable point up to earliest pending ACK
  15. 15. Re-entrant connections multiplexed to active threads Kernel-space epoll() inserts into latch-free event queue Dynamically size threads pool Gracefully handles 5,000+ concurrent client sessions on r3.8xl Standard MySQL – one thread per connection Doesn’t scale with connection count MySQL EE – connections assigned to thread group Requires careful stall threshold tuning CLIENTCONNECTION CLIENTCONNECTION LATCH FREE TASK QUEUE epoll() MYSQL THREAD MODEL AURORA THREAD MODEL Adaptive thread pool
  16. 16. IO traffic in Aurora (read replica) PAGE CACHE UPDATE Aurora master 30% Read 70% Write Aurora replica 100% New Reads Shared Multi-AZ Storage MySQL master 30% Read 70% Write MySQL replica 30% New Reads 70% Write SINGLE-THREADED BINLOG APPLY Data volume Data volume Logical: Ship SQL statements to replica. Write workload similar on both instances. Independent storage. Can result in data drift between master and replica. Physical: Ship redo from master to replica. Replica shares storage. No writes performed. Cached pages have redo applied. Advance read view when all commits seen. MYSQL READ SCALING AMAZON AURORA READ SCALING
  17. 17. Improvements over the past few months Write batch size tuning Asynchronous send for read/write I/Os Purge thread performance Bulk insert performance BATCH OPERATIONS Failover time reductions Malloc reduction System call reductions Undo slot caching patterns Cooperative log apply OTHER Binlog and distributed transactions Lock compression Read-ahead CUSTOMER FEEDBACK Hot row contention Dictionary statistics Mini-transaction commit code path Query cache read/write conflicts Dictionary system mutex LOCK CONTENTION
  18. 18. Availability “Performance only matters if your database is up”
  19. 19. Storage node availability Quorum system for read/write; latency tolerant Peer-to-peer gossip replication to fill in holes Continuous backup to S3 (designed for 11 9s durability) Continuous scrubbing of data blocks Continuous monitoring of nodes and disks for repair 10 GB segments as unit of repair or hotspot rebalance to quickly rebalance load Quorum membership changes do not stall writes AZ 1 AZ 2 AZ 3 Amazon S3
  20. 20. Traditional databases Have to replay logs since the last checkpoint Typically 5 minutes between checkpoints Single-threaded in MySQL; requires a large number of disk accesses Amazon Aurora Underlying storage replays redo records on demand as part of a disk read Parallel, distributed, asynchronous No replay for startup Checkpointed data Redo log Crash at T0 requires a reapplication of the SQL in the redo log since last checkpoint T0 T0 Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously Instant crash recovery
  21. 21. Survivable caches We moved the cache out of the database process Cache remains warm in the event of a database restart Lets you resume fully loaded operations much faster Instant crash recovery + survivable cache = quick and easy recovery from DB failures SQL Transactions Caching SQL Transactions Caching SQL Transactions Caching Caching process is outside the DB process and remains warm across a database restart
  22. 22. Faster, more predictable failover App runningFailure detection DNS propagation Recovery Recovery DB failure MYSQL App running Failure detection DNS propagation Recovery DB failure AURORA WITH MARIADB DRIVER 1 5 - 2 0 s e c 3 - 2 0 s e c
  23. 23. ALTER SYSTEM CRASH [{INSTANCE | DISPATCHER | NODE}] ALTER SYSTEM SIMULATE percent_failure DISK failure_type IN [DISK index | NODE index] FOR INTERVAL interval ALTER SYSTEM SIMULATE percent_failure NETWORK failure_type [TO {ALL | read_replica | availability_zone}] FOR INTERVAL interval Simulate failures using SQL To cause the failure of a component at the database node: To simulate the failure of disks: To simulate the failure of networking:
  24. 24. Amazon Aurora team work hard. have fun. make history.
  25. 25. Thank you!

×