More Related Content

Slideshows for you(20)

Similar to AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)(20)


More from Amazon Web Services(20)


AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)

  1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. December 1, 2016 Deep Dive on Amazon Aurora Anurag Gupta, VP, Database Services Amazon EMR, Amazon Redshift, Amazon Aurora, Amazon Athena, AWS Glue
  2. Agenda  What is Aurora?  Review of Aurora performance  New performance enhancements  Review of Aurora availability  New availability enhancements  Other recent and upcoming feature enhancements
  3. Open source compatible relational database Performance and availability of commercial databases Simplicity and cost-effectiveness of open source databases What is Amazon Aurora?
  4. Performance
  5. WRITE PERFORMANCE READ PERFORMANCE Scaling with instance sizes Aurora scales with instance size for both read and write. Aurora MySQL 5.6 MySQL 5.7
  6. Real-life data – gaming workload Aurora vs. RDS MySQL – r3.4XL, MAZ Aurora 3X faster on r3.4xlarge
  7. Do fewer I/Os Minimize network packets Cache prior results Offload the database engine DO LESS WORK Process asynchronously Reduce latency path Use lock-free data structures Batch operations together BE MORE EFFICIENT How did we achieve this? DATABASES ARE ALL ABOUT I/O NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND HIGH-THROUGHPUT PROCESSING IS ALL ABOUT CONTEXT SWITCHES
  8. I/O traffic in MySQL BINLOG DATA DOUBLE-WRITELOG FRM FILES T Y P E O F W R IT E MYSQL WITH REPLICA EBS mirrorEBS mirror AZ 1 AZ 2 Amazon S3 EBS Amazon Elastic Block Store (EBS) Primary Instance Replica Instance 1 2 3 4 5 Issue write to EBS – EBS issues to mirror, ack when both done Stage write to standby instance through DRBD Issue write to EBS on standby instance I/O FLOW Steps 1, 3, 4 are sequential and synchronous This amplifies both latency and jitter Many types of writes for each user operation Have to write data blocks twice to avoid torn writes OBSERVATIONS 780K transactions 7,388K I/Os per million txns (excludes mirroring, standby) Average 7.4 I/Os per transaction PERFORMANCE 30 minute SysBench writeonly workload, 100GB dataset, RDS MultiAZ, 30K PIOPS
  9. I/O traffic in Aurora AZ 1 AZ 3 Primary Instance Amazon S3 AZ 2 Replica Instance AMAZON AURORA ASYNC 4/6 QUORUM DISTRIBUTED WRITES BINLOG DATA DOUBLE-WRITELOG FRM FILES T Y P E O F W R IT E I/O FLOW Only write redo log records; all steps asynchronous No data block writes (checkpoint, cache replacement) 6X more log writes, but 9X less network traffic Tolerant of network and storage outlier latency OBSERVATIONS 27,378K transactions 35X MORE 950K I/Os per 1M txns (6X amplification) 7.7X LESS PERFORMANCE Boxcar redo log records – fully ordered by LSN Shuffle to appropriate segments – partially ordered Boxcar to storage nodes and issue writesReplica Instance
  10. I/O traffic in Aurora (storage node) LOG RECORDS Primary Instance INCOMING QUEUE STORAGE NODE S3 BACKUP 1 2 3 4 5 6 7 8 UPDATE QUEUE ACK HOT LOG DATA BLOCKS POINT IN TIME SNAPSHOT GC SCRUB COALESCE SORT GROUP PEER TO PEER GOSSIPPeer Storage Nodes All steps are asynchronous Only steps 1 and 2 are in foreground latency path Input queue is 46X less than MySQL (unamplified, per node) Favor latency-sensitive operations Use disk space to buffer against spikes in activity OBSERVATIONS I/O FLOW ① Receive record and add to in-memory queue ② Persist record and acknowledge ③ Organize records and identify gaps in log ④ Gossip with peers to fill in holes ⑤ Coalesce log records into new data block versions ⑥ Periodically stage log and new block versions to S3 ⑦ Periodically garbage collect old versions ⑧ Periodically validate CRC codes on blocks
  11. I/O traffic in Aurora Replicas PAGE CACHE UPDATE Aurora Master 30% Read 70% Write Aurora Replica 100% New Reads Shared Multi-AZ Storage MySQL Master 30% Read 70% Write MySQL Replica 30% New Reads 70% Write SINGLE-THREADED BINLOG APPLY Data Volume Data Volume Logical: Ship SQL statements to Replica Write workload similar on both instances Independent storage Can result in data drift between Master and Replica Physical: Ship redo from Master to Replica Replica shares storage. No writes performed Cached pages have redo applied Advance read view when all commits seen MYSQL READ SCALING AMAZON AURORA READ SCALING
  12. “In MySQL, we saw replica lag spike to almost 12 minutes which is almost absurd from an application’s perspective. With Aurora, the maximum read replica lag across 4 replicas never exceeded 20 ms.” Real-life data - read replica latency
  13. Asynchronous group commits Read Write Commit Read Read T1 Commit (T1) Commit (T2) Commit (T3) LSN 10 LSN 12 LSN 22 LSN 50 LSN 30 LSN 34 LSN 41 LSN 47 LSN 20 LSN 49 Commit (T4) Commit (T5) Commit (T6) Commit (T7) Commit (T8) LSN GROWTH Durable LSN at head-node COMMIT QUEUE Pending commits in LSN order TIME GROUP COMMIT TRANSACTIONS Read Write Commit Read Read T1 Read Write Commit Read Read Tn TRADITIONAL APPROACH AMAZON AURORA Maintain a buffer of log records to write out to disk Issue write when buffer full or time out waiting for writes First writer has latency penalty when write rate is low Request I/O with first write, fill buffer till write picked up Individual write durable when 4 of 6 storage nodes ACK Advance DB Durable point up to earliest pending ACK
  14. Re-entrant connections multiplexed to active threads Kernel-space epoll() inserts into latch-free event queue Dynamically size threads pool Gracefully handles 5000+ concurrent client sessions on r3.8xl Standard MySQL – one thread per connection Doesn’t scale with connection count MySQL EE – connections assigned to thread group Requires careful stall threshold tuning CLIENTCONNECTION CLIENTCONNECTION LATCH FREE TASK QUEUE epoll() MYSQL THREAD MODEL AURORA THREAD MODEL Adaptive thread pool
  15. Scan Delete Aurora lock management Scan Delete Insert Scan Scan Insert Delete Scan Insert Insert MySQL lock manager Aurora lock manager  Same locking semantics as MySQL  Concurrent access to lock chains  Multiple scanners allowed in an individual lock chains  Lock-free deadlock detection Needed to support many concurrent sessions, high update throughput
  16. New performance enhancements
  17. Cached read performance Catalog concurrency: Improved data dictionary synchronization and cache eviction. NUMA aware scheduler: Aurora scheduler is now NUMA aware. Helps scale on multi-socket instances. Read views: Aurora now uses a latch-free concurrent read-view algorithm to construct read views. 0 100 200 300 400 500 600 700 MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 2016 In thousands of read requests/sec * R3.8xlarge instance, <1GB dataset using Sysbench 25% Throughput gain
  18. Smart scheduler: Aurora scheduler now dynamically assigns threads between I/O heavy and CPU heavy workloads. Smart selector: Aurora reduces read latency by selecting the copy of data on a storage node with best performance Logical read ahead (LRA): We avoid read I/O waits by prefetching pages based on their order in the btree. Non-cached read performance 0 20 40 60 80 100 120 MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 2016 In thousands of requests/sec * R3.8xlarge instance, 1TB dataset using Sysbench 10% Throughput gain
  19. Scan Delete Hot row contention Scan Delete Insert Scan Scan Insert Delete Scan Insert Insert MySQL lock manager Aurora lock manager Highly contended workloads had high memory and CPU  1.9 (Nov) – lock compression (bitmap for hot locks)  1.9 – replace spinlocks with blocking futex – up to 12x reduction in CPU, 3x improvement in throughput  December – use dynamic programming to release locks: from O(totalLocks * waitLocks) to O(totalLocks) Throughput on Percona TPC-C 100 improved 29x (from 1,452 txns/min to 42,181 txns/min)
  20. Hot row contention MySQL 5.6 MySQL 5.7 Aurora Improvement 500 connections 6,093 25,289 73,955 2.92x 5000 connections 1,671 2,592 42,181 16.3x Percona TPC-C – 10GB * Numbers are in tpmC, measured using release 1.10 on an R3.8xlarge, MySQL numbers using RDS and EBS with 30K PIOPS MySQL 5.6 MySQL 5.7 Aurora Improvement 500 connections 3,231 11,868 70,663 5.95x 5000 connections 5,575 13,005 30,221 2.32x Percona TPC-C – 100GB
  21.  Accelerates batch inserts sorted by primary key – works by caching the cursor position in an index traversal.  Dynamically turns itself on or off based on data pattern.  Avoids contention in acquiring latches while navigating down the tree.  Bi-directional, works across all insert statements. • LOAD INFILE, INSERT INTO SELECT, INSERT INTO REPLACE and, Multi-value inserts. Batch insert performance Index R4 R5R2 R3R0 R1 R6 R7 R8 Index Root Index R4 R5R2 R3R0 R1 R6 R7 R8 Index Root MySQL: Traverses B-tree starting from root for all inserts Aurora: Inserts avoids index traversal
  22. Faster index build  MySQL 5.6 leverages Linux read ahead – but this requires consecutive block addresses in the btree. It inserts entries top down into the new btree, causing splits and excessive logging.  Aurora’s scan pre-fetches blocks based on position in tree, not block address.  Aurora builds the leaf blocks and then the branches of the tree. • No splits during the build. • Each page touched only once. • One log record per page. 2-4X better than MySQL 5.6 or MySQL 5.7 0 2 4 6 8 10 12 r3.large on 10GB dataset r3.8xlarge on 10GB dataset r3.8xlarge on 100GB dataset Hours RDS MySQL 5.6 RDS MySQL 5.7 Aurora 2016
  23. Why spatial index Need to store and reason about spatial data • E.g., “Find all people within 1 mile of a hospital” • Spatial data is multi-dimensional • B-Tree indexes are one-dimensional Aurora supports spatial data types (point/polygon) • GEOMETRY data types inherited from MySQL 5.6 • This spatial data cannot be indexed Two possible approaches: • Specialized access method for spatial data (e.g., R-Tree) • Map spatial objects to one-dimensional space & store in B- Tree - space-filling curve using a grid approximation A B A A A A A A A B B B B B A COVERS B COVEREDBY A A CONTAINS B INSIDE A A TOUCH B TOUCH A A OVERLAPBDYINTERSECT B OVERLAPBDYINTERSECT A A OVERLAPBDYDISJOINT B OVERLAPBDYDISJOINT A A EQUAL B EQUAL A A DISJOINT B DISJOINT A A COVERS B ON A
  24. Spatial indexes in Aurora Z-index used in Aurora Challenges with R-Trees Keeping it efficient while balanced Rectangles should not overlap or cover empty space Degenerates over time Re-indexing is expensive R-Tree used in MySQL 5.7 Z-index (dimensionally ordered space filling curve) Uses regular B-Tree for storing and indexing Removes sensitivity to resolution parameter Adapts to granularity of actual data without user declaration Eg GeoWave (National Geospatial-Intelligence Agency)
  25. Spatial index benchmarks Sysbench – points and polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . * r3.8xlarge using Sysbench on <1GB dataset * Write Only: 4000 clients, Select Only: 2000 clients, ST_EQUALS 0 20000 40000 60000 80000 100000 120000 140000 Select-only (reads/sec) Write-only (writes/sec) Aurora MySQL 5.7
  26. Availability “Performance only matters if your database is up”
  27. Storage durability Storage volume automatically grows up to 64 TB Quorum system for read/write; latency tolerant Peer to peer gossip replication to fill in holes Continuous backup to S3 (built for 11 9s durability) Continuous monitoring of nodes and disks for repair 10 GB segments as unit of repair or hotspot rebalance Quorum membership changes do not stall writes AZ 1 AZ 2 AZ 3 Amazon S3
  28. Aurora Replicas Aurora clusters contain a primary node and up to fifteen replicas Failing database nodes are automatically detected and replaced Failing database processes are automatically detected and recycled Customer applications may scale-out read traffic across replicas Replicas are automatically promoted on persistent outage AZ 1 AZ 3AZ 2 Primary Node Primary Node Primary Node Primary Node Primary Node Secondary Node Primary Node Primary Node Secondary Node
  29. Continuous backup Segment snapshot Log records Recovery point Segment 1 Segment 2 Segment 3 Time • Take periodic snapshot of each segment in parallel; stream the redo logs to Amazon S3 • Backup happens continuously without performance or availability impact • At restore, retrieve the appropriate segment snapshots and log streams to storage nodes • Apply log streams to segment snapshots in parallel and asynchronously
  30. Traditional Databases Have to replay logs since the last checkpoint Typically 5 minutes between checkpoints Single-threaded in MySQL; requires a large number of disk accesses Amazon Aurora Underlying storage replays redo records on demand as part of a disk read Parallel, distributed, asynchronous No replay for startup Checkpointed Data Redo Log Crash at T0 requires a re-application of the SQL in the redo log since last checkpoint T0 T0 Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously Instant crash recovery
  31. Survivable caches We moved the cache out of the database process Cache remains warm in the event of database restart Lets you resume fully loaded operations much faster Instant crash recovery + survivable cache = quick and easy recovery from DB failures SQL Transactions Caching SQL Transactions Caching SQL Transactions Caching Caching process is outside the DB process and remains warm across a database restart
  32. Faster failover App RunningFailure Detection DNS Propagation Recovery Recovery DB Failure MYSQL App Running Failure Detection DNS Propagation Recovery DB Failure AURORA WITH MARIADB DRIVER 1 5 - 2 0 s e c 3 - 2 0 s e c
  33. Database failover time 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 0 - 5s – 30% of fail-overs 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 5 - 10s – 40% of fail-overs 0% 10% 20% 30% 40% 50% 60% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 10 - 20s – 25% of fail-overs 0% 5% 10% 15% 20% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 20 - 30s – 5% of fail-overs
  34. New availability enhancements
  35. Availability is about more than HW failures You also incur availability disruptions when you 1. Patch your database software 2. Modify your database schema 3. Perform large scale database reorganizations 4. Restore a database after a user error
  36. Zero downtime patching Networking state Application state Storage Service App state Net state App state Net state BeforeZDP New DB Engine Old DB Engine New DB Engine Old DB Engine WithZDP User sessions terminate during patching User sessions remain active through patching Storage Service
  37. Zero downtime patching – current constraints We have to go to our current patching model when we can’t park connections: • Long running queries • Open transactions • Bin-log enabled • Parameter changes pending • Temporary tables open • Locked tables • SSL connections open • Read replicas instances We are working on addressing the above.
  38. Database cloning Create a copy of a database without duplicate storage costs • Creation of a clone is nearly instantaneous – we don’t copy data • Data copy happens only on write – when original and cloned volume data differ Typical use cases: • Clone a production DB to run tests • Reorganize a database • Save a point-in-time snapshot for analysis without impacting production system Production database Clone Clone Clone Dev/test applications Benchmarks Production applications Production applications
  39. How does it work? Page 1 Page 2 Page 3 Page 4 Source Database Page 1 Page 3 Page 2 Page 4 Cloned database Shared Distributed Storage System: physical pages Both databases reference same pages on the shared distributed storage system Page 1 Page 2 Page 3 Page 4
  40. How does it work? (contd.) Page 1 Page 2 Page 3 Page 4 Page 5 Page 1 Page 3 Page 5 Page 2 Page 4 Page 6 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 As databases diverge, new pages are added appropriately to each database while still referencing pages common to both databases Page 2 Page 3 Page 5 Shared Distributed Storage System: physical pages Source Database Cloned database
  41. Online DDL: Aurora vs. MySQL  Full table copy; rebuilds all indexes – can take hours or days to complete.  Needs temporary space for DML operations  DDL operation impacts DML throughput  Table lock applied to apply DML changes Index LeafLeafLeaf Leaf Index Root table name operation column-name time-stamp Table 1 Table 2 Table 3 add-col add-col add-col column-abc column-qpr column-xyz t1 t2 t3  We add an entry to the metadata table and use schema versioning to decode the block.  Added a modify-on-write primitive to upgrade the block to the latest schema when it is modified.  Currently support add NULLable column at end of table.  Priority is to support other add column, drop/reorder, modify datatypes. MySQL Amazon Aurora
  42. Online DDL performance On r3.large On r3.8xlarge Aurora MySQL 5.6 MySQL 5.7 10GB table 0.27 sec 3,960 sec 1,600 sec 50GB table 0.25 sec 23,400 sec 5,040 sec 100GB table 0.26 sec 53,460 sec 9,720 sec Aurora MySQL 5.6 MySQL 5.7 10GB table 0.06 sec 900 sec 1,080 sec 50GB table 0.08 sec 4,680 sec 5,040 sec 100GB table 0.15 sec 14,400 sec 9,720 sec
  43. Online point-in-time restore Online point-in-time restore is a quick way to bring the database to a particular point in time without having to restore from backups • Rewinding the database to quickly recover from unintentional DML/DDL operations. • Rewind multiple times to determine the desired point-in-time in the database state. For example, quickly iterate over schema changes without having to restore multiple times. t0 t1 t2 t0 t1 t2 t3 t4 t3 t4 Rewind to t1 Rewind to t3 Invisible Invisible
  44. Online PiTR Online PiTR operation changes the state of the current DB Current DB is available within seconds, even for multi-terabyte DBs No additional storage cost as current DB is restored to prior point in time Multiple iterative online PiTRs are practical Rewind has to be within the allowed rewind period based on purchased rewind storage Cross-region online PiTR is not supported Online vs. offline point-in-time restore (PiTR) Offline PiTR PiTR creates a new DB at desired point in time from the backup of the current DB New DB instance takes hours to restore for multi- terabyte DBs Each restored DB is billed for its own storage Multiple iterative offline PiTRs is time consuming Offline PiTR has to be within the configured backup window or from snapshots Aurora supports cross-region PiTR
  45. How does it work? Segment snapshot Log records Rewind Point Segment 1 Segment 2 Segment 3 Time  Aurora takes periodic snapshots within each segment in parallel and stores them locally  At rewind time, each segment picks the previous local snapshot and applies the log streams to the snapshot to produce the desired state of the DB Storage Segments
  46. Logs within the log stream are made visible or invisible based on the branch within the LSN tree, providing a consistent view for the DB  The first rewind performed at t2 to rewind the DB to t1 makes the logs in purple color invisible  The second rewind performed at time t4 to rewind the DB to t3 makes the logs in red and purple invisible How does it work? (contd.) t0 t1 t2 t0 t1 t2 t3 t4 t3 t4 Rewind to t1 Rewind to t3 Invisible Invisible
  47. Removing blockers
  48. My applications require PostgreSQL Amazon Aurora PostgreSQL compatibility now in preview Same underlying scale out, 3 AZ, 6 copy, fault tolerant, self healing, expanding database optimized storage tier Integrated with a PostgreSQL 9.6 compatible database Session DAT206-R today @3:30 - Venetian, Level 3, San Polo 3403 Logging + Storage SQL Transactions Caching Amazon S3
  49. T2 RI discounts Up to 34% with a 1-year RI Up to 57% with a 3-year RI vCPU Mem Hourly Price db.t2.medium 2 4 $0.082 db.r3.large 2 15.25 $0.29 db.r3.xlarge 4 30.5 $0.58 db.r3.2xlarge 8 61 $1.16 db.r3.4xlarge 16 122 $2.32 db.r3.8xlarge 32 244 $4.64 An R3.Large is too expensive for my use case T2.Small coming in Q1. 2017 *Prices are for Virginia
  50. My databases need to meet certifications  Amazon Aurora gives each database instance IP firewall protection  Aurora offers transparent encryption at rest and SSL protection for data in transit  Amazon VPC lets you isolate and control network configuration and connect securely to your IT infrastructure  AWS Identity and Access Management (IAM) provides resource-level permission controls *New* *New*
  51. Aurora Auditing MariaDB server_audit plugin Aurora native audit support Aurora can sustain over 500K events/sec Create event string DDL DML Query DCL Connect DDL DML Query DCL Connect Write to File Create event string Create event string Create event string Create event string Create event string Latch-free queue Write to File Write to File Write to File MySQL 5.7 Aurora Audit Off 95K 615K 6.47x Audit On 33K 525K 15.9x Sysbench Select-only Workload on 8xlarge Instance
  52. AWS ecosystem Lambda S3 IAM CloudWatch Generate AWS Lambda events from Aurora stored procedures. Load data from Amazon S3, store snapshots and backups in S3. Use AWS IAM roles to manage database access control. Upload systems metrics and audit logs to Amazon CloudWatch. *NEW* Q1
  53. MySQL compatibility Business Intelligence Data Integration Query and Monitoring “We ran our compatibility test suites against Amazon Aurora and everything just worked." - Dan Jewett, VP, Product Management at Tableau MySQL 5.6 / InnoDB compatible  No application compatibility issues reported since launch  MySQL ISV applications run pretty much as is Working on 5.7 compatibility  Running a bit slower than expected  Back ported 81 fixes from different MySQL releases
  54. Timeline Available now (1.9) Available in Dec (1.10) Available in Q1 Performance Availability Security Ecosystem PCI/DSS HIPPA/BAA Fast online schema change Managed MySQL to Aurora replication Cross-region snapshot copy Online Point in Time Restore Database cloning Zero-downtime patching Spatial indexingLock compression Replace spinlocks with blocking futex Faster index build Aurora auditing IAM Integration Copy-on-write volume T2.Medium T2.Small CloudWatch for metrics, audit
  55. Thank you! We are collecting feedback forms in the back. There are also a pile of temporary tattoos there that you can put on before the relay party. Two sheets in each one, so you can share with a friend. Have fun!