Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent 2018

5,433 views

Published on

Amazon Aurora is a high performance, highly scalable database service with MySQL- and PostgreSQL-compatibility. One of its key components is an innovative storage system that is optimized for database workloads and specifically designed to take advantage of modern cloud technology. Hear from the team that built Amazon Aurora's storage system on how the system is designed, how it works, and what you need to know to get the most out of it.

  • DOWNLOAD FULL eBOOK INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookeBOOK Crime, eeBOOK Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL eBOOK INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookeBOOK Crime, eeBOOK Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Stop getting scammed by online, programs that don't even work! ■■■ http://scamcb.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Aurora Storage Demystified: How It All Works Murali Brahmadesam D i r e c t o r o f E n g i n e e r i n g A u r o r a S t o r a g e A m a z o n W e b S e r v i c e s D A T 3 6 3
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Amazon Aurora – Brief overview • Quick recap: Database internals • Cloud native database architecture • Log is the Database • Durability at Scale • Distributed Commits • Read scalability • Performance Results Recent Features: • Global Databases – New Feature announced @ re:Invent 2018 • Fast Database Cloning • Database Backtrack
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Aurora… Delivered as a managed service Amazon Aurora Speed and availability of high-end commercial databases Simplicity and cost-effectiveness of open source databases Drop-in compatibility with MySQL and PostgreSQL Simple pay as you go pricing Enterprise database at open source price
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Aurora customer adoption Fastest growing service in AWS history Aurora is used by ¾ of the top 100 AWS customers
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Quick recap: Database internals • Data is organized as fixed sized pages and is kept in main memory as “buffer pool” and is persisted on durable storage using periodic “checkpoints” • Data is modified in-place in buffer pool using do-undo-redo protocol with before images and after images stored in Write-Ahead Log (WAL) DO DO-UNDO-REDO protocol
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Quick recap: Database internals contd… DO-UNDO-REDO protocol contd… UNDOREDO
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Quick recap: Database internals contd… Recovery Tc Tf Tx1 Tx2 Tx3 Tx4 checkpoint system failure  Tx1 can be ignored as checkpoint has been taken after its commit  Tx2 and Tx3 redone using REDO procedure  Tx4 is undone using REDO/UNDO procedure
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SQL Transactions Caching Logging Compute Traditional database architecture Databases are all about I/O Design principles for 40+ years Increase I/O bandwidth Decrease number of I/Os Attached Storage
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SQL Transactions Caching Logging Compute Database in cloud Compute & Storage have different lifetimes Compute instances • fail and are replaced • are shut down to save cost • are scaled up/down/out based on load needs Storage on the other hand has to be long lived Decouple Compute and Storage for scalability, availability, durability Network Storage
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Aurora uses service-oriented architecture Moved the logging and storage layer into a multitenant, purpose-built log- structured distributed storage system designed for databases Leveraged existing AWS services: Amazon Elastic Compute Cloud (Amazon EC2) , Amazon Virtual Private Cloud (Amazon VPC), Amazon DynamoDB, Amazon Simple Workflow Service (Amazon SWF) , Amazon Route 53, Amazon Simple Storage Service (Amazon S3) and others Logging + Storage SQL Transactions Caching
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Offload redo processing to distributed storage  Database instance writes redo log records to storage  Database instance reads pages on demand from storage which uses the same redo log applicator as the database instance to produce the correct page image  Storage materializes the database pages in background  Storage performs continuous backups of redo logs and pages without impacting database instance  Storage is replicated for durability and availability
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. IO flow in Amazon Aurora Storage Node ①Receive log records and add to in-memory queue ②Durably persist log records and ACK ③Organize records and identify gaps in log ④Gossip with peers to fill in holes ⑤Coalesce log records into new page versions ⑥Periodically stage log and new page versions to S3 ⑦Periodically garbage collect old versions ⑧Periodically validate CRC codes on blocks Note: All steps are asynchronous Only steps 1 and 2 are in foreground latency path LOG RECORDS Database Instance INCOMING QUEUE STORAGE NODE S3 BACKUP 1 2 3 4 5 6 7 8 UPDATE QUEUE ACK HOT LOG DATA PAGES CONTINUOUS BACKUP GC SCRUB COALESCE SORT GROUP PEER TO PEER GOSSIPPeer Storage Nodes
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. At scale there are continuous independent failures due to failing nodes, disks, switches. These happen either due to hard failures or due to regular maintenance Solution is to replicate storage for resilience One common strawman: Replicate 3-ways with 1 copy per AZ Use write and read quorums of 2/3 Availability Zone 1 Shared storage volume Availability Zone 2 Availability Zone 3 Storage nodes with SSDs X X X Uncorrelated and independent failures
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Boils down to losing 1 node  Still have 2/3 nodes  Can establish quorum  No data loss Availability Zone 1 Shared storage volume Availability Zone 2 Availability Zone 3 Storage nodes with SSDs X What about AZ failure?
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Losing 1 node in an AZ while another AZ is down  Lose 2/3 nodes  Lose quorum  Lose data Availability Zone 1 Shared storage volume Availability Zone 2 Availability Zone 3 Storage nodes with SSDs XX What about AZ+1 failures?
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Replicate 6-ways with 2 copies per AZ Write quorum of 4/6 Read quorum of 3/6 (only for repair) What if there is an AZ failure ?  Still have 4/6 nodes  Maintain write availability What if there is an AZ+1 failure ?  Still have 3 nodes (read/repair quorum)  No data loss  Rebuild failed node by copying from one of other 3  Recover write availability Availability Zone 1 Shared storage volume Availability Zone 2 Availability Zone 3 Storage nodes with SSDs X X Aurora tolerates AZ+1 failures
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Depends on repairing a failed node before AZ+1 becomes AZ+2 (double-fault) p(AZ+2) in repair interval is function of MTTF Can only reduce MTTF & p(AZ+2) so much Instead try to reduce repair interval (MTTR) Availability Zone 1 Shared storage volume Availability Zone 2 Availability Zone 3 Storage nodes with SSDs X X Is a 4/6 quorum sufficient for handling AZ+1 failures?
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.  Partition volume into 𝑛 fixed size segments Replicate each segment 6-ways into a Protection Group (PG) A single PG failing is enough to fail the entire volume  Trade-off between likelihood of faults and time to repair If segments are too small then failures are more likely If segments are too big then repairs take too long  Choose the biggest size that lets us repair “fast enough” We currently picked a segment size of 10GB as we can repair a 10GB segment in ~10 seconds on a 10Gbps link Segmented storage
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resilience and operations Leverage failure tolerance for maintenance operations Heat management OS and security patching Software upgrades to storage fleet Execute upgrades one AZ at a time At most one member of a PG is patched at a time
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use quorum sets, and epochs to: Enable quicker transitions with epoch advances Create richer temporary quorums during change Reverse changes by more quorum transitions Membership updates: Occur without consensus and only go forward A B C D E F A B C D E F A B C D E G A B C D E F A B C D E G Epoch 1: All node healthy Epoch 2: Node F is in suspect state; second quorum group is formed with node G; both quorums are active Epoch 3: Node F is confirmed unhealthy; new quorum group with node G is active Fast and reversible membership changes
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Continuous backup, snapshots • Take periodic snapshot of each PG in parallel; stream the redo logs to Amazon S3 • Backup happens continuously without performance or availability impact • Volume level snapshot is a O(1) operation: a marker in the continuous backup stream • At restore, retrieve the appropriate PG snapshots and log streams to storage nodes and apply log streams in parallel and asynchronously • Volume level snapshots can be incrementally copied to x-region PG snapshot Log records Recovery point PG 1 PG 2 PG 3 Time
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Storage nodes Establish compact consistency points that: Increase monotonically Are continuously returned to the database Do not vote on accepting a write Execute idempotent operations on local state Database nodes Handle locking, transactions, deadlocks, constraints etc. Aurora: Asynchronous commit processing
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Each redo log record includes backlink LSNs: 1. Page backlink: Previous log record of the modified page » Used to materialize blocks on demand and background 2. Segment backlink: Previous log record in the segment » Used to identify records not received by the storage node 3. Volume backlink: Previous log record in the volume » Used to regenerate metadata as a fallback path Backward chaining of redo log records
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Segment Complete LSN (SCL) Low-water mark below which all log records have been received Maintained for each segment at each storage node using segment backlinks Used to identify “holes” or missing writes when gossiping with peers Sent to DB with each write acknowledgement Protection Group Complete LSN (PGCL) PGCL can advance after DB sees SCL advance at 4/6 segments Volume Complete LSN (VCL) VCL can advance after DB sees PGCL advance at all PGs Consistency Point LSN (CPL) Final log record of a mini-transaction (MTR) is set with a CPL flag Volume Durable LSN (VDL) Highest CPL that is smaller than or equal to VCL Storage consistency points
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No flush, consensus is required Commits in Aurora When DB can prove that all changes have met quorum By ensuring that VDL >= Commit LSN Are asynchronously acknowledged for multiple transactions
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Crash recovery CRASH Log records Gaps Volume Complete LSN (VCL) AT CRASH IMMEDIATELY AFTER CRASH RECOVERY Logs above VDL including the ragged edge of logs not meeting quorum is truncated at recovery Volume Durable LSN (VDL)
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instant crash recovery Traditional database Have to replay logs since the last checkpoint Typically 5 minutes between checkpoints Single-threaded in MySQL; requires a large number of disk accesses Aurora Underlying storage replays redo records on demand as part of a disk read Parallel, distributed, asynchronous No replay for startup Checkpointed Data Redo Log Crash at T0 requires a re-application of the SQL in the redo log since last checkpoint T0 T0 Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read replicas and custom reader end-points Master Read replica Read replica Read replica Shared distributed storage volume Reader end-point #1 Reader end-point #2 Up to 15 promotable read replicas across multiple availability zones Re-do log based replication leads to low replica lag – typically < 10ms Custom reader end-point with configurable failover order
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Aurora avoids quorum reads DB can use storage consistency points to pick a node Need to pick a node that is up to date Enough to read a page as of the VDL Sufficient to pick a segment whose SCL >= VDL LSN N LSN N LSN N-1 LSN N LSN N-3 LSN N For each protection group, at least 4 nodes in the quorum group will have the most recent data
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Aurora I/O profile MySQL with Replica Aurora EBS mirrorEBS mirror AZ 1 AZ 2 EBS Amazon Elastic Block Store (EBS) Primary Instance Replica Instance 1 2 3 4 5 Amazon S3 MySQL I/O profile for 30 min Sysbench run 780K transactions Average 7.4 I/Os per transaction Aurora IO profile for 30 min Sysbench run 27M transactions – 35X MORE 0.95 I/Os per transaction (6X amplification) – 7.7X LESS Binlog Data Double-writeLog Frm files ASYNC 4/6 QUORUM Continuous backup AZ 1 Primary Instance Amazon S3 AZ 2 Replica Instance AZ 3 Replica Instance 1 1
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Write and read throughput 200,000 170,000 9,536 5,592 0 50,000 100,000 150,000 200,000 250,000 Write Aurora 5.7 Aurora 5.6 MySQL 5.7 MySQL 5.6 705,000 705,000 290,787 257,122 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 Aurora 5.7 Aurora 5.6 MySQL 5.7 MySQL 5.6 Write Throughput Read Throughput Using Sysbench with 250 tables and 200,000 rows per table on R4.16XL Aurora MySQL is 5x faster than MySQL
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Global physical replication Primary region Secondary region 1 ASYNC 4/6 QUORUM Continuous backup AZ 1 Primary Instance Amazon S3 AZ 2 Replica Instance AZ 3 Replica Instance Replication Server Replication Fleet Storage Fleet 11 4 AZ 1 Replica Instance AZ 2 AZ 3 ASYNC 4/6 QUORUM Continuous backup Amazon S3 Replica Instance Replica Instance Replication Agent Replication Fleet Storage Fleet 3 3 2 ① Primary instance sends log records in parallel to storage nodes, replica instances and replication server ② Replication server streams log records to Replication Agent in secondary region ③ Replication agent sends log records in parallel to storage nodes, and replica instances ④ Replication server pulls log records from storage nodes to catch up after outages High throughput: Up to 150K writes/sec – negligible performance impact Low replica lag: < 1 sec cross-region replica lag under heavy load Fast recovery: < 1 min to accept full read-write workloads after region failure
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Global replication performance Logical replication Physical replication 0 100 200 300 400 500 600 0 50,000 100,000 150,000 200,000 250,000 seconds QPS QPS Lag 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 0 50,000 100,000 150,000 200,000 250,000 seconds QPS QPS Lag Logical vs. physical replication
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fast Database Cloning Create a copy of a database without duplicate storage costs • Creation of a clone is instantaneous since it doesn’t require deep copy • Data copy happens only on write – when original and cloned volume data differ Typical use cases: • Clone a production DB to run tests • Reorganize a database • Save a point in time snapshot for analysis without impacting production system Production database Clone Clone Clone Dev/test applications Benchmarks Production applications Production applications
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cloning vs Point in Time Restore (PiTR) 1. Reference the pages of the source DB to create the new DB 2. New DB instance is available in minutes for multi terabyte DBs 3. Storage cost: No additional cost for shared DB pages 4. Multiple clone DBs can share same DB pages 5. Cloned database contains the data from the source database at the time of cloning 1. Copy the pages of the source DB from backup to create the new DB 2. New DB instance is available in hours for multi terabyte DBs 3. Storage cost: each DB is billed for its own storage 4. Each restored DB has its own copy of DB pages 5. Point in Time Restore produces a database from any point in time within backup window Cloning PiTR
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Cloning: How does it work? Page 1 Page 2 Page 3 Page 4 Source database Cloned database Both databases reference same pages on the shared distributed storage system Page 1 Page 2 Page 3 Page 4 Page 1 Page 3 Protection group 1 Page 2 Page 4 Protection group 2 Shared distributed storage system
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Cloning: How does it work? Contd… Page 1 Page 2 Page 3 Page 4 Page 5 Source database Cloned database As databases diverge, new pages are added appropriately to each database while still referencing pages common to both databases Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 1 Page 3 Page 5 Page 3 Page 5 Protection group 1 Page 2 Page 4 Page 6 Page 2 Protection group 2 Shared distributed storage system
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Backtrack Backtrack is a quick way to bring the database to a particular point in time without having to restore from backups • Rewinding the database to quickly recover from unintentional DML/DDL operations • Rewind multiple times to determine the desired point in time in the database state. For example quickly iterate over schema changes without having to restore multiple times t0 t1 t2 t0 t1 t2 t3 t4 t3 t4 Rewind to t1 Rewind to t3 Invisible Invisible
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Backtrack vs point in time restore (PiTR) 1. Operation changes the state of the current DB 2. Current DB is available within seconds, even for multi terabyte DBs 3. No additional storage cost as current DB is restored to prior point in time 4. Multiple iterative backtracks are quick 5. Rewind has to be within the configured backtrack period 1. Operation creates a new DB at desired point in time from the backup 2. New DB instance takes hours to restore for multi terabyte DBs 3. Each restored DB is billed for its own storage 4. Multiple iterative PiTRs takes time 5. PiTR has to be within the configured backup window Backtrack PiTR
  50. 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Backtrack: How does it work? Logs are made visible or invisible based on the branch within the LSN tree to provide a consistent view for the DB The first rewind performed at t2 to rewind the DB to t1 makes the logs in orange color invisible. The second rewind performed at time t4 to rewind the DB to t3 makes the logs in red and orange invisible t0 t1 t2 t0 t1 t2 t3 t4 t3 t4 Rewind to t1 Rewind to t3 Invisible Invisible Log stream
  51. 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  52. 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Publications Amazon Aurora: Design Considerations for High Throughput Cloud- Native Relational Databases. In SIGMOD 2017 Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. In SIGMOD 2018
  53. 53. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Murali Brahmadesam brahmade@amazon.com

×