Amazon Aurora TechConnect

© 2017, Amazon Web Services, Inc. or its Affiliates. | Amazon Confidential
August 2020

Aurora Overview
By
Murali Brahmadesam
Director of Engineering
Amazon Aurora
Amazon Web Services

© 2017, Amazon Web Services, Inc. or its Affiliates. | Amazon Confidential
What is Amazon Aurora ?
Enterprise class cloud native database
Speed and availability of high-end commercial databases
Simplicity and cost-effectiveness of open-source databases
Drop-in compatibility with MySQL and PostgreSQL
Simple pay-as-you-go pricing
Delivered as a managed service
AmazonAurora

Traditional database architecture
Databases are all about I/O
Design principles for > 40 years
• Increase I/O bandwidth
• Decrease number of I/Os !
SQL
Transactions
Caching
Logging
Compute
Attached storage

Quick recap: Database B+ Tree
Root
Interme
diate
Interme
diate
Leaf 1 Leaf 2 Leaf 3 Leaf 4
Leaf 1 Leaf 2 Leaf 3 Leaf 4Root Interme
diate
Interme
diate

Aurora approach: Log is the database
t5 can be created using log records from t1 and t5

Aurora : Offload checkpointing to the storage fleet
Problem 1:
Solution:
Problem 2:
Solution:

Aurora approach: compute & storage separation
Compute & storage have different lifetimes
Compute instances
• fail and are replaced
• are shut down to save cost
• are scaled up/down/out on the basis of load needs
SQL
Transactions
Caching
Logging
Compute
Network storage
Storage, on the other hand, has to be long-lived
Decouple compute and storage for scalability, availability, durability

Scale-out, distributed architecture
 Purpose-built log-structured
distributed storage system
designed for databases
 Storage volume is striped across
hundreds of storage nodes
distributed over 3 different
availability zones
 Six copies of data, two copies in
each availability zone to protect
againstAZ+1 failures
 Plan to apply same principles to
other layers of the stack
Master Replica ReplicaReplica
Availability
Zone 1
Shared storage volume
Availability
Zone 2
Availability
Zone 3
Storage nodes with SSDs
SQL
Transactions
Caching
SQL
Transactions
Caching
SQL
Transactions
Caching

Aurora DevelopmentCenters around the world

AURORA –
SPEED AND
AVAILABILITY
5x faster than MySQL; 3x faster
than PostgreSQL
“Performance only matters if
your database is up”

Aurora MySQL performance
WRITE PERFORMANCE READ PERFORMANCE
MySQL SysBench results; R4.16XL: 64cores
/ 488 GB RAM
Aurora read write throughput compared to MySQL 5.6
based on industry standard benchmarks.
Aurora MySQL 5.6
0
100000
200000
300000
400000
500000
600000
700000
0
50000
100000
150000
200000
250000

DO LESS WORK
Do fewer IOs
Minimize network packets
Cache prior results
Offload the database engine
Process asynchronously
Reduce latency path
Use lock-free data structures
Batch operations together
BE MORE EFFICIENT
Speed- How did we achieve this?
DATABASESARE ALL ABOUT I/O
NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND
HIGH-THROUGHPUT PROCESSING IS ALL ABOUT CONTEXT SWITCHES

BINL
OG
DAT
A
DOUBLE-
WRITE
LOG FRM FILES
TYPE OF WRITE
MYSQL WITH REPLICA
EBS mirrorEBS mirror
AZ 1 AZ 2
Amazon S3
EBS
Amazon Elastic
Block Store (EBS)
Primary
Instance
Replica
Instance
1
2
3
4
5
Issue write to EBS – EBS issues to mirror, ack when
both done
Stage write to standby instance through DRBD
Issue write to EBS on standby instance
IO FLOW
Steps 1, 3, 4 are sequential and synchronous
This amplifies both latency and jitter
Many types of writes for each user operation
Have to write data blocks twice to avoid torn writes
OBSERVATIONS
780K transactions
7,388K I/Os per million txns (excludes mirroring,
standby)
Average 7.4 I/Os per transaction
PERFORMANCE
30 minute SysBench writeonly workload, 100GB dataset, RDS MultiAZ,
30K PIOPS
IO Traffic in MySQL

AZ 1 AZ 3
Primary
Instance
Amazon S3
AZ 2
Replica
Instance
AMAZON AURORA
ASYNC
4/6 QUORUM
DISTRIBUTED
WRITES
BINLO
G
DATA DOUBLE-WRITELOG FRM FILES
TYPE OF WRITE
IO FLOW
Only write redo log records; all steps asynchronous
No data block writes (checkpoint, cache
replacement)
6X more log writes, but 9X less network traffic
Tolerant of network and storage outlier latency
OBSERVATIONS
27,378K transactions 35X
MORE
950K I/Os per 1M txns (6X amplification) 7.7X
LESS
PERFORMANCE
Boxcar redo log records – fully ordered by LSN
Shuffle to appropriate segments – partially ordered
Boxcar to storage nodes and issue writes
Replica
Instance
IO Traffic in Aurora – DB Engine

I/O flow in Amazon Aurora storage node
①Receive log records and add to in-memory queue
and durably persist log records
② ACK to the database
③ Organize records and identify gaps in log
④ Gossip with peers to fill in holes
⑤ Coalesce log records into new page versions
⑥ Periodically stage log and new page versions to S3
⑦ Periodically garbage collect old versions
⑧ Periodically validate CRC codes on blocks
Log records
Database
instance
Incoming queue
Storage node
S3 backup
1
2
3
4
5
6
7
8
Update
Queue
ACK
Hot
log
Data
Pages
Continuous backup
GC
Scrub
Coalesce
Sort
Group
Peer-to-peer gossipPeer
storage
nodes
Note:
• All steps are asynchronous
• Only steps 1 and 2 are in the foreground latency
path

Asynchronous Group Commits
Read
Write
Commit
Read
Read
T1
Commit (T1)
Commit (T2)
Commit (T3)
LSN 10
LSN 12
LSN 22
LSN 50
LSN 30
LSN 34
LSN 41
LSN 47
LSN 20
LSN 49
Commit (T4)
Commit (T5)
Commit (T6)
Commit (T7)
Commit (T8)
LSN GROWTH
Durable LSN at head-node
COMMIT QUEUE
Pending commits in LSN order
TIME
GROUP
COMMIT
TRANSACTIONS
Read
Write
Commit
Read
Read
T1
Read
Write
Commit
Read
Read
Tn
TRADITIONAL APPROACH AMAZON AURORA
Maintain a buffer of log records to write out to disk
Issue write when buffer full or time out waiting for writes
First writer has latency penalty when write rate is low
Request I/O with first write, fill buffer till write picked up
Individual write durable when 4 of 6 storage nodes ACK
Advance DB Durable point up to earliest pending ACK

Re-entrant connections multiplexed to active threads
Kernel-space epoll() inserts into latch-free event queue
Dynamically size threads pool
Gracefully handles 5000+ concurrent client sessions on r3.8xl
Standard MySQL – one thread per connection
Doesn’t scale with connection count
MySQL EE – connections assigned to thread group
Requires careful stall threshold tuning
CLIENTCONNECTION
CLIENTCONNECTION
LATCH FREE
TASK QUEUE
epoll()
MYSQL THREAD MODEL AURORA THREAD MODEL
Adaptive Thread Pool

Scan
Delete
 Same locking semantics as MySQL
 Concurrent access to lock chains
 Multiple scanners allowed in an individual lock chains
 Lock-free deadlock detection
Scan
Delete
Insert
Scan Scan
Insert
Delete
Scan
Insert
Insert
MySQL lock manager Aurora lock manager
Needed to support many concurrent sessions, high update throughput
Aurora Lock Management

Aurora storage has thousands of CPUs
 Presents opportunity to push down and
parallelize query processing using the storage
fleet
 Moving processing close to data reduces network
traffic and latency
However, there are significant challenges
 Data stored in storage node is not range
partitioned – require full scans
 Data may be in-flight
 Read views may not allow viewing most recent
data
 Not all functions can be pushed down to storage
nodes
DATABASE NODE
STORAGE NODES
PUSH DOWN
PREDICATES
AGGREGATE
RESULTS
Parallel Query Processing

C I – Continuous I*
Innovation
Improvement
• Collaborate with other AWS and EC2 Teams
• Operations Performance – Fleet Performance
• Product Performance Improvement

Simplicity
&
Cost Effective
ABC
+

Simplified Storage Management
Continuous backup
Automatic storage scaling
Fast Database Cloning – COW pages
Backtrack – without restore from backups
64TB+ of storage – auto-incremented in 10GB units
up to 64 TB
Production database
Clone Clone
Clone
Dev/test
applications
Benchmarks
Production
applications
Production
applications

Simplified DB Node Management - Aurora Serverless
 Starts up on demand, shuts
down when not in use
 Scales up/down automatically
 No application impact when
scaling
 Pay per second, 1 minute
minimum
WARM POOL
OF INSTANCES
APPLICATION
DATABASE STORAGE
SCALABLE DB CAPACITY
REQUEST ROUTERS

Global replication –.
Faster disaster recovery and enhanced data
locality
Promote read-replica to a
master for faster recovery in
the event of disaster
Bring data close to your
customer’s applications in
different regions
Promote to a master for easy
migration

Primary region Secondary region
1
ASYNC 4/6 QUORUM
Continuous
backup
AZ 1
Primary
Instance
Amazon
S3
AZ 2
Replica
Instance
AZ 3
Replica
Instance
Replicatio
n Server
Replication Fleet
Storage Fleet
11
4
AZ 1
Replica
Instance
AZ 2 AZ 3
ASYNC 4/6 QUORUM
Continuous
backup
Amazon
S3
Replica
Instance
Replica
Instance
Replicatio
n Agent
Replication Fleet
Storage Fleet
3
3
2
① Primary instance sends log records in parallel to storage nodes,
replica instances and replication server
② Replication server streams log records to Replication Agent in
secondary region
③ Replication agent sends log records in parallel to storage nodes,
and replica instances
④ Replication server pulls log records from storage nodes to catch up
after outages
High throughput: Up to 150K writes/sec –
negligible performance impact
Low replica lag: < 1 sec cross-region replica
lag under heavy load
Fast recovery: < 1 min to accept full read-write
workloads after region failure
Global Physical Replication

US
East
EU
West
Read
er
Writ
er
Applicat
ion
AP
Southeas
t
Read
er
Read
er
US
West
Read
er
Data Available forWrites From Any Region
Global Read AfterWrite Consistency
Global Writes–Read Replica Write Forwarding

What it takes to be in Aurora
• Background in Computer science, Engineering
• Experience as Developers, Software Dev Managers, and Product Managers
(tech)
• Development background in database, storage, networking, operating
systems, or system development
• Programming Languages:C++ (Engine, Storage); Java/Python/Scripting
(Control Plane, Infrastructure)
• Managers – Communication /Writing – Clear & Concise
• Interview Process – indexed on both technical strengths & Leadership
Principles

Customer Obsession
Leaders start with the customer and work backwards.
Ownership
Leaders are owners.They think long term and don’t
sacrifice long-term value for short-term results.
Invent and Simplify
Leaders expect and require innovation and invention from
their teams and always find ways to simplify.
Are Right, A Lot
Leaders are right a lot.
Learn and Be Curious
Leaders are never done learning and always seek to
improve themselves.
Hire and Develop the Best
Leaders raise the performance bar with every hire and
promotion.
Insist on the Highest Standards
Leaders have relentlessly high standards — many people
may think these standards are unreasonably high.
Think Big
Thinking small is a self-fulfilling prophecy.
Bias for Action
Speed matters in business.
Frugality
Accomplish more with less.
EarnTrust
Leaders listen attentively, speak candidly, and treat others
respectfully.
Dive Deep
Leaders operate at all levels, stay connected to the details, audit
frequently, and are skeptical when metrics and anecdote differ.
Have Backbone; Disagree and Commit
Leaders are obligated to respectfully challenge decisions when they
disagree, even when doing so is uncomfortable or exhausting.
Deliver Results
Leaders focus on the key inputs for their business and deliver them
with the right quality and in a timely fashion.
Amazon Leadership Principles

Leveraging cloud ecosystem
Lambda
S3
IAM
Invoke Lambda events from stored procedures/triggers.
Load data from S3, store snapshots and backups in S3.
Use IAM roles to manage database access control.
Upload systems metrics and audit logs to CloudWatch.

Automate administrative tasks
Schema design
Query
construction
Query
optimization
Automatic fail-over
Backup & recovery
Isolation & security
Industry compliance
Push-button scaling
Automated patching
Advanced monitoring
Routine maintenance
Takes care of time-consuming database management tasks, freeing customers to
focus on their applications and business
Customer
AWS

Aurora customer adoption
Aurora is used by ¾ of the top 100 AWS customers
Fastest growing service in AWS history

Who are moving to Aurora and why?
Customers using
commercial engines
Customers using
open source engines
 Higher performance – up to 5x
 Better availability and durability
 Reduces cost – up to 60%
 Easy migration; no application change
 One tenth of the cost; no licenses
 Integration with cloud ecosystem
 Comparable performance and availability
 Migration tooling and services

Why work @ Aurora
Customer Impact
Technical Skills
Development
Personal
Interests

A day in my life @ Aurora
 Support – Chat, Email, Calls
 Collaboration with GlobalTeams
 Impact customer businesses
 Use a variety of Cloud services
 Design Reviews
 Live Deployments
 Knowledge Sharing (Aurora, AWS, Amazon)
 People (1x1s, promotion, career development, hiring, etc)
 Fun @ Work

Career Growth and Development @ Aurora
• You make the career plan that works best forYOU
• Reviews focus on your superpowers and growth
opportunity
• Goal Setting:We set challenging business and personal
goals and are encouraged to achieve (and exceed them)
• Feedback designed for continuous growth
• Countless opportunities within Aurora, AWS and Amazon

“You don’t choose your passions, your passions choose
you. All of us are gifted with certain passions, and the
people who are lucky are the ones who get to follow those
things.”
– Jeff Bezos

Amazon Aurora TechConnect

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Amazon Aurora TechConnect

Similar to Amazon Aurora TechConnect (20)

Recently uploaded

Recently uploaded (20)

Amazon Aurora TechConnect