Which Database is Right for My Workload?: Database Week San Francisco

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Choosing the Right Database(s)
for Your Cloud Applications
Miguel Cervantes
Associate Solutions Architect
Joyjeet Banerjee
Enterprise Solutions Architect

Why are we even having this discussion?
Because:
• Picking the right database based on
imperfect data is challenging
• Decades of traditional app development
have conditioned us to put everything in
a big box
• Decades of IT budgeting/procurement
have required us to price it out first
• Moving/migrating critical data carries
risk
• Staff is hard to retrain
• Legacy concerns limit choices
• …
Using fit-for-purpose
databases for
your various
cloud-native app
data needs:
polyglot persistence

Cloud-native apps are different…
• Demand based scalable compute tiers (i.e., AWS Auto Scaling)
• Orchestrated container-based services (i.e., Amazon ECS, Amazon EKS)
• Decoupled microservice architectures
• Serverless event driven computing (i.e., AWS Lambda, Amazon API
Gateway, AWS Step Functions)
• Streaming data processing: IoT, gaming, clickstreams
• AI/ML: recommendation engines, chat bots
• …
The Twelve Factor App - https://12factor.net/

…and place different demands on the database tier
Functional Requirements:
• High throughput
• Large data sets
• Flexible consistency
• Global scale needs
• Low latency
Operational Requirements:
• High load variability
• Large connection counts
• Near-zero downtime
• High data durability
• Low cost
• Automation

The good news
• Cloud makes it easy to add new databases
• The options to chose from are increasing, and getting better
• Microservices are driving polyglot persistence
… and the challenges
• Selecting the right database for the job
• Maintaining consistency between different datasets

So what is the right database solution?
Shape of Data,
Relationships
Data Set Size Actions on Data,
Query Patterns
Data Durability
Expertise Legacy
Dependencies
Business Needs Platform
Integration
Latency,
Performance
Operational Load

Decision Matrix
Service Shape Size Workload Performance Durability Expertise Legacy Business
Needs
Ops. Load Platform
Integration
Amazon
Aurora
MySQL
Structured,
semistruct.
Mid TB
Range
K/V lookups,
transactional,
light analytics
High
throughput,
low latency
High Relational,
MySQL, SQL
Server
User defined
code, COTS
Database
Freedom
Low to
moderate
Serverless, IAM,
Lambda, Auto
Scaling,
Amazon S3
Amazon
Aurora
PostgreSQL
Structured,
semistruct.
Mid TB
Range
Transactional,
light analytics
High
throughput,
low latency
High Relational,
PostgreSQL,
Oracle
User defined
code, COTS
Database
Freedom
Moderate In the works
Amazon
RDS DB
Engines
Structured,
semistruct.
Low TB
Range
Transactional,
light analytics
Mid-to-high
throughput,
low latency
User
controlled
Engine
specific
Engine req.,
COTS
Right
sizing
Moderate Log streaming
Amazon
DynamoDB
Semistruct. High TB
Range
K/V lookups,
NoSQL, OLTP,
document store
Ultra-high
throughput,
low latency
High NoSQL No active
development
Zero
downtime,
Ultra-high
scale
Low Serverless, IAM,
Lambda, DAX,
Auto Scaling,
Kinesis Streams
Amazon
Neptune
Graph-
structured
Mid TB
Range
Graph, highly
connected
data, transact.
High
throughput,
low latency
High Graph,
Gremlin,
SPARQL
Database
Freedom
Low IAM, Amazon S3

Decision Matrix (cont.)
Service Shape Size Workload Performance Durability Expertise Legacy Business
Needs
Ops. Load Platform
Integration
Amazon
ElastiCache
Semistruct.,
Unstructure
d
Low TB
Range
In-memory
caching,
K/V lookups,
NoSQL
High
throughput,
ultra-low
latency
Low (In-
memory,
auto
failover to
read replica
Caching,
NoSQL
Response
latency, DB
cost opt.
Low Scalable
clusters
Amazon
Redshift
Structured,
semistruct.
PB
Range
Optimized
analytics
Mid-to-high
latency
High DW, data
science
User
defined
code, COTS
Cost opt. Moderate IAM, Amazon
S3
Amazon
Athena
Structured,
semistruct.
TB
Range
Flexible
analytics
High latency High
(Amazon S3)
Data lakes,
data science
Flexibility Low IAM, Amazon
S3
Amazon
EMR
Semistruct. PB
Range
Flexible
analytics
Low-to-high
latency
User-
controlled
Data lakes,
data science,
DW
Tooling &
versioning
Cost opt. High IAM, Amazon
S3, EC2 Spot

When should you consider Amazon Aurora MySQL
or PostgreSQL?
• Migrations from commercial engines: Oracle, SQL Server, DB2
• Customers reaching scaling limits on Amazon RDS MySQL, PostgreSQL
• Customers migrating from complex, HA environments of MySQL, PostgreSQL
• High Throughput workloads
• OLTP, HTAP workloads
When should you consider Amazon Aurora MySQL versus PostgreSQL compatible?
• Customer expertise
• Complex data types and code in the DB: stored procs, triggers, assemblies
• Schema Conversion Tool can give you an idea of coverage for your workload

When should you use Amazon Aurora MySQL or
PostgreSQL?
• Need or tolerate relational semantics
• High throughput writes and significant read scaling
• Up to 15 low lag read replica in region, multi-master (coming soon)
• Large data sets: tens of TB, more when sharded
• Up to 64 TB / cluster volume
• High volumes of concurrent, non-persistent connections (MySQL)
• Variable load, infrequent use
• Aurora Serverless (coming soon), auto scaling replicas
• HTAP with highly parallelizable analytical queries
• Aurora Parallel Query engine (coming soon)
• High HA and durability requirements

Scale-out, distributed architecture
Logging pushed down to a purpose-built
log-structured distributed storage system
Storage volume is striped across hundreds
of storage nodes distributed across 3
availability zones (AZ)
Six copies of data, two copies in each AZ
Master
AVAILABILITY
ZONE 1
SHARED STORAGE VOLUME
AVAILABILITY
ZONE 2
AVAILABILITY
ZONE 3
STORAGE NODES WITH SSDS
SQL
TRANSACTIONS
CACHING
SQL
TRANSACTIONS
CACHING
SQL
TRANSACTIONS
CACHING

Pattern: High throughput workload
Need: High throughput, low latency, handle growth
economically
Action: Migrating to multiple Amazon Aurora
MySQL clusters (sharded)
Lessons Learned:
• Aurora effective for semi-structured data with
effective, optimized query patterns
• Disciplined data design & testing approach
• Sharding is an effective way to scale out, scale
writes and reduce index sizes, if query
patterns make sense
Migration: In-house dual-writes + batch backfill
DNA similarity matching algorithm
Millions of reads/batched writes
Key-value, range restricted queries
Achieved <10ms read latency
10x reduction in cost vs Cassandra
Aurora r3.xlarge-r3.2xlarge clusters as shards
Exponential growth

Pattern: Multi-master
Need: High throughput & no downtime
(connection storms)
Action: Migrating to Amazon Aurora MySQL
single cluster
Lessons Learned:
Multi-master as a near-zero downtime failover
strategy is common on-premises
• What is the true failover time?
• Almost always similar or slower than Aurora
Even same-engine migrations may be complex
• Beware of time zone conversions &
replication
~300 GB data set
57m active devices/600m API call/day
20k-80k (120k peak) writes/sec, reads: 4-5x
Load-balancing w/ hot failover
Statement-based binlog replication

What about analytics? – Amazon Redshift versus
Amazon Aurora
When should you consider Amazon Redshift?
• analytical workload, batch processing
• Always online, small number of user connections, heavy queries
• Columnar storage advantage: aggregations and scan queries
• Integration with data lakes: native import/export to Amazon S3, Spectrum
When should you use Amazon Redshift?
• Structured data analytics (optionally enhanced with unstructured data via Spectrum)
• Scale: up to 2PB, control over storage, massively parallel query engine - up to 4,608 vCPUs
Amazon Redshift versus Amazon Aurora?
• Consider Aurora for hybrid, beware of impact of OLAP on OLTP queries (isolate via clone)

Pattern: Multi-DB analytics with Amazon Aurora &
Amazon Redshift

When should you consider Amazon DynamoDB?
• OLTP workloads
• Active development
• Mission-critical applications
• Scaling problems with other databases
• Need to reduce operations workload
• Need consistent latency or performance under load
• Data durability

When should you use Amazon DynamoDB?
• Few or zero relationships between entities
• Data is (or can be) organized in hierarchies/aggregates
• Single digit millisecond latency requirements
• Data can’t be lost
• Need for global replication
• Variable workload based on time of day
• Quick growth or large scale events
• E.g.: Halloween, New Years’ Eve, Black Friday, Cyber Monday…

Pattern: Optimistic concurrency control with Amazon
DynamoDB
1. Get the cart: GetItem
{
"TableName": "Cart",
"Key": {"CartId": {"N": "2"} }
}
2. Update the cart: conditional PutItem (or UpdateItem)
{
"TableName": "Cart",
"Item": {
"CartID": {"N”: "2"},
"LastUpdate": {"N":"t4"},
"CartItems": {…}
},
"ConditionExpression": "LastUpdate = :v1”,
"ExpressionAttributeValues": {":v1": {"N": "t3"}}
}
CART
CartID:2,
LastUpdate: t3,
CartItems: [
{ID: 2, Qty: 1},
{ID: 5, Qty: 2}]
• Use conditions to implement optimistic
concurrency control ensuring data consistency
• GetItem call can be eventually consistent

Pattern: Building blocks for serverless micro-services with Amazon DynamoDB
AWS
Lambda
Amazon
DynamoDB
Amazon API
Gateway
Amazon
DynamoDB
Stream
AWS
Lambda
Amazon S3
Amazon ES
Amazon
Athena
Amazon
DynamoDB
Query sideOLTP/Command side
1. API Gateway + Lambda + Amazon DynamoDB: core building block
2. DynamoDB Streams + AWS Lambda: building block for reliable event delivery

Pattern: data consistency between micro-services
Cart AWS LambdaDynamoDB Stream Product
{pid: ”product1”, qty: 34}
Saga pattern using DynamoDB Streams and Lambda
Shopping cart service Product catalog service
CartService.commit API
{CheckOut: false,
Items:{} }
{CheckOut: true, {product1,…} }

To wrap up:
• Polyglot persistence – use fit for purpose DBs for each service
• Traditional DB patterns don’t scale well to cloud-native apps
• Using few decision criteria isn’t always selective enough
• Nothing beats doing thorough testing and tuning
• Patterns are useful, but there’s a lot of variability in workloads

Useful Resources
AWS Database Blog:
https://aws.amazon.com/blogs/database/
Look for articles in the “Under the Hood” series for technical product details
AWS Database Migration Service:
https://aws.amazon.com/dms/
Migrate your databases to the cloud

Thank you!

Which Database is Right for My Workload?: Database Week San Francisco

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Which Database is Right for My Workload?: Database Week San Francisco

Similar to Which Database is Right for My Workload?: Database Week San Francisco (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Which Database is Right for My Workload?: Database Week San Francisco