AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013

DAT203 - AWS Storage and Database
Architecture Best Practices
Siva Raghupathy, Amazon Web Services

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

The Third Platform
• Built on:
–
–
–
–

Mobile devices
Cloud services
Social technologies
Big data

• Billions of users
• Millions of apps

Data Volume, Velocity, Variety
• 2.7 zettabytes (ZB) of data
exists in the digital universe
today
– 1 ZB = 1 billion terabytes

• 450 billion transaction per day
by 2020
• More unstructured data than
structured data

Common Questions from Database Developers
Cloud Migration
• How do I move (my data) to the
cloud?
Data/Storage Technologies
• What data store should I use?
– SQL or NoSQL?
– Hadoop or DW?
– What about search?

Management Concerns
• Is my data (in the cloud) secure?
• Relational features w/o management
nightmares?
• My data volume, velocity, and variety
are exploding!
• How can I reduce cost?
Performance and Delivery
• Need low latency (ms or µs)
• Need high throughput
• Need to ship in days – not years!

Cloud Data Tier Anti-Pattern

Data Tier

Cloud Data Tier Architecture – Use the Right Tool for the Job!
Client Tier

App/Web Tier

Data Tier
Search

Cache

Blob Store

ETL

NoSQL

SQL

Data
Warehouse

Hadoop

AWS
Deployment & Administration

App Services

Compute

Storage

Database

Networking
AWS Global Infrastructure

AWS Managed Database & Storage Services
Structured – Complex Query
• SQL
– Amazon RDS
(MySQL, Oracle, SQL Server)

• Data Warehouse
– Amazon Redshift

Structured – Simple Query
• NoSQL
– Amazon DynamoDB

• Cache
– Amazon ElastiCache
(Memcached, Redis)

• Search
– Amazon
CloudSearch

Unstructured – Custom Query
• Hadoop
– Amazon Elastic MapReduce
(EMR)

Unstructured – No Query
• Cloud Storage
– Amazon S3
– Amazon Glacier

AWS Primitive Compute and Storage
Compute Capabilities
• Many different EC2 instance
types
–
–
–
–

General purpose
Compute optimized
Storage optimized
Memory optimized

• Host any major data storage
technology

Raw Storage Options
• EC2 Instance store (ephemeral)
• Amazon Elastic Block Store (EBS)
– Standard volume
• 1 TB, ~100 IOPS per volume

– Provisioned IOPS volume
• 1 TB, up to 4000 IOPS per volume

– Stripe multiple volumes for higher
IOPS or storage

– RDBMS
– NoSQL
– Cache

Primitives add flexibility, but also come with operational burden!

AWS Data Tier Architecture - Us the right tool for the job!

Data Tier
Amazon
ElastiCache

Amazon
CloudSearch

Amazon
Elastic MapReduce

Amazon S3

Amazon
Glacier
Amazon DynamoDB

Amazon RDS

Amazon Redshift

AWS Data Pipeline

Amazon
CloudSearch

Amazon
ElastiCache
Amazon
RDS

Amazon
EMR

Amazon
DynamoDB

Amazon
Redshift

AWS Data Pipeline

Reference Architecture

Amazon
S3

Amazon
Glacier

Use Case: A Video Streaming Application

Use Case: A Video Streaming App – Upload

Amazon
CloudSearch
Amazon
RDS

Amazon
DynamoDB

Amazon
S3

A Video Streaming App – Discovery
CloudFront

Amazon
CloudSearch
Amazon
ElastiCache
Amazon
RDS

X

Amazon
DynamoDB

Amazon
S3

Amazon
Glacier

Use Case: A Video Streaming App – Recs

Amazon
DynamoDB

Amazon
EMR

Amazon
S3

Amazon
Glacier

Use Case: A Video Streaming App – Analytics

Amazon
EMR

Amazon
S3

Amazon
Redshift

Amazon
Glacier

What is the temperature of your data?

Data Characteristics: Hot, Warm, Cold
Hot

Warm

Cold

Volume
Item size
Latency
Durability

MB–GB
B–KB
ms
Low–High

GB–TB
KB–MB
ms, sec
High

PB
KB–TB
min, hrs
Very High

Request rate
Cost/GB

Very High
$$-$

High
$-¢¢

Low
¢

Structure

Low

Amazon
Glacier

Amazon S3
Amazon
ElastiCache

Amazon
EMR
Amazon
DynamoDB

Amazon
RDS

Amazon
Redshift

High
High
High
Low
Low

Request rate
Cost/GB
Latency
Data Volume

Low
Low
High
High

What data store should I use?
ElastiCache

Amazon
DynamoDB

Amazon
RDS

Cloud
Search

Amazon
Redshift

Amazon
EMR (Hive)

Amazon S3

Amazon
Glacier

Average
latency

ms

ms

ms,sec

ms,sec

sec,min

sec,min,
hrs

ms,sec,min hrs
(~ size)

Data volume

GB

GB–TBs
(no limit)

GB–TB
GB–TB
(3 TB Max)

Item size

B-KB

KB
KB
(64 KB max) (~rowsize)

TB–PB
GB–PB GB–PB
(1.6 PB max) (~nodes) (no limit)

GB–PB
(no limit)

KB
(1 MB
max)

KB
(64 K max)

KB-MB

KB-GB
(5 TB max)

GB
(40 TB
max)

Request rate Very High Very High

High

High

Low

Low

Low–
Very High
(no limit)

Very Low
(no limit)

Storage cost $$
$/GB/month

¢¢

$

¢

¢

¢

¢

High

High

High

High

Very High

Very High

Durability

¢¢

Low Very High
Moderate
Hot Data

Warm Data

Cold Data

AWS Data Tier Architecture - Use the right tool for the job!

Data Tier
Amazon
ElastiCache

Amazon
CloudSearch

Amazon
Elastic MapReduce

Amazon S3
Amazon
Glacier

Amazon DynamoDB

Amazon RDS

Amazon Redshift

AWS Data Pipeline

Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate Object size Total size Objects per month
(Writes/sec) (Bytes)
(GB/month)
300

2048

1483

777,600,000

Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?

Amazon S3 or
Amazon
DynamoDB?

Request rate Object size Total size Objects per
(GB/month) month
300

2,048

1,483

777,600,000

Amazon DynamoDB

use

Request rate Object size Total size Objects per
(GB/month) month
Scenario 1 300

2,048

1,483

777,600,000

Scenario 2 300

32,768

23,730

777,600,000

use

Amazon S3

Amazon RDS
When to use

When not to use

•
•
•

•

Transactions
Complex queries
Medium to high query/write rate
– Up to 30 K IOPS (15 K reads + 15
K writes)

•
•
•

100s of GB to low TBs
Workload can fit in a single node
High durability

Massive read/write rates
– Example: 150 K write requests per
second

•

Data size or throughput demands
sharding
– Example: 10 s or 100 s of terabytes

•
•

Simple Get/Put and queries that a
NoSQL can handle
Complex analytics
Push-Button Scaling

Multi-AZ

AZ 1

AZ 2

Region

Read Replicas

Amazon RDS Best Practices
• Use the right DB instance class
• Use EBS-optimized instances
– db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge,
db.cr1.8xlarge

• Use provisioned IOPS
• Use multi-AZ for high availability
• Use read replicas for
– Scaling reads
– Schema changes
– Additional failure recovery

Amazon DynamoDB
When to use
•
•
•
•
•
•
•

Fast and predictable performance
Seamless/massive scale
Autosharding
Consistent/low latency
No size or throughput limits
Very high durability
Key-value or simple queries

When not to use
•
•
•
•

Need multi-item/row or cross table
transactions
Need complex queries, joins
Need real-time analytics on
historic data
Storing cold data

Amazon DynamoDB Best Practices
• Keep item size small
• Store metadata in Amazon DynamoDB and
large blobs in Amazon S3
• Use a table with a hash key for extremely
high scale
• Use table per day, week, month etc. for
storing time series data
• Use conditional/OCC updates
• Use hash-range key to model
– 1:N relationships
– Multi-tenancy

• Avoid hot keys and hot partitions

Events_table_2012
Event_id
(Hash key)

Timestam
p
(range key)

Attribute1

….

Attribute N

Events_table_2012_05_week1
Attribute1
…. Attribute N
Event_id
Timestam
(Hash key)
p Timestam
Attribute1
…. Attribute N
Event_id
(range key)
(Hash key)
p
(range key)
Attribute1
…. Attribute N
Event_id
Timestam
(Hash key)
p
(range key)

Amazon ElastiCache (Memcached)
When to use

When not to use

•
•
•

•
•

•

Transient key-value store
Need to speed up reads/write
Caching frequent SQL, NoSQL or
DW query results
Saving transient and frequently
updated data
–
–

•

Increment/decrement game
scores/counters
Web application session storage

Best effort deduplication

Store infrequently used data
Need persistence

Amazon ElastiCache (Memcached) Best Practices
•
•
•
•
•

Use autodiscovery
Share memcached client objects in application
Use TTLs
Consider memory for connections overhead
Use Amzon CloudWatch alarms / SNS alerts
•
•
•

Number of connections
Swap memory usage
Freeable memory

Amazon ElastiCache (Redis)
When to use

When not to use

•

•
•
•
•

Key-value store with advanced
data structures
– Strings, lists, sets, sorted sets,
hashes

•
•
•
•
•
•

Caching
Leader boards
High-speed sorting
Atomic counters
Queuing systems
Activity streams

Need “native” sharding or scale-out
Need “hard” persistence
Data won’t fit in memory
Need transaction rollback even
under exceptions

Amazon ElastiCache (Redis) Best Practices
•
•

Use TTL
Use the right instance types
•

•

Use read replicas
•
•
•

•
•

Instances with high ECU/vCPU and network performance
yield the highest throughput. Example: m2.4xlarge, m2.2xlarge

Increase read throughput
AOF cannot protect against all failure modes
Promote read replicas to primary

Use RDB file snapshot for on-premises to Amazon ElastiCache migration
Key parameter group settings
•
•
•

Avoid “AOF with fsync always” – huge impact on performance
AOF (+ RDB) with fsync everysec – best durability + performance
Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-pubsub-soft-limit
based on the workloads

Amazon CloudSearch
When to use

When not to use

•
•
•
•
•
•

•

No search expertise
Full-text search
Ranking
Relevance
Structured and unstructured data
Faceting
– $0 to $10 (4 items)
– $10 and above (3 items)

Not as replacement for a database
–

Not as a system of record

– Transient data
– Nonatomic updates

Amazon CloudSearch Best Practices
• Batch documents for uploading
• Use Amazon CloudSearch for searching and another
store for retrieving full records for the UI (i.e. don’t use
return fields)
• Include other data like popularity scores in documents
• Use stop words to remove common terms
• Use fielded queries to reduce match sets
• Query latency is proportional to query specificity

Amazon Redshift
When to use

When not to use

•
•

•

•
•
•
•
•
•

Information analysis and reporting
Complex DW queries that
summarize historical data
Batched large updates e.g. daily
sales totals
10s of concurrent queries
100s GB to PB
Compression
Column based

OLTP workloads
– 1000s of concurrent users
– Large number of singleton
updates

Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon
S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts
– Split your data into multiple files
– Use GZIP or LZOP compression
– Use manifest file

• Choose proper sort key
– Range or equality on WHERE clause

• Choose proper distribution key
– Join column, foreign key or largest dimension, group by column
– Avoid distribution key for denormalized data

Amazon Elastic MapReduce
When to use

When not to use

•

•

Batch analytics/processing
–

•
•
•
•
•

Answers in minutes or hours

Structured and unstructured data
•
Parallel scans of the entire dataset
with uniform query performance
Supports Hive QL + other languages
GB, TB, or PB of data
Replicated data store (HDFS) for
ad-hoc and real-time queries
(HBase)

Real-time analytics (DW)
– Need answers in seconds

1000s of concurrent users

Amazon Elastic MapReduce Best Practices
• Choose between transient and persistent
clusters for best TCO
• Leverage Amazon S3 integration for
highly durable and interim storage
• Right-size cluster instances based on
each job – not one size fits all
• Leverage resizing and spot to add and
remove capacity cost-effectively
• Tuning cluster instances can be easier
than tuning Hadoop code

Job Flow

Duration:
14 Hours

Job Flow

Duration:
7 Hours

AWS Data Pipeline
When to use
•
•

Automate movement and transformation
of data (ETL in the cloud)
Dependency management
–
–

•
•
•

Schedule management
Transient Amazon EMR clusters
Regular data move pattern
–
–

•

Data
Control

Every hour, day
Every 30 minutes

Amazon DynamoDB backups
–

Cross region

When not to use
•
•
•

Less that 15 minutes scheduling
interval
Execution latency less than a minute
Event-based scheduling

AWS Data Pipeline Best Practices
•
•
•
•

Use dependency rather than time based
Make your activities idempotent
Add in your tools using shell activity
Use Amazon S3 for staging

Amazon S3
When to use

When not to use

•
•
•
•
•

•
•
•
•

Store large objects
Key-value store - Get/Put/List
Unlimited storage
Versioning
– 99.999999999%

•

•

Very high throughput (via parallel
clients)
Use for storing persistent data
– Backups
– Source/target for EMR
– Blob store with metadata in SQL
or NoSQL

•

Complex queries
Very low latency (ms)
Search
Read-after-write consistency for
overwrites
Need transactions

Amazon S3 Best Practices
•
•
•
•

Use random hash prefix for keys
Ensure a random access pattern
Use Amazon CloudFront for high throughput GETs and PUTs
Leverage the high durability, high throughput design of Amazon S3
for backup and as a common storage sink
•
•
•

•
•

Durable sink between data services
Supports de-coupling and asynchronous delivery
Consider RRS for lower cost, lower durability storage of derivatives or copies

Consider parallel threads and multipart upload for faster writes
Consider parallel threads and range get for faster reads

Amazon Glacier
When to use

When not to use

•
•
•

•
•

•
•
•

Infrequently accessed data sets
Very low cost storage
Data retrieval times of several
hours is acceptable
Encryption at rest
– 99.999999999%
Unlimited amount of storage

Frequent access
Low latency access

Amazon Glacier Best Practices
• Reduce request and storage costs with aggregation
•
•
•

Aggregating your files into bigger files before sending them to Amazon Glacier
Store checksums along with your files
Use a format that allows you to access files within your aggregate archive

• Improve speed and reliability with multipart upload
• Reduce costs with ranged retrievals
• Maintaining your own index in a highly durable store

Amazon EC2 + Amazon EBS/Instance
Storage
When to use

When not to use

•
•
•

•

Alternate data store technologies
Hand-tuned performance needs
Direct/admin access required

•

When a managed service will do
the job
When operational experience is
low

Amazon EBS Best Practices
•

Pick the right EC2 instance type
•
•

•
•
•

Use provisioned IOPS volumes for database workloads requiring
consistent IOPS
Use standard volumes for workloads requiring low to moderate IOPS
& occasional bursts
Stripe multiple Amazon EBS volumes for higher IOPS or storage
•
•

•

Higher “network performance” instances for driving more Amazon EBS IOPS
EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazon EBS

RAID0 for higher I/O
RAID10 for highest local durability

Amazon EBS snapshots
•

Quiesce the file system and take a snapshot

Amazon EC2 Best Practices

HI-Best IOPS/$
HS-Best GB/$

Best vCPU/$

Best MemoryGiB/$

Cloud Data Tier Architecture Anti-Pattern

Data Tier

Please give us your feedback on this
presentation

DAT203
As a thank you, we will select prize
winners daily for completed surveys!

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013

Similar to AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013