Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.
2. The Third Platform
• Built on:
–
–
–
–
Mobile devices
Cloud services
Social technologies
Big data
• Billions of users
• Millions of apps
3. Data Volume, Velocity, Variety
• 2.7 zettabytes (ZB) of data
exists in the digital universe
today
– 1 ZB = 1 billion terabytes
• 450 billion transaction per day
by 2020
• More unstructured data than
structured data
4. Common Questions from Database Developers
Cloud Migration
• How do I move (my data) to the
cloud?
Data/Storage Technologies
• What data store should I use?
– SQL or NoSQL?
– Hadoop or DW?
– What about search?
Management Concerns
• Is my data (in the cloud) secure?
• Relational features w/o management
nightmares?
• My data volume, velocity, and variety
are exploding!
• How can I reduce cost?
Performance and Delivery
• Need low latency (ms or µs)
• Need high throughput
• Need to ship in days – not years!
6. Cloud Data Tier Architecture – Use the Right Tool for the Job!
Client Tier
App/Web Tier
Data Tier
Search
Cache
Blob Store
ETL
NoSQL
SQL
Data
Warehouse
Hadoop
10. AWS Primitive Compute and Storage
Compute Capabilities
• Many different EC2 instance
types
–
–
–
–
General purpose
Compute optimized
Storage optimized
Memory optimized
• Host any major data storage
technology
Raw Storage Options
• EC2 Instance store (ephemeral)
• Amazon Elastic Block Store (EBS)
– Standard volume
• 1 TB, ~100 IOPS per volume
– Provisioned IOPS volume
• 1 TB, up to 4000 IOPS per volume
– Stripe multiple volumes for higher
IOPS or storage
– RDBMS
– NoSQL
– Cache
Primitives add flexibility, but also come with operational burden!
11. AWS Data Tier Architecture - Us the right tool for the job!
Data Tier
Amazon
ElastiCache
Amazon
CloudSearch
Amazon
Elastic MapReduce
Amazon S3
Amazon
Glacier
Amazon DynamoDB
Amazon RDS
Amazon Redshift
AWS Data Pipeline
20. Data Characteristics: Hot, Warm, Cold
Hot
Warm
Cold
Volume
Item size
Latency
Durability
MB–GB
B–KB
ms
Low–High
GB–TB
KB–MB
ms, sec
High
PB
KB–TB
min, hrs
Very High
Request rate
Cost/GB
Very High
$$-$
High
$-¢¢
Low
¢
22. What data store should I use?
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Cloud
Search
Amazon
Redshift
Amazon
EMR (Hive)
Amazon S3
Amazon
Glacier
Average
latency
ms
ms
ms,sec
ms,sec
sec,min
sec,min,
hrs
ms,sec,min hrs
(~ size)
Data volume
GB
GB–TBs
(no limit)
GB–TB
GB–TB
(3 TB Max)
Item size
B-KB
KB
KB
(64 KB max) (~rowsize)
TB–PB
GB–PB GB–PB
(1.6 PB max) (~nodes) (no limit)
GB–PB
(no limit)
KB
(1 MB
max)
KB
(64 K max)
KB-MB
KB-GB
(5 TB max)
GB
(40 TB
max)
Request rate Very High Very High
High
High
Low
Low
Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost $$
$/GB/month
¢¢
$
¢
¢
¢
¢
High
High
High
High
Very High
Very High
Durability
¢¢
Low Very High
Moderate
Hot Data
Warm Data
Cold Data
23. AWS Data Tier Architecture - Use the right tool for the job!
Data Tier
Amazon
ElastiCache
Amazon
CloudSearch
Amazon
Elastic MapReduce
Amazon S3
Amazon
Glacier
Amazon DynamoDB
Amazon RDS
Amazon Redshift
AWS Data Pipeline
25. Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate Object size Total size Objects per month
(Writes/sec) (Bytes)
(GB/month)
300
2048
1483
777,600,000
30. Amazon RDS
When to use
When not to use
•
•
•
•
Transactions
Complex queries
Medium to high query/write rate
– Up to 30 K IOPS (15 K reads + 15
K writes)
•
•
•
100s of GB to low TBs
Workload can fit in a single node
High durability
Massive read/write rates
– Example: 150 K write requests per
second
•
Data size or throughput demands
sharding
– Example: 10 s or 100 s of terabytes
•
•
Simple Get/Put and queries that a
NoSQL can handle
Complex analytics
Push-Button Scaling
Multi-AZ
AZ 1
AZ 2
Region
Read Replicas
31. Amazon RDS Best Practices
• Use the right DB instance class
• Use EBS-optimized instances
– db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge,
db.cr1.8xlarge
• Use provisioned IOPS
• Use multi-AZ for high availability
• Use read replicas for
– Scaling reads
– Schema changes
– Additional failure recovery
32. Amazon DynamoDB
When to use
•
•
•
•
•
•
•
Fast and predictable performance
Seamless/massive scale
Autosharding
Consistent/low latency
No size or throughput limits
Very high durability
Key-value or simple queries
When not to use
•
•
•
•
Need multi-item/row or cross table
transactions
Need complex queries, joins
Need real-time analytics on
historic data
Storing cold data
33. Amazon DynamoDB Best Practices
• Keep item size small
• Store metadata in Amazon DynamoDB and
large blobs in Amazon S3
• Use a table with a hash key for extremely
high scale
• Use table per day, week, month etc. for
storing time series data
• Use conditional/OCC updates
• Use hash-range key to model
– 1:N relationships
– Multi-tenancy
• Avoid hot keys and hot partitions
Events_table_2012
Event_id
(Hash key)
Timestam
p
(range key)
Attribute1
….
Attribute N
Events_table_2012_05_week1
Events_table_2012_05_week2
Attribute1
…. Attribute N
Event_id
Timestam
(Hash key)
p Timestam
Attribute1
…. Attribute N
Event_id
(range key)
(Hash key)
p
Events_table_2012_05_week3
(range key)
Attribute1
…. Attribute N
Event_id
Timestam
(Hash key)
p
(range key)
34. Amazon ElastiCache (Memcached)
When to use
When not to use
•
•
•
•
•
•
Transient key-value store
Need to speed up reads/write
Caching frequent SQL, NoSQL or
DW query results
Saving transient and frequently
updated data
–
–
•
Increment/decrement game
scores/counters
Web application session storage
Best effort deduplication
Store infrequently used data
Need persistence
35. Amazon ElastiCache (Memcached) Best Practices
•
•
•
•
•
Use autodiscovery
Share memcached client objects in application
Use TTLs
Consider memory for connections overhead
Use Amzon CloudWatch alarms / SNS alerts
•
•
•
Number of connections
Swap memory usage
Freeable memory
36. Amazon ElastiCache (Redis)
When to use
When not to use
•
•
•
•
•
Key-value store with advanced
data structures
– Strings, lists, sets, sorted sets,
hashes
•
•
•
•
•
•
Caching
Leader boards
High-speed sorting
Atomic counters
Queuing systems
Activity streams
Need “native” sharding or scale-out
Need “hard” persistence
Data won’t fit in memory
Need transaction rollback even
under exceptions
37. Amazon ElastiCache (Redis) Best Practices
•
•
Use TTL
Use the right instance types
•
•
Use read replicas
•
•
•
•
•
Instances with high ECU/vCPU and network performance
yield the highest throughput. Example: m2.4xlarge, m2.2xlarge
Increase read throughput
AOF cannot protect against all failure modes
Promote read replicas to primary
Use RDB file snapshot for on-premises to Amazon ElastiCache migration
Key parameter group settings
•
•
•
Avoid “AOF with fsync always” – huge impact on performance
AOF (+ RDB) with fsync everysec – best durability + performance
Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-pubsub-soft-limit
based on the workloads
38. Amazon CloudSearch
When to use
When not to use
•
•
•
•
•
•
•
No search expertise
Full-text search
Ranking
Relevance
Structured and unstructured data
Faceting
– $0 to $10 (4 items)
– $10 and above (3 items)
Not as replacement for a database
–
Not as a system of record
– Transient data
– Nonatomic updates
39. Amazon CloudSearch Best Practices
• Batch documents for uploading
• Use Amazon CloudSearch for searching and another
store for retrieving full records for the UI (i.e. don’t use
return fields)
• Include other data like popularity scores in documents
• Use stop words to remove common terms
• Use fielded queries to reduce match sets
• Query latency is proportional to query specificity
40. Amazon Redshift
When to use
When not to use
•
•
•
•
•
•
•
•
•
Information analysis and reporting
Complex DW queries that
summarize historical data
Batched large updates e.g. daily
sales totals
10s of concurrent queries
100s GB to PB
Compression
Column based
Very high durability
OLTP workloads
– 1000s of concurrent users
– Large number of singleton
updates
41. Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon
S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts
– Split your data into multiple files
– Use GZIP or LZOP compression
– Use manifest file
• Choose proper sort key
– Range or equality on WHERE clause
• Choose proper distribution key
– Join column, foreign key or largest dimension, group by column
– Avoid distribution key for denormalized data
42. Amazon Elastic MapReduce
When to use
When not to use
•
•
Batch analytics/processing
–
•
•
•
•
•
Answers in minutes or hours
Structured and unstructured data
•
Parallel scans of the entire dataset
with uniform query performance
Supports Hive QL + other languages
GB, TB, or PB of data
Replicated data store (HDFS) for
ad-hoc and real-time queries
(HBase)
Real-time analytics (DW)
– Need answers in seconds
1000s of concurrent users
43. Amazon Elastic MapReduce Best Practices
• Choose between transient and persistent
clusters for best TCO
• Leverage Amazon S3 integration for
highly durable and interim storage
• Right-size cluster instances based on
each job – not one size fits all
• Leverage resizing and spot to add and
remove capacity cost-effectively
• Tuning cluster instances can be easier
than tuning Hadoop code
Job Flow
Duration:
14 Hours
Job Flow
Duration:
7 Hours
44. AWS Data Pipeline
When to use
•
•
Automate movement and transformation
of data (ETL in the cloud)
Dependency management
–
–
•
•
•
Schedule management
Transient Amazon EMR clusters
Regular data move pattern
–
–
•
Data
Control
Every hour, day
Every 30 minutes
Amazon DynamoDB backups
–
Cross region
When not to use
•
•
•
Less that 15 minutes scheduling
interval
Execution latency less than a minute
Event-based scheduling
45. AWS Data Pipeline Best Practices
•
•
•
•
Use dependency rather than time based
Make your activities idempotent
Add in your tools using shell activity
Use Amazon S3 for staging
46. Amazon S3
When to use
When not to use
•
•
•
•
•
•
•
•
•
Store large objects
Key-value store - Get/Put/List
Unlimited storage
Versioning
Very high durability
– 99.999999999%
•
•
Very high throughput (via parallel
clients)
Use for storing persistent data
– Backups
– Source/target for EMR
– Blob store with metadata in SQL
or NoSQL
•
Complex queries
Very low latency (ms)
Search
Read-after-write consistency for
overwrites
Need transactions
47. Amazon S3 Best Practices
•
•
•
•
Use random hash prefix for keys
Ensure a random access pattern
Use Amazon CloudFront for high throughput GETs and PUTs
Leverage the high durability, high throughput design of Amazon S3
for backup and as a common storage sink
•
•
•
•
•
Durable sink between data services
Supports de-coupling and asynchronous delivery
Consider RRS for lower cost, lower durability storage of derivatives or copies
Consider parallel threads and multipart upload for faster writes
Consider parallel threads and range get for faster reads
48. Amazon Glacier
When to use
When not to use
•
•
•
•
•
•
•
•
Infrequently accessed data sets
Very low cost storage
Data retrieval times of several
hours is acceptable
Encryption at rest
Very high durability
– 99.999999999%
Unlimited amount of storage
Frequent access
Low latency access
49. Amazon Glacier Best Practices
• Reduce request and storage costs with aggregation
•
•
•
Aggregating your files into bigger files before sending them to Amazon Glacier
Store checksums along with your files
Use a format that allows you to access files within your aggregate archive
• Improve speed and reliability with multipart upload
• Reduce costs with ranged retrievals
• Maintaining your own index in a highly durable store
50. Amazon EC2 + Amazon EBS/Instance
Storage
When to use
When not to use
•
•
•
•
Alternate data store technologies
Hand-tuned performance needs
Direct/admin access required
•
When a managed service will do
the job
When operational experience is
low
51. Amazon EBS Best Practices
•
Pick the right EC2 instance type
•
•
•
•
•
Use provisioned IOPS volumes for database workloads requiring
consistent IOPS
Use standard volumes for workloads requiring low to moderate IOPS
& occasional bursts
Stripe multiple Amazon EBS volumes for higher IOPS or storage
•
•
•
Higher “network performance” instances for driving more Amazon EBS IOPS
EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazon EBS
RAID0 for higher I/O
RAID10 for highest local durability
Amazon EBS snapshots
•
Quiesce the file system and take a snapshot
52. Amazon EC2 Best Practices
HI-Best IOPS/$
HS-Best GB/$
Best vCPU/$
Best MemoryGiB/$
55. AWS Data Tier Architecture - Use the right tool for the job!
Data Tier
Amazon
ElastiCache
Amazon
CloudSearch
Amazon
Elastic MapReduce
Amazon S3
Amazon
Glacier
Amazon DynamoDB
Amazon RDS
Amazon Redshift
AWS Data Pipeline