0
DAT203 - AWS Storage and Database
Architecture Best Practices
Siva Raghupathy, Amazon Web Services

© 2013 Amazon.com, Inc...
The Third Platform
• Built on:
–
–
–
–

Mobile devices
Cloud services
Social technologies
Big data

• Billions of users
• ...
Data Volume, Velocity, Variety
• 2.7 zettabytes (ZB) of data
exists in the digital universe
today
– 1 ZB = 1 billion terab...
Common Questions from Database Developers
Cloud Migration
• How do I move (my data) to the
cloud?
Data/Storage Technologie...
Cloud Data Tier Anti-Pattern

Data Tier
Cloud Data Tier Architecture – Use the Right Tool for the Job!
Client Tier

App/Web Tier

Data Tier
Search

Cache

Blob St...
AWS
Deployment & Administration

App Services

Compute

Storage

Database

Networking
AWS Global Infrastructure
AWS Managed Database & Storage Services
Structured – Complex Query
• SQL
– Amazon RDS
(MySQL, Oracle, SQL Server)

• Data ...
AWS Primitive Compute and Storage
Compute Capabilities
• Many different EC2 instance
types
–
–
–
–

General purpose
Comput...
AWS Data Tier Architecture - Us the right tool for the job!

Data Tier
Amazon
ElastiCache

Amazon
CloudSearch

Amazon
Elas...
Reference Architecture
Amazon
CloudSearch

Amazon
ElastiCache
Amazon
RDS

Amazon
EMR

Amazon
DynamoDB

Amazon
Redshift

AWS Data Pipeline

Refere...
Use Case: A Video Streaming Application
Use Case: A Video Streaming App – Upload

Amazon
CloudSearch
Amazon
RDS

Amazon
DynamoDB

Amazon
S3
A Video Streaming App – Discovery
CloudFront

Amazon
CloudSearch
Amazon
ElastiCache
Amazon
RDS

X

Amazon
DynamoDB

Amazon...
Use Case: A Video Streaming App – Recs

Amazon
DynamoDB

Amazon
EMR

Amazon
S3

Amazon
Glacier
Use Case: A Video Streaming App – Analytics

Amazon
EMR

Amazon
S3

Amazon
Redshift

Amazon
Glacier
What is the temperature of your data?
Data Characteristics: Hot, Warm, Cold
Hot

Warm

Cold

Volume
Item size
Latency
Durability

MB–GB
B–KB
ms
Low–High

GB–TB
...
Structure

Low

Amazon
Glacier

Amazon S3
Amazon
ElastiCache

Amazon
EMR
Amazon
DynamoDB

Amazon
RDS

Amazon
Redshift

Hig...
What data store should I use?
ElastiCache

Amazon
DynamoDB

Amazon
RDS

Cloud
Search

Amazon
Redshift

Amazon
EMR (Hive)

...
AWS Data Tier Architecture - Use the right tool for the job!

Data Tier
Amazon
ElastiCache

Amazon
CloudSearch

Amazon
Ela...
Cost Conscious Design
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will g...
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
Amazon S3 or
Amazon
DynamoDB?

Request rate Object size Total size Objects per
(Writes/sec) (Bytes)
(GB/month) month
300

...
Amazon DynamoDB

use

Request rate Object size Total size Objects per
(Writes/sec) (Bytes)
(GB/month) month
Scenario 1 300...
Best Practices
Amazon RDS
When to use

When not to use

•
•
•

•

Transactions
Complex queries
Medium to high query/write rate
– Up to 30...
Amazon RDS Best Practices
• Use the right DB instance class
• Use EBS-optimized instances
– db.m1.large, db.m1.xlarge, db....
Amazon DynamoDB
When to use
•
•
•
•
•
•
•

Fast and predictable performance
Seamless/massive scale
Autosharding
Consistent...
Amazon DynamoDB Best Practices
• Keep item size small
• Store metadata in Amazon DynamoDB and
large blobs in Amazon S3
• U...
Amazon ElastiCache (Memcached)
When to use

When not to use

•
•
•

•
•

•

Transient key-value store
Need to speed up rea...
Amazon ElastiCache (Memcached) Best Practices
•
•
•
•
•

Use autodiscovery
Share memcached client objects in application
U...
Amazon ElastiCache (Redis)
When to use

When not to use

•

•
•
•
•

Key-value store with advanced
data structures
– Strin...
Amazon ElastiCache (Redis) Best Practices
•
•

Use TTL
Use the right instance types
•

•

Use read replicas
•
•
•

•
•

In...
Amazon CloudSearch
When to use

When not to use

•
•
•
•
•
•

•

No search expertise
Full-text search
Ranking
Relevance
St...
Amazon CloudSearch Best Practices
• Batch documents for uploading
• Use Amazon CloudSearch for searching and another
store...
Amazon Redshift
When to use

When not to use

•
•

•

•
•
•
•
•
•

Information analysis and reporting
Complex DW queries t...
Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon
S3, Amazon DynamoDB, Amazon EMR/EC2/...
Amazon Elastic MapReduce
When to use

When not to use

•

•

Batch analytics/processing
–

•
•
•
•
•

Answers in minutes o...
Amazon Elastic MapReduce Best Practices
• Choose between transient and persistent
clusters for best TCO
• Leverage Amazon ...
AWS Data Pipeline
When to use
•
•

Automate movement and transformation
of data (ETL in the cloud)
Dependency management
–...
AWS Data Pipeline Best Practices
•
•
•
•

Use dependency rather than time based
Make your activities idempotent
Add in you...
Amazon S3
When to use

When not to use

•
•
•
•
•

•
•
•
•

Store large objects
Key-value store - Get/Put/List
Unlimited s...
Amazon S3 Best Practices
•
•
•
•

Use random hash prefix for keys
Ensure a random access pattern
Use Amazon CloudFront for...
Amazon Glacier
When to use

When not to use

•
•
•

•
•

•
•
•

Infrequently accessed data sets
Very low cost storage
Data...
Amazon Glacier Best Practices
• Reduce request and storage costs with aggregation
•
•
•

Aggregating your files into bigge...
Amazon EC2 + Amazon EBS/Instance
Storage
When to use

When not to use

•
•
•

•

Alternate data store technologies
Hand-tu...
Amazon EBS Best Practices
•

Pick the right EC2 instance type
•
•

•
•
•

Use provisioned IOPS volumes for database worklo...
Amazon EC2 Best Practices

HI-Best IOPS/$
HS-Best GB/$

Best vCPU/$

Best MemoryGiB/$
Summary
Cloud Data Tier Architecture Anti-Pattern

Data Tier
AWS Data Tier Architecture - Use the right tool for the job!

Data Tier
Amazon
ElastiCache

Amazon
CloudSearch

Amazon
Ela...
Amazon
CloudSearch

Amazon
ElastiCache
Amazon
RDS

Amazon
EMR

Amazon
DynamoDB

Amazon
Redshift

AWS Data Pipeline

Refere...
Cost Conscious Design
Please give us your feedback on this
presentation

DAT203
As a thank you, we will select prize
winners daily for completed...
Remember…
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013
Upcoming SlideShare
Loading in...5
×

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013

8,126

Published on

Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.

Published in: Technology, Travel
2 Comments
23 Likes
Statistics
Notes
No Downloads
Views
Total Views
8,126
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
317
Comments
2
Likes
23
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013"

  1. 1. DAT203 - AWS Storage and Database Architecture Best Practices Siva Raghupathy, Amazon Web Services © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. The Third Platform • Built on: – – – – Mobile devices Cloud services Social technologies Big data • Billions of users • Millions of apps
  3. 3. Data Volume, Velocity, Variety • 2.7 zettabytes (ZB) of data exists in the digital universe today – 1 ZB = 1 billion terabytes • 450 billion transaction per day by 2020 • More unstructured data than structured data
  4. 4. Common Questions from Database Developers Cloud Migration • How do I move (my data) to the cloud? Data/Storage Technologies • What data store should I use? – SQL or NoSQL? – Hadoop or DW? – What about search? Management Concerns • Is my data (in the cloud) secure? • Relational features w/o management nightmares? • My data volume, velocity, and variety are exploding! • How can I reduce cost? Performance and Delivery • Need low latency (ms or µs) • Need high throughput • Need to ship in days – not years!
  5. 5. Cloud Data Tier Anti-Pattern Data Tier
  6. 6. Cloud Data Tier Architecture – Use the Right Tool for the Job! Client Tier App/Web Tier Data Tier Search Cache Blob Store ETL NoSQL SQL Data Warehouse Hadoop
  7. 7. AWS Deployment & Administration App Services Compute Storage Database Networking AWS Global Infrastructure
  8. 8. AWS Managed Database & Storage Services Structured – Complex Query • SQL – Amazon RDS (MySQL, Oracle, SQL Server) • Data Warehouse – Amazon Redshift Structured – Simple Query • NoSQL – Amazon DynamoDB • Cache – Amazon ElastiCache (Memcached, Redis) • Search – Amazon CloudSearch Unstructured – Custom Query • Hadoop – Amazon Elastic MapReduce (EMR) Unstructured – No Query • Cloud Storage – Amazon S3 – Amazon Glacier
  9. 9. AWS Primitive Compute and Storage Compute Capabilities • Many different EC2 instance types – – – – General purpose Compute optimized Storage optimized Memory optimized • Host any major data storage technology Raw Storage Options • EC2 Instance store (ephemeral) • Amazon Elastic Block Store (EBS) – Standard volume • 1 TB, ~100 IOPS per volume – Provisioned IOPS volume • 1 TB, up to 4000 IOPS per volume – Stripe multiple volumes for higher IOPS or storage – RDBMS – NoSQL – Cache Primitives add flexibility, but also come with operational burden!
  10. 10. AWS Data Tier Architecture - Us the right tool for the job! Data Tier Amazon ElastiCache Amazon CloudSearch Amazon Elastic MapReduce Amazon S3 Amazon Glacier Amazon DynamoDB Amazon RDS Amazon Redshift AWS Data Pipeline
  11. 11. Reference Architecture
  12. 12. Amazon CloudSearch Amazon ElastiCache Amazon RDS Amazon EMR Amazon DynamoDB Amazon Redshift AWS Data Pipeline Reference Architecture Amazon S3 Amazon Glacier
  13. 13. Use Case: A Video Streaming Application
  14. 14. Use Case: A Video Streaming App – Upload Amazon CloudSearch Amazon RDS Amazon DynamoDB Amazon S3
  15. 15. A Video Streaming App – Discovery CloudFront Amazon CloudSearch Amazon ElastiCache Amazon RDS X Amazon DynamoDB Amazon S3 Amazon Glacier
  16. 16. Use Case: A Video Streaming App – Recs Amazon DynamoDB Amazon EMR Amazon S3 Amazon Glacier
  17. 17. Use Case: A Video Streaming App – Analytics Amazon EMR Amazon S3 Amazon Redshift Amazon Glacier
  18. 18. What is the temperature of your data?
  19. 19. Data Characteristics: Hot, Warm, Cold Hot Warm Cold Volume Item size Latency Durability MB–GB B–KB ms Low–High GB–TB KB–MB ms, sec High PB KB–TB min, hrs Very High Request rate Cost/GB Very High $$-$ High $-¢¢ Low ¢
  20. 20. Structure Low Amazon Glacier Amazon S3 Amazon ElastiCache Amazon EMR Amazon DynamoDB Amazon RDS Amazon Redshift High High High Low Low Request rate Cost/GB Latency Data Volume Low Low High High
  21. 21. What data store should I use? ElastiCache Amazon DynamoDB Amazon RDS Cloud Search Amazon Redshift Amazon EMR (Hive) Amazon S3 Amazon Glacier Average latency ms ms ms,sec ms,sec sec,min sec,min, hrs ms,sec,min hrs (~ size) Data volume GB GB–TBs (no limit) GB–TB GB–TB (3 TB Max) Item size B-KB KB KB (64 KB max) (~rowsize) TB–PB GB–PB GB–PB (1.6 PB max) (~nodes) (no limit) GB–PB (no limit) KB (1 MB max) KB (64 K max) KB-MB KB-GB (5 TB max) GB (40 TB max) Request rate Very High Very High High High Low Low Low– Very High (no limit) Very Low (no limit) Storage cost $$ $/GB/month ¢¢ $ ¢ ¢ ¢ ¢ High High High High Very High Very High Durability ¢¢ Low Very High Moderate Hot Data Warm Data Cold Data
  22. 22. AWS Data Tier Architecture - Use the right tool for the job! Data Tier Amazon ElastiCache Amazon CloudSearch Amazon Elastic MapReduce Amazon S3 Amazon Glacier Amazon DynamoDB Amazon RDS Amazon Redshift AWS Data Pipeline
  23. 23. Cost Conscious Design
  24. 24. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate Object size Total size Objects per month (Writes/sec) (Bytes) (GB/month) 300 2048 1483 777,600,000
  25. 25. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
  26. 26. Amazon S3 or Amazon DynamoDB? Request rate Object size Total size Objects per (Writes/sec) (Bytes) (GB/month) month 300 2,048 1,483 777,600,000
  27. 27. Amazon DynamoDB use Request rate Object size Total size Objects per (Writes/sec) (Bytes) (GB/month) month Scenario 1 300 2,048 1,483 777,600,000 Scenario 2 300 32,768 23,730 777,600,000 use Amazon S3
  28. 28. Best Practices
  29. 29. Amazon RDS When to use When not to use • • • • Transactions Complex queries Medium to high query/write rate – Up to 30 K IOPS (15 K reads + 15 K writes) • • • 100s of GB to low TBs Workload can fit in a single node High durability Massive read/write rates – Example: 150 K write requests per second • Data size or throughput demands sharding – Example: 10 s or 100 s of terabytes • • Simple Get/Put and queries that a NoSQL can handle Complex analytics Push-Button Scaling Multi-AZ AZ 1 AZ 2 Region Read Replicas
  30. 30. Amazon RDS Best Practices • Use the right DB instance class • Use EBS-optimized instances – db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge, db.cr1.8xlarge • Use provisioned IOPS • Use multi-AZ for high availability • Use read replicas for – Scaling reads – Schema changes – Additional failure recovery
  31. 31. Amazon DynamoDB When to use • • • • • • • Fast and predictable performance Seamless/massive scale Autosharding Consistent/low latency No size or throughput limits Very high durability Key-value or simple queries When not to use • • • • Need multi-item/row or cross table transactions Need complex queries, joins Need real-time analytics on historic data Storing cold data
  32. 32. Amazon DynamoDB Best Practices • Keep item size small • Store metadata in Amazon DynamoDB and large blobs in Amazon S3 • Use a table with a hash key for extremely high scale • Use table per day, week, month etc. for storing time series data • Use conditional/OCC updates • Use hash-range key to model – 1:N relationships – Multi-tenancy • Avoid hot keys and hot partitions Events_table_2012 Event_id (Hash key) Timestam p (range key) Attribute1 …. Attribute N Events_table_2012_05_week1 Events_table_2012_05_week2 Attribute1 …. Attribute N Event_id Timestam (Hash key) p Timestam Attribute1 …. Attribute N Event_id (range key) (Hash key) p Events_table_2012_05_week3 (range key) Attribute1 …. Attribute N Event_id Timestam (Hash key) p (range key)
  33. 33. Amazon ElastiCache (Memcached) When to use When not to use • • • • • • Transient key-value store Need to speed up reads/write Caching frequent SQL, NoSQL or DW query results Saving transient and frequently updated data – – • Increment/decrement game scores/counters Web application session storage Best effort deduplication Store infrequently used data Need persistence
  34. 34. Amazon ElastiCache (Memcached) Best Practices • • • • • Use autodiscovery Share memcached client objects in application Use TTLs Consider memory for connections overhead Use Amzon CloudWatch alarms / SNS alerts • • • Number of connections Swap memory usage Freeable memory
  35. 35. Amazon ElastiCache (Redis) When to use When not to use • • • • • Key-value store with advanced data structures – Strings, lists, sets, sorted sets, hashes • • • • • • Caching Leader boards High-speed sorting Atomic counters Queuing systems Activity streams Need “native” sharding or scale-out Need “hard” persistence Data won’t fit in memory Need transaction rollback even under exceptions
  36. 36. Amazon ElastiCache (Redis) Best Practices • • Use TTL Use the right instance types • • Use read replicas • • • • • Instances with high ECU/vCPU and network performance yield the highest throughput. Example: m2.4xlarge, m2.2xlarge Increase read throughput AOF cannot protect against all failure modes Promote read replicas to primary Use RDB file snapshot for on-premises to Amazon ElastiCache migration Key parameter group settings • • • Avoid “AOF with fsync always” – huge impact on performance AOF (+ RDB) with fsync everysec – best durability + performance Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-pubsub-soft-limit based on the workloads
  37. 37. Amazon CloudSearch When to use When not to use • • • • • • • No search expertise Full-text search Ranking Relevance Structured and unstructured data Faceting – $0 to $10 (4 items) – $10 and above (3 items) Not as replacement for a database – Not as a system of record – Transient data – Nonatomic updates
  38. 38. Amazon CloudSearch Best Practices • Batch documents for uploading • Use Amazon CloudSearch for searching and another store for retrieving full records for the UI (i.e. don’t use return fields) • Include other data like popularity scores in documents • Use stop words to remove common terms • Use fielded queries to reduce match sets • Query latency is proportional to query specificity
  39. 39. Amazon Redshift When to use When not to use • • • • • • • • • Information analysis and reporting Complex DW queries that summarize historical data Batched large updates e.g. daily sales totals 10s of concurrent queries 100s GB to PB Compression Column based Very high durability OLTP workloads – 1000s of concurrent users – Large number of singleton updates
  40. 40. Amazon Redshift Best Practices • Use COPY command to load large data sets from Amazon S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts – Split your data into multiple files – Use GZIP or LZOP compression – Use manifest file • Choose proper sort key – Range or equality on WHERE clause • Choose proper distribution key – Join column, foreign key or largest dimension, group by column – Avoid distribution key for denormalized data
  41. 41. Amazon Elastic MapReduce When to use When not to use • • Batch analytics/processing – • • • • • Answers in minutes or hours Structured and unstructured data • Parallel scans of the entire dataset with uniform query performance Supports Hive QL + other languages GB, TB, or PB of data Replicated data store (HDFS) for ad-hoc and real-time queries (HBase) Real-time analytics (DW) – Need answers in seconds 1000s of concurrent users
  42. 42. Amazon Elastic MapReduce Best Practices • Choose between transient and persistent clusters for best TCO • Leverage Amazon S3 integration for highly durable and interim storage • Right-size cluster instances based on each job – not one size fits all • Leverage resizing and spot to add and remove capacity cost-effectively • Tuning cluster instances can be easier than tuning Hadoop code Job Flow Duration: 14 Hours Job Flow Duration: 7 Hours
  43. 43. AWS Data Pipeline When to use • • Automate movement and transformation of data (ETL in the cloud) Dependency management – – • • • Schedule management Transient Amazon EMR clusters Regular data move pattern – – • Data Control Every hour, day Every 30 minutes Amazon DynamoDB backups – Cross region When not to use • • • Less that 15 minutes scheduling interval Execution latency less than a minute Event-based scheduling
  44. 44. AWS Data Pipeline Best Practices • • • • Use dependency rather than time based Make your activities idempotent Add in your tools using shell activity Use Amazon S3 for staging
  45. 45. Amazon S3 When to use When not to use • • • • • • • • • Store large objects Key-value store - Get/Put/List Unlimited storage Versioning Very high durability – 99.999999999% • • Very high throughput (via parallel clients) Use for storing persistent data – Backups – Source/target for EMR – Blob store with metadata in SQL or NoSQL • Complex queries Very low latency (ms) Search Read-after-write consistency for overwrites Need transactions
  46. 46. Amazon S3 Best Practices • • • • Use random hash prefix for keys Ensure a random access pattern Use Amazon CloudFront for high throughput GETs and PUTs Leverage the high durability, high throughput design of Amazon S3 for backup and as a common storage sink • • • • • Durable sink between data services Supports de-coupling and asynchronous delivery Consider RRS for lower cost, lower durability storage of derivatives or copies Consider parallel threads and multipart upload for faster writes Consider parallel threads and range get for faster reads
  47. 47. Amazon Glacier When to use When not to use • • • • • • • • Infrequently accessed data sets Very low cost storage Data retrieval times of several hours is acceptable Encryption at rest Very high durability – 99.999999999% Unlimited amount of storage Frequent access Low latency access
  48. 48. Amazon Glacier Best Practices • Reduce request and storage costs with aggregation • • • Aggregating your files into bigger files before sending them to Amazon Glacier Store checksums along with your files Use a format that allows you to access files within your aggregate archive • Improve speed and reliability with multipart upload • Reduce costs with ranged retrievals • Maintaining your own index in a highly durable store
  49. 49. Amazon EC2 + Amazon EBS/Instance Storage When to use When not to use • • • • Alternate data store technologies Hand-tuned performance needs Direct/admin access required • When a managed service will do the job When operational experience is low
  50. 50. Amazon EBS Best Practices • Pick the right EC2 instance type • • • • • Use provisioned IOPS volumes for database workloads requiring consistent IOPS Use standard volumes for workloads requiring low to moderate IOPS & occasional bursts Stripe multiple Amazon EBS volumes for higher IOPS or storage • • • Higher “network performance” instances for driving more Amazon EBS IOPS EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazon EBS RAID0 for higher I/O RAID10 for highest local durability Amazon EBS snapshots • Quiesce the file system and take a snapshot
  51. 51. Amazon EC2 Best Practices HI-Best IOPS/$ HS-Best GB/$ Best vCPU/$ Best MemoryGiB/$
  52. 52. Summary
  53. 53. Cloud Data Tier Architecture Anti-Pattern Data Tier
  54. 54. AWS Data Tier Architecture - Use the right tool for the job! Data Tier Amazon ElastiCache Amazon CloudSearch Amazon Elastic MapReduce Amazon S3 Amazon Glacier Amazon DynamoDB Amazon RDS Amazon Redshift AWS Data Pipeline
  55. 55. Amazon CloudSearch Amazon ElastiCache Amazon RDS Amazon EMR Amazon DynamoDB Amazon Redshift AWS Data Pipeline Reference Architecture Amazon S3 Amazon Glacier
  56. 56. Cost Conscious Design
  57. 57. Please give us your feedback on this presentation DAT203 As a thank you, we will select prize winners daily for completed surveys!
  58. 58. Remember…
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×