AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013
 

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013

on

  • 3,774 views

Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) ...

Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.

Statistics

Views

Total Views
3,774
Views on SlideShare
3,692
Embed Views
82

Actions

Likes
16
Downloads
170
Comments
0

5 Embeds 82

http://funsung.blogspot.com 60
https://bluesoft.acelerato.com 18
http://funsung.blogspot.no 2
http://funsung.blogspot.com.au 1
http://funsung.blogspot.tw 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013 AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013 Presentation Transcript

  • DAT203 - AWS Storage and Database Architecture Best Practices Siva Raghupathy, Amazon Web Services © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • The Third Platform • Built on: – – – – Mobile devices Cloud services Social technologies Big data • Billions of users • Millions of apps
  • Data Volume, Velocity, Variety • 2.7 zettabytes (ZB) of data exists in the digital universe today – 1 ZB = 1 billion terabytes • 450 billion transaction per day by 2020 • More unstructured data than structured data
  • Common Questions from Database Developers Cloud Migration • How do I move (my data) to the cloud? Data/Storage Technologies • What data store should I use? – SQL or NoSQL? – Hadoop or DW? – What about search? Management Concerns • Is my data (in the cloud) secure? • Relational features w/o management nightmares? • My data volume, velocity, and variety are exploding! • How can I reduce cost? Performance and Delivery • Need low latency (ms or µs) • Need high throughput • Need to ship in days – not years!
  • Cloud Data Tier Anti-Pattern Data Tier
  • Cloud Data Tier Architecture – Use the Right Tool for the Job! Client Tier App/Web Tier Data Tier Search Cache Blob Store ETL NoSQL SQL Data Warehouse Hadoop
  • AWS Deployment & Administration App Services Compute Storage Database Networking AWS Global Infrastructure
  • AWS Managed Database & Storage Services Structured – Complex Query • SQL – Amazon RDS (MySQL, Oracle, SQL Server) • Data Warehouse – Amazon Redshift Structured – Simple Query • NoSQL – Amazon DynamoDB • Cache – Amazon ElastiCache (Memcached, Redis) • Search – Amazon CloudSearch Unstructured – Custom Query • Hadoop – Amazon Elastic MapReduce (EMR) Unstructured – No Query • Cloud Storage – Amazon S3 – Amazon Glacier
  • AWS Primitive Compute and Storage Compute Capabilities • Many different EC2 instance types – – – – General purpose Compute optimized Storage optimized Memory optimized • Host any major data storage technology Raw Storage Options • EC2 Instance store (ephemeral) • Amazon Elastic Block Store (EBS) – Standard volume • 1 TB, ~100 IOPS per volume – Provisioned IOPS volume • 1 TB, up to 4000 IOPS per volume – Stripe multiple volumes for higher IOPS or storage – RDBMS – NoSQL – Cache Primitives add flexibility, but also come with operational burden!
  • AWS Data Tier Architecture - Us the right tool for the job! Data Tier Amazon ElastiCache Amazon CloudSearch Amazon Elastic MapReduce Amazon S3 Amazon Glacier Amazon DynamoDB Amazon RDS Amazon Redshift AWS Data Pipeline
  • Reference Architecture
  • Amazon CloudSearch Amazon ElastiCache Amazon RDS Amazon EMR Amazon DynamoDB Amazon Redshift AWS Data Pipeline Reference Architecture Amazon S3 Amazon Glacier
  • Use Case: A Video Streaming Application
  • Use Case: A Video Streaming App – Upload Amazon CloudSearch Amazon RDS Amazon DynamoDB Amazon S3
  • A Video Streaming App – Discovery CloudFront Amazon CloudSearch Amazon ElastiCache Amazon RDS X Amazon DynamoDB Amazon S3 Amazon Glacier
  • Use Case: A Video Streaming App – Recs Amazon DynamoDB Amazon EMR Amazon S3 Amazon Glacier
  • Use Case: A Video Streaming App – Analytics Amazon EMR Amazon S3 Amazon Redshift Amazon Glacier
  • What is the temperature of your data?
  • Data Characteristics: Hot, Warm, Cold Hot Warm Cold Volume Item size Latency Durability MB–GB B–KB ms Low–High GB–TB KB–MB ms, sec High PB KB–TB min, hrs Very High Request rate Cost/GB Very High $$-$ High $-¢¢ Low ¢
  • Structure Low Amazon Glacier Amazon S3 Amazon ElastiCache Amazon EMR Amazon DynamoDB Amazon RDS Amazon Redshift High High High Low Low Request rate Cost/GB Latency Data Volume Low Low High High
  • What data store should I use? ElastiCache Amazon DynamoDB Amazon RDS Cloud Search Amazon Redshift Amazon EMR (Hive) Amazon S3 Amazon Glacier Average latency ms ms ms,sec ms,sec sec,min sec,min, hrs ms,sec,min hrs (~ size) Data volume GB GB–TBs (no limit) GB–TB GB–TB (3 TB Max) Item size B-KB KB KB (64 KB max) (~rowsize) TB–PB GB–PB GB–PB (1.6 PB max) (~nodes) (no limit) GB–PB (no limit) KB (1 MB max) KB (64 K max) KB-MB KB-GB (5 TB max) GB (40 TB max) Request rate Very High Very High High High Low Low Low– Very High (no limit) Very Low (no limit) Storage cost $$ $/GB/month ¢¢ $ ¢ ¢ ¢ ¢ High High High High Very High Very High Durability ¢¢ Low Very High Moderate Hot Data Warm Data Cold Data
  • AWS Data Tier Architecture - Use the right tool for the job! Data Tier Amazon ElastiCache Amazon CloudSearch Amazon Elastic MapReduce Amazon S3 Amazon Glacier Amazon DynamoDB Amazon RDS Amazon Redshift AWS Data Pipeline
  • Cost Conscious Design
  • Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate Object size Total size Objects per month (Writes/sec) (Bytes) (GB/month) 300 2048 1483 777,600,000
  • Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
  • Amazon S3 or Amazon DynamoDB? Request rate Object size Total size Objects per (Writes/sec) (Bytes) (GB/month) month 300 2,048 1,483 777,600,000
  • Amazon DynamoDB use Request rate Object size Total size Objects per (Writes/sec) (Bytes) (GB/month) month Scenario 1 300 2,048 1,483 777,600,000 Scenario 2 300 32,768 23,730 777,600,000 use Amazon S3
  • Best Practices
  • Amazon RDS When to use When not to use • • • • Transactions Complex queries Medium to high query/write rate – Up to 30 K IOPS (15 K reads + 15 K writes) • • • 100s of GB to low TBs Workload can fit in a single node High durability Massive read/write rates – Example: 150 K write requests per second • Data size or throughput demands sharding – Example: 10 s or 100 s of terabytes • • Simple Get/Put and queries that a NoSQL can handle Complex analytics Push-Button Scaling Multi-AZ AZ 1 AZ 2 Region Read Replicas
  • Amazon RDS Best Practices • Use the right DB instance class • Use EBS-optimized instances – db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge, db.cr1.8xlarge • Use provisioned IOPS • Use multi-AZ for high availability • Use read replicas for – Scaling reads – Schema changes – Additional failure recovery
  • Amazon DynamoDB When to use • • • • • • • Fast and predictable performance Seamless/massive scale Autosharding Consistent/low latency No size or throughput limits Very high durability Key-value or simple queries When not to use • • • • Need multi-item/row or cross table transactions Need complex queries, joins Need real-time analytics on historic data Storing cold data
  • Amazon DynamoDB Best Practices • Keep item size small • Store metadata in Amazon DynamoDB and large blobs in Amazon S3 • Use a table with a hash key for extremely high scale • Use table per day, week, month etc. for storing time series data • Use conditional/OCC updates • Use hash-range key to model – 1:N relationships – Multi-tenancy • Avoid hot keys and hot partitions Events_table_2012 Event_id (Hash key) Timestam p (range key) Attribute1 …. Attribute N Events_table_2012_05_week1 Events_table_2012_05_week2 Attribute1 …. Attribute N Event_id Timestam (Hash key) p Timestam Attribute1 …. Attribute N Event_id (range key) (Hash key) p Events_table_2012_05_week3 (range key) Attribute1 …. Attribute N Event_id Timestam (Hash key) p (range key)
  • Amazon ElastiCache (Memcached) When to use When not to use • • • • • • Transient key-value store Need to speed up reads/write Caching frequent SQL, NoSQL or DW query results Saving transient and frequently updated data – – • Increment/decrement game scores/counters Web application session storage Best effort deduplication Store infrequently used data Need persistence
  • Amazon ElastiCache (Memcached) Best Practices • • • • • Use autodiscovery Share memcached client objects in application Use TTLs Consider memory for connections overhead Use Amzon CloudWatch alarms / SNS alerts • • • Number of connections Swap memory usage Freeable memory
  • Amazon ElastiCache (Redis) When to use When not to use • • • • • Key-value store with advanced data structures – Strings, lists, sets, sorted sets, hashes • • • • • • Caching Leader boards High-speed sorting Atomic counters Queuing systems Activity streams Need “native” sharding or scale-out Need “hard” persistence Data won’t fit in memory Need transaction rollback even under exceptions
  • Amazon ElastiCache (Redis) Best Practices • • Use TTL Use the right instance types • • Use read replicas • • • • • Instances with high ECU/vCPU and network performance yield the highest throughput. Example: m2.4xlarge, m2.2xlarge Increase read throughput AOF cannot protect against all failure modes Promote read replicas to primary Use RDB file snapshot for on-premises to Amazon ElastiCache migration Key parameter group settings • • • Avoid “AOF with fsync always” – huge impact on performance AOF (+ RDB) with fsync everysec – best durability + performance Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-pubsub-soft-limit based on the workloads
  • Amazon CloudSearch When to use When not to use • • • • • • • No search expertise Full-text search Ranking Relevance Structured and unstructured data Faceting – $0 to $10 (4 items) – $10 and above (3 items) Not as replacement for a database – Not as a system of record – Transient data – Nonatomic updates
  • Amazon CloudSearch Best Practices • Batch documents for uploading • Use Amazon CloudSearch for searching and another store for retrieving full records for the UI (i.e. don’t use return fields) • Include other data like popularity scores in documents • Use stop words to remove common terms • Use fielded queries to reduce match sets • Query latency is proportional to query specificity
  • Amazon Redshift When to use When not to use • • • • • • • • • Information analysis and reporting Complex DW queries that summarize historical data Batched large updates e.g. daily sales totals 10s of concurrent queries 100s GB to PB Compression Column based Very high durability OLTP workloads – 1000s of concurrent users – Large number of singleton updates
  • Amazon Redshift Best Practices • Use COPY command to load large data sets from Amazon S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts – Split your data into multiple files – Use GZIP or LZOP compression – Use manifest file • Choose proper sort key – Range or equality on WHERE clause • Choose proper distribution key – Join column, foreign key or largest dimension, group by column – Avoid distribution key for denormalized data
  • Amazon Elastic MapReduce When to use When not to use • • Batch analytics/processing – • • • • • Answers in minutes or hours Structured and unstructured data • Parallel scans of the entire dataset with uniform query performance Supports Hive QL + other languages GB, TB, or PB of data Replicated data store (HDFS) for ad-hoc and real-time queries (HBase) Real-time analytics (DW) – Need answers in seconds 1000s of concurrent users
  • Amazon Elastic MapReduce Best Practices • Choose between transient and persistent clusters for best TCO • Leverage Amazon S3 integration for highly durable and interim storage • Right-size cluster instances based on each job – not one size fits all • Leverage resizing and spot to add and remove capacity cost-effectively • Tuning cluster instances can be easier than tuning Hadoop code Job Flow Duration: 14 Hours Job Flow Duration: 7 Hours
  • AWS Data Pipeline When to use • • Automate movement and transformation of data (ETL in the cloud) Dependency management – – • • • Schedule management Transient Amazon EMR clusters Regular data move pattern – – • Data Control Every hour, day Every 30 minutes Amazon DynamoDB backups – Cross region When not to use • • • Less that 15 minutes scheduling interval Execution latency less than a minute Event-based scheduling
  • AWS Data Pipeline Best Practices • • • • Use dependency rather than time based Make your activities idempotent Add in your tools using shell activity Use Amazon S3 for staging
  • Amazon S3 When to use When not to use • • • • • • • • • Store large objects Key-value store - Get/Put/List Unlimited storage Versioning Very high durability – 99.999999999% • • Very high throughput (via parallel clients) Use for storing persistent data – Backups – Source/target for EMR – Blob store with metadata in SQL or NoSQL • Complex queries Very low latency (ms) Search Read-after-write consistency for overwrites Need transactions
  • Amazon S3 Best Practices • • • • Use random hash prefix for keys Ensure a random access pattern Use Amazon CloudFront for high throughput GETs and PUTs Leverage the high durability, high throughput design of Amazon S3 for backup and as a common storage sink • • • • • Durable sink between data services Supports de-coupling and asynchronous delivery Consider RRS for lower cost, lower durability storage of derivatives or copies Consider parallel threads and multipart upload for faster writes Consider parallel threads and range get for faster reads
  • Amazon Glacier When to use When not to use • • • • • • • • Infrequently accessed data sets Very low cost storage Data retrieval times of several hours is acceptable Encryption at rest Very high durability – 99.999999999% Unlimited amount of storage Frequent access Low latency access
  • Amazon Glacier Best Practices • Reduce request and storage costs with aggregation • • • Aggregating your files into bigger files before sending them to Amazon Glacier Store checksums along with your files Use a format that allows you to access files within your aggregate archive • Improve speed and reliability with multipart upload • Reduce costs with ranged retrievals • Maintaining your own index in a highly durable store
  • Amazon EC2 + Amazon EBS/Instance Storage When to use When not to use • • • • Alternate data store technologies Hand-tuned performance needs Direct/admin access required • When a managed service will do the job When operational experience is low
  • Amazon EBS Best Practices • Pick the right EC2 instance type • • • • • Use provisioned IOPS volumes for database workloads requiring consistent IOPS Use standard volumes for workloads requiring low to moderate IOPS & occasional bursts Stripe multiple Amazon EBS volumes for higher IOPS or storage • • • Higher “network performance” instances for driving more Amazon EBS IOPS EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazon EBS RAID0 for higher I/O RAID10 for highest local durability Amazon EBS snapshots • Quiesce the file system and take a snapshot
  • Amazon EC2 Best Practices HI-Best IOPS/$ HS-Best GB/$ Best vCPU/$ Best MemoryGiB/$
  • Summary
  • Cloud Data Tier Architecture Anti-Pattern Data Tier
  • AWS Data Tier Architecture - Use the right tool for the job! Data Tier Amazon ElastiCache Amazon CloudSearch Amazon Elastic MapReduce Amazon S3 Amazon Glacier Amazon DynamoDB Amazon RDS Amazon Redshift AWS Data Pipeline
  • Amazon CloudSearch Amazon ElastiCache Amazon RDS Amazon EMR Amazon DynamoDB Amazon Redshift AWS Data Pipeline Reference Architecture Amazon S3 Amazon Glacier
  • Cost Conscious Design
  • Please give us your feedback on this presentation DAT203 As a thank you, we will select prize winners daily for completed surveys!
  • Remember…