Architecting for Failure in AWS - PuppetConf 2013

ARCHITECTING IN AWS
for resilience & cost at scale
Jos Boumans - @jiboumans
http://rafaykhan619.wix.com/downhouse
Thursday 22 August 13

RIPE NCC
Engineering manager for RIPE Database
http://www.ripe.net/db

CANONICAL
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775
Engineering manager for Ubuntu Server 10.04 & 10.10
http://www.ubuntu.com/business/server/overview

KRUX
VP of Operations & Infrastructure
http://www.krux.com/

SOME OF OUR CUSTOMERS

LOTS OFTRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

AVERAGE REQUESTS* / SEC
http://mashable.com/2013/03/21/happy-7th-birthday-twitter/
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
*Twitter: New tweets
Wikipedia: Articles read
Krux: New data points
0 3,750 7,500 11,250 15,000

MONTHLY UNIQUE USERS
0 200,000,000 400,000,000 600,000,000 800,000,000
http://en.wikipedia.org/wiki/Wikipedia
http://mashable.com/2013/03/21/happy-7th-birthday-twitter/

WE CHOSE 'THE CLOUD'
http://previewnetworks.com/blog/

THERE ARE DOWNSIDES
http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines

RESILIENCE & COST AT SCALE

FOCUS ON AWS
http://aws.amazon.com/

APRIL 21, 2011
http://aws.amazon.com/message/680587/
Also: June 29, 2012 - October 22, 2012 - December 24, 2012
http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/

So#ware,)8)
Automa/on,)4)
Process,)14)
#"of"Issues"
ROOT CAUSE CATEGORIES
http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis
Software bugs & human error

JUNE 29, 2012
http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/

AWS OUTAGE =YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu

RESILIENCE @ SCALE
Embrace Failure: Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http://blabitcanada.com/category/twitter-2/

DEFINE 'AVAILABLE'
Things will break, so choose your degraded state.
http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris

BASIC API CALL
3 potential points of failure

FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/

USER EXPERIENCE
My tweet got posted

RESILIENCETOOLS
Storage, Network & ACL
http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html

MANY SMALL NODESVERSUS
A FEW LARGER NODES
The beneﬁts of the many outweigh the beneﬁts of the few
http://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/

DATABASES
CAPTheorem applies.
Your choice: sacriﬁce availability or consistency. Orange is a lie.
RDBMS
BigTable Based
Master / Slave based
CouchDB
Dynamo Based
http://ferd.ca/beating-the-cap-theorem-checklist.html

SIMPLE STORAGE SERVICE
S3:Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
http://aws.amazon.com/s3/
https://forums.aws.amazon.com/message.jspa?messageID=182919#182919

CACHE WHATYOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://cruncht.com/95/drupal-caching/
http://redis.io/
http://memcached.org/
http://varnish-cache.org/

CLIENT SIDE STORAGE
Keep a copy of your users data locally
http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/

USE ELASTIC LOAD BALANCERS
They will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/

USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure

SHOUT OUT: DYN
DNS for Bit.ly, Quora,Twitter,Wikia, Fastly, etc
http://dyn.com

USE IAM ROLES FOR ACCESS
Humans make mistakes, including your humans

COST @ SCALE
Scaling without breaking the bank
http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg

EMR + SPOT INSTANCES
On demand rate: $0.165 / hour
http://aws.amazon.com/ec2/spot-instances/

AMAZON REDSHIFT
Economical Business Intelligence
Scales with data size
http://www.flitemedia.com/music.php
http://aws.amazon.com/redshift
http://www.tableausoftware.com/

AMAZON GLACIER
"Tapes for the Cloud Era"
Writes vastly cheaper than reads
http://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html

AWS SIMPLE EMAIL SERVICE
Dealing with email is boring and time consuming
http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/

AWS SIMPLE QUEUE SERVICE
Excellent for latency insensitive, small volume queues
http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.html
http://aws.amazon.com/sqs/
http://colby.id.au/benchmarking-sqs

INSTANCE MARKETPLACE
Buy & sell reserved instances
http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/

AWS DYNAMO DB
Excellent for small keys & high read rates
at known & consistent IOPS
http://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/

MAXIMIZE IOPS
RAID 0 Ephemeral drives
use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPS
http://calculator.s3.amazonaws.com/calc5.html
http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance

RED FLAGS
Anti-patterns to watch out for
http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/

PROVISIONED IOPS EBS
Ephemeral storage on c1/m1.xlarge or SSD is better
If you must: m*large or c1.xlarge for dedicated NIC
http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nn
http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html

AWS DYNAMO DB
For high write rates or
large/variable keys
http://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx

HIGH IO/DISK/RAM NODES
Use them deliberately
http://elledecoration.co.za/2010/07/gigantic/

AWS CLOUDWATCH
Metric collection,Amazon style
Cost prohibitive & resolution too low
http://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/

LOWER COST PER METRIC
Use graphite & statsd
http://graphite.wikidot.com/
https://github.com/etsy/statsd

HOSTED ALTERNATIVES
Circonus:All the insights you ever wanted
StackDriver: Optimized for AWS
http://circonus.com
http://stackdriver.com

AWS CLOUDFORMATION
Templatize your entire stack
Harder to use as complexity increases
http://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg

RDS FOR ANALYTICS/REPORTS
Paying OLTP prices for BI usage
Sharding will be a matter of time
http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/

Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@jiboumans
http://slideshare.net/jiboumans

Architecting for Failure in AWS - PuppetConf 2013

More Related Content

Similar to Architecting for Failure in AWS - PuppetConf 2013

More from Puppet

Recently uploaded

Architecting for Failure in AWS - PuppetConf 2013