Reliability & Scale in AWS while letting you sleep through the night

ONE MAN OPS
Reliability & Scale in AWS while letting you sleep through the night
Jos Boumans - @jiboumans
http://www.fwallpaper.net/picture_pics-Sleepy-cat.html

ONE OF A KIND
My own category

RIPE NCC
Engineering manager for RIPE Database
http://www.ripe.net/db

CANONICAL
Engineering manager for Ubuntu Server 10.04 & 10.10

http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview

KRUX
VP of Operations & Infrastructure

http://www.krux.com/

LOTS OF TRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

0 2,500 5,000 7,500 10,000

AVERAGE REQUESTS* / SEC
*Twitter: New tweets
Wikipedia: Articles read
https://twitter.com/tps_watcher
Krux: New data points
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm

0 125,000,000 250,000,000 375,000,000 500,000,000

MONTHLY UNIQUE USERS
http://www.mediabistro.com/alltwitter/twitter-active-total-users_b17655
http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/

WE CHOSE 'THE CLOUD'
http://previewnetworks.com/blog/

THERE ARE DOWNSIDES
http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines

FOCUS ON AWS
http://aws.amazon.com/

APRIL 21, 2011
http://aws.amazon.com/message/65648/
http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

... SOME OUTAGES ...
... SKIPPED FOR BREVITY ...

JUNE 14, 2012
http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html http://blog.pagerduty.com/2012/06/outage-post-mortem-june-14/

JUNE 29, 2012
http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/

AWS OUTAGE = YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu

THE RULES HAVE CHANGED
You're not in Kansas anymore

http://entreatmenot.blogspot.com/2011/04/shattered-dreams.html

NETWORK WILL PARTITION
And it will happen often

http://thevinylvillain.blogspot.com/2010_04_01_archive.html

DISK IO WILL FLUCTUATE
On a good day, it's mediocre

http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm

IP ADDRESSES WILL CHANGE
IP lease is 8 hours
DNS TTL is 60 seconds
www.fantom-xp.com

INSTANCES WILL DIE
And it will always be your Database Master

http://room57.deviantart.com/art/Hangman-188353196

HUMANS MAKE MISTAKES
Including your humans

EMBRACE FAILURE
Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm

ADJUST YOUR STRATEGY
Don't bring a knife to a gun ﬁght

http://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/

DATA STORES
Some work better than others

http://gustavhoiland.com/2010/03/10/stacked-boxes/

RDBMS
CouchDB
BigTable Based
Dynamo Based
Master / Slave based

CAP THEOREM
Your choice: sacriﬁce availability or consistency.
Orange is a lie.

MYSQL / ORACLE VS RDS
See: Network partitioning & instances dying

BIGTABLE BASED STORES
HBase, Accumulo, Hypertable
Still suffer when network partitioning happens
http://www.cloudera.com/cdh4/

DYNAMO BASED STORES
Cassandra, Riak, DynamoDB

http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/

GO HOSTED?
CouchDB, MongoDB, Riak, Cassandra, HBase
Your Latency May Vary
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html

CLIENT SIDE STORAGE
Keep a copy of your users data locally

http://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html

FILE STORES
EBS vs Instance Store

http://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html

SIMPLE STORAGE SERVICE
S3: Arguably AWS' best feature

http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/

TRAFFIC SHAPING
Control every part of the request

http://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn

STAY LOCAL IF YOU CAN
Going off box exposes you to risks you need to mitigate

http://southshorewoman.com/issue/june-2010/article/local-character

CACHE WHAT YOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://theoatmeal.com/blog/charity_money

USE ELASTIC LOAD BALANCERS
They will save you more than once

http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/

USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure

SHOUT OUT: DYN
DNS for Bit.ly, Quora, Twitter, Wikia, etc

USE A CDN
Critical items should always be available

http://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html

MEASURE EVERYTHING
Find outliers, deviants & trends before they cause trouble

http://www.themoviedb.org/movie/629-the-usual-suspects

GRAPHITE, STATSD & COLLECTD
Use Statsd & Collectd for application/system metrics
Use graphite to store, aggregate & visualize
http://hostedgraphite.com/
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/

GRAPH EVENTS
Deployments, outages, CDN reconﬁgurations, failed builds, etc
Anything that's important to the health of your eco system
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

COMPARE WEEK TO WEEK
Overlay week to week graphs using timeShift()
Quickly identiﬁes trends and deviations from trends
http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10

FORECASTING
Use Holt-Winters conﬁdence bands
Verify that your metrics are within normal tolerance
https://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts

FIND INDIVIDUAL OUTLIERS
Absolute numbers mean very little
Use mean & standard deviation
http://en.wikipedia.org/wiki/File:Black_sheep-1.jpg

ALERT ON TRENDS
Once you go over a threshold, it's too late
Alert on unwanted trends and preemptively ﬁx
http://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/

MEASURE WITHOUT RETROFIT
LogFormat "http.beacon:%D|ms" stats
CustomLog "|nc -u localhost 8125" stats
http://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/

SHOUT OUT: NEW RELIC
Python, Ruby, .NET, Java, PHP support
In depth proﬁling of your app for performance & errors.

CONFIGURATION MANAGEMENT
Unique snowﬂakes are bad

http://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html

PUPPET VS CHEF
Yes.

http://puppetlabs.com/
http://www.opscode.com/chef

INFRASTRUCTURE AS CODE
Use different environments
Measure and report on it
http://americansingercanary.com/green.htm

SHOUT OUT: UBUNTU
Ubuntu + cloud-init + boto = awesome*
*I am biased

http://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools

DEV = PRODUCTION
"I dunno, it worked on my laptop"
Instead, use vagrant
http://vagrantup.com/ http://vagrantup.com/

ROLL YOUR OWN AMIS
Instantly boot up new deployments
Reduce Time to Respond
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/

CONFIDENT DEPLOYS
That human error could be yours

http://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the

CONTINUOUS INTEGRATION
Ours: Github + Jenkins + FPM + apt::s3
From commit to deployable in one command http://github.com/
http://jenkins-ci.org/
https://github.com/thekad/apt-s3
https://github.com/jordansissel/fpm/wiki/

ONE CLICK DEPLOYMENTS
Deployments should not be exciting.
Don't create a checklist; automate & track
http://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://checkmarkable.com/

DARK LAUNCHES
Exercise the code without impacting the user experience
http://www.kissmetrics.com/
http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/

SHADOW TRAFFIC
Test new code against live trafﬁc

http://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323

SLEEP TIGHT
Slides at: www.Slideshare.net/jiboumans
We're hiring: www.krux.com
http://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.html

Reliability & Scale in AWS while letting you sleep through the night

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Reliability & Scale in AWS while letting you sleep through the night

Similar to Reliability & Scale in AWS while letting you sleep through the night (20)

Recently uploaded

Recently uploaded (20)

Reliability & Scale in AWS while letting you sleep through the night