Not everything that happens
in Vegas stays in Vegas
or “getting devs to be on call for what they ship” :-)
1. Speed of innovation
3. Running costs
a. “It’ll cost what it ends up costing”
In practise, they found that holding to the first two ended
up costing way less than otherwise expected.
Riot Games + League of Legends
Cloud == ideal for MMOs. Solve launch issues.
● chef gets used a lot here.
○ talked about their evolution with it, lessons learned
● What sucked?
○ 25 minute bootstrap runs
○ External dependencies (including S3)
○ Duplicating application deployment recipes
● golden masters and immutable servers simplify your
● “if you’re doing chef without BerkShelf you’re doing it
● Make it easy to throw up new things
Testing in production
Netflix, Riot, Kickstarter - they all do this.
● 10s to 100s of code pushes per day
● 1000s to 100,000s of config changes per day
○ they tune their A/B testing constantly
Of course, they also have the instrumentation to react to
How’re other people doing DevOps?
Good news - we’re at the “more sophisticated” end of the
Every “cloud native” was doing this.
Things other people did better:
● “Golden master” AMIs
● Immutable instances
● Absolute ownership of vertical slices
● Config-managment (chef/puppet) featured
● Extensive monitoring+logs+visibility == “table stakes”
○ for developers!
● Easy to throw up new things
● Run many small, simple, collaborating things
Who? Riot Games, Netflix, change.org, Kickstarter
Logging aggregation is important
Lots of 3rd party companies are offering centralized
logging services, there's a huge appetite for logging
● DIY - Lumberjacking slides
DEMO: Monitoring & Logging
● Tag Metrics, awesome Metric discoverability
● Cloud Watch integration
○ I never knew I could see ELB metrics :-)
● Alarms are integrated
● You can template Dashboards
● Can Search, Save Searches, Alerts on searches
● No alert on patterns
● Archive to S3 / Push to Redshift
Logging aggregation is FOR DEVELOPERS!!!
Saves lots of time when you’re on call.
Benefit of logging as a service.
● When your infrastructure is in trouble, you do not
want to have your logging analytic system on the
AWS Services that loggly could use:
● Kafka + Storm vs Kinesis
● Elastic Search vs Cloud Search
Predictive Analytics using Storm, Hadoop, R and
● Provisioned IOPS solve all issues :)
● ELB do not perform with extremely high volume
● DNS round robin is a very good basic load
● Cassandra works very well for application data.
● Cassandra does not work well as a queue system,
hard to track order of events.
● Keep the architecture simple.
Many types of load
● Load testing
○ (running a marathon), predict future load and
plan in advance
● Stress testing
○ Break things (figure out limits), mitigation
● Resilience test
○ Figure out how many parts of the architecture
you can lose and still operate
● Performance test
○ How is latency and throughput changing when
the load increase
Phase roll out and measure
● Load Testing is necessary but not sufficient.
○ Deploy to alpha cluster.
○ The release cycle is important, phased
deployment, one box, monitor and ramp up.
○ Monitor performance and behaviour, look at
99% of the traffic, not at the average.
● Netflix record 1.2 billion metrics per day
○ 5 minutes SLA
We took part to the AWS Gameday
Inspired by the 2012 Obama For America DevOps
and Amazon.com ops teams
● Build an Autoscaling application
● Exchange administrative IAM credentials with
● Break your opponent's systems
● Restore your system
● Lessons learned
Who is interested if we wanted to run this?
It needs a full day, ~ 6 hours.