Five years of EC2
          distilled
                  Grig Gheorghiu

Silicon Valley Cloud Computing Meetup, Feb. 19th 2013

                     @griggheo
             agiletesting.blogspot.com
whoami

• Dir of Technology at Reliam (managed
  hosting)
• Sr Sys Architect at OpenX
• VP Technical Ops at Evite
• VP Technical Ops at Nasty Gal
EC2 creds

• Started with personal m1.small instance in
    2008
• Still around!
• UPTIME:
•   5:13:52 up 438 days, 23:33,   1 user,   load average:
    0.03, 0.09, 0.08
EC2 at OpenX
• end of 2008
• 100s then 1000s of instances
• one of largest AWS customers at the time
• NAMING is very important
 • terminated DB server by mistake
 • in ideal world naming doesn’t matter
EC2 at OpenX (cont.)
• Failures are very frequent at scale
• Forced to architect for failure and
  horizontal scaling
• Hard to scale at all layers at the same time
  (scaling app server layer can overwhelm DB
  layer; play wack-a-mole)
• Elasticity: easier to scale out than scale back
EC2 at OpenX (cont.)
• Automation and configuration management
  become critical
 • Used little-known tool - ‘slack’
 • Rolled own EC2 management tool in
    Python, wrapped around EC2 Java API
 • Testing deployments is critical (one
    mistake can get propagated everywhere)
EC2 at OpenX (cont.)
• Hard to scale at the DB layer (MySQL)
 • mysql-proxy for r/w split
 • slaves behind HAProxy for reads
• HAProxy for LB, then ELB
 • ELB melted initially, had to be gradually
    warmed up
EC2 at Evite

• Sharded MySQL at DB layer; application
  very write-intensive
• Didn’t do proper capacity planning/dark
  launching; had to move quickly from data
  center to EC2 to scale horizontally
• Engaged Percona at the same time
EC2 at Evite (cont.)
• Started with EBS volumes (separate for
  data, transaction logs, temp files)
• EBS horror stories
• CPU Wait up to 100%, instances AWOL
• I/O very inconsistent, unpredictable
• Striped EBS volumes in RAID0 helps with
  performance but not with reliability
EC2 at Evite (cont.)
•   EBS apocalypse in April 2011

•   Hit us even with masters and slaves in diff.
    availability zones (but all in single region -
    mistake!)

•   IMPORTANT: rebuilding redundancy into your
    system is HARD

•   For DB servers, reloading data on new server is
    a lengthy process
EC2 at Evite (cont.)
• General operation: very frequent failures
  (once a week); nightmare for pager duty
• Got very good at disaster recovery!
  •   Failover of master to slave

  •   Rebuilding of slave from master (xtrabackup)

• Local disks striped in RAID0 better than
  EBS
EC2 at Evite (cont.)
• Ended up moving DB servers back to data
  center
• Bare metal (Dell C2100, 144 GB RAM,
  RAID10); 2 MySQL instances per server
• Lots of tuning help from Percona
• BUT: EC2 was great for capacity planning!
  (Zynga does the same)
EC2 at Evite (cont.)
• Relational databases are not ready for the
  cloud (reliability, I/O performance)
• Still keep MySQL slaves in EC2 for DR
• Ryan Macktechnologies so“Wecould better
  understood
              (Facebook):
                           we
                              chose well-

  predict capacity needs and rely on our existing
  monitoring and operational tool kits."
EC2 at Evite (cont.)
• Didn’t use provisioned IOPS for EBS
• Didn’t use VPC
• Great experience with Elastic Map Reduce,
  S3, Route 53 DNS
• Not so great experience with DynamoDB
• ELB OK but still need HAProxy behind it
EC2 at NastyGal
• VPC - really good idea!
 • Extension of data center infrastructure
 • Currently using it for dev/staging + some
    internal backend production
 • Challenging to set up VPN tunnels to
    various firewall vendors (Cisco, Fortinet)
    - not much debugging on VPC side
Interacting with AWS
• AWS API (mostly Java based, but also Ruby
  and Python)
• Multi-cloud libraries: jclouds (Java), libcloud
  (Python), deltacloud (Ruby)
• Chef knife
• Vagrant EC2 provider
• Roll your own
Proper infrastructure care
              and feeding
• Monitoring - alerting, logging, graphing
• It’s not in production if it’s not monitored
  and graphed
• Monitoring is for ops what testing is for
  dev
  • Great way to learn a new infrastructure
  • Dev and ops on pager
Proper infrastructure care
       and feeding
• Going from #monitoringsucks to
  #monitoringlove and @monitorama
• Modern monitoring/graphing/logging tools
 • Sensu, Graphite, Boundary, Server
    Density, New Relic, Papertrail, Pingdom,
    Dead Man’s Snitch
Proper infrastructure care
       and feeding
•   Dashboards!

•   Mission Control page with graphs based on
    Graphite and Google Visualization API

•   Correlate spikes and dips in graphs with errors
    (external and internal monitoring)

    •   Akamai HTTP 500 alerts correlated with Web
        server 500 errors and DB server I/O wait
        increase
Proper infrastructure care
       and feeding



•   HTTP 500 errors as a percentage of all HTTP
    requests across all app servers in the last 60
    minutes
Proper infrastructure care
       and feeding
•   Expect failures and recover quickly

•   Capacity planning
    •   Dark launching

    •   Measure baselines

    •   Correlate external symptoms (HTTP 500) with
        metrics (CPU I/O Wait) then keep metrics
        under certain thresholds by adding resources
Proper infrastructure care
       and feeding
•   Automate, automate, automate! - Chef, Puppet,
    CFEngine, Jenkins, Capistrano, Fabric

•   Chef - can be single source of truth for
    infrastructure
    •   Running chef-client continuously on nodes
        requires discipline

    •   Logging into remote node is anti-pattern (hard!)
Proper infrastructure care
       and feeding
•   Chef best practices

    •   Use knife - no snowflakes!

    •   Deploy new nodes, don’t do massive updates
        in place

•   BUT! beware of OS monoculture

    •   kernel bug after 200+ days

    •   leapocalypse
Is the cloud worth the
          hype?
•   It’s a game changer, but it’s not magical; try before
    you buy! (benchmarks could surprise you)

•   Cloud expert? Carry pager or STFU

•   Forces you to think about failure recovery,
    horizontal scalability, automation

•   Something to be said about abstracting away the
    physical network - the most obscure bugs are
    network-related (ARP caching, routing tables)
So...when should I use
      the cloud?
• Great for dev/staging/testing
• Great for layers of infrastructure that
  contain many identical nodes and that are
  forgiving of node failures (web farms,
  Hadoop nodes, distributed databases)
• Not great for ‘snowflake’-type systems
• Not great for RDBMS (esp. write-intensive)
If you still want to use
       the cloud
•   Watch that monthly bill!

•   Use multiple cloud vendors
•   Design your infrastructure to scale horizontally
    and to be portable across cloud vendors

    •   Shared nothing

    •   No SAN, NAS
If you still want to use
       the cloud
•   Don’t get locked into vendor-proprietary
    services
    •   EC2, S3, Route 53, EMR are OK

    •   Data stores are not OK (DynamoDB)

    •   OpsWorks - debatable (based on Chef, but still
        locks you in)

    •   Wrap services in your own RESTful endpoints
Does EC2 have rivals?
•   No (or at least not yet)
•   Anybody use GCE?
•   Other public clouds are either toys or
    smaller, with less features (no names named)
•   Perception matters - not a contender unless
    featured on High Scalability blog
•   APIs matter less (can use multi-cloud libs)
Does EC2 have rivals?
•   OpenStack, CloudStack, Eucalyptus all seem
    promising
•   Good approach: private infrastructure (bare
    metal, private cloud) for performance/
    reliability + extension into public cloud for
    elasticity/agility (EC2 VPC, Rack Connect)

• How about PaaS?
 • Personally: too hard to relinquish control

Five Years of EC2 Distilled

  • 1.
    Five years ofEC2 distilled Grig Gheorghiu Silicon Valley Cloud Computing Meetup, Feb. 19th 2013 @griggheo agiletesting.blogspot.com
  • 2.
    whoami • Dir ofTechnology at Reliam (managed hosting) • Sr Sys Architect at OpenX • VP Technical Ops at Evite • VP Technical Ops at Nasty Gal
  • 3.
    EC2 creds • Startedwith personal m1.small instance in 2008 • Still around! • UPTIME: • 5:13:52 up 438 days, 23:33, 1 user, load average: 0.03, 0.09, 0.08
  • 4.
    EC2 at OpenX •end of 2008 • 100s then 1000s of instances • one of largest AWS customers at the time • NAMING is very important • terminated DB server by mistake • in ideal world naming doesn’t matter
  • 5.
    EC2 at OpenX(cont.) • Failures are very frequent at scale • Forced to architect for failure and horizontal scaling • Hard to scale at all layers at the same time (scaling app server layer can overwhelm DB layer; play wack-a-mole) • Elasticity: easier to scale out than scale back
  • 6.
    EC2 at OpenX(cont.) • Automation and configuration management become critical • Used little-known tool - ‘slack’ • Rolled own EC2 management tool in Python, wrapped around EC2 Java API • Testing deployments is critical (one mistake can get propagated everywhere)
  • 7.
    EC2 at OpenX(cont.) • Hard to scale at the DB layer (MySQL) • mysql-proxy for r/w split • slaves behind HAProxy for reads • HAProxy for LB, then ELB • ELB melted initially, had to be gradually warmed up
  • 8.
    EC2 at Evite •Sharded MySQL at DB layer; application very write-intensive • Didn’t do proper capacity planning/dark launching; had to move quickly from data center to EC2 to scale horizontally • Engaged Percona at the same time
  • 9.
    EC2 at Evite(cont.) • Started with EBS volumes (separate for data, transaction logs, temp files) • EBS horror stories • CPU Wait up to 100%, instances AWOL • I/O very inconsistent, unpredictable • Striped EBS volumes in RAID0 helps with performance but not with reliability
  • 10.
    EC2 at Evite(cont.) • EBS apocalypse in April 2011 • Hit us even with masters and slaves in diff. availability zones (but all in single region - mistake!) • IMPORTANT: rebuilding redundancy into your system is HARD • For DB servers, reloading data on new server is a lengthy process
  • 11.
    EC2 at Evite(cont.) • General operation: very frequent failures (once a week); nightmare for pager duty • Got very good at disaster recovery! • Failover of master to slave • Rebuilding of slave from master (xtrabackup) • Local disks striped in RAID0 better than EBS
  • 12.
    EC2 at Evite(cont.) • Ended up moving DB servers back to data center • Bare metal (Dell C2100, 144 GB RAM, RAID10); 2 MySQL instances per server • Lots of tuning help from Percona • BUT: EC2 was great for capacity planning! (Zynga does the same)
  • 13.
    EC2 at Evite(cont.) • Relational databases are not ready for the cloud (reliability, I/O performance) • Still keep MySQL slaves in EC2 for DR • Ryan Macktechnologies so“Wecould better understood (Facebook): we chose well- predict capacity needs and rely on our existing monitoring and operational tool kits."
  • 14.
    EC2 at Evite(cont.) • Didn’t use provisioned IOPS for EBS • Didn’t use VPC • Great experience with Elastic Map Reduce, S3, Route 53 DNS • Not so great experience with DynamoDB • ELB OK but still need HAProxy behind it
  • 15.
    EC2 at NastyGal •VPC - really good idea! • Extension of data center infrastructure • Currently using it for dev/staging + some internal backend production • Challenging to set up VPN tunnels to various firewall vendors (Cisco, Fortinet) - not much debugging on VPC side
  • 16.
    Interacting with AWS •AWS API (mostly Java based, but also Ruby and Python) • Multi-cloud libraries: jclouds (Java), libcloud (Python), deltacloud (Ruby) • Chef knife • Vagrant EC2 provider • Roll your own
  • 17.
    Proper infrastructure care and feeding • Monitoring - alerting, logging, graphing • It’s not in production if it’s not monitored and graphed • Monitoring is for ops what testing is for dev • Great way to learn a new infrastructure • Dev and ops on pager
  • 18.
    Proper infrastructure care and feeding • Going from #monitoringsucks to #monitoringlove and @monitorama • Modern monitoring/graphing/logging tools • Sensu, Graphite, Boundary, Server Density, New Relic, Papertrail, Pingdom, Dead Man’s Snitch
  • 19.
    Proper infrastructure care and feeding • Dashboards! • Mission Control page with graphs based on Graphite and Google Visualization API • Correlate spikes and dips in graphs with errors (external and internal monitoring) • Akamai HTTP 500 alerts correlated with Web server 500 errors and DB server I/O wait increase
  • 20.
    Proper infrastructure care and feeding • HTTP 500 errors as a percentage of all HTTP requests across all app servers in the last 60 minutes
  • 21.
    Proper infrastructure care and feeding • Expect failures and recover quickly • Capacity planning • Dark launching • Measure baselines • Correlate external symptoms (HTTP 500) with metrics (CPU I/O Wait) then keep metrics under certain thresholds by adding resources
  • 22.
    Proper infrastructure care and feeding • Automate, automate, automate! - Chef, Puppet, CFEngine, Jenkins, Capistrano, Fabric • Chef - can be single source of truth for infrastructure • Running chef-client continuously on nodes requires discipline • Logging into remote node is anti-pattern (hard!)
  • 23.
    Proper infrastructure care and feeding • Chef best practices • Use knife - no snowflakes! • Deploy new nodes, don’t do massive updates in place • BUT! beware of OS monoculture • kernel bug after 200+ days • leapocalypse
  • 24.
    Is the cloudworth the hype? • It’s a game changer, but it’s not magical; try before you buy! (benchmarks could surprise you) • Cloud expert? Carry pager or STFU • Forces you to think about failure recovery, horizontal scalability, automation • Something to be said about abstracting away the physical network - the most obscure bugs are network-related (ARP caching, routing tables)
  • 25.
    So...when should Iuse the cloud? • Great for dev/staging/testing • Great for layers of infrastructure that contain many identical nodes and that are forgiving of node failures (web farms, Hadoop nodes, distributed databases) • Not great for ‘snowflake’-type systems • Not great for RDBMS (esp. write-intensive)
  • 26.
    If you stillwant to use the cloud • Watch that monthly bill! • Use multiple cloud vendors • Design your infrastructure to scale horizontally and to be portable across cloud vendors • Shared nothing • No SAN, NAS
  • 27.
    If you stillwant to use the cloud • Don’t get locked into vendor-proprietary services • EC2, S3, Route 53, EMR are OK • Data stores are not OK (DynamoDB) • OpsWorks - debatable (based on Chef, but still locks you in) • Wrap services in your own RESTful endpoints
  • 28.
    Does EC2 haverivals? • No (or at least not yet) • Anybody use GCE? • Other public clouds are either toys or smaller, with less features (no names named) • Perception matters - not a contender unless featured on High Scalability blog • APIs matter less (can use multi-cloud libs)
  • 29.
    Does EC2 haverivals? • OpenStack, CloudStack, Eucalyptus all seem promising • Good approach: private infrastructure (bare metal, private cloud) for performance/ reliability + extension into public cloud for elasticity/agility (EC2 VPC, Rack Connect) • How about PaaS? • Personally: too hard to relinquish control