SlideShare a Scribd company logo
Rainmakers
How Netflix Operates Clouds for Maximum Freedom and Agility



             Jeremy Edberg
      Reliability Architect, Netflix
Do you have...

                           •    A release Engineer?

                           •    A QA department?

                           •    Chef or Puppet to
                                manage your systems?



Tweet @jedberg with feedback!
Do you have...


                      •     Upwards of 100 releases a day?




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
With more than 30 million streaming members in
  the United States, Canada, Latin America, the
United Kingdom, Ireland and the Nordics, Netflix is
the world's leading internet subscription service for
 enjoying movies and TV programs streamed over
         the internet to PCs, Macs and TV.
                                         Source: http://ir.netflix.com



   Tweet @jedberg with feedback!
The Netflix Way
                 •      Everything is “built for three”

                 •      Fully automated build tools to test and
                        make packages

                 •      Fully automated machine image bakery

                 •      Fully automated image deployment

                 •      Independent teams responsible for
                        both Dev and Ops

Tweet @jedberg with feedback!
Philosophy




Tweet @jedberg with feedback!
Automate all the things!




Tweet @jedberg with feedback!
Automate all the things!

                     •     Application startup

                     •     Configuration

                     •     Code deployment

                     •     System deployment



Tweet @jedberg with feedback!
Automation


                      •     Standard base image

                      •     Tools to manage all the systems

                      •     Automated code deployment




Tweet @jedberg with feedback!
Shared state should be
       stored in a shared service

     Data on an instance should
       be replicated to other
             instances
Tweet @jedberg with feedback!
“Build for Three”
  We hold a boot camp for new engineers to teach them how
        to build for a highly distributed environment.




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix on AWS
                                2012   2012   2012
                                IPv6   IPv6   IPv6




           Open Connect

Tweet @jedberg with feedback!
Highly aligned, loosely coupled

                  •     Services are built by different teams
                        who work together to figure out what
                        each service will provide.

                  •     The service owner publishes an API
                        that anyone can use.



Tweet @jedberg with feedback!
Advantages to a Service
           Oriented Architecture
             •     Easier auto-scaling

             •     Easier capacity planning

             •     Identify problematic code-paths more easily

             •     Narrow in the effects of a change

             •     More efficient local caching


Tweet @jedberg with feedback!
Freedom and Responsibility

                  •     Developers deploy when they want

                  •     They also manage their own capacity
                        and autoscaling

                  •     And fix anything that breaks at 4am!



Tweet @jedberg with feedback!
All systems choices assume
     some part will fail at some
                point.



Tweet @jedberg with feedback!
The Monkey Theory

                           • Simulate things
                                that go wrong
                           • Find things that
                                are different

Tweet @jedberg with feedback!
Execution




                                       Photo from I, Robot, copyright 20th Century Fox
Tweet @jedberg with feedback!
Netflix built a global PaaS

               • Service Oriented
                      Architecture
               • HTTP/Rest interfaces
                      between services

Tweet @jedberg with feedback!
Netflix PaaS features
           •     Supports all regions and zones

           •     Multiple accounts

           •     Cross region/account replication

           •     Internationalized, localized and GeoIP routed

           •     Advanced key management

           •     Autoscaling with 1000s of instances

           •     Monitoring and alerting on millions of metrics
Tweet @jedberg with feedback!
What AWS Provides
     •     Instances

     •     Machine Images

     •     Elastic IPs

     •     Load Balancers

     •     Security groups / Autoscaling groups

     •     Availability zones and regions

Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)

    Optional
                       Java (JDK 6 or 7)
    Apache
                       Appdynamics
                        App Agent

   Monitoring           monitoring   Tomcat
  Log Rotation
     to S3                             Application war file, base    Healthcheck, status
                         GC and       servlet, platform, interface servelets, JMX interface,
 Appdynamics           thread dump   jars for dependent services       Servo autoscale
 Machine Agent            logging



Tweet @jedberg with feedback!
The Netflix Platform
           Discovery
      (Eureka)Entrypoints       Circut Breakers (Hystrix)
      (Edda)Configuration            Cassandra (Priam &
           (Archaius)            Astyanax & CassJMeter)
     Zookeeper (Exhibitor)                Cryptex
    logging (Blitz4j & Honu)           AKMSEvCache
             NIWS                       Proxiesi18n
              Geo                          L10n
              Base        Open Source
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
N
                                   ov      C
                                   D     r u ra
                                    e          to
                                 c
                                20
                                  12         A
                                         x sty
                                 Fe             an
                                    b      S




Tweet @jedberg with feedback!
                                         o   er a
                                 M         Pr v
                                   ar    m ia
                                             C
                                 A        e r as
                                                 sJ
                                   pr
                                            Ex      M
                                   M     r hi
                                                b
                                                      et
                                     a
                                 y                ito
                                 Ju
                                    n      A
                                         s rch
                                  Ju      A a
                                    l    d sg iu
                                              ar
                                            C
                                A
                                                      Open Source at Netflix




                                          M ha
  Edda
  Blitz4j




                                  ug
 Hystrix




                                           on os
                                              ke
Governator




                                Se               y
                                   p      Eu
                                         a re
                                 O             k
                                   ct
Finding things
    •      Discovery (Eureka)
         •    Application to instance mapping
         •    Heartbeat to keep track of health
    •      Entrypoints (Edda)
         •    Local database of AWS resources
    •      NIWS (Netflix Internal Web Service)
         •    On instance software load balancer
         •    Handles retry logic
    •      Geo (Geolocation library)
         •    Provides IP to Lat/Lon mapping for any service that needs it.


Tweet @jedberg with feedback!
Entrypoints (Edda)

               •     REST API
                   •      GET /REST/v2/instance/$id

               •     Keeps track of all resources

                   •      Autoscaling groups, EIPs, Instances,
                          Applications, Clusters, History


Tweet @jedberg with feedback!
Entrypoints Exploration
         Find all active instances    GET /REST/v2/view/instances


          Find all instances in a     GET /REST/v2/group/clusters
                  cluster

         Show only ASG name,          /v2/aws/autoScalingGroups/edda-v123;_pp:
                                          (autoScalingGroupName,instances:
         instance ID and health             (instanceId,lifecycleState))



         Which ASG contains a        /v2/aws/autoScalingGroups;instances.instanceId=i-
                                                         96f3ca3a
          particular instance?

Tweet @jedberg with feedback!
Keeping it all Straight
        •     Configuration (Archaius)
             •  Global variables (Fast properties)
        •     Base
             •  Base system. Prod vs. Test, etc
        •     Zookeeper (Curator)
             •  Locks, other similar coordination
        •     Logging (Blitz4j and Honu)
             •  Keep track of what happened and store it for
                post analysis.
Tweet @jedberg with feedback!
Keeping it Secure
     •     Cryptex

          •     Service for key management

          •     High, medium and low value keys

     •     AKMS (Amazon Key Management System)

          •     Hands out keys to instances (and dev boxes) so
                they don’t have to store the key on the instance


Tweet @jedberg with feedback!                For more info, see SEC201: Security Panel
Storing it
•    Cassandra (Priam, astyanax)
    •     Configure and access Cassandra
    •     Provide OO abstractions handle
          connection pooling, discovery of hosts
•    EVCache (Eccentric Volatile Cache)
    •     Wrapper for memcached to handle zone
          awareness and replication
•    Proxies
    •     Get data out of the datacenter and into
          the cloud.
Tweet @jedberg with feedback!
Data
                            What do we do with it all?




Tweet @jedberg with feedback!
We store it!

                    • Cache (memcached)
                    • Cassandra
                    • RDS (MySql)
Tweet @jedberg with feedback!
Cassandra




Tweet @jedberg with feedback!
Why Cassandra?

               • Availability over consistency
               • Writes over reads
               • We know Java
               • Open source + support
Tweet @jedberg with feedback!
Using Cassandra at Netflix
        •     Priam
             •   Zero touch auto-config
             •   State management
             •   Token assignment
             •   Node replacement
             •   Backup/restore to/from S3
        •     Astyanax
             •   OO abstraction to Cassandra
             •   Multi-region support


Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Cassandra Architecture




Tweet @jedberg with feedback!
Cassandra Architecture




Tweet @jedberg with feedback!   For more info, see DAT202: Optimizing your Cassandra Database on AWS
Tools
             •     Asgard

             •     AWS usage

             •     Atlas

             •     Chronos

             •     Build system

             •     Explorers (Cassandra and SimpleDB)

Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Elastic Load
                                   Balancer
     Auto Scaling
        Group




                                Security
                                                 Instances
                                 Group


   Launch
 Configuration

                                Amazon Machine
Tweet @jedberg with feedback!       Image
api-frontend




   api-usprod-v007                             api-usprod-v008




Tweet @jedberg with feedback!
api-frontend




   api-usprod-v007                             api-usprod-v008




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix has moved the
               granularity from the
             instance to the cluster

Tweet @jedberg with feedback!
Why Bake?
 Traditional:
 •launch OS               Generic AMI
 •install packages                            Instance

 •install app




  Netflix:
  •launch OS+app
                            App AMI           Instance




Tweet @jedberg with feedback!
Getting Baked
                                                    Artifactory
                                                    Artifactory                                   app bundles
                        Ivy
                                                                            snapshot / release
                                               libraries
                                                                              libraries / apps


Jenkins
 Jenkins                      resolve
                               resolve                     test
                                                            test                 publish
                                                                                 publish

       sync
        sync                              compile
                                          compile                       build
                                                                        build                report
                                                                                              report
            source



           Perforce / /Git
            Perforce Git                 Ant targets               Groovy all over


  Tweet @jedberg with feedback!
Base Image
  Baking                                                S3 / EBS


                                                   foundation
                                                    foundation
                                                       AMI
                                                       AMI
     Linux: CentOS, Fedora, Ubuntu
                                                                    base
                                                                    base
                                                                    AMI
                                                                    AMI
                                               mount               snapshot

                                                                               Ready
                                                                                 for
          Yum // Apt
          Yum Apt                                                               app
                                         install       Bakery
                                                       Bakery                   bake
                                                                         AWS
      RPMs: Apache, Java...


                                 ec2 slave instances
Tweet @jedberg with feedback!
App Image
 Baking                                                    S3 / EBS



                                                       base AMI
                                                       base AMI
       Linux, Apache, Java, Tomcat

                                                                       app
                                                                       app
                                                                       AMI
                                                                       AMI
                                                   mount              snapshot


       Jenkins // Yum //
        Jenkins Yum                                                                 Ready
          Artifactory
          Artifactory
                                                                                  to launch!
                                             install       Bakery
                                                           Bakery
                                                                            AWS
           app bundle


                                     ec2 slave instances
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)

    Optional
                       Java (JDK 6 or 7)
    Apache
                       Appdynamics
                        App Agent

   Monitoring           monitoring   Tomcat
  Log Rotation
     to S3                             Application war file, base    Healthcheck, status
                         GC and       servlet, platform, interface servelets, JMX interface,
 Appdynamics           thread dump   jars for dependent services       Servo autoscale
 Machine Agent            logging



Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)

    Optional
                       Java (JDK 6 or 7)
    Apache
                       Appdynamics
                        App Agent

   Monitoring           monitoring     JBoss
  Log Rotation
     to S3                             Application war file, base    Healthcheck, status
                         GC and       servlet, platform, interface servelets, JMX interface,
 Appdynamics           thread dump   jars for dependent services       Servo autoscale
 Machine Agent            logging



Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)

    Optional
                       Python
    Apache


                        monitoring
   Monitoring                        Django
  Log Rotation
     to S3                               Application file, base
                                      server, platform, interface
 Appdynamics              logging    libs for dependent services
 Machine Agent


Tweet @jedberg with feedback!
The Monkey Theory

                           • Simulate things
                                that go wrong
                           • Find things that
                                are different

Tweet @jedberg with feedback!
The simian army
                •     Chaos -- Kills random instances

                •     Chaos Gorilla -- Kills zones

                •     Chaos Kong -- Kills regions

                •     Latency -- Degrades network and injects faults

                •     Conformity -- Looks for outliers

                •     Circus -- Kills and launches instances to maintain zone balance

                •     Doctor -- Fixes unhealthy resources

                •     Janitor -- Cleans up unused resources

                •     Howler -- Yells about bad things like Amazon limit violations

                •     Security -- Finds security issues and expiring certificates
Tweet @jedberg with feedback!                     For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
What’s going on?!




Tweet @jedberg with feedback!
Atlas




Tweet @jedberg with feedback!
{
  "clusters": [
    "epic_aggregator",
    "epic_aggregator-dev"
  ],                                                                   {
  "alerts": [                                                                "metricName": "EpicPlugin_MetricCount",
    // you can use javascript style comments in the config                   "applyTo": "instance",
    {                                                                        "description": "${instanceId} is reporting too many metrics",
      "metricName": "EpicPlugin_NumDropped",                                 "condition": {
      "applyTo": "cluster",                                                    "type": "NumOccurrences",
      "condition": {                                                           "num": 4,
        "type": "StaticThreshold",                                             "condition": {
        "max": 0.0                                                               "type": "StaticThreshold",
      },                                                                         "max": 0.0
      "severity": "major",                                                     }
      "description": "plugin is dropping metrics"                            },
    },                                                                       "additionalDetails": {
    {                                                                          "statusUrl": "http://${publicDnsName}:7001/Status",
      "metricName": "EpicPlugin_NumDropped_Instance",                          "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}"
      "applyTo": "instance",                                                 }
      "condition": {                                                         "overrides": {
        "type": "NumOccurrences",                                              "subject": "${instanceId} is reporting too many metrics",
        "num": 4,                                                              "incident_key": "${metricName}:${instanceId}",
        "condition": {                                                         "service_key_override": "12345",
          "type": "StaticThreshold",                                           "email_override": "devnull@netflix.com"
          "max": 0.0                                                         },
        }                                                                    "severity": "minor"
      },                                                                   }
      "overrides": {                                                     ]
        "service_key_override": "12345",                               }
        "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"],
        "email_override": "devnull@netflix.com"
      },
      "severity": "minor"
    },
   




                                                                                        Example Alert Config

               Tweet @jedberg with feedback!
Alert Tuning




Tweet @jedberg with feedback!
Alert Systems
                                           CORE
                                            CORE
   Atlas                                   Event
                                            Event
                                                    Paging
                                                     Paging
                                                    Service
                                          Gateway    Service
     alerting
                                          Gateway
       alerting



                                CORE
                                CORE
  Appdynamics                   Agent               Amazon
                                                     Amazon
                                 Agent                SES
       api
                                                       SES
         api




                                CORE
                                CORE
                                Agent
                                 Agent
       api
         api




                                Other
                                 Other
                                Team’ss
                                 Team’
                                Agent
                                 Agent



Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Chronos




Tweet @jedberg with feedback!
Data Collection Pipeline




                                Data Processing Pipeline
                                        Text

Tweet @jedberg with feedback!                For more info, see BDT303: Data Science with Elastic MapReduce
Chuckwa/Honu messages / min

                                63 billion
                                messages a
                                   day




Tweet @jedberg with feedback!
Best Practices




Tweet @jedberg with feedback!
Incident Reviews
                                Ask the key questions:

                •      What went wrong?

                •      How could we have detected it sooner?

                •      How could we have prevented it?

                •      How can we prevent this class of
                       problem in the future?

                •      How can we improve our behavior for
                       next time?
Tweet @jedberg with feedback!
Best Practices for Data
        •     Have multiple copies of all data
        •     Keep those copies in multiple AZs
        •     Avoid keeping state on a single instance
        •     Take frequent snapshots of EBS disks
        •     No secret keys on the instance

Tweet @jedberg with feedback!
Netflix autoscaling
      2
                                Deployment


                                     Text
      1




                                  Traffic Peak


Tweet @jedberg with feedback!
AWS Usage
                                Dollar amounts have been carefully removed




Tweet @jedberg with feedback!
Going multi-zone




Tweet @jedberg with feedback!
Benefits of Amazon’s Zones

           •     Loosely connected

           •     Low latency between zones

           •     99.95% uptime guarantee per region




Tweet @jedberg with feedback!
Going Multi-region




Tweet @jedberg with feedback!
Leveraging Multi-region

         •     100% uptime is theoretically possible.

         •     You have to replicate your data

         •     This will cost money




Tweet @jedberg with feedback!
Circuit Breakers (Hystrix)
        Be liberal in what you accept, strict in what you send




Tweet @jedberg with feedback!
Just a quick reminder...


                      •     (Some of) Netflix is open source:

                            •   https://github.com/netflix




Tweet @jedberg with feedback!
We are sincerely eager to
  hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
    form when you have a
            chance.
Questions?




Tweet @jedberg with feedback!
Getting in touch
         Email: jedberg@{gmail,netflix}.com
         Twitter: @jedberg
         Web: www.jedberg.net
         Facebook: facebook.com/jedberg
         Linkedin: www.linkedin.com/in/jedberg
Tweet @jedberg with feedback!

More Related Content

What's hot

SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
DevOps-as-a-Service: Towards Automating the Automation
DevOps-as-a-Service: Towards Automating the AutomationDevOps-as-a-Service: Towards Automating the Automation
DevOps-as-a-Service: Towards Automating the Automation
Keith Pleas
 
More the merrier: a microservices anti-pattern
More the merrier: a microservices anti-patternMore the merrier: a microservices anti-pattern
More the merrier: a microservices anti-pattern
Chris Richardson
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
 
Building a DevOps organization
Building a DevOps organizationBuilding a DevOps organization
Building a DevOps organization
Zinnov
 
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
SlideTeam
 
The Observability Pipeline
The Observability PipelineThe Observability Pipeline
The Observability Pipeline
Tyler Treat
 
DEVSECOPS.pptx
DEVSECOPS.pptxDEVSECOPS.pptx
DEVSECOPS.pptx
MohammadSaif904342
 
DevOps introduction
DevOps introductionDevOps introduction
DevOps introduction
Christian F. Nissen
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD Pipeline
James Wickett
 
DevSecOps 101
DevSecOps 101DevSecOps 101
Shift Left Security - The What, Why and How
Shift Left Security - The What, Why and HowShift Left Security - The What, Why and How
Shift Left Security - The What, Why and How
DevOps.com
 
Introducing DevOps
Introducing DevOpsIntroducing DevOps
Introducing DevOps
Nishanth K Hydru
 
Benefits of DevSecOps
Benefits of DevSecOpsBenefits of DevSecOps
Dynatrace
DynatraceDynatrace
Dynatrace
Purnima Kurella
 
Why to Cloud Native
Why to Cloud NativeWhy to Cloud Native
Why to Cloud Native
Karthik Gaekwad
 
Customer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring RedefinedCustomer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring Redefined
Michel Duruel
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Josh Evans
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
Dalibor Blazevic
 
Bridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineBridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD Pipeline
DevOps.com
 

What's hot (20)

SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
DevOps-as-a-Service: Towards Automating the Automation
DevOps-as-a-Service: Towards Automating the AutomationDevOps-as-a-Service: Towards Automating the Automation
DevOps-as-a-Service: Towards Automating the Automation
 
More the merrier: a microservices anti-pattern
More the merrier: a microservices anti-patternMore the merrier: a microservices anti-pattern
More the merrier: a microservices anti-pattern
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
 
Building a DevOps organization
Building a DevOps organizationBuilding a DevOps organization
Building a DevOps organization
 
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
 
The Observability Pipeline
The Observability PipelineThe Observability Pipeline
The Observability Pipeline
 
DEVSECOPS.pptx
DEVSECOPS.pptxDEVSECOPS.pptx
DEVSECOPS.pptx
 
DevOps introduction
DevOps introductionDevOps introduction
DevOps introduction
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD Pipeline
 
DevSecOps 101
DevSecOps 101DevSecOps 101
DevSecOps 101
 
Shift Left Security - The What, Why and How
Shift Left Security - The What, Why and HowShift Left Security - The What, Why and How
Shift Left Security - The What, Why and How
 
Introducing DevOps
Introducing DevOpsIntroducing DevOps
Introducing DevOps
 
Benefits of DevSecOps
Benefits of DevSecOpsBenefits of DevSecOps
Benefits of DevSecOps
 
Dynatrace
DynatraceDynatrace
Dynatrace
 
Why to Cloud Native
Why to Cloud NativeWhy to Cloud Native
Why to Cloud Native
 
Customer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring RedefinedCustomer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring Redefined
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
 
Bridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineBridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD Pipeline
 

Similar to Devops at Netflix (re:Invent)

RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
Amazon Web Services
 
Maintaining reliability in an unreliable world
Maintaining reliability in an unreliable worldMaintaining reliability in an unreliable world
Maintaining reliability in an unreliable worldJeremy Edberg
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
Eonblast
 
Mysql(2)
Mysql(2)Mysql(2)
Mysql(2)tomcoh
 
What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?
DATAVERSITY
 
High level-api in tensorflow
High level-api in tensorflowHigh level-api in tensorflow
High level-api in tensorflow
Hyungjoo Cho
 
第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken
Keiichiro Ono
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
nathanmarz
 
SOA Latam 2015
SOA Latam 2015SOA Latam 2015
SOA Latam 2015
Domingo Suarez Torres
 
MLOps for living: Infrastructure-as-Code on AWS
MLOps for living: Infrastructure-as-Code on AWSMLOps for living: Infrastructure-as-Code on AWS
MLOps for living: Infrastructure-as-Code on AWS
AntonChernov9
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
Miguel Pastor
 
Micro Service Architecture
Micro Service ArchitectureMicro Service Architecture
Micro Service ArchitectureEduards Sizovs
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
Tony Tam
 
J2EE Online Training
J2EE Online TrainingJ2EE Online Training
J2EE Online Training
Srihitha Technologies
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
lucenerevolution
 
Coding Secure Infrastructure in the Cloud using the PIE framework
Coding Secure Infrastructure in the Cloud using the PIE frameworkCoding Secure Infrastructure in the Cloud using the PIE framework
Coding Secure Infrastructure in the Cloud using the PIE framework
James Wickett
 
End to-end convolutional semantic embeddings
End to-end convolutional semantic embeddingsEnd to-end convolutional semantic embeddings
End to-end convolutional semantic embeddings
harmonylab
 
The Silver Bullet Syndrome by Alexey Vasiliev
The Silver Bullet Syndrome by Alexey VasilievThe Silver Bullet Syndrome by Alexey Vasiliev
The Silver Bullet Syndrome by Alexey Vasiliev
Pivorak MeetUp
 

Similar to Devops at Netflix (re:Invent) (20)

RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
 
Maintaining reliability in an unreliable world
Maintaining reliability in an unreliable worldMaintaining reliability in an unreliable world
Maintaining reliability in an unreliable world
 
Cloud robotics
Cloud roboticsCloud robotics
Cloud robotics
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
Mysql(2)
Mysql(2)Mysql(2)
Mysql(2)
 
What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?
 
High level-api in tensorflow
High level-api in tensorflowHigh level-api in tensorflow
High level-api in tensorflow
 
第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
SOA Latam 2015
SOA Latam 2015SOA Latam 2015
SOA Latam 2015
 
MLOps for living: Infrastructure-as-Code on AWS
MLOps for living: Infrastructure-as-Code on AWSMLOps for living: Infrastructure-as-Code on AWS
MLOps for living: Infrastructure-as-Code on AWS
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
 
Micro Service Architecture
Micro Service ArchitectureMicro Service Architecture
Micro Service Architecture
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
J2EE Online Training
J2EE Online TrainingJ2EE Online Training
J2EE Online Training
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
 
Coding Secure Infrastructure in the Cloud using the PIE framework
Coding Secure Infrastructure in the Cloud using the PIE frameworkCoding Secure Infrastructure in the Cloud using the PIE framework
Coding Secure Infrastructure in the Cloud using the PIE framework
 
End to-end convolutional semantic embeddings
End to-end convolutional semantic embeddingsEnd to-end convolutional semantic embeddings
End to-end convolutional semantic embeddings
 
The Silver Bullet Syndrome by Alexey Vasiliev
The Silver Bullet Syndrome by Alexey VasilievThe Silver Bullet Syndrome by Alexey Vasiliev
The Silver Bullet Syndrome by Alexey Vasiliev
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

Devops at Netflix (re:Invent)

  • 1. Rainmakers How Netflix Operates Clouds for Maximum Freedom and Agility Jeremy Edberg Reliability Architect, Netflix
  • 2. Do you have... • A release Engineer? • A QA department? • Chef or Puppet to manage your systems? Tweet @jedberg with feedback!
  • 3. Do you have... • Upwards of 100 releases a day? Tweet @jedberg with feedback!
  • 5. With more than 30 million streaming members in the United States, Canada, Latin America, the United Kingdom, Ireland and the Nordics, Netflix is the world's leading internet subscription service for enjoying movies and TV programs streamed over the internet to PCs, Macs and TV. Source: http://ir.netflix.com Tweet @jedberg with feedback!
  • 6. The Netflix Way • Everything is “built for three” • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deployment • Independent teams responsible for both Dev and Ops Tweet @jedberg with feedback!
  • 8. Automate all the things! Tweet @jedberg with feedback!
  • 9. Automate all the things! • Application startup • Configuration • Code deployment • System deployment Tweet @jedberg with feedback!
  • 10. Automation • Standard base image • Tools to manage all the systems • Automated code deployment Tweet @jedberg with feedback!
  • 11. Shared state should be stored in a shared service Data on an instance should be replicated to other instances Tweet @jedberg with feedback!
  • 12. “Build for Three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment. Tweet @jedberg with feedback!
  • 13. Tweet @jedberg with feedback!
  • 14. Netflix on AWS 2012 2012 2012 IPv6 IPv6 IPv6 Open Connect Tweet @jedberg with feedback!
  • 15. Highly aligned, loosely coupled • Services are built by different teams who work together to figure out what each service will provide. • The service owner publishes an API that anyone can use. Tweet @jedberg with feedback!
  • 16. Advantages to a Service Oriented Architecture • Easier auto-scaling • Easier capacity planning • Identify problematic code-paths more easily • Narrow in the effects of a change • More efficient local caching Tweet @jedberg with feedback!
  • 17. Freedom and Responsibility • Developers deploy when they want • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am! Tweet @jedberg with feedback!
  • 18. All systems choices assume some part will fail at some point. Tweet @jedberg with feedback!
  • 19. The Monkey Theory • Simulate things that go wrong • Find things that are different Tweet @jedberg with feedback!
  • 20. Execution Photo from I, Robot, copyright 20th Century Fox Tweet @jedberg with feedback!
  • 21. Netflix built a global PaaS • Service Oriented Architecture • HTTP/Rest interfaces between services Tweet @jedberg with feedback!
  • 22. Netflix PaaS features • Supports all regions and zones • Multiple accounts • Cross region/account replication • Internationalized, localized and GeoIP routed • Advanced key management • Autoscaling with 1000s of instances • Monitoring and alerting on millions of metrics Tweet @jedberg with feedback!
  • 23. What AWS Provides • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling groups • Availability zones and regions Tweet @jedberg with feedback!
  • 24. Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread dump jars for dependent services Servo autoscale Machine Agent logging Tweet @jedberg with feedback!
  • 25. The Netflix Platform Discovery (Eureka)Entrypoints Circut Breakers (Hystrix) (Edda)Configuration Cassandra (Priam & (Archaius) Astyanax & CassJMeter) Zookeeper (Exhibitor) Cryptex logging (Blitz4j & Honu) AKMSEvCache NIWS Proxiesi18n Geo L10n Base Open Source Tweet @jedberg with feedback!
  • 26. Tweet @jedberg with feedback!
  • 27. N ov C D r u ra e to c 20 12 A x sty Fe an b S Tweet @jedberg with feedback! o er a M Pr v ar m ia C A e r as sJ pr Ex M M r hi b et a y ito Ju n A s rch Ju A a l d sg iu ar C A Open Source at Netflix M ha Edda Blitz4j ug Hystrix on os ke Governator Se y p Eu a re O k ct
  • 28. Finding things • Discovery (Eureka) • Application to instance mapping • Heartbeat to keep track of health • Entrypoints (Edda) • Local database of AWS resources • NIWS (Netflix Internal Web Service) • On instance software load balancer • Handles retry logic • Geo (Geolocation library) • Provides IP to Lat/Lon mapping for any service that needs it. Tweet @jedberg with feedback!
  • 29. Entrypoints (Edda) • REST API • GET /REST/v2/instance/$id • Keeps track of all resources • Autoscaling groups, EIPs, Instances, Applications, Clusters, History Tweet @jedberg with feedback!
  • 30. Entrypoints Exploration Find all active instances GET /REST/v2/view/instances Find all instances in a GET /REST/v2/group/clusters cluster Show only ASG name, /v2/aws/autoScalingGroups/edda-v123;_pp: (autoScalingGroupName,instances: instance ID and health (instanceId,lifecycleState)) Which ASG contains a /v2/aws/autoScalingGroups;instances.instanceId=i- 96f3ca3a particular instance? Tweet @jedberg with feedback!
  • 31. Keeping it all Straight • Configuration (Archaius) • Global variables (Fast properties) • Base • Base system. Prod vs. Test, etc • Zookeeper (Curator) • Locks, other similar coordination • Logging (Blitz4j and Honu) • Keep track of what happened and store it for post analysis. Tweet @jedberg with feedback!
  • 32. Keeping it Secure • Cryptex • Service for key management • High, medium and low value keys • AKMS (Amazon Key Management System) • Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance Tweet @jedberg with feedback! For more info, see SEC201: Security Panel
  • 33. Storing it • Cassandra (Priam, astyanax) • Configure and access Cassandra • Provide OO abstractions handle connection pooling, discovery of hosts • EVCache (Eccentric Volatile Cache) • Wrapper for memcached to handle zone awareness and replication • Proxies • Get data out of the datacenter and into the cloud. Tweet @jedberg with feedback!
  • 34. Data What do we do with it all? Tweet @jedberg with feedback!
  • 35. We store it! • Cache (memcached) • Cassandra • RDS (MySql) Tweet @jedberg with feedback!
  • 37. Why Cassandra? • Availability over consistency • Writes over reads • We know Java • Open source + support Tweet @jedberg with feedback!
  • 38. Using Cassandra at Netflix • Priam • Zero touch auto-config • State management • Token assignment • Node replacement • Backup/restore to/from S3 • Astyanax • OO abstraction to Cassandra • Multi-region support Tweet @jedberg with feedback!
  • 39. Tweet @jedberg with feedback!
  • 40. Tweet @jedberg with feedback!
  • 42. Cassandra Architecture Tweet @jedberg with feedback! For more info, see DAT202: Optimizing your Cassandra Database on AWS
  • 43. Tools • Asgard • AWS usage • Atlas • Chronos • Build system • Explorers (Cassandra and SimpleDB) Tweet @jedberg with feedback!
  • 44. Tweet @jedberg with feedback!
  • 45. Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon Machine Tweet @jedberg with feedback! Image
  • 46. api-frontend api-usprod-v007 api-usprod-v008 Tweet @jedberg with feedback!
  • 47. api-frontend api-usprod-v007 api-usprod-v008 Tweet @jedberg with feedback!
  • 48. Tweet @jedberg with feedback!
  • 49. Tweet @jedberg with feedback!
  • 50. Tweet @jedberg with feedback!
  • 51. Netflix has moved the granularity from the instance to the cluster Tweet @jedberg with feedback!
  • 52. Why Bake? Traditional: •launch OS Generic AMI •install packages Instance •install app Netflix: •launch OS+app App AMI Instance Tweet @jedberg with feedback!
  • 53. Getting Baked Artifactory Artifactory app bundles Ivy snapshot / release libraries libraries / apps Jenkins Jenkins resolve resolve test test publish publish sync sync compile compile build build report report source Perforce / /Git Perforce Git Ant targets Groovy all over Tweet @jedberg with feedback!
  • 54. Base Image Baking S3 / EBS foundation foundation AMI AMI Linux: CentOS, Fedora, Ubuntu base base AMI AMI mount snapshot Ready for Yum // Apt Yum Apt app install Bakery Bakery bake AWS RPMs: Apache, Java... ec2 slave instances Tweet @jedberg with feedback!
  • 55. App Image Baking S3 / EBS base AMI base AMI Linux, Apache, Java, Tomcat app app AMI AMI mount snapshot Jenkins // Yum // Jenkins Yum Ready Artifactory Artifactory to launch! install Bakery Bakery AWS app bundle ec2 slave instances Tweet @jedberg with feedback!
  • 56. Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread dump jars for dependent services Servo autoscale Machine Agent logging Tweet @jedberg with feedback!
  • 57. Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring JBoss Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread dump jars for dependent services Servo autoscale Machine Agent logging Tweet @jedberg with feedback!
  • 58. Linux Base AMI (CentOS or Ubuntu) Optional Python Apache monitoring Monitoring Django Log Rotation to S3 Application file, base server, platform, interface Appdynamics logging libs for dependent services Machine Agent Tweet @jedberg with feedback!
  • 59. The Monkey Theory • Simulate things that go wrong • Find things that are different Tweet @jedberg with feedback!
  • 60. The simian army • Chaos -- Kills random instances • Chaos Gorilla -- Kills zones • Chaos Kong -- Kills regions • Latency -- Degrades network and injects faults • Conformity -- Looks for outliers • Circus -- Kills and launches instances to maintain zone balance • Doctor -- Fixes unhealthy resources • Janitor -- Cleans up unused resources • Howler -- Yells about bad things like Amazon limit violations • Security -- Finds security issues and expiring certificates Tweet @jedberg with feedback! For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
  • 61. What’s going on?! Tweet @jedberg with feedback!
  • 63. {   "clusters": [     "epic_aggregator",     "epic_aggregator-dev"   ], {   "alerts": [       "metricName": "EpicPlugin_MetricCount",     // you can use javascript style comments in the config       "applyTo": "instance",     {       "description": "${instanceId} is reporting too many metrics",       "metricName": "EpicPlugin_NumDropped",       "condition": {       "applyTo": "cluster",         "type": "NumOccurrences",       "condition": {         "num": 4,         "type": "StaticThreshold",         "condition": {         "max": 0.0           "type": "StaticThreshold",       },           "max": 0.0       "severity": "major",         }       "description": "plugin is dropping metrics"       },     },       "additionalDetails": {     {         "statusUrl": "http://${publicDnsName}:7001/Status",       "metricName": "EpicPlugin_NumDropped_Instance",         "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}"       "applyTo": "instance",       }       "condition": {       "overrides": {         "type": "NumOccurrences",         "subject": "${instanceId} is reporting too many metrics",         "num": 4,         "incident_key": "${metricName}:${instanceId}",         "condition": {         "service_key_override": "12345",           "type": "StaticThreshold",         "email_override": "devnull@netflix.com"           "max": 0.0       },         }       "severity": "minor"       },     }       "overrides": {   ]         "service_key_override": "12345", }         "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"],         "email_override": "devnull@netflix.com"       },       "severity": "minor"     },     Example Alert Config Tweet @jedberg with feedback!
  • 64. Alert Tuning Tweet @jedberg with feedback!
  • 65. Alert Systems CORE CORE Atlas Event Event Paging Paging Service Gateway Service alerting Gateway alerting CORE CORE Appdynamics Agent Amazon Amazon Agent SES api SES api CORE CORE Agent Agent api api Other Other Team’ss Team’ Agent Agent Tweet @jedberg with feedback!
  • 66. Tweet @jedberg with feedback!
  • 68. Data Collection Pipeline Data Processing Pipeline Text Tweet @jedberg with feedback! For more info, see BDT303: Data Science with Elastic MapReduce
  • 69. Chuckwa/Honu messages / min 63 billion messages a day Tweet @jedberg with feedback!
  • 71. Incident Reviews Ask the key questions: • What went wrong? • How could we have detected it sooner? • How could we have prevented it? • How can we prevent this class of problem in the future? • How can we improve our behavior for next time? Tweet @jedberg with feedback!
  • 72. Best Practices for Data • Have multiple copies of all data • Keep those copies in multiple AZs • Avoid keeping state on a single instance • Take frequent snapshots of EBS disks • No secret keys on the instance Tweet @jedberg with feedback!
  • 73. Netflix autoscaling 2 Deployment Text 1 Traffic Peak Tweet @jedberg with feedback!
  • 74. AWS Usage Dollar amounts have been carefully removed Tweet @jedberg with feedback!
  • 76. Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per region Tweet @jedberg with feedback!
  • 78. Leveraging Multi-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost money Tweet @jedberg with feedback!
  • 79. Circuit Breakers (Hystrix) Be liberal in what you accept, strict in what you send Tweet @jedberg with feedback!
  • 80. Just a quick reminder... • (Some of) Netflix is open source: • https://github.com/netflix Tweet @jedberg with feedback!
  • 81. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  • 83. Getting in touch Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg Tweet @jedberg with feedback!

Editor's Notes

  1. My friends Joe and Carl already told you about Nac and our build system. This allows the devs to take control of their deployment. Each team is responsible for their own deployments and uptime. When something breaks, we have a system that lets us page a team who then gets on and fixes their stuff. Each team is responsible for their own destiny. So how do we stay reliable when we have no control? Information.
  2. Automate as much as you can
  3. The more automated things are, the easier it is to be a sysadmin. Application startup Configuration Code deployment Full system deployment The more automated things are, the easier it is to scale especially in a virtualized environment with auto-scaling And virtualized computing added the last bit, the ability to automate system deployment. (Ok, that ’ s not entirely true, but watch me wave my hands and say it is)
  4. In most places, you have this. Standard image with tools to manage the systems and the deployment.
  5. By building for three, you can reasonably lose one of your instances and still be stable.
  6. replication factor quorum reads / writes
  7. In most systems, you worry about the software and installing it on an OS. At Netflix, the smallest thing we worry about is the instance image, which lives in a cluster. We ’ ve essentially built a platform for doing automated deployment of Java code (and some Python too!)
  8. So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines? We like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because: • More Reliable: less systems that can fail at deploy time right when we need them most. • Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise. • Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster
  9. OK, back to the build pipeline again. We have a vague “ app bundles ” output in this diagram. Let ’ s delve into how we manage the application bundle artifacts in more detail.
  10. The first step of the baking process is to create the “ base ” image that we will use for baking all app images. This is done once every week or two. We start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages: • Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc. The end result is a beefed-up OS image that is ready to go, and just needs an app added.
  11. The first step of the baking process is to create the “ base ” image that we will use for baking all app images. This is done once every week or two. We start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages: • Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc. The end result is a beefed-up OS image that is ready to go, and just needs an app added.
  12. Gateway classifies and routes events based on severity and the systems involved. The gateway currently processes around 48K events a day
  13. (step through) here are some best practices we ’ ve learned over the last year with EC2. Some of these we follow well, and some we need to follow better.
  14. At Netflix we use autoscaling the help manage reliability and cost. Here is one of our clusters scaling up and down. We are tuning for the holidays, so you can see parts where we are doing squeeze tests and adjusting the scaling speed and values.
  15. Amazon will help you as well. One way they do this is by providing zones. Each zone is like an island that is loosely connected to the other zones, but mostly distinct.
  16. So how do you get better than 99.95% uptime? Multiple zones! By spreading your systems out across multiple zones, you should be able to withstand the failure of one zone. In a little bit, I ’ ll go over how reddit and Netflix used a multizone strategy to survive outages.
  17. Amazon, as well as other providers, offer multiple regions as well. Regions are essentially like separate providers with the same featureset. Your data does not get shared across regions
  18. You can contact me in one of these ways, or ask your question now. thank you.