SlideShare a Scribd company logo
Architectural Patterns for High
         Anxiety Availability
                     November 2012
                     Adrian Cockcroft
                  @adrianco #netflixcloud #qconsf
            http://www.linkedin.com/in/adriancockcroft



@adrianco
The Netflix Streaming Service

            Now in USA, Canada, Latin
        America, UK, Ireland, Sweden, Denm
             ark, Norway and Finland


@adrianco
US Non-Member Web Site
              Advertising and Marketing Driven




@adrianco
Member Web Site
             Personalization Driven




@adrianco
Streaming Device API




@adrianco
Content Delivery Service
    Distributed storage nodes controlled by Netflix cloud services




@adrianco
November 2012 Traffic




@adrianco
Abstract
• Netflix on Cloud – What, Why and When

• Globally Distributed Architecture

• Benchmarks and Scalability

• Open Source Components

• High Anxiety

@adrianco
Blah Blah                            Blah

             (I’m skipping all the cloud intro etc. did that
               last year… Netflix runs in the cloud, if you
              hadn’t figured that out already you aren’t
              paying attention and should go read Infoq
                       and slideshare.net/netflix)



@adrianco
Things we don’t do




@adrianco
Things We Do Do…
                                             In production
                                             at Netflix
•   Big Data/Hadoop                          2009
•   AWS Cloud                                2009
•   Application Performance Management       2010
•   Integrated DevOps Practices              2010
•   Continuous Integration/Delivery          2010
•   NoSQL, Globally Distributed              2010
•   Platform as a Service; Micro-Services    2010
•   Social coding, open development/github   2011

@adrianco
How Netflix Works
Consumer
Electronics                                        User Data
                                 Web Site or
AWS Cloud
                                Discovery API
 Services
                                                 Personalization
CDN Edge
Locations
                                                      DRM
              Customer Device
                                Streaming API
               (PC, PS3, TV…)
                                                  QoS Logging


                                                     CDN
                                                Management and
                                                   Steering
                                OpenConnect
                                 CDN Boxes
                                                Content Encoding



 @adrianco
Web Server Dependencies Flow
        (Home page business transaction as seen by AppDynamics)

 Each icon is three
 to a few hundred
 instances across
 three AWS zones                              Cassandra

                                                          memcached
                                                     Web service
       Start Here
                                                          S3 bucket




Personalization movie
group chooser
 @adrianco
Component Micro-Services
            Test With Chaos Monkey, Latency Monkey




@adrianco
Three Balanced Availability Zones
                           Test with Chaos Gorilla

                                Load Balancers




          Zone A                      Zone B                  Zone C
   Cassandra and Evcache       Cassandra and Evcache   Cassandra and Evcache
         Replicas                    Replicas                Replicas




@adrianco
Triple Replicated Persistence
           Cassandra maintenance affects individual replicas
                              Load Balancers




          Zone A                    Zone B                  Zone C
   Cassandra and Evcache     Cassandra and Evcache   Cassandra and Evcache
         Replicas                  Replicas                Replicas




@adrianco
Isolated Regions

                      US-East Load Balancers                                              EU-West Load Balancers




      Zone A                      Zone B                  Zone C               Zone A                 Zone B               Zone C

 Cassandra Replicas          Cassandra Replicas      Cassandra Replicas   Cassandra Replicas     Cassandra Replicas   Cassandra Replicas




@adrianco
Failure Modes and Effects
Failure Mode          Probability   Mitigation Plan
Application Failure   High          Automatic degraded response
AWS Region Failure    Low           Wait for region to recover
AWS Zone Failure      Medium        Continue to run on 2 out of 3 zones
Datacenter Failure    Medium        Migrate more functions to cloud
Data store failure    Low           Restore from S3 backups
S3 failure            Low           Restore from remote archive




@adrianco
Zone Failure Modes
• Power Outage
    – Instances lost, ephemeral state lost
    – Clean break and recovery, fail fast, “no route to host”

• Network Outage
    – Instances isolated, state inconsistent
    – More complex symptoms, recovery issues, transients

• Dependent Service Outage
    – Cascading failures, misbehaving instances, human errors
    – Confusing symptoms, recovery issues, byzantine effects

        More detail on this topic at AWS Re:Invent later this month…
@adrianco
Cassandra backed Micro-Services

            A highly scalable, available and
             durable deployment pattern



@adrianco
Micro-Service Pattern
     One keyspace, replaces a single table or materialized view
                                                                       Single function Cassandra
  Many Different Single-Function REST Clients                          Cluster Managed by Priam
                                                                       Between 6 and 72 nodes

                                     Stateless Data Access REST Service
                                     Astyanax Cassandra Client




                                                                                Optional
Each icon represents a horizontally scaled service of three to                  Datacenter
hundreds of instances deployed over three availability zones                    Update Flow
                              Appdynamics Service Flow Visualization
@adrianco
Stateless Micro-Service Architecture

 Linux Base AMI (CentOS or Ubuntu)

   Optional Apache
 frontend, memcache
                       Java (JDK 6 or 7)
   d, non-java apps


                          AppDynamics
                            appagent
                           monitoring       Tomcat
     Monitoring
                                            Application war file, base servlet,
  Log rotation to S3                                                              Healthcheck, status servlets, JMX
                                             platform, client interface jars,
    AppDynamics        GC and thread dump                                            interface, Servo autoscale
                                                        Astyanax
    machineagent             logging
      Epic/Atlas




@adrianco
Astyanax
                 Available at http://github.com/netflix

• Features
    –   Complete abstraction of connection pool from RPC protocol
    –   Fluent Style API
    –   Operation retry with backoff
    –   Token aware
• Recipes
    –   Distributed row lock (without zookeeper)
    –   Multi-DC row lock
    –   Uniqueness constraint
    –   Multi-row uniqueness constraint
    –   Chunked and multi-threaded large file storage


@adrianco
Astyanax Query Example
Paginate through all columns in a row
ColumnList<String> columns;
int pageize = 10;
try {
  RowQuery<String, String> query = keyspace
      .prepareQuery(CF_STANDARD1)
      .getKey("A")
      .setIsPaginating()
      .withColumnRange(new RangeBuilder().setMaxSize(pageize).build());

   while (!(columns = query.execute().getResult()).isEmpty()) {
     for (Column<String> c : columns) {
     }
   }
} catch (ConnectionException e) {
}


@adrianco
Astyanax - Cassandra Write Data Flows
           Single Region, Multiple Availability Zone, Token Aware

                                          Cassandra
                                          •Disks
                                          •Zone A

1. Client Writes to local   Cassandra 3                 2Cassandra   If a node goes
   coordinator              •Disks4                     3•Disks 4    offline, hinted handoff
2. Coodinator writes to     •Zone C           1          •Zone B     completes the write
                                                         2
   other zones                            Token                      when the node comes
3. Nodes return ack                                                  back up.
4. Data written to                        Aware
   internal commit log                    Clients                    Requests can choose to
   disks (no more than      Cassandra                    Cassandra   wait for one node, a
   10 seconds later)        •Disks                       •Disks      quorum, or all nodes to
                            •Zone B                      •Zone C     ack the write

                                          Cassandra
                                                    3
                                                                     SSTable disk writes and
                                          •Disks    4                compactions occur
                                          •Zone A
                                                                     asynchronously


   @adrianco
Data Flows for Multi-Region Writes
          Token Aware, Consistency Level = Local Quorum

1. Client writes to local replicas                   If a node or region goes offline, hinted handoff
2. Local write acks returned to                      completes the write when the node comes back up.
   Client which continues when                       Nightly global compare and repair jobs ensure
   2 of 3 local nodes are                            everything stays consistent.
   committed
3. Local coordinator writes to
   remote coordinator.                                                       100+ms latency
                                                      Cassandra                                       Cassandra
4. When data arrives, remote                          • Disks
                                                      • Zone A
                                                                                                      • Disks
                                                                                                      • Zone A

   coordinator node acks and         Cassandra   2                2
                                                                  Cassandra              Cassandra                4Cassandra
                                          6
                                     • Disks                      • Disks 6 3           5• Disks6                 4 Disks6
   copies to other remote zones      • Zone C
                                                           1
                                                                  • Zone B               • Zone C
                                                                                                                   •
                                                                                                                   • Zone B

                                                                                                                        4
5. Remote nodes ack to local                           US                                              EU
   coordinator                                       Clients                                         Clients
                                     Cassandra                        2
                                                                  Cassandra              Cassandra                 Cassandra
6. Data flushed to internal          • Disks
                                     • Zone B
                                                                  • Disks
                                                                  • Zone C
                                                                          6              • Disks
                                                                                         • Zone B
                                                                                                                   • Disks
                                                                                                                   • Zone C

   commit log disks (no more                          Cassandra                                              5
                                                                                                     6Cassandra
                                                      • Disks
   than 10 seconds later)                             • Zone A
                                                                                                      • Disks
                                                                                                      • Zone A




   @adrianco
Cassandra Instance Architecture

 Linux Base AMI (CentOS or Ubuntu)

  Tomcat and Priam
      on JDK
                       Java (JDK 7)
 Healthcheck, Status


                          AppDynamics
                            appagent
                           monitoring       Cassandra Server
     Monitoring
                                            Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and
   AppDynamics
                       GC and thread dump                                     SSTables
   machineagent
                             logging
    Epic/Atlas




@adrianco
Priam – Cassandra Automation
            Available at http://github.com/netflix

•   Netflix Platform Tomcat Code
•   Zero touch auto-configuration
•   State management for Cassandra JVM
•   Token allocation and assignment
•   Broken node auto-replacement
•   Full and incremental backup to S3
•   Restore sequencing from S3
•   Grow/Shrink Cassandra “ring”

@adrianco
Cassandra Backup
• Full Backup                                                Cassandra

                                             Cassandra                       Cassandra

    – Time based snapshot
    – SSTable compress -> S3     Cassandra                                               Cassandra




• Incremental                                                  S3
                                                             Backup
                               Cassandra                                                   Cassandra

    – SSTable write triggers
      compressed copy to S3
                                      Cassandra                                    Cassandra


• Archive                                            Cassandra       Cassandra


    – Copy cross region
                                    A

@adrianco
Deployment at Netflix

            Over 50 Cassandra Clusters
            Over 500 m2.4xlg+hi1.4xlg
            Over 30TB of daily backups
            Biggest cluster 72 nodes
            1 cluster over 250Kwrites/s

@adrianco
Cassandra Explorer for Data
                 Open source on github soon




@adrianco
ETL for Cassandra
•   Data is de-normalized over many clusters!
•   Too many to restore from backups for ETL
•   Solution – read backup files using Hadoop
•   Aegisthus
    – http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html

    – High throughput raw SSTable processing
    – Re-normalizes many clusters to a consistent view
    – Extract, Transform, then Load into Teradata

@adrianco
Benchmarks and Scalability




@adrianco
Cloud Deployment Scalability
         New Autoscaled AMI – zero to 500 instances from 21:38:52 - 21:46:32, 7m40s
  Scaled up and down over a few days, total 2176 instance launches, m2.2xlarge (4 core 34GB)

                          Min. 1st Qu. Median Mean 3rd Qu. Max.
                           41.0 104.2 149.0 171.8 215.8 562.0




@adrianco
Scalability from 48 to 288 nodes on AWS
  http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html


                Client Writes/s by node count – Replication Factor = 3
1200000
                                                                      1099837
1000000

 800000
                                                           Used 288 of m1.xlarge
 600000                                                    4 CPU, 15 GB RAM, 8 ECU
                                            537172         Cassandra 0.86
 400000                                                    Benchmark config only
                                  366828                   existed for about 1hr
 200000                 174373

       0
            0         50         100       150       200   250       300        350


@adrianco
“Some people skate to the puck,
  I skate to where the puck is going to be”
               Wayne Gretzky




@adrianco
Cassandra on AWS
The Past                     The Future
• Instance: m2.4xlarge       • Instance: hi1.4xlarge
• Storage: 2 drives, 1.7TB   • Storage: 2 SSD volumes, 2TB
• CPU: 8 Cores, 26 ECU       • CPU: 8 HT cores, 35 ECU
• RAM: 68GB                  • RAM: 64GB
• Network: 1Gbit             • Network: 10Gbit
• IOPS: ~500                 • IOPS: ~100,000
• Throughput: ~100Mbyte/s    • Throughput: ~1Gbyte/s
• Cost: $1.80/hr             • Cost: $3.10/hr


@adrianco
Cassandra Disk vs. SSD Benchmark
            Same Throughput, Lower Latency, Half Cost




@adrianco
Netflix Open Source Strategy
• Release PaaS Components git-by-git
    – Source at github.com/netflix – we build from it…
    – Intros and techniques at techblog.netflix.com
    – Blog post or new code every few weeks


• Motivations
    – Give back to Apache licensed OSS community
    – Motivate, retain, hire top engineers
    – “Peer pressure” code cleanup, external contributions

@adrianco
Instance creation


        Bakery &
       Build tools                       Asgard

                     Base AMI                                       Instance
                                                  Autoscaling
       Application                Odin
         Code                                       scripts




        Image baked             ASG / Instance started          Instance Running




@adrianco
Application Launch


            Governator                            Eureka
             (Guice)
                              Async
                             logging
                                       Archaius            Edda
              Servo


                                               Service
            Application initializing   Registry, configuration
                                               history


@adrianco
Runtime


        Astyanax                     Priam

                   Curator                                     Chaos Monkey
                                                               Latency Monkey
                             NIWS
                                                Exhibitor
                              LB                               Janitor Monkey
                   REST
                                                               Cass JMeter
     Dependency    client
      Command                       Explorers



       Client Side             Server Side                  Resiliency aids
      Components              Components



@adrianco
Open Source Projects
         Legend
 Github / Techblog                Priam                          Exhibitor
                                                                                       Servo and Autoscaling Scripts
Apache Contributions
                          Cassandra as a Service           Zookeeper as a Service
                                 Astyanax                         Curator                         Honu
   Techblog Post
                         Cassandra client for Java          Zookeeper Patterns          Log4j streaming to Hadoop
   Coming Soon
                                CassJMeter                        EVCache                     Circuit Breaker
                           Cassandra test suite           Memcached as a Service          Robust service pattern

                       Cassandra Multi-region EC2            Eureka / Discovery          Asgard - AutoScaleGroup
                           datastore support                 Service Directory             based AWS console

                                Aegisthus                         Archaius                    Chaos Monkey
                        Hadoop ETL for Cassandra        Dynamics Properties Service       Robustness verification
                                                                   Edda
                                 Explorers                                                   Latency Monkey
                                                          Queryable config history

                       Governator - Library lifecycle     Server-side latency/error
                                                                                             Janitor Monkey
                        and dependency injection                  injection

                                   Odin
                                                          REST Client + mid-tier LB         Bakeries and AMI
                         Workflow orchestration

                              Async logging             Configuration REST endpoints         Build dynaslaves



@adrianco
Cassandra Next Steps
• Migrate Production Cassandra to SSD
    – Many clusters done
    – 100+ SSD nodes running

• Autoscale Cassandra using Priam
    – Cassandra 1.2 Vnodes make this easier
    – Shrink Cassandra cluster every night

• Automated Zone and Region Operations
    – Add/Remove Zone, split or merge clusters
    – Add/Remove Region, split or merge clusters


@adrianco
YOLO




@adrianco
Skynet
        A Netflix Hackday project that might just terminate the
                               world…

       (hack currently only implemented in Powerpoint – luckly)




@adrianco
The Plot (kinda)
• Skynet is a sentient computer

• Skynet defends itself if you try to turn it off

• Connor is the guy who eventually turns it off

• Terminator is the robot sent to kill Connor

@adrianco
The Hacktors
• Cass_skynet is a self-managing Cassandra cluster
• Connor_monkey kills cass_skynet nodes
• Terminator_monkey kills connor_monkey nodes




@adrianco
The Hacktion
• Cass_skynet stores a history of its world and
  action scripts that trigger from what it sees
• Action response to losing a node
    – Auto-replace node and grow cluster size
• Action response to losing more nodes
    – Replicate cluster into a new zone or region
• Action response to seeing a Connor_monkey
    – Startup a Terminator_monkey

@adrianco
Implementation
• Priam
    – Autoreplace missing nodes
    – Grow cass_skynet cluster in zone, to new zones or regions
• Cassandra Keyspaces
    – Actions – scripts to be run
    – Memory – record event log of everything seen
• Cron job once a minute
    – Extract actions from Cassandra and execute
    – Log actions and results in memory
• Chaos Monkey configuration
    – Terminator_monkey: pick a zone, kill any connor_monkey
    – Connor_monkey: kill any cass_skynet or terminator_monkey


@adrianco
“Simulation”




@adrianco
High Anxiety




@adrianco
Takeaway

  Netflix has built and deployed a scalable global platform based on
                           Cassandra and AWS.

Key components of the Netflix PaaS are being released as Open Source
          projects so you can build your own custom PaaS.

                  SSD’s in the cloud are awesome….

                         http://github.com/Netflix
                        http://techblog.netflix.com
                        http://slideshare.net/Netflix

                 http://www.linkedin.com/in/adriancockcroft
                   @adrianco http://perfcap.blogspot.com


@adrianco

More Related Content

What's hot

IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its ApplicationsIBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
Tony Pearson
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
aspyker
 
Modular Level Design for Skyrim
Modular Level Design for SkyrimModular Level Design for Skyrim
Modular Level Design for Skyrim
Joel Burgess
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
DataStax
 
Past, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in GamesPast, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in Games
Colin Barré-Brisebois
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
Slide_N
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
Electronic Arts / DICE
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Johan Andersson
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Sudeep Das, Ph.D.
 
Voxel based global-illumination
Voxel based global-illuminationVoxel based global-illumination
Voxel based global-illumination
SeyedMorteza Mostajabodaveh
 
Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3
Guerrilla
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
Elasticsearch
 
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Intel® Software
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
aspyker
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
Guerrilla
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Yves Raimond
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 

What's hot (20)

IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its ApplicationsIBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Modular Level Design for Skyrim
Modular Level Design for SkyrimModular Level Design for Skyrim
Modular Level Design for Skyrim
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
 
Past, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in GamesPast, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in Games
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Voxel based global-illumination
Voxel based global-illuminationVoxel based global-illumination
Voxel based global-illumination
 
Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 

Viewers also liked

Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
Adrian Cockcroft
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Adrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Adrian Cockcroft
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Adrian Cockcroft
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
Adrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
Adrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
Adrian Cockcroft
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
Adrian Cockcroft
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
Adrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Adrian Cockcroft
 
Speeding Up Innovation
Speeding Up InnovationSpeeding Up Innovation
Speeding Up Innovation
Adrian Cockcroft
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
Sudhir Tonse
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
Adrian Cockcroft
 
Culture
CultureCulture
Culture
Reed Hastings
 
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres OpenCraig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
PostgresOpen
 

Viewers also liked (20)

Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Speeding Up Innovation
Speeding Up InnovationSpeeding Up Innovation
Speeding Up Innovation
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Culture
CultureCulture
Culture
 
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres OpenCraig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
 

Similar to Architectures for High Availability - QConSF

ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Amazon Web Services
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
MassTLC
 
Cloudian_Cassandra Summit 2012
Cloudian_Cassandra Summit 2012Cloudian_Cassandra Summit 2012
Cloudian_Cassandra Summit 2012
CLOUDIAN KK
 
AWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go SquaredAWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go Squared
Amazon Web Services
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
Sudhir Tonse
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
Ram Chinta
 
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
Amazon Web Services
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sina
Hui Cheng
 
Microservices reativos usando a stack do Netflix na AWS
Microservices reativos usando a stack do Netflix na AWSMicroservices reativos usando a stack do Netflix na AWS
Microservices reativos usando a stack do Netflix na AWS
Diego Pacheco
 
게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon Web Services Korea
 
Introduction to AWS tools
Introduction to AWS toolsIntroduction to AWS tools
Introduction to AWS tools
Amazon Web Services
 
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
Amazon Web Services
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
Ruslan Meshenberg
 
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
IndicThreads
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
DataStax Academy
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
Acquia
 
Servers fail, who cares?
Servers fail, who cares? Servers fail, who cares?
Servers fail, who cares?
greggulrich
 
Randall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day KeynoteRandall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day Keynote
AWS Chicago
 
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge
Adrian Cockcroft
 
Best Practices Scaling Web Application Up to Your First 10 Million Users
Best Practices Scaling Web Application Up to Your First 10 Million UsersBest Practices Scaling Web Application Up to Your First 10 Million Users
Best Practices Scaling Web Application Up to Your First 10 Million Users
Amazon Web Services
 

Similar to Architectures for High Availability - QConSF (20)

ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Cloudian_Cassandra Summit 2012
Cloudian_Cassandra Summit 2012Cloudian_Cassandra Summit 2012
Cloudian_Cassandra Summit 2012
 
AWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go SquaredAWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go Squared
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
 
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sina
 
Microservices reativos usando a stack do Netflix na AWS
Microservices reativos usando a stack do Netflix na AWSMicroservices reativos usando a stack do Netflix na AWS
Microservices reativos usando a stack do Netflix na AWS
 
게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
게임을 위한 Cloud Native on AWS (김일호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
 
Introduction to AWS tools
Introduction to AWS toolsIntroduction to AWS tools
Introduction to AWS tools
 
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
 
Servers fail, who cares?
Servers fail, who cares? Servers fail, who cares?
Servers fail, who cares?
 
Randall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day KeynoteRandall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day Keynote
 
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge
 
Best Practices Scaling Web Application Up to Your First 10 Million Users
Best Practices Scaling Web Application Up to Your First 10 Million UsersBest Practices Scaling Web Application Up to Your First 10 Million Users
Best Practices Scaling Web Application Up to Your First 10 Million Users
 

More from Adrian Cockcroft

Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
Adrian Cockcroft
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
Adrian Cockcroft
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Adrian Cockcroft
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
Adrian Cockcroft
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
Adrian Cockcroft
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
Adrian Cockcroft
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
Adrian Cockcroft
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
Adrian Cockcroft
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
Adrian Cockcroft
 

More from Adrian Cockcroft (15)

Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
 

Recently uploaded

Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
bellared2
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 

Recently uploaded (20)

Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 

Architectures for High Availability - QConSF

  • 1. Architectural Patterns for High Anxiety Availability November 2012 Adrian Cockcroft @adrianco #netflixcloud #qconsf http://www.linkedin.com/in/adriancockcroft @adrianco
  • 2. The Netflix Streaming Service Now in USA, Canada, Latin America, UK, Ireland, Sweden, Denm ark, Norway and Finland @adrianco
  • 3. US Non-Member Web Site Advertising and Marketing Driven @adrianco
  • 4. Member Web Site Personalization Driven @adrianco
  • 6. Content Delivery Service Distributed storage nodes controlled by Netflix cloud services @adrianco
  • 8. Abstract • Netflix on Cloud – What, Why and When • Globally Distributed Architecture • Benchmarks and Scalability • Open Source Components • High Anxiety @adrianco
  • 9. Blah Blah Blah (I’m skipping all the cloud intro etc. did that last year… Netflix runs in the cloud, if you hadn’t figured that out already you aren’t paying attention and should go read Infoq and slideshare.net/netflix) @adrianco
  • 10. Things we don’t do @adrianco
  • 11. Things We Do Do… In production at Netflix • Big Data/Hadoop 2009 • AWS Cloud 2009 • Application Performance Management 2010 • Integrated DevOps Practices 2010 • Continuous Integration/Delivery 2010 • NoSQL, Globally Distributed 2010 • Platform as a Service; Micro-Services 2010 • Social coding, open development/github 2011 @adrianco
  • 12. How Netflix Works Consumer Electronics User Data Web Site or AWS Cloud Discovery API Services Personalization CDN Edge Locations DRM Customer Device Streaming API (PC, PS3, TV…) QoS Logging CDN Management and Steering OpenConnect CDN Boxes Content Encoding @adrianco
  • 13. Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Each icon is three to a few hundred instances across three AWS zones Cassandra memcached Web service Start Here S3 bucket Personalization movie group chooser @adrianco
  • 14. Component Micro-Services Test With Chaos Monkey, Latency Monkey @adrianco
  • 15. Three Balanced Availability Zones Test with Chaos Gorilla Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas @adrianco
  • 16. Triple Replicated Persistence Cassandra maintenance affects individual replicas Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas @adrianco
  • 17. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas @adrianco
  • 18. Failure Modes and Effects Failure Mode Probability Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive @adrianco
  • 19. Zone Failure Modes • Power Outage – Instances lost, ephemeral state lost – Clean break and recovery, fail fast, “no route to host” • Network Outage – Instances isolated, state inconsistent – More complex symptoms, recovery issues, transients • Dependent Service Outage – Cascading failures, misbehaving instances, human errors – Confusing symptoms, recovery issues, byzantine effects More detail on this topic at AWS Re:Invent later this month… @adrianco
  • 20. Cassandra backed Micro-Services A highly scalable, available and durable deployment pattern @adrianco
  • 21. Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Many Different Single-Function REST Clients Cluster Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client Optional Each icon represents a horizontally scaled service of three to Datacenter hundreds of instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization @adrianco
  • 22. Stateless Micro-Service Architecture Linux Base AMI (CentOS or Ubuntu) Optional Apache frontend, memcache Java (JDK 6 or 7) d, non-java apps AppDynamics appagent monitoring Tomcat Monitoring Application war file, base servlet, Log rotation to S3 Healthcheck, status servlets, JMX platform, client interface jars, AppDynamics GC and thread dump interface, Servo autoscale Astyanax machineagent logging Epic/Atlas @adrianco
  • 23. Astyanax Available at http://github.com/netflix • Features – Complete abstraction of connection pool from RPC protocol – Fluent Style API – Operation retry with backoff – Token aware • Recipes – Distributed row lock (without zookeeper) – Multi-DC row lock – Uniqueness constraint – Multi-row uniqueness constraint – Chunked and multi-threaded large file storage @adrianco
  • 24. Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginating() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } } } catch (ConnectionException e) { } @adrianco
  • 25. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Cassandra •Disks •Zone A 1. Client Writes to local Cassandra 3 2Cassandra If a node goes coordinator •Disks4 3•Disks 4 offline, hinted handoff 2. Coodinator writes to •Zone C 1 •Zone B completes the write 2 other zones Token when the node comes 3. Nodes return ack back up. 4. Data written to Aware internal commit log Clients Requests can choose to disks (no more than Cassandra Cassandra wait for one node, a 10 seconds later) •Disks •Disks quorum, or all nodes to •Zone B •Zone C ack the write Cassandra 3 SSTable disk writes and •Disks 4 compactions occur •Zone A asynchronously @adrianco
  • 26. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum 1. Client writes to local replicas If a node or region goes offline, hinted handoff 2. Local write acks returned to completes the write when the node comes back up. Client which continues when Nightly global compare and repair jobs ensure 2 of 3 local nodes are everything stays consistent. committed 3. Local coordinator writes to remote coordinator. 100+ms latency Cassandra Cassandra 4. When data arrives, remote • Disks • Zone A • Disks • Zone A coordinator node acks and Cassandra 2 2 Cassandra Cassandra 4Cassandra 6 • Disks • Disks 6 3 5• Disks6 4 Disks6 copies to other remote zones • Zone C 1 • Zone B • Zone C • • Zone B 4 5. Remote nodes ack to local US EU coordinator Clients Clients Cassandra 2 Cassandra Cassandra Cassandra 6. Data flushed to internal • Disks • Zone B • Disks • Zone C 6 • Disks • Zone B • Disks • Zone C commit log disks (no more Cassandra 5 6Cassandra • Disks than 10 seconds later) • Zone A • Disks • Zone A @adrianco
  • 27. Cassandra Instance Architecture Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Java (JDK 7) Healthcheck, Status AppDynamics appagent monitoring Cassandra Server Monitoring Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and AppDynamics GC and thread dump SSTables machineagent logging Epic/Atlas @adrianco
  • 28. Priam – Cassandra Automation Available at http://github.com/netflix • Netflix Platform Tomcat Code • Zero touch auto-configuration • State management for Cassandra JVM • Token allocation and assignment • Broken node auto-replacement • Full and incremental backup to S3 • Restore sequencing from S3 • Grow/Shrink Cassandra “ring” @adrianco
  • 29. Cassandra Backup • Full Backup Cassandra Cassandra Cassandra – Time based snapshot – SSTable compress -> S3 Cassandra Cassandra • Incremental S3 Backup Cassandra Cassandra – SSTable write triggers compressed copy to S3 Cassandra Cassandra • Archive Cassandra Cassandra – Copy cross region A @adrianco
  • 30. Deployment at Netflix Over 50 Cassandra Clusters Over 500 m2.4xlg+hi1.4xlg Over 30TB of daily backups Biggest cluster 72 nodes 1 cluster over 250Kwrites/s @adrianco
  • 31. Cassandra Explorer for Data Open source on github soon @adrianco
  • 32. ETL for Cassandra • Data is de-normalized over many clusters! • Too many to restore from backups for ETL • Solution – read backup files using Hadoop • Aegisthus – http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html – High throughput raw SSTable processing – Re-normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata @adrianco
  • 34. Cloud Deployment Scalability New Autoscaled AMI – zero to 500 instances from 21:38:52 - 21:46:32, 7m40s Scaled up and down over a few days, total 2176 instance launches, m2.2xlarge (4 core 34GB) Min. 1st Qu. Median Mean 3rd Qu. Max. 41.0 104.2 149.0 171.8 215.8 562.0 @adrianco
  • 35. Scalability from 48 to 288 nodes on AWS http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Client Writes/s by node count – Replication Factor = 3 1200000 1099837 1000000 800000 Used 288 of m1.xlarge 600000 4 CPU, 15 GB RAM, 8 ECU 537172 Cassandra 0.86 400000 Benchmark config only 366828 existed for about 1hr 200000 174373 0 0 50 100 150 200 250 300 350 @adrianco
  • 36. “Some people skate to the puck, I skate to where the puck is going to be” Wayne Gretzky @adrianco
  • 37. Cassandra on AWS The Past The Future • Instance: m2.4xlarge • Instance: hi1.4xlarge • Storage: 2 drives, 1.7TB • Storage: 2 SSD volumes, 2TB • CPU: 8 Cores, 26 ECU • CPU: 8 HT cores, 35 ECU • RAM: 68GB • RAM: 64GB • Network: 1Gbit • Network: 10Gbit • IOPS: ~500 • IOPS: ~100,000 • Throughput: ~100Mbyte/s • Throughput: ~1Gbyte/s • Cost: $1.80/hr • Cost: $3.10/hr @adrianco
  • 38. Cassandra Disk vs. SSD Benchmark Same Throughput, Lower Latency, Half Cost @adrianco
  • 39. Netflix Open Source Strategy • Release PaaS Components git-by-git – Source at github.com/netflix – we build from it… – Intros and techniques at techblog.netflix.com – Blog post or new code every few weeks • Motivations – Give back to Apache licensed OSS community – Motivate, retain, hire top engineers – “Peer pressure” code cleanup, external contributions @adrianco
  • 40. Instance creation Bakery & Build tools Asgard Base AMI Instance Autoscaling Application Odin Code scripts Image baked ASG / Instance started Instance Running @adrianco
  • 41. Application Launch Governator Eureka (Guice) Async logging Archaius Edda Servo Service Application initializing Registry, configuration history @adrianco
  • 42. Runtime Astyanax Priam Curator Chaos Monkey Latency Monkey NIWS Exhibitor LB Janitor Monkey REST Cass JMeter Dependency client Command Explorers Client Side Server Side Resiliency aids Components Components @adrianco
  • 43. Open Source Projects Legend Github / Techblog Priam Exhibitor Servo and Autoscaling Scripts Apache Contributions Cassandra as a Service Zookeeper as a Service Astyanax Curator Honu Techblog Post Cassandra client for Java Zookeeper Patterns Log4j streaming to Hadoop Coming Soon CassJMeter EVCache Circuit Breaker Cassandra test suite Memcached as a Service Robust service pattern Cassandra Multi-region EC2 Eureka / Discovery Asgard - AutoScaleGroup datastore support Service Directory based AWS console Aegisthus Archaius Chaos Monkey Hadoop ETL for Cassandra Dynamics Properties Service Robustness verification Edda Explorers Latency Monkey Queryable config history Governator - Library lifecycle Server-side latency/error Janitor Monkey and dependency injection injection Odin REST Client + mid-tier LB Bakeries and AMI Workflow orchestration Async logging Configuration REST endpoints Build dynaslaves @adrianco
  • 44. Cassandra Next Steps • Migrate Production Cassandra to SSD – Many clusters done – 100+ SSD nodes running • Autoscale Cassandra using Priam – Cassandra 1.2 Vnodes make this easier – Shrink Cassandra cluster every night • Automated Zone and Region Operations – Add/Remove Zone, split or merge clusters – Add/Remove Region, split or merge clusters @adrianco
  • 46. Skynet A Netflix Hackday project that might just terminate the world… (hack currently only implemented in Powerpoint – luckly) @adrianco
  • 47. The Plot (kinda) • Skynet is a sentient computer • Skynet defends itself if you try to turn it off • Connor is the guy who eventually turns it off • Terminator is the robot sent to kill Connor @adrianco
  • 48. The Hacktors • Cass_skynet is a self-managing Cassandra cluster • Connor_monkey kills cass_skynet nodes • Terminator_monkey kills connor_monkey nodes @adrianco
  • 49. The Hacktion • Cass_skynet stores a history of its world and action scripts that trigger from what it sees • Action response to losing a node – Auto-replace node and grow cluster size • Action response to losing more nodes – Replicate cluster into a new zone or region • Action response to seeing a Connor_monkey – Startup a Terminator_monkey @adrianco
  • 50. Implementation • Priam – Autoreplace missing nodes – Grow cass_skynet cluster in zone, to new zones or regions • Cassandra Keyspaces – Actions – scripts to be run – Memory – record event log of everything seen • Cron job once a minute – Extract actions from Cassandra and execute – Log actions and results in memory • Chaos Monkey configuration – Terminator_monkey: pick a zone, kill any connor_monkey – Connor_monkey: kill any cass_skynet or terminator_monkey @adrianco
  • 53. Takeaway Netflix has built and deployed a scalable global platform based on Cassandra and AWS. Key components of the Netflix PaaS are being released as Open Source projects so you can build your own custom PaaS. SSD’s in the cloud are awesome…. http://github.com/Netflix http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/adriancockcroft @adrianco http://perfcap.blogspot.com @adrianco

Editor's Notes

  1. Complete connection pool abstractionQueries and mutations wrapped in objects created by the Keyspace implementation making it possible to retry failed operations.  This varies from other connection pool implementations on which the operation is created on a specific connection and must be completely redone if it fails.Simplified serialization via method overloading.  The low level thrift library only understands data that is serialized to a byte array.  Hector requires serializers to be specified for nearly every call.  Astyanax minimizes the places where serializers are specified by using predefined ColumnFamiliy and ColumnPath definitions which specify the serializers.  The API also overloads set and get operation for common data types.The internal library does not log anything.  All internal events are instead ... calls to a ConnectionPoolMonitor interface.  This allows customization of log levels and filtering of repeating events outside of the scope of the connection poolSuper columns will soon be replaced by Composite column names. As such it is recommended to not use super columns at all and to use Composite column names instead. There is some support for super columns in Astyanax but those methods have been deprecated and will eventually be removed.