SlideShare a Scribd company logo
Maximizing Scalability, Resiliency, and
Engineering Velocity in the Cloud
Coburn Watson
Manager, Cloud Performance, Netflix
Surge „13
Netflix, Inc.
• World's leading internet television network
• ~ 38 Million subscribers in 40+ countries
• Over a billion hours streamed per month
• Approximately 33% of all US Internet traffic at
night
• Recent Notables
• Increased Originals catalog
• Large open source contribution
• OpenConnect (homegrown CDN)
2
About Me
• Manage Cloud Performance Engineering Team
• Sub-team of Cloud Solutions Organization
• Focus on performance since 2000
• Large-scale billing applications, eCommerce, datacenter
mgmt., etc.
• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
• Passion for tackling performance at cloud-scale
• Looking for great performance engineers
• cwatson@netflix.com
3
Freedom and Responsibility
• Culture deck..a great read
• Good performers: 2x, Top performers: 10x
• What engineers dislike
• cumbersome processes
• deployment inefficiency
• restricted access
• restricted technical freedom
• lack of trust
• If removed…maximize:
• Engineering velocity
• Engineer satisfaction
4
Maximizing: Engineering
Velocity
5
How
• Implementation freedom
• SCM, libraries, language
• that said..platform benefits exist
• Deployment freedom
• Service team owns
• push schedule, functionality, performance
• operational activities (being paged)
• On-demand cloud capacity
• Thousands of instances at the push of a button
6
Rapid Deployment?
Impossible..
3-6 Months?
7
Rapid (Cloud) Deployment
3-5 Minutes
8
BaseAMI
• Supply the foundation
• Monitoring, java, apache, tomcat, etc.
• Open source project: Aminator
9
Pushing Code: Red-Black
• Gracefully roll code in, or out, of production
• Asgard is our AWS configuration mgmt. tool
10
Compounded risks with increased velocity
Risks: Decreased Reliability, Performance, and Scalability
Not all Roses
11
Goal: CI (Continuous
Improvement)
12
Maximizing: Reliability
13
Fear (Revere) the Monkeys
• Simulate
• Latency
• Errors
• Initiate
• Instance Termination
• Availability Zone Failure
• Identify
• Configuration Drift
… in Test and Production
14
Tracking Change: Chronos
• Aggregate Significant Events *
• Current Sources:
• Pushes (Asgard)
• Production Change Requests (JIRA)
• AWS Notifications
• Dynamic Property Changes
• ASG Scaling Events
• Implementation
• Simple REST-service; customized adapters
* - “can disrupt production service”
15
Chronos, cont.
16
Automated Canary Analysis
• Identify regression between new and existing code
• Point ACA to baseline (prod) and canary ASG
• Typically analyze an hours worth of time series data
• Compare ratio of averages between canary and baseline
• Evaluate range and noise; determine quality of signal
• Bucket: Hot, Cold, Noisy, or OK
• Multiple classifiers available
• Multiple metric collections (e.g. hand-picked by service, general)
• Rollup
• Constrained: along metric dimensions
• Final: Score the canary
• Implementation: R-based analysis
17
HOT OK NOISYCOLDOK
NOISY
constrained rollup (dashed)
final rollup
ACA: in Action
18
Hystrix: Defend Your App
● Protection from downstream service failures
● Functional (unavailable) or performance in nature
19
Maximizing: Scalability and
Performance
20
Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances per
day
• order of tens of thousands of EC2 instances
• Larger ASG spans 200-900 m2.4xlarge daily
Why:
• Improved scalability during unexpected workloads
• Absorb variance in service performance profile
• Reactive chain of dependencies
• Creates "reserved instance troughs" for batch activity
21
Dynamic Scaling, cont.
Example covers 3 services
• 2 edge (A,B), 1 mid-tier (C)
• C has more upstream services
than simply A and B
Multiple Autoscaling Policies
• (A) System Load Average
• (B,C) Request-Rate based
22
Dynamic Scaling, cont.
23
Dynamic Scaling, cont.
• Response time variability greatest during scaling events
• Average response time primary between 75-150 msec
24
Dynamic Scaling, cont.
• Instance counts 3x, Aggregate requests 4.5x (not shown)
• Average CPU utilization per instance: ~25-55%25
Study performed:
• 24 node C* SSD-based cluster (hi1.4xlarge)
• mid-tier service load application
• Targeting 2x production rates
• Increase read ops from 30k to to 70k in ~ 3 minutes
• Increase write ops 750 to 1500 in ~ 3 minutes
Results:
• 95th pctl response time increase: ~ 17 msec to 45 msec
• 99th pctl response time increase: ~ 35 msec to 80 msec
Cassandra Performance
26
Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
27
Cloud-scale Load Testing
• Ad-Hoc or CI-based load test model
• (CI) Run-over-run comparison; email on rule violation
1. Jenkins initiates job
2. JMeter instances apply load
3. Results written to s3
4. Instance metrics published to
Atlas
5. Raw data fetched and
processed
28
Conclusions
• Continually accelerate engineering velocity
• Evolve architecture and processes to mitigate risks
• Stateless micro-service architectures win!
• Remove barriers for engineers
• Last option should be to reduce rate of change
• Exercise failure and “thundering herd” scenarios
• Cloud native scaling and resiliency are key factors
• Leverage pre-existing OSS PaaS when possible
29
Netflix Open Source
Our Open Source Software simplifies mgmt at
scale
Great projects, stunning colleagues:
jobs.netflix.com30
Q&A
• cwatson@netflix.com
• Netflix Tech Blog: http://techblog.netflix.com
31

More Related Content

What's hot

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Gigaom
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
HostedbyConfluent
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
SingleStore
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
HostedbyConfluent
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
Philip Fisher-Ogden
 
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
DevOps.com
 
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the CloudGCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
Samuel Chow
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
InfluxData
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Monal Daxini
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
confluent
 
Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform
confluent
 
Session 03 data_migration_at_scale_by_sameer
Session 03 data_migration_at_scale_by_sameerSession 03 data_migration_at_scale_by_sameer
Session 03 data_migration_at_scale_by_sameer
Ashish Pandey
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 

What's hot (20)

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
 
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
 
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the CloudGCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
GCPLA Meetup Workshop - Migration from a Legacy Infrastructure to the Cloud
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform
 
Session 03 data_migration_at_scale_by_sameer
Session 03 data_migration_at_scale_by_sameerSession 03 data_migration_at_scale_by_sameer
Session 03 data_migration_at_scale_by_sameer
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 

Viewers also liked

2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud finalYuryIzrailevsky
 
NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2
Ruslan Meshenberg
 
Engineering Velocity: Shifting the Curve at Netflix
Engineering Velocity: Shifting the Curve at NetflixEngineering Velocity: Shifting the Curve at Netflix
Engineering Velocity: Shifting the Curve at Netflix
Dianne Marsh
 
Think Like a Hacker
Think Like a HackerThink Like a Hacker
Think Like a Hacker
NormShield, Inc.
 
From Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at NetflixFrom Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at Netflix
Dianne Marsh
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
Danny Yuan
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
Alex Maestretti
 
Engineering Tools at Netflix: Enabling Continuous Delivery
Engineering Tools at Netflix: Enabling Continuous DeliveryEngineering Tools at Netflix: Enabling Continuous Delivery
Engineering Tools at Netflix: Enabling Continuous Delivery
Mike McGarr
 
OTT & The Future of Connected TV
OTT & The Future of Connected TVOTT & The Future of Connected TV
OTT & The Future of Connected TV
Clearbridge Mobile
 
Continuous Delivery at Netflix, and beyond
Continuous Delivery at Netflix, and beyondContinuous Delivery at Netflix, and beyond
Continuous Delivery at Netflix, and beyond
Mike McGarr
 
Implementing DevOps
Implementing DevOpsImplementing DevOps
Implementing DevOps
Mike McGarr
 
Splitting the Check on Compliance and Security
Splitting the Check on Compliance and SecuritySplitting the Check on Compliance and Security
Splitting the Check on Compliance and Security
Jason Chan
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
Dianne Marsh
 
Critical Infrastructure Protection from Terrorist Attacks
Critical Infrastructure Protection from Terrorist AttacksCritical Infrastructure Protection from Terrorist Attacks
Critical Infrastructure Protection from Terrorist Attacks
BGA Cyber Security
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
Amazon Web Services
 
NormShield Cyber Threat & Vulnerability Orchestration Overview
NormShield Cyber Threat & Vulnerability Orchestration OverviewNormShield Cyber Threat & Vulnerability Orchestration Overview
NormShield Cyber Threat & Vulnerability Orchestration Overview
NormShield, Inc.
 

Viewers also liked (17)

2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
 
NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2
 
Engineering Velocity: Shifting the Curve at Netflix
Engineering Velocity: Shifting the Curve at NetflixEngineering Velocity: Shifting the Curve at Netflix
Engineering Velocity: Shifting the Curve at Netflix
 
Think Like a Hacker
Think Like a HackerThink Like a Hacker
Think Like a Hacker
 
From Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at NetflixFrom Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at Netflix
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Engineering Tools at Netflix: Enabling Continuous Delivery
Engineering Tools at Netflix: Enabling Continuous DeliveryEngineering Tools at Netflix: Enabling Continuous Delivery
Engineering Tools at Netflix: Enabling Continuous Delivery
 
OTT & The Future of Connected TV
OTT & The Future of Connected TVOTT & The Future of Connected TV
OTT & The Future of Connected TV
 
Continuous Delivery at Netflix, and beyond
Continuous Delivery at Netflix, and beyondContinuous Delivery at Netflix, and beyond
Continuous Delivery at Netflix, and beyond
 
Implementing DevOps
Implementing DevOpsImplementing DevOps
Implementing DevOps
 
Splitting the Check on Compliance and Security
Splitting the Check on Compliance and SecuritySplitting the Check on Compliance and Security
Splitting the Check on Compliance and Security
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
 
Critical Infrastructure Protection from Terrorist Attacks
Critical Infrastructure Protection from Terrorist AttacksCritical Infrastructure Protection from Terrorist Attacks
Critical Infrastructure Protection from Terrorist Attacks
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
NormShield Cyber Threat & Vulnerability Orchestration Overview
NormShield Cyber Threat & Vulnerability Orchestration OverviewNormShield Cyber Threat & Vulnerability Orchestration Overview
NormShield Cyber Threat & Vulnerability Orchestration Overview
 

Similar to Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Nicolas Poggi
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
DataStax Academy
 
Release it! - Takeaways
Release it! - TakeawaysRelease it! - Takeaways
Release it! - Takeaways
Manuela Grindei
 
Effective Service Mesh to turbocharge Cloud Resiliency
Effective Service Mesh to turbocharge Cloud ResiliencyEffective Service Mesh to turbocharge Cloud Resiliency
Effective Service Mesh to turbocharge Cloud Resiliency
Liang Gang Yu
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
Otto Mok
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Tools. Techniques. Trouble?
Tools. Techniques. Trouble?Tools. Techniques. Trouble?
Tools. Techniques. Trouble?
Testplant
 
Log Monitoring and Anomaly Detection at Scale at ORNL
Log Monitoring and Anomaly Detection at Scale at ORNLLog Monitoring and Anomaly Detection at Scale at ORNL
Log Monitoring and Anomaly Detection at Scale at ORNL
Elasticsearch
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]
AppFirst
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
DataWorks Summit
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
Randy Shoup
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
Tianjian Chen
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
Amazon Web Services
 

Similar to Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud (20)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 
Release it! - Takeaways
Release it! - TakeawaysRelease it! - Takeaways
Release it! - Takeaways
 
Effective Service Mesh to turbocharge Cloud Resiliency
Effective Service Mesh to turbocharge Cloud ResiliencyEffective Service Mesh to turbocharge Cloud Resiliency
Effective Service Mesh to turbocharge Cloud Resiliency
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Tools. Techniques. Trouble?
Tools. Techniques. Trouble?Tools. Techniques. Trouble?
Tools. Techniques. Trouble?
 
Log Monitoring and Anomaly Detection at Scale at ORNL
Log Monitoring and Anomaly Detection at Scale at ORNLLog Monitoring and Anomaly Detection at Scale at ORNL
Log Monitoring and Anomaly Detection at Scale at ORNL
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
 
CDN algos
CDN algosCDN algos
CDN algos
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

  • 1. Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud Coburn Watson Manager, Cloud Performance, Netflix Surge „13
  • 2. Netflix, Inc. • World's leading internet television network • ~ 38 Million subscribers in 40+ countries • Over a billion hours streamed per month • Approximately 33% of all US Internet traffic at night • Recent Notables • Increased Originals catalog • Large open source contribution • OpenConnect (homegrown CDN) 2
  • 3. About Me • Manage Cloud Performance Engineering Team • Sub-team of Cloud Solutions Organization • Focus on performance since 2000 • Large-scale billing applications, eCommerce, datacenter mgmt., etc. • Genentech, McKesson, Amdocs, Mercury Int., HP, etc. • Passion for tackling performance at cloud-scale • Looking for great performance engineers • cwatson@netflix.com 3
  • 4. Freedom and Responsibility • Culture deck..a great read • Good performers: 2x, Top performers: 10x • What engineers dislike • cumbersome processes • deployment inefficiency • restricted access • restricted technical freedom • lack of trust • If removed…maximize: • Engineering velocity • Engineer satisfaction 4
  • 6. How • Implementation freedom • SCM, libraries, language • that said..platform benefits exist • Deployment freedom • Service team owns • push schedule, functionality, performance • operational activities (being paged) • On-demand cloud capacity • Thousands of instances at the push of a button 6
  • 9. BaseAMI • Supply the foundation • Monitoring, java, apache, tomcat, etc. • Open source project: Aminator 9
  • 10. Pushing Code: Red-Black • Gracefully roll code in, or out, of production • Asgard is our AWS configuration mgmt. tool 10
  • 11. Compounded risks with increased velocity Risks: Decreased Reliability, Performance, and Scalability Not all Roses 11
  • 14. Fear (Revere) the Monkeys • Simulate • Latency • Errors • Initiate • Instance Termination • Availability Zone Failure • Identify • Configuration Drift … in Test and Production 14
  • 15. Tracking Change: Chronos • Aggregate Significant Events * • Current Sources: • Pushes (Asgard) • Production Change Requests (JIRA) • AWS Notifications • Dynamic Property Changes • ASG Scaling Events • Implementation • Simple REST-service; customized adapters * - “can disrupt production service” 15
  • 17. Automated Canary Analysis • Identify regression between new and existing code • Point ACA to baseline (prod) and canary ASG • Typically analyze an hours worth of time series data • Compare ratio of averages between canary and baseline • Evaluate range and noise; determine quality of signal • Bucket: Hot, Cold, Noisy, or OK • Multiple classifiers available • Multiple metric collections (e.g. hand-picked by service, general) • Rollup • Constrained: along metric dimensions • Final: Score the canary • Implementation: R-based analysis 17
  • 18. HOT OK NOISYCOLDOK NOISY constrained rollup (dashed) final rollup ACA: in Action 18
  • 19. Hystrix: Defend Your App ● Protection from downstream service failures ● Functional (unavailable) or performance in nature 19
  • 21. Dynamic Scaling EC2 footprint autoscales 2500-3500 instances per day • order of tens of thousands of EC2 instances • Larger ASG spans 200-900 m2.4xlarge daily Why: • Improved scalability during unexpected workloads • Absorb variance in service performance profile • Reactive chain of dependencies • Creates "reserved instance troughs" for batch activity 21
  • 22. Dynamic Scaling, cont. Example covers 3 services • 2 edge (A,B), 1 mid-tier (C) • C has more upstream services than simply A and B Multiple Autoscaling Policies • (A) System Load Average • (B,C) Request-Rate based 22
  • 24. Dynamic Scaling, cont. • Response time variability greatest during scaling events • Average response time primary between 75-150 msec 24
  • 25. Dynamic Scaling, cont. • Instance counts 3x, Aggregate requests 4.5x (not shown) • Average CPU utilization per instance: ~25-55%25
  • 26. Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge) • mid-tier service load application • Targeting 2x production rates • Increase read ops from 30k to to 70k in ~ 3 minutes • Increase write ops 750 to 1500 in ~ 3 minutes Results: • 95th pctl response time increase: ~ 17 msec to 45 msec • 99th pctl response time increase: ~ 35 msec to 80 msec Cassandra Performance 26
  • 27. Response times consistent during 4x increase in load * * Due to upstream code change EVcache (memcached) Scalability 27
  • 28. Cloud-scale Load Testing • Ad-Hoc or CI-based load test model • (CI) Run-over-run comparison; email on rule violation 1. Jenkins initiates job 2. JMeter instances apply load 3. Results written to s3 4. Instance metrics published to Atlas 5. Raw data fetched and processed 28
  • 29. Conclusions • Continually accelerate engineering velocity • Evolve architecture and processes to mitigate risks • Stateless micro-service architectures win! • Remove barriers for engineers • Last option should be to reduce rate of change • Exercise failure and “thundering herd” scenarios • Cloud native scaling and resiliency are key factors • Leverage pre-existing OSS PaaS when possible 29
  • 30. Netflix Open Source Our Open Source Software simplifies mgmt at scale Great projects, stunning colleagues: jobs.netflix.com30
  • 31. Q&A • cwatson@netflix.com • Netflix Tech Blog: http://techblog.netflix.com 31

Editor's Notes

  1. Maximum engineering velocity can only be achieved when deployment velocity is a non-factor…thousands of systems in the time it takes to get a coffee.
  2. Chronos is the “go to” tool when something goes awry in production