SlideShare a Scribd company logo
Ensuring Performance in a Fast- 
Paced Environment 
Martin Spier 
Performance Engineering @ Netflix 
@spiermar 
mspier@netflix.com 
Performance & Capacity 2014 by CMG
Martin Spier 
● Performance Engineer @ Netflix 
● Previously @ Expedia and Dell 
● Performance 
o Architecture, Tuning and Profiling 
o Testing and Frameworks 
o Tool Development 
● Blog @ http://overloaded.io 
● Twitter @spiermar
● World's leading Internet television network 
● ⅓ of all traffic heading into American homes at 
peak hours 
● > 50 million members 
● > 40 countries 
● > 1 billion hours of TV shows and movies per 
month 
● > 100s different client devices
Agenda 
● How Netflix Works 
o Culture, Development Model, High-level 
Architecture, Platform 
● Ensuring Performance 
o Auto-Scaling, Squeeze Tests, Simian Army, Hystrix, 
Redundancy, Canary Analysis, Performance Test 
Framework, Large Scale Tests
Freedom and Responsibility 
● Culture deck* is TRUE 
o 9M+ views 
● Minimal process 
● Context over control 
● Root access to everything 
● No approvals required 
● Only Senior Engineers 
* http://www.slideshare.net/reed2001/culture-1798664
Independent Development Teams 
● Highly aligned, loosely coupled 
● Free to define release cycles 
● Free to choose use any methodology 
● But it’s an agile environment 
● And there is a “paved road”
Development Agility 
● Continuous innovation cycle 
● Shorter development cycles 
● Automate everything! 
● Self-service deployments 
● A/B Tests 
● Failure cost close to zero 
● Lower time to market 
● Innovation > Risk
Architecture 
● Scalable and Resilient 
● Micro-services 
● Stateless 
● Assume Failure 
● Backwards Compatible 
● Service Discovery
Zuul & Dynamic Routing 
● Zuul, the front door for all requests from devices and 
websites to the backend of the Netflix streaming 
application 
● Dynamic Routing 
● Monitoring 
● Resiliency and Security 
● Region and AZ Failure 
* https://github.com/Netflix/zuul
Cloud 
● Amazon’s AWS 
● Multi-region Active/Active 
● Ephemeral Instances 
● Auto-Scaling 
● Netflix OSS (https://github.com/Netflix)
Performance Engineering 
● Not a part of any development team 
● Not a shared service 
● Through consultation improve and maintain the 
performance and reliability 
● Provide self-service performance analysis utilities 
● Disseminate performance best practices 
● And we’re hiring!
What about Performance?
Auto-Scaling 
● 5-6x Intraday 
● Auto-Scaling Groups (ASGs) 
● Reactive Auto-Scaling 
● Predictive Auto-Scaling (Scryer)
Squeeze Tests 
● Stress Test, with Production Load 
● Steering Production Traffic 
● Understand the Upper Limits of Capacity 
● Adjust Auto-Scaling Policies 
● Automated Squeeze Tests
Red/Black Pushes 
● New builds are rolled out as new 
Auto-Scaling Groups (ASGs) 
● Elastic Load Balancers (ELBs) 
control the traffic going to each 
ASG 
● Fast and simple rollback if issues 
are found 
● Canary Clusters are used to test 
builds before a full rollout
Monitoring: Atlas 
● Humongous, 1.2 billion distinct time 
series 
● Integrated to all systems, production 
and test 
● 1 minute resolution, quick roll ups 
● 12-month persistence 
● API and querying UI 
● System and Application Level 
● Servo (github.com/Netflix/servo) 
● Custom dashboards
Vector 
● 1 second Resolution 
● No Persistence 
● Leverages Performance Co- 
Pilot (PCP) 
● System-level Metrics 
● Java Metrics (parfait) 
● ElasticSearch, Cassandra 
● Flame Graphs (Brendan Gregg)
Mogul 
● ASG and Instance Level 
● Resource Demand; 
● Performance 
Characteristics; 
● And Downstream 
Dependencies.
Slalom 
● Cluster Level 
● High-level Demand Flow 
● Cross-application Request 
Tracing 
● Downstream and Upstream 
Demand
Canary Release 
“Canary release is a technique to reduce the risk 
of introducing a new software version in 
production by slowly rolling out the change to a 
small subset of users before rolling it out to the 
entire infrastructure and making it available to 
everybody.”
Automatic Canary Analysis (ACA) 
Exactly what the name implies. An automated 
way of analyzing a canary release.
ACA: Use Case 
● You are a service owner and have finished 
implementing a new feature into your application. 
● You want to determine if the new build, v1.1, is 
performing analogous to the existing build. 
● The new build is deployed automatically to a canary 
cluster 
● A small percentage of production traffic is steered to the 
canary cluster 
● After a short period of time, canary analysis 
is triggered
Automated Canary Analysis 
● For a given set of metrics, ACA will compare 
samples from baseline and canary; 
● Determine if they are analogous; 
● Identify any metrics that deviate from the 
baseline; 
● And generate a score that indicates the overall 
similarity of the canary.
Automated Canary Analysis 
● The score will be associated 
with a Go/No-Go decision; 
● And the new build will be 
rolled out (or not) to the rest 
of the production 
environment. 
● No workload definitions 
● No synthetic load
What about pre-production 
Performance 
Testing? 
When is it appropriate?
Not always! 
Sometimes it doesn't make sense to run 
performance tests.
Remember the short release cycles? 
With the short time span between production builds, 
pre-production tests don’t warn us much sooner. 
(And there’s ACA)
So when? 
When it brings value. Not just because is 
part of a process.
When? Use Cases 
● New Services 
● Large Code Refactoring 
● Architecture Changes 
● Workload Changes 
● Proof of Concept 
● Initial Cluster Sizing 
● Instance Type Migration
Use Cases, cont. 
● Troubleshooting 
● Tuning 
● Teams that release less frequently 
o Intermediary Builds 
● Base Components (Paved Road) 
o Amazon Cloud Images (AMIs) 
o Platform 
o Common Libraries
Who? 
● Push “tests” to development teams 
● Development understands the product, they 
developed It 
● Performance Engineering knows the tools 
and techniques (so we help!) 
● Easier to scale the effort!
How? Environment 
● Free to create any environment configuration 
● Integration stack 
● Full production-like or scaled-down environment 
● Hybrid model 
o Performance + integration stack 
● Production testing
How? Test Framework 
● Built around JMeter
How? Test Framework 
● Runs on Amazon’s EC2 
● Leverages Jenkins for orchestration
How? Analysis 
● In-house developed web analysis tool and API 
● Results persisted on Amazon’s S3 and RDS
How? Analysis 
● Automated analysis built-in (thresholds) 
● Customized alerts 
● Interface with monitoring tools
Large Scale Tests 
● > 100k req/s 
● > 100 of load generators 
● High Throughput Components 
o In-Memory Caches 
● Component scaling 
● Full production tests
Large Scale Tests: Problems 
● Your test client is likely the first bottleneck 
● Components are (often) not designed to 
scale 
o Great performance per node; 
o But they don’t scale horizontally. 
o Controller, data feeder, load generator*, result 
collection, result analysis, monitoring 
* often the exception
Large Scale Tests: Single Controller 
● Single controller, multiple load generators 
● Controller also serves as data feeder 
● Controller collects all results synchronously 
● Controller aggregates monitoring data 
● Batch and async might alleviate the problem 
● Analysis of large result sets is heavy (think 
percentiles)
Large Scale Tests: Distributed Model 
● Data Feeding and Load Generation 
o No Controller 
o Independent Load Generators 
● Data Collection and Monitoring 
o Decentralized Monitoring Platform 
● Data Analysis 
o Aggregation at node level 
o Hive/Pig 
o ElasticSearch
Takeaways 
● Canary analysis 
● Testing only when it brings VALUE 
● Leveraging cloud for tests 
● Automated test analysis 
● Pushing execution to development teams 
● Open source tools
Martin Spier 
mspier@netflix.com 
@spiermar 
http://overloaded.io/
References 
● parfait (https://code.google.com/p/parfait/) 
● servo (https://github.com/Netflix/servo) 
● hystrix (https://github.com/Netflix/Hystrix) 
● culture deck ( 
http://www.slideshare.net/reed2001/culture-1798664) 
● zuul (https://github.com/Netflix/zuul) 
● scryer ( 
http://techblog.netflix.com/2013/11/scryer-netflixs-predictive- 
auto-scaling.html)
Backup Slides
Simian Army 
● Ensures cloud handles failures 
through regular testing 
● The Monkeys 
o Chaos Monkey: Resiliency 
o Latency: Artificial Delays 
o Conformity: Best-practices 
o Janitor: Unused Instances 
o Doctor: Health checks 
o Security: Security Violations 
o Chaos Gorilla: AZ Failure 
o Chaos Kong: Region Failure
“... is a latency and fault 
tolerance library designed to 
isolate points of access to 
remote systems ...” 
● Stop cascading failures. 
● Fallbacks and graceful degradation 
● Fail fast and rapid recovery 
● Thread and semaphore isolation with 
circuit breakers 
● Real-time monitoring and 
configuration changes 
* https://github.com/Netflix/Hystrix
Real-time Analytics Platform (RTA) 
● ACA runs on top of RTA 
● Compute Engines 
o OpenCPU (R) 
o OpenPY (Python) 
● Data Sources 
o Real-time Monitoring Systems 
o Big Data Platforms 
● Reporting, Scheduling, Persistence
Slow Performance Regression 
● Deviation => “acceptable” regression 
● Small performance regressions might sneak in 
● Short release cycle = many releases 
● Many releases = cumullative regression
Slow Performance Regression
Testing Lower Level Components 
● Base AMIs 
o OS (Linux), tools and agents 
● Common Application Platform 
● Common Libraries 
● Reference Application 
o Leverages a common architecture (front, middle, 
data, memcache, jar clients, Hystrix) 
o Implements functions that stress 
specific resources (cpu, service, db)

More Related Content

What's hot

Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
Karthik Deivasigamani
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Legacy Typesafe (now Lightbend)
 
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
Bert Jan Schrijver
 
JavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceJavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National Police
Bert Jan Schrijver
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National Police
Bert Jan Schrijver
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
Ilya Ganelin
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Vadym Kazulkin
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
ESUG
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
Ahmed Soliman
 
Introduction to Akka Streams
Introduction to Akka StreamsIntroduction to Akka Streams
Introduction to Akka Streams
Knoldus Inc.
 
LCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation ImprovementsLCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation Improvements
Linaro
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)
neptunerx
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
The Rocky Cloud Road
The Rocky Cloud RoadThe Rocky Cloud Road
The Rocky Cloud Road
Gert Drapers
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Quick Tour On Zeppelin
Quick Tour On ZeppelinQuick Tour On Zeppelin
Quick Tour On Zeppelin
Knoldus Inc.
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Tim Bozarth
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
Legacy Typesafe (now Lightbend)
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Matt Tesauro
 

What's hot (20)

Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
 
JavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceJavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National Police
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National Police
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Introduction to Akka Streams
Introduction to Akka StreamsIntroduction to Akka Streams
Introduction to Akka Streams
 
LCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation ImprovementsLCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation Improvements
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
 
The Rocky Cloud Road
The Rocky Cloud RoadThe Rocky Cloud Road
The Rocky Cloud Road
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Quick Tour On Zeppelin
Quick Tour On ZeppelinQuick Tour On Zeppelin
Quick Tour On Zeppelin
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013
 

Similar to Ensuring Performance in a Fast-Paced Environment (CMG 2014)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
OpenStack
 
Expedia 3x3 presentation
Expedia 3x3 presentationExpedia 3x3 presentation
Expedia 3x3 presentation
Drew Hannay
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
StormForge .io
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
All Things Open
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failure
DiUS
 
Security in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps EngineersSecurity in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps Engineers
DevOps.com
 
Gatling
Gatling Gatling
Gatling
Gaurav Shukla
 
Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With Gatling
Knoldus Inc.
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
HostedbyConfluent
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
smalltown
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the Cloud
Josh Evans
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
Amazon Web Services
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
Mark Price
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
C4Media
 
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2
 
Agile devops in the cloud
Agile devops in the cloudAgile devops in the cloud
Agile devops in the cloud
Chamith Kumarage
 

Similar to Ensuring Performance in a Fast-Paced Environment (CMG 2014) (20)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
Expedia 3x3 presentation
Expedia 3x3 presentationExpedia 3x3 presentation
Expedia 3x3 presentation
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failure
 
Security in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps EngineersSecurity in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps Engineers
 
Gatling
Gatling Gatling
Gatling
 
Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With Gatling
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the Cloud
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the Cloud
 
Agile devops in the cloud
Agile devops in the cloudAgile devops in the cloud
Agile devops in the cloud
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Ensuring Performance in a Fast-Paced Environment (CMG 2014)

  • 1. Ensuring Performance in a Fast- Paced Environment Martin Spier Performance Engineering @ Netflix @spiermar mspier@netflix.com Performance & Capacity 2014 by CMG
  • 2. Martin Spier ● Performance Engineer @ Netflix ● Previously @ Expedia and Dell ● Performance o Architecture, Tuning and Profiling o Testing and Frameworks o Tool Development ● Blog @ http://overloaded.io ● Twitter @spiermar
  • 3. ● World's leading Internet television network ● ⅓ of all traffic heading into American homes at peak hours ● > 50 million members ● > 40 countries ● > 1 billion hours of TV shows and movies per month ● > 100s different client devices
  • 4. Agenda ● How Netflix Works o Culture, Development Model, High-level Architecture, Platform ● Ensuring Performance o Auto-Scaling, Squeeze Tests, Simian Army, Hystrix, Redundancy, Canary Analysis, Performance Test Framework, Large Scale Tests
  • 5. Freedom and Responsibility ● Culture deck* is TRUE o 9M+ views ● Minimal process ● Context over control ● Root access to everything ● No approvals required ● Only Senior Engineers * http://www.slideshare.net/reed2001/culture-1798664
  • 6. Independent Development Teams ● Highly aligned, loosely coupled ● Free to define release cycles ● Free to choose use any methodology ● But it’s an agile environment ● And there is a “paved road”
  • 7. Development Agility ● Continuous innovation cycle ● Shorter development cycles ● Automate everything! ● Self-service deployments ● A/B Tests ● Failure cost close to zero ● Lower time to market ● Innovation > Risk
  • 8.
  • 9. Architecture ● Scalable and Resilient ● Micro-services ● Stateless ● Assume Failure ● Backwards Compatible ● Service Discovery
  • 10. Zuul & Dynamic Routing ● Zuul, the front door for all requests from devices and websites to the backend of the Netflix streaming application ● Dynamic Routing ● Monitoring ● Resiliency and Security ● Region and AZ Failure * https://github.com/Netflix/zuul
  • 11. Cloud ● Amazon’s AWS ● Multi-region Active/Active ● Ephemeral Instances ● Auto-Scaling ● Netflix OSS (https://github.com/Netflix)
  • 12. Performance Engineering ● Not a part of any development team ● Not a shared service ● Through consultation improve and maintain the performance and reliability ● Provide self-service performance analysis utilities ● Disseminate performance best practices ● And we’re hiring!
  • 14. Auto-Scaling ● 5-6x Intraday ● Auto-Scaling Groups (ASGs) ● Reactive Auto-Scaling ● Predictive Auto-Scaling (Scryer)
  • 15. Squeeze Tests ● Stress Test, with Production Load ● Steering Production Traffic ● Understand the Upper Limits of Capacity ● Adjust Auto-Scaling Policies ● Automated Squeeze Tests
  • 16. Red/Black Pushes ● New builds are rolled out as new Auto-Scaling Groups (ASGs) ● Elastic Load Balancers (ELBs) control the traffic going to each ASG ● Fast and simple rollback if issues are found ● Canary Clusters are used to test builds before a full rollout
  • 17. Monitoring: Atlas ● Humongous, 1.2 billion distinct time series ● Integrated to all systems, production and test ● 1 minute resolution, quick roll ups ● 12-month persistence ● API and querying UI ● System and Application Level ● Servo (github.com/Netflix/servo) ● Custom dashboards
  • 18. Vector ● 1 second Resolution ● No Persistence ● Leverages Performance Co- Pilot (PCP) ● System-level Metrics ● Java Metrics (parfait) ● ElasticSearch, Cassandra ● Flame Graphs (Brendan Gregg)
  • 19. Mogul ● ASG and Instance Level ● Resource Demand; ● Performance Characteristics; ● And Downstream Dependencies.
  • 20. Slalom ● Cluster Level ● High-level Demand Flow ● Cross-application Request Tracing ● Downstream and Upstream Demand
  • 21. Canary Release “Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.”
  • 22. Automatic Canary Analysis (ACA) Exactly what the name implies. An automated way of analyzing a canary release.
  • 23. ACA: Use Case ● You are a service owner and have finished implementing a new feature into your application. ● You want to determine if the new build, v1.1, is performing analogous to the existing build. ● The new build is deployed automatically to a canary cluster ● A small percentage of production traffic is steered to the canary cluster ● After a short period of time, canary analysis is triggered
  • 24. Automated Canary Analysis ● For a given set of metrics, ACA will compare samples from baseline and canary; ● Determine if they are analogous; ● Identify any metrics that deviate from the baseline; ● And generate a score that indicates the overall similarity of the canary.
  • 25. Automated Canary Analysis ● The score will be associated with a Go/No-Go decision; ● And the new build will be rolled out (or not) to the rest of the production environment. ● No workload definitions ● No synthetic load
  • 26. What about pre-production Performance Testing? When is it appropriate?
  • 27. Not always! Sometimes it doesn't make sense to run performance tests.
  • 28. Remember the short release cycles? With the short time span between production builds, pre-production tests don’t warn us much sooner. (And there’s ACA)
  • 29. So when? When it brings value. Not just because is part of a process.
  • 30. When? Use Cases ● New Services ● Large Code Refactoring ● Architecture Changes ● Workload Changes ● Proof of Concept ● Initial Cluster Sizing ● Instance Type Migration
  • 31. Use Cases, cont. ● Troubleshooting ● Tuning ● Teams that release less frequently o Intermediary Builds ● Base Components (Paved Road) o Amazon Cloud Images (AMIs) o Platform o Common Libraries
  • 32. Who? ● Push “tests” to development teams ● Development understands the product, they developed It ● Performance Engineering knows the tools and techniques (so we help!) ● Easier to scale the effort!
  • 33. How? Environment ● Free to create any environment configuration ● Integration stack ● Full production-like or scaled-down environment ● Hybrid model o Performance + integration stack ● Production testing
  • 34. How? Test Framework ● Built around JMeter
  • 35. How? Test Framework ● Runs on Amazon’s EC2 ● Leverages Jenkins for orchestration
  • 36. How? Analysis ● In-house developed web analysis tool and API ● Results persisted on Amazon’s S3 and RDS
  • 37. How? Analysis ● Automated analysis built-in (thresholds) ● Customized alerts ● Interface with monitoring tools
  • 38.
  • 39. Large Scale Tests ● > 100k req/s ● > 100 of load generators ● High Throughput Components o In-Memory Caches ● Component scaling ● Full production tests
  • 40. Large Scale Tests: Problems ● Your test client is likely the first bottleneck ● Components are (often) not designed to scale o Great performance per node; o But they don’t scale horizontally. o Controller, data feeder, load generator*, result collection, result analysis, monitoring * often the exception
  • 41. Large Scale Tests: Single Controller ● Single controller, multiple load generators ● Controller also serves as data feeder ● Controller collects all results synchronously ● Controller aggregates monitoring data ● Batch and async might alleviate the problem ● Analysis of large result sets is heavy (think percentiles)
  • 42. Large Scale Tests: Distributed Model ● Data Feeding and Load Generation o No Controller o Independent Load Generators ● Data Collection and Monitoring o Decentralized Monitoring Platform ● Data Analysis o Aggregation at node level o Hive/Pig o ElasticSearch
  • 43. Takeaways ● Canary analysis ● Testing only when it brings VALUE ● Leveraging cloud for tests ● Automated test analysis ● Pushing execution to development teams ● Open source tools
  • 44. Martin Spier mspier@netflix.com @spiermar http://overloaded.io/
  • 45. References ● parfait (https://code.google.com/p/parfait/) ● servo (https://github.com/Netflix/servo) ● hystrix (https://github.com/Netflix/Hystrix) ● culture deck ( http://www.slideshare.net/reed2001/culture-1798664) ● zuul (https://github.com/Netflix/zuul) ● scryer ( http://techblog.netflix.com/2013/11/scryer-netflixs-predictive- auto-scaling.html)
  • 47. Simian Army ● Ensures cloud handles failures through regular testing ● The Monkeys o Chaos Monkey: Resiliency o Latency: Artificial Delays o Conformity: Best-practices o Janitor: Unused Instances o Doctor: Health checks o Security: Security Violations o Chaos Gorilla: AZ Failure o Chaos Kong: Region Failure
  • 48. “... is a latency and fault tolerance library designed to isolate points of access to remote systems ...” ● Stop cascading failures. ● Fallbacks and graceful degradation ● Fail fast and rapid recovery ● Thread and semaphore isolation with circuit breakers ● Real-time monitoring and configuration changes * https://github.com/Netflix/Hystrix
  • 49. Real-time Analytics Platform (RTA) ● ACA runs on top of RTA ● Compute Engines o OpenCPU (R) o OpenPY (Python) ● Data Sources o Real-time Monitoring Systems o Big Data Platforms ● Reporting, Scheduling, Persistence
  • 50. Slow Performance Regression ● Deviation => “acceptable” regression ● Small performance regressions might sneak in ● Short release cycle = many releases ● Many releases = cumullative regression
  • 52. Testing Lower Level Components ● Base AMIs o OS (Linux), tools and agents ● Common Application Platform ● Common Libraries ● Reference Application o Leverages a common architecture (front, middle, data, memcache, jar clients, Hystrix) o Implements functions that stress specific resources (cpu, service, db)