SlideShare a Scribd company logo

Monitorama 2015 Netflix Instance Analysis

Monitorama 2015 talk by Brendan Gregg, Netflix. With our large and ever-changing cloud environment, it can be vital to debug instance-level performance quickly. There are many instance monitoring solutions, but few come close to meeting our requirements, so we've been building our own and open sourcing them. In this talk, I will discuss our real-world requirements for instance-level analysis and monitoring: not just the metrics and features we desire, but the methodologies we'd like to apply. I will also cover the new and novel solutions we have been developing ourselves to meet these needs and desires, which include use of advanced Linux performance technologies (eg, ftrace, perf_events), and on-demand self-service analysis (Vector).

1 of 69
Download to read offline
Netflix
Instance Performance Analysis
Requirements
Brendan Gregg
Senior Performance Architect
Performance Engineering Team
bgregg@netflix.com @brendangregg
Jun	
  2015	
  
Monitoring companies are selling
faster horses
I want to buy a car
Server/Instance Analysis Potential
In the last 10 years…
•  More Linux
•  More Linux metrics
•  Better visualizations
•  Containers
Conditions ripe for innovation: where is our Henry Ford?
This Talk
•  Instance analysis: system resources, kernel, processes
–  For customers: what you can ask for
–  For vendors: our desirables & requirements
–  What we are building (and open sourcing) at Netflix to
modernize instance performance analysis (Vector, …)
•  Over 60M subscribers
•  FreeBSD CDN for content delivery
•  Massive AWS EC2 Linux cloud
•  Many monitoring/analysis tools
•  Awesome place to work
Agenda
1.  Desirables
2.  Undesirables
3.  Requirements
4.  Methodologies
5.  Our Tools

Recommended

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Hugo Louro
 

More Related Content

What's hot

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache StormP. Taylor Goetz
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and MessagingXin Wang
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackAndrew Yongjoon Kong
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...Ruslan Meshenberg
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Spark Summit
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm毅 吕
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Igalia
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersYahoo Developer Network
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 

What's hot (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 

Similar to Monitorama 2015 Netflix Instance Analysis

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
 
HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & AnalysisRishi Pathak
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and AnalysisRishi Pathak
 
Introduction to Malware Analysis
Introduction to Malware AnalysisIntroduction to Malware Analysis
Introduction to Malware AnalysisAndrew McNicol
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurementPTIHPA
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob HolcombPriyanka Aash
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Andrew DuFour
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudBrendan Gregg
 
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018Chun-Yu Tseng
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTraceGraeme Jenkinson
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLPeace Lee
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Adam Dunkels
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout SessionSplunk
 

Similar to Monitorama 2015 Netflix Instance Analysis (20)

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & Analysis
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and Analysis
 
Introduction to Malware Analysis
Introduction to Malware AnalysisIntroduction to Malware Analysis
Introduction to Malware Analysis
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
ch11.ppt
ch11.pptch11.ppt
ch11.ppt
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob Holcomb
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGL
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 

More from Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 

Recently uploaded

"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys Vasyliev"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys VasylievFwdays
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellencePrecisely
 
AI Act & Standardization: UNINFO involvement
AI Act & Standardization: UNINFO involvementAI Act & Standardization: UNINFO involvement
AI Act & Standardization: UNINFO involvementMimmo Squillace
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17Ana-Maria Mihalceanu
 
Curtain Module Manual Zigbee Neo CS01-1C.pdf
Curtain Module Manual Zigbee Neo CS01-1C.pdfCurtain Module Manual Zigbee Neo CS01-1C.pdf
Curtain Module Manual Zigbee Neo CS01-1C.pdfDomotica daVinci
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
LF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIELF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIEDanBrown980551
 
Importance of magazines in education ppt
Importance of magazines in education pptImportance of magazines in education ppt
Importance of magazines in education pptsafnarafeek2002
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologySafe Software
 
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?GleecusTechlabs1
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringMassimo Talia
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...DianaGray10
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build PolandGDSC PJATK
 
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre..."Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...shaiyuvasv
 
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
Zi-Stick UBS Dongle ZIgbee from  Aeotec manualZi-Stick UBS Dongle ZIgbee from  Aeotec manual
Zi-Stick UBS Dongle ZIgbee from Aeotec manualDomotica daVinci
 
Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?MENGSAYLOEM1
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1Inbay UK
 
Building Products That Think- Bhaskaran Srinivasan & Ashish Gupta
Building Products That Think- Bhaskaran Srinivasan & Ashish GuptaBuilding Products That Think- Bhaskaran Srinivasan & Ashish Gupta
Building Products That Think- Bhaskaran Srinivasan & Ashish GuptaISPMAIndia
 
The Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product SchoolThe Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product SchoolProduct School
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceSusan Ibach
 

Recently uploaded (20)

"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys Vasyliev"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys Vasyliev
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
 
AI Act & Standardization: UNINFO involvement
AI Act & Standardization: UNINFO involvementAI Act & Standardization: UNINFO involvement
AI Act & Standardization: UNINFO involvement
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
 
Curtain Module Manual Zigbee Neo CS01-1C.pdf
Curtain Module Manual Zigbee Neo CS01-1C.pdfCurtain Module Manual Zigbee Neo CS01-1C.pdf
Curtain Module Manual Zigbee Neo CS01-1C.pdf
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
LF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIELF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIE
 
Importance of magazines in education ppt
Importance of magazines in education pptImportance of magazines in education ppt
Importance of magazines in education ppt
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI Technology
 
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineering
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build Poland
 
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre..."Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
 
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
Zi-Stick UBS Dongle ZIgbee from  Aeotec manualZi-Stick UBS Dongle ZIgbee from  Aeotec manual
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
 
Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1
 
Building Products That Think- Bhaskaran Srinivasan & Ashish Gupta
Building Products That Think- Bhaskaran Srinivasan & Ashish GuptaBuilding Products That Think- Bhaskaran Srinivasan & Ashish Gupta
Building Products That Think- Bhaskaran Srinivasan & Ashish Gupta
 
The Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product SchoolThe Future of Product, by Founder & CEO, Product School
The Future of Product, by Founder & CEO, Product School
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data science
 

Monitorama 2015 Netflix Instance Analysis

  • 1. Netflix Instance Performance Analysis Requirements Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@netflix.com @brendangregg Jun  2015  
  • 2. Monitoring companies are selling faster horses I want to buy a car
  • 3. Server/Instance Analysis Potential In the last 10 years… •  More Linux •  More Linux metrics •  Better visualizations •  Containers Conditions ripe for innovation: where is our Henry Ford?
  • 4. This Talk •  Instance analysis: system resources, kernel, processes –  For customers: what you can ask for –  For vendors: our desirables & requirements –  What we are building (and open sourcing) at Netflix to modernize instance performance analysis (Vector, …)
  • 5. •  Over 60M subscribers •  FreeBSD CDN for content delivery •  Massive AWS EC2 Linux cloud •  Many monitoring/analysis tools •  Awesome place to work
  • 6. Agenda 1.  Desirables 2.  Undesirables 3.  Requirements 4.  Methodologies 5.  Our Tools
  • 11. Histograms …  or  a  density  plot  
  • 20. … Without Running All These
  • 22. Other Desirables •  Safe for production use •  Easy to use: self service •  [Near] Real Time •  Ad hoc / custom instrumentation •  Complete documentation •  Graph labels and units •  Open source •  Community
  • 25. Pie Charts …for real-time metrics usr   sys   wait   idle  
  • 26. Doughnuts usr   sys   wait   idle   …like pie charts but worse
  • 27. Traffic Lights …when used for subjective metrics These can be used for objective metrics For subjective metrics (eg, IOPS/latency) try weather icons instead RED == BAD (usually) GREEN == GOOD (hopefully)
  • 29. Acceptable T&Cs •  Probably acceptable: •  Probably not acceptable: •  Check with your legal team By  submi9ng  any  Ideas,  Customer  and  Authorized  Users  agree   that:  ...  (iii)  all  right,  Ftle  and  interest  in  and  to  the  Ideas,  including  all   associated  IP  Rights,  shall  be,  and  hereby  are,  assigned  to  [us]   XXX,  Inc.  shall  have  a  royalty-­‐free,  worldwide,  transferable,  and   perpetual  license  to  use  or  incorporate  into  the  Service  any   suggesFons,  ideas,  enhancement  requests,  feedback,  or  other   informaFon  provided  by  you  or  any  Authorized  User  relaFng  to  the   Service.  
  • 30. Acceptable Technical Debt •  It must be worth the … •  Extra complexity when debugging •  Time to explain to others •  Production reliability risk •  Security risk •  There is no such thing as a free trial
  • 31. Known Overhead •  Overhead must be known to be managed –  T&Cs should not prohibit its measurement or publication •  Sources of overhead: –  CPU cycles –  File system I/O –  Network I/O –  Installed software size •  We will measure it
  • 32. Low Overhead •  Overhead should also be the lowest possible –  1% CPU overhead means 1% more instances, and $$$ •  Things we try to avoid –  Tracing every function/method call –  Needless kernel/user data transfers –  strace (ptrace), tcpdump, libpcap, … •  Event logging doesn't scale
  • 33. Scalable •  Can the product scale to (say) 100,000 instances? –  Atlas, our cloud-wide analysis tool, can –  We tend to kill other monitoring tools that attempt this •  Real-time dashboards showing all instances: –  How does that work? Can it scale to 1k? … 100k? –  Adrian Cockcroft's spigo can simulate protocols at scale •  High overhead might be worth it: on-demand only
  • 34. Useful An instance analysis solution must provide actionable information that helps us improve performance
  • 36. Methodologies Methodologies pose the questions for metrics to answer Good monitoring/analysis tools should support performance analysis methodologies
  • 37. Drunk Man Anti-Method •  Tune things at random until the problem goes away
  • 38. Workload Characterization Study the workload applied: 1.  Who 2.  Why 3.  What 4.  How Target  Workload  
  • 39. Workload Characterization Eg, for CPUs: 1.  Who: which PIDs, programs, users 2.  Why: code paths, context 3.  What: CPU instructions, cycles 4.  How: changing over time Target  Workload  
  • 41. CPUs Who How What Why top,  htop! perf record -g! flame  graphs   monitoring   perf stat -a -d!
  • 42. Most Monitoring Products Today Who How What Why top,  htop! perf record -g! flame  Graphs   monitoring   perf stat -a -d!
  • 43. The USE Method •  For every resource, check: 1.  Utilization 2.  Saturation 3.  Errors •  Saturation is queue length or queued time •  Start by drawing a functional (block) diagram of your system / software / environment Resource   UFlizaFon   (%)  X  
  • 44. USE Method for Hardware Include busses & interconnects!
  • 46. Most Monitoring Products Today •  Showing what is and is not commonly measured •  Score: 8 out of 33 (24%) •  We can do better… U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E  
  • 47. Other Methodologies •  There are many more: –  Drill-Down Analysis Method –  Time Division Method –  Stack Profile Method –  Off-CPU Analysis –  … –  I've covered these in previous talks & books
  • 49. BaseAMI •  Many sources for instance metrics & analysis –  Atlas, Vector, sar, perf-tools (ftrace, perf_events), … •  Currently not using 3rd party monitoring vendor tools Linux  (usually  Ubuntu)   Java  (JDK  7  or  8)   Tomcat  GC  and   thread   dump   logging   hystrix,  metrics  (Servo),   health  check   OpFonal  Apache,   memcached,  Node.js,   …   Atlas,  S3  log  rotaFon,   sar,  erace,  perf,  stap,   perf-­‐tools   Vector,  pcp   ApplicaFon  war  files,   plahorm,  base  servelet  
  • 51. Netflix Atlas Select  Instance   Historical  Metrics   Select  Metrics  
  • 53. Netflix Vector Near  real-­‐7me,   per-­‐second  metrics   Flame  Graphs   Select   Metrics   Select  Instance  
  • 54. Java CPU Flame Graphs
  • 55. Needs -XX:+PreserveFramePointer and perf-map-agent Java CPU Flame Graphs Java   JVM   Kernel  
  • 56. sar •  System Activity Reporter. Archive of metrics, eg: •  Metrics are also in Atlas and Vector •  Linux sar is well designed: units, groups $ sar -n DEV! Linux 3.13.0-49-generic (prod0141) !06/06/2015!_x86_64_ !(16 CPU)! ! 12:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil! 12:05:01 AM eth0 4824.26 3941.37 919.57 15706.14 0.00 0.00 0.00 0.00! 12:05:01 AM lo 23913.29 23913.29 17677.23 17677.23 0.00 0.00 0.00 0.00! 12:15:01 AM eth0 4507.22 3749.46 909.03 12481.74 0.00 0.00 0.00 0.00! 12:15:01 AM lo 23456.94 23456.94 14424.28 14424.28 0.00 0.00 0.00 0.00! 12:25:01 AM eth0 10372.37 9990.59 1219.22 27788.19 0.00 0.00 0.00 0.00! 12:25:01 AM lo 25725.15 25725.15 29372.20 29372.20 0.00 0.00 0.00 0.00! 12:35:01 AM eth0 4729.53 3899.14 914.74 12773.97 0.00 0.00 0.00 0.00! 12:35:01 AM lo 23943.61 23943.61 14740.62 14740.62 0.00 0.00 0.00 0.00! […]!
  • 58. perf-tools •  Some front-ends to Linux ftrace & perf_events –  Advanced, custom kernel observability when needed (rare) –  https://github.com/brendangregg/perf-tools –  Unsupported hacks: see WARNINGs •  ftrace –  First added to Linux 2.6.27 –  A collection of capabilities, used via /sys/kernel/debug/tracing/ •  perf_events –  First added to Linux 2.6.31 –  Tracer/profiler multi-tool, used via "perf" command
  • 59. perf-tools: funccount •  Eg, count a kernel function call rate: •  Other perf-tools can then instrument these in more detail # ./funccount -i 1 'bio_*'! Tracing "bio_*"... Ctrl-C to end.! ! FUNC COUNT! bio_attempt_back_merge 26! bio_get_nr_vecs 361! bio_alloc 536! bio_alloc_bioset 536! bio_endio 536! bio_free 536! bio_fs_destructor 536! bio_init 536! bio_integrity_enabled 536! bio_put 729! bio_add_page 1004! ! [...]! Counts  are  in-­‐kernel,   for  low  overhead  
  • 61. eBPF •  Currently being integrated. Efficient (JIT) in-kernel maps. •  Measure latency, heat maps, …
  • 62. eBPF eBPF will make a profound difference to monitoring on Linux systems There will be an arms race to support it, post Linux 4.1+ If it's not on your roadmap, it should be
  • 64. Requirements •  Acceptable T&Cs •  Acceptable technical debt •  Known overhead •  Low overhead •  Scalable •  Useful
  • 65. Methodologies Support for: •  Workload Characterization •  The USE Method •  … Not starting with metrics in search of uses
  • 67. Instrument These With full eBPF support Linux has awesome instrumentation: use it!
  • 68. Links & References •  Netflix Vector –  https://github.com/netflix/vector –  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  Netflix Atlas –  http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html •  Heat Maps –  http://www.brendangregg.com/heatmaps.html –  http://www.brendangregg.com/HeatMaps/latency.html •  Flame Graphs –  http://www.brendangregg.com/flamegraphs.html –  http://techblog.netflix.com/2014/11/nodejs-in-flames.html •  Frequency Trails: http://www.brendangregg.com/frequencytrails.html •  Methodology –  http://www.brendangregg.com/methodology.html –  http://www.brendangregg.com/USEmethod/use-linux.html •  perf-tools: https://github.com/brendangregg/perf-tools •  eBPF: http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html •  Images: –  horse: Microsoft Powerpoint clip art –  gauge: https://github.com/thlorenz/d3-gauge –  eBPF ponycorn: Deirdré Straughan & General Zoi's Pony Creator
  • 69. Thanks •  Questions? •  http://techblog.netflix.com •  http://slideshare.net/brendangregg •  http://www.brendangregg.com •  bgregg@netflix.com •  @brendangregg Jun  2015