SlideShare a Scribd company logo
1 of 83
Download to read offline
Actionable Metrics
                      Enabling Decision-Making in Netflix’s Decentralized
                                         Environment

                                      Cloud Tech III
                                     October 6, 2012
                                      Roy Rapoport
                               @royrapoport, rsr@netflix.com

Thursday, October 18, 12
Me

                     • Been in tech for about 20 years
                     • Systems engineering, networking, software
                           development, QA, release management
                     • Time at Netflix: 1195 days (3y:3m:1w)
                     • (Current) job at Netflix: Make things better
                           (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor



                       % of instances with even public IP addresses




Thursday, October 18, 12
Technology Overview




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Culture Overview




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations
     • Get out of the
             way of
             Developers



Thursday, October 18, 12
The Metric Lifecycle




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look

Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look
                     • Alert

Thursday, October 18, 12
Systems

                     • Flexible
                     • Scalable
                     • Self-Service


Thursday, October 18, 12
Telemetry
                             Flexible, Scalable, Self-Service
                   import netflix.metrics
                   [...]
                       self.nm = netflix.metrics.Metrics("core_cag")
                   [...]
                   def api(self):
                       self.nm.nfCounter("api")
                       [...]
                       self.nm.nfCounter(“application_%s” % application)
                   [...]




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds
     • Compare to
             history




Thursday, October 18, 12
For Example ...
                           Last 3 hours’ core_tools.core_cag_api




                                         What the ...




Thursday, October 18, 12
For Example ...
                                  Visualization (Continued)

                           Last 4 days’ core_tools.core_cag_api




                                    even more questions!



Thursday, October 18, 12
For Example ...
                                   Visualization (Continued)

                           Last 10 days’ core_tools.core_cag_api




                                   What caused the spike?


Thursday, October 18, 12
For Example ...
                                 Visualization (Continued)

                           Show alert volume per application




                             Someone had a rough few days...


Thursday, October 18, 12
Don’t Like Surprises...
                 {
                           "alerts": [
                               {
                                   "applyTo": "cluster",
                                   "condition": {
                                       "minPercent": 90.0,
                                       "noise" : .2,
                                       "maxPercent": 25.0,
                                       "type": "DoubleExponential"
                                   },
                                   "metricName": "core_cag_api",
                                   "severity": "major"
                               }
                           ],
                           "clusters": [
                               "core_tools"
                           ]
                 }




Thursday, October 18, 12
Threshold Tuning


                     • An Abbreviated History ...



Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)




                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT


                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket

                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket
                     • Want to tune an alert? Submit a ticket
                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                               (It gets better)




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold
                     • Freedom!




Thursday, October 18, 12
Threshold Tuning
                                        (It gets better)

                     • You get to configure your own threshold
                     • Freedom!
                     • Also, you have to configure your own
                           thresholds




Thursday, October 18, 12
Threshold Tuning
                              (Are we there yet?)




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference
                     • Still falls short



Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
If Time Allows ...



Thursday, October 18, 12
Events vs Metrics




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time



Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time
                     • Lack magnitude


Thursday, October 18, 12
Why Build It?




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?


Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?
                     • Better Alerting

Thursday, October 18, 12
Chronos




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying
                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming                 •   Something didn’t happen




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume
                     •     Recursive
                           •   Recursive



Thursday, October 18, 12
End Result




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes



Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR


Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments

Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments
                     • You should do this
Thursday, October 18, 12
I Didn’t Mention

                     • End-to-end testing and alerting
                     • External availability and performance
                     • Open Connect
                     • Jobs

Thursday, October 18, 12
Questions?




Thursday, October 18, 12

More Related Content

Viewers also liked

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Thingsroyrapoport
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attackQrator Labs
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisManojit Nandi
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environmentsAlois Reitbauer
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alertsAlois Reitbauer
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing testAlois Reitbauer
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. Alois Reitbauer
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Alois Reitbauer
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in ProductionAlois Reitbauer
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixExtract Data Conference
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

Viewers also liked (19)

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attack
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environments
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alerts
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing test
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection.
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in Production
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at Netflix
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

Similar to Cloud Tech III: Actionable Metrics

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Daum DNA
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloudJill Mee
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...Rodrigo Laiola Guimarães
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack Foundation
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"Randy Bias
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm__lucas
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o HerokuFilipe Ximenes
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)Empatika
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureAtlassian
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using AkkaMiguel Pastor
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applicationsLuke Cawood
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkRichard Tuin
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipelineKen Collins
 

Similar to Cloud Tech III: Actionable Metrics (20)

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloud
 
April JavaScript Tools
April JavaScript ToolsApril JavaScript Tools
April JavaScript Tools
 
What is SCRUM?
What is SCRUM?What is SCRUM?
What is SCRUM?
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm
 
hello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdfhello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdf
 
KubeSecOps
KubeSecOpsKubeSecOps
KubeSecOps
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o Heroku
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applications
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and Mink
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipeline
 

Recently uploaded

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Cloud Tech III: Actionable Metrics

  • 1. Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment Cloud Tech III October 6, 2012 Roy Rapoport @royrapoport, rsr@netflix.com Thursday, October 18, 12
  • 2. Me • Been in tech for about 20 years • Systems engineering, networking, software development, QA, release management • Time at Netflix: 1195 days (3y:3m:1w) • (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. ) Thursday, October 18, 12
  • 7. Metrics Humor % of instances with even public IP addresses Thursday, October 18, 12
  • 9. Technology Overview • SoA, REST, Mostly Java Thursday, October 18, 12
  • 10. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 11. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 12. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 14. Culture Overview • Freedom and Responsibility Thursday, October 18, 12
  • 15. Culture Overview • Freedom and Responsibility • Distributed Operations Thursday, October 18, 12
  • 16. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of Developers Thursday, October 18, 12
  • 18. The Metric Lifecycle • Send Thursday, October 18, 12
  • 19. The Metric Lifecycle • Send • Look Thursday, October 18, 12
  • 20. The Metric Lifecycle • Send • Look • Alert Thursday, October 18, 12
  • 21. Systems • Flexible • Scalable • Self-Service Thursday, October 18, 12
  • 22. Telemetry Flexible, Scalable, Self-Service import netflix.metrics [...] self.nm = netflix.metrics.Metrics("core_cag") [...] def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application) [...] Thursday, October 18, 12
  • 23. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 24. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 25. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 26. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 27. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 28. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 29. Alerting Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 30. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds Thursday, October 18, 12
  • 31. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds • Compare to history Thursday, October 18, 12
  • 32. For Example ... Last 3 hours’ core_tools.core_cag_api What the ... Thursday, October 18, 12
  • 33. For Example ... Visualization (Continued) Last 4 days’ core_tools.core_cag_api even more questions! Thursday, October 18, 12
  • 34. For Example ... Visualization (Continued) Last 10 days’ core_tools.core_cag_api What caused the spike? Thursday, October 18, 12
  • 35. For Example ... Visualization (Continued) Show alert volume per application Someone had a rough few days... Thursday, October 18, 12
  • 36. Don’t Like Surprises... { "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ] } Thursday, October 18, 12
  • 37. Threshold Tuning • An Abbreviated History ... Thursday, October 18, 12
  • 38. Threshold Tuning (in the beginning) Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 39. Threshold Tuning (in the beginning) • Systems owned by IT Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 40. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 41. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket • Want to tune an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 42. Threshold Tuning (It gets better) Thursday, October 18, 12
  • 43. Threshold Tuning (It gets better) • You get to configure your own threshold Thursday, October 18, 12
  • 44. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! Thursday, October 18, 12
  • 45. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! • Also, you have to configure your own thresholds Thursday, October 18, 12
  • 46. Threshold Tuning (Are we there yet?) Thursday, October 18, 12
  • 47. Threshold Tuning (Are we there yet?) • Play with historical data Thursday, October 18, 12
  • 48. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference Thursday, October 18, 12
  • 49. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference • Still falls short Thursday, October 18, 12
  • 50. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 51. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 52. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 53. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 54. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 55. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 56. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 57. If Time Allows ... Thursday, October 18, 12
  • 58. Events vs Metrics Thursday, October 18, 12
  • 59. Events vs Metrics • Irregular Interval Thursday, October 18, 12
  • 60. Events vs Metrics • Irregular Interval • Point in time Thursday, October 18, 12
  • 61. Events vs Metrics • Irregular Interval • Point in time • Lack magnitude Thursday, October 18, 12
  • 62. Why Build It? Thursday, October 18, 12
  • 63. Why Build It? • Change management • Vs Change control Thursday, October 18, 12
  • 64. Why Build It? • Change management • Vs Change control • What Changed? Thursday, October 18, 12
  • 65. Why Build It? • Change management • Vs Change control • What Changed? • Better Alerting Thursday, October 18, 12
  • 67. Chronos • Rapidly Prototyped Thursday, October 18, 12
  • 68. Chronos • Rapidly Prototyped • Adapters and reporters Thursday, October 18, 12
  • 69. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying Thursday, October 18, 12
  • 70. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • Alarming Thursday, October 18, 12
  • 71. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming Thursday, October 18, 12
  • 72. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming • Something didn’t happen Thursday, October 18, 12
  • 73. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume Thursday, October 18, 12
  • 74. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume • Recursive • Recursive Thursday, October 18, 12
  • 76. End Result • Massive decrease in change control tickets Thursday, October 18, 12
  • 77. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI Thursday, October 18, 12
  • 78. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes Thursday, October 18, 12
  • 79. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR Thursday, October 18, 12
  • 80. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments Thursday, October 18, 12
  • 81. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments • You should do this Thursday, October 18, 12
  • 82. I Didn’t Mention • End-to-end testing and alerting • External availability and performance • Open Connect • Jobs Thursday, October 18, 12