SlideShare a Scribd company logo
Ariel Smoliar
Monitoring Platform
Objective
Develop a data-driven service to understand,
mitigate and prevent production outages
“You can observe a lot by just watching.”
(Yogi Berra)
Deliver reliable and scalable intelligent monitoring platform
to make customers and production happy
Leveraging Data
Implement
Machine Learning
Embrace DevOps
• Logging
• Time-series metrics
• APIs performance
• Normalization • Trends on time-series data
• Metrics correlation
• Outlier and anomaly detection
• Predictive analytics
• Collaboration
• MTTI and MTTR
• Failure automation
• War room
Approach to Solution
Data Monitoring
• The goal of monitoring is to detect problems before they turn
into outages, not to detect outages
• In my product planning I will be focusing on the following
components:
– Collecting data
– Visualizing data
– Trending and alerting
Let’s Proceed in Three Phases:
Phase 1
Phase 2
Phase 3
Interview dev and ops teams to better understand the
production, monitoring methods and DevOps practice
Implement immediate changes to the postmortem process
based on challenges that were identified
Develop a data-driven monitoring system to handle the
outages in a period of one year
Roadmap Over the Next Year
Phase 2: Outage Understanding
Outcome: Detailed and focused
postmortem service
Q1 Q2 Q3 Q4
Phase 3(a): Outage Mitigation
Outcome: New capabilities to reduce
mean time to identification of outages
Phase 3(c): Continuing Outage Prevention
Outcome: Contextualized data platform to
reduce and prevent outages
Phase 1:
Interviewing
Phase 3 (b): Outage Prevention
Outcome: New capabilities to reduce
mean time to resolution of outages
Which production alerts or incidents require postmortem?
How is knowledge shared today between Ops and Dev teams?
How do you allocate ownership for fixing bugs after an outage?
What is the actionable learning process after outage investigation?
What are the communication channels?
Which monitoring and alerting systems are being used?
Which metrics are you using to measure continuous improvement?
What KPIs are you using?
What data do you log?
What are the main problems you see today in your production deployment?
Can you specify any common or unusual patterns (dependency on user traffic, etc.)?
Across how many data centers and cloud providers is the code deployed?
Phase 1: Interview Dev and Ops TeamsProductionMonitoringDevOps
Discuss the following topics:
Phase 2: Outage Understanding
Immediate Changes
• Postmortem format should include four main components and not take too much time to
complete:
– Description of the outage
– Timeline of the events that identify the sequence of what actually happened
– Contributing conditions analysis: why the outage occurred and what contributed to it
– Recommendations to prevent the outage in the future
• Company’s greatest asset is its people. We need to make sure that the engineers/ops feel
comfortable to share the relevant information to better conduct root cause analysis
• Actionable learning and ownership:
– Assign tasks to team members and track progress (field ticket/bug id)
– Update playbook (github/wiki) depending on the recommendations
– Encourage discussion between engineering and ops teams in live chat rooms
Goal: Make sure postmortem focuses on the process and the technology, not finding
who to blame; ensure that data allows for actionable learning process
Priorities for the Team
• Expanding the functionalities of
the service to:
– Assign ownership and prioritize tasks
– Automatically open JIRA ticket to
track the progress
– Update production launch readiness
checklist (optional)
– Tag events (data center, device, etc.)
• Adding screenshot of graphs to
the form
• Visualizing events that lead to
outage on timeline
• Storing event timelines
• Exploring option to use
monitoring tools
(ganglia/CloudWatch) API to pull
metric data
• Reviewing recent outage data to
look for patterns
Backend/UI Data Science
Mockups
Timeline visualization of events during an outage investigation
Phase 3(a): Outage Mitigation
• We should be able to better investigate outages with the PostMortem service
– Analyzing simultaneously multiple timelines of previous outages (historical data) can help to
identify patterns and improve time for MTTI and MTTR
– If an outage events sequence is repeated, we should make sure that that the postmortem
recommendations are better implemented
– Sharing knowledge, graphs and reports from the PostMortem service can improve
collaboration between teams
• We will be designing an open API platform to collect and analyze data (network, databases, APM
metrics, servers, system, logs, CDN) across all domains from all our monitoring systems into a
single place
• We will start exploring multiple analytics areas (baselining, correlation, trending, outlier and
anomaly detection) on time-series data and can expand to include categorical data
• We will set bi-monthly meetings to share information and get feedback from our internal
customers in order to learn from recent outages and communicate our progress
Goal: Expand the postmortem process with new tools to reduce the time spent on
identifying and investigating an outage. This phase will also involve designing the
advanced platform
Priorities for the Team
• Designing and implementing
platform and data pipeline to
collect, analyze and store
timestamped numerical data
• Automating historical outage
timelines comparison
• Adding reporting system and
option to share analysis
insights
• Tracking system of open tasks
from previous outages
• Examining baseline creation
for production
• Initial work on correlation
analysis across multiple
domains (PCA, etc.)
• Exploring open source
projects (Netflix, Twitter,
Etsy) for outlier and
anomaly detection
• Reviewing trending
algorithms
Backend/UI Data Science
Mockups
Presenting multiple timelines of previous outages
Phase 3(b): Outage Mitigation
• We should work with other teams to identify business’s KPIs and then determine which
metrics can be collected to create and monitor those KPIs. Some examples for KPIs:
– Availability, latency, HTTP error codes (4xx, 5xx), user experience/number of users/revenue, etc.
• As we are moving forward with the new monitoring platform, it’s important to see if we
are improving these three parameters:
– Mean Time to Identification (MTTI)
– Mean Time to Resolution (MTTR)
– Number of outages
• We will focus on data quality and stress the importance of logging to the engineering
teams because the results of our analytics engine (for example correlating infrastructure
metrics related to end user experience with our mobile app) depend on the data we have
• We will keep automating our analytics engine to ensure that the platform is scalable and
not built on top of pre-defined patterns or rules
Goal: Improve data collection, processing, normalization and correlation capabilities
across the environments and data sources
Priorities for the Team
• Building scalable and stable
platform to ingest data from
multiple sources
• Visualization of results:
– beautiful dashboards
– trends
– correlations
• Alerting based on trends
• Implementing better data
flow and sharing (RBAC)
• Implementing trends
based on time-series data
• Implementing and
evaluating results of
running metrics
correlation on-demand
• Testing baselines and AD
(ROC curves)
Backend/UI Data Science
Logs are not sexy but…
Logging Practice
• Log everything – will enable to take every
customer action or internal transaction to gain
insights into what’s working and what’s not
• Assign transaction ID (session ID for example)
through the app server for every transaction,
expediting the investigation process
• Collect logs into our log management system;
later alerts will be streamed to the new
platform
API Monitoring
To enrich the data, log each API call and monitor
the following information:
– Error code rate (autorization failures)
– Latency (90th, 95th percentile)
– Dependencies on 3rd party APIs as time spent on
external services
Phase 3(c): Continuing Outage
Prevention
• At this point our platform is already contributing to outage mitigation:
– Data across all domains is collected, analyzed and visualized
– Easier to share information based on historical data
– Trends on time-series data allows us to predict if something may go
wrong earlier, preventing outages
• Improving data collection, processing, normalization and centralizing
monitoring data sources is an ongoing process. Any new sources can
enrich the data and help adjust the algorithms
• This phase will be critical in evaluating the machine learning
algorithms and making sure we have a robust alerting platform (false
positives and true positives) to reduce the number of outages
Goal: Converge the capabilities we have built towards a better system to reduce the
number of outages
Priorities for the Team
• Implementing outlier and
anomaly detection and
evaluating performance
• Testing predictive analytics
– alerting based on sequence
of events (divergence from
normal baseline) that may
lead to an outage
• Open source the new AD
framework
Backend/UI Data Science
• Improving the platform
infrastructure
• Monitoring the performance of
the platform with the new
solution
• Visualizing outlier and anomaly
detection results
• Providing visibility into potential
problems (predictive)
• Configuring chat rooms, emails,
teams and owners to share
information/alerts
• Planning a failure automation
process
Long-Term Product Vision
Automation
Collaboration
Analytics
Automating workflow for relevant teams and advancing
failure automation will be needed for the growing number
of employees and the increasingly complex infrastructure.
Utilizing war room will make sure that all relevant teams
are involved and monitoring together. An enhanced
onboarding process will be needed for new engineers to
understand potential issues with production.
Reducing the massive data stream to a more contextualized
view for faster escalation. Clustering, predictive analytics,
and a recommendation capability will be the core for the
success of the solution.
Conclusions
• Contextualize insights across all domains to make sure the
best user experience is continually provided
• Accelerate time required to investigate and resolve
production problems, leading to increased uptime
• Increase productivity: right information gets to the right
people at the right time
Deploying this three phase approach will help to:

More Related Content

What's hot

Cloud computing Risk management
Cloud computing Risk management  Cloud computing Risk management
Cloud computing Risk management
Padma Jella
 
BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-Reduce
Lilia Sfaxi
 
Process Mining - Chapter 12 - Analyzing Spaghetti Processes
Process Mining - Chapter 12 - Analyzing Spaghetti ProcessesProcess Mining - Chapter 12 - Analyzing Spaghetti Processes
Process Mining - Chapter 12 - Analyzing Spaghetti Processes
Wil van der Aalst
 
Jeu 5S en support d'une formation
Jeu 5S en support d'une formation Jeu 5S en support d'une formation
Jeu 5S en support d'une formation Sylvain BROSSARD
 
jeu lean manufacturing
jeu lean manufacturingjeu lean manufacturing
jeu lean manufacturing
CIPE
 
Implémentation d’une solution de géolocalisation des véhicules basée sur open...
Implémentation d’une solution de géolocalisation des véhicules basée sur open...Implémentation d’une solution de géolocalisation des véhicules basée sur open...
Implémentation d’une solution de géolocalisation des véhicules basée sur open...
HORIYASOFT
 
Rapport de stage original
Rapport de stage originalRapport de stage original
Rapport de stage original
Yvan Nguedjem
 
Mémento du cérémonial militaire
Mémento du cérémonial militaireMémento du cérémonial militaire
Mémento du cérémonial militaire
OPUS IN FIDE
 
Introduction à la technologie Cloud Computing
Introduction à la technologie Cloud ComputingIntroduction à la technologie Cloud Computing
Introduction à la technologie Cloud Computing
Raouia Bouabdallah
 
Tests & recette - Les fondamentaux
Tests & recette - Les fondamentauxTests & recette - Les fondamentaux
Tests & recette - Les fondamentaux
COMPETENSIS
 
Progiciel de gestion intégré SAP
Progiciel de gestion intégré SAPProgiciel de gestion intégré SAP
Progiciel de gestion intégré SAP
FICEL Hemza
 
Standardisation, maitrise et optimisation du système de pilotage de la perfor...
Standardisation, maitrise et optimisation du système de pilotage de la perfor...Standardisation, maitrise et optimisation du système de pilotage de la perfor...
Standardisation, maitrise et optimisation du système de pilotage de la perfor...
oumaimazizi
 
juste-à-temps
 juste-à-temps juste-à-temps
juste-à-temps
Mohammed ZAAFA
 
Découvrez la Value Stream Mapping (VSM)
Découvrez la Value Stream Mapping (VSM)Découvrez la Value Stream Mapping (VSM)
Découvrez la Value Stream Mapping (VSM)
XL Groupe
 
Soutenance fin d'étude
Soutenance fin d'étudeSoutenance fin d'étude
Soutenance fin d'étude
julienlfr
 
Agile Project Management with Scrum PDF
Agile Project Management with Scrum PDFAgile Project Management with Scrum PDF
Agile Project Management with Scrum PDF
iFour Technolab Pvt. Ltd.
 
ERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTS
ERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTSERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTS
ERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTS
Abdou Lahad SYLLA
 
Mesure de la performance du SI de camtel nguimo hermann 5.0
Mesure de la performance du SI de camtel  nguimo hermann 5.0Mesure de la performance du SI de camtel  nguimo hermann 5.0
Mesure de la performance du SI de camtel nguimo hermann 5.0
Hermann NGUIMO
 
Conception et mise en place d'une application SIG-Web d'aide à la décision po...
Conception et mise en place d'une application SIG-Web d'aide à la décision po...Conception et mise en place d'une application SIG-Web d'aide à la décision po...
Conception et mise en place d'une application SIG-Web d'aide à la décision po...
wassimchakroun3
 
Rapport de projet_de_fin_d__tudes__pfe__safwen (8)
Rapport de projet_de_fin_d__tudes__pfe__safwen (8)Rapport de projet_de_fin_d__tudes__pfe__safwen (8)
Rapport de projet_de_fin_d__tudes__pfe__safwen (8)
safwenbenfredj
 

What's hot (20)

Cloud computing Risk management
Cloud computing Risk management  Cloud computing Risk management
Cloud computing Risk management
 
BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-Reduce
 
Process Mining - Chapter 12 - Analyzing Spaghetti Processes
Process Mining - Chapter 12 - Analyzing Spaghetti ProcessesProcess Mining - Chapter 12 - Analyzing Spaghetti Processes
Process Mining - Chapter 12 - Analyzing Spaghetti Processes
 
Jeu 5S en support d'une formation
Jeu 5S en support d'une formation Jeu 5S en support d'une formation
Jeu 5S en support d'une formation
 
jeu lean manufacturing
jeu lean manufacturingjeu lean manufacturing
jeu lean manufacturing
 
Implémentation d’une solution de géolocalisation des véhicules basée sur open...
Implémentation d’une solution de géolocalisation des véhicules basée sur open...Implémentation d’une solution de géolocalisation des véhicules basée sur open...
Implémentation d’une solution de géolocalisation des véhicules basée sur open...
 
Rapport de stage original
Rapport de stage originalRapport de stage original
Rapport de stage original
 
Mémento du cérémonial militaire
Mémento du cérémonial militaireMémento du cérémonial militaire
Mémento du cérémonial militaire
 
Introduction à la technologie Cloud Computing
Introduction à la technologie Cloud ComputingIntroduction à la technologie Cloud Computing
Introduction à la technologie Cloud Computing
 
Tests & recette - Les fondamentaux
Tests & recette - Les fondamentauxTests & recette - Les fondamentaux
Tests & recette - Les fondamentaux
 
Progiciel de gestion intégré SAP
Progiciel de gestion intégré SAPProgiciel de gestion intégré SAP
Progiciel de gestion intégré SAP
 
Standardisation, maitrise et optimisation du système de pilotage de la perfor...
Standardisation, maitrise et optimisation du système de pilotage de la perfor...Standardisation, maitrise et optimisation du système de pilotage de la perfor...
Standardisation, maitrise et optimisation du système de pilotage de la perfor...
 
juste-à-temps
 juste-à-temps juste-à-temps
juste-à-temps
 
Découvrez la Value Stream Mapping (VSM)
Découvrez la Value Stream Mapping (VSM)Découvrez la Value Stream Mapping (VSM)
Découvrez la Value Stream Mapping (VSM)
 
Soutenance fin d'étude
Soutenance fin d'étudeSoutenance fin d'étude
Soutenance fin d'étude
 
Agile Project Management with Scrum PDF
Agile Project Management with Scrum PDFAgile Project Management with Scrum PDF
Agile Project Management with Scrum PDF
 
ERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTS
ERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTSERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTS
ERP : Etude et Mise en place avec Odoo 8 sous ubuntun14.04.05 LTS
 
Mesure de la performance du SI de camtel nguimo hermann 5.0
Mesure de la performance du SI de camtel  nguimo hermann 5.0Mesure de la performance du SI de camtel  nguimo hermann 5.0
Mesure de la performance du SI de camtel nguimo hermann 5.0
 
Conception et mise en place d'une application SIG-Web d'aide à la décision po...
Conception et mise en place d'une application SIG-Web d'aide à la décision po...Conception et mise en place d'une application SIG-Web d'aide à la décision po...
Conception et mise en place d'une application SIG-Web d'aide à la décision po...
 
Rapport de projet_de_fin_d__tudes__pfe__safwen (8)
Rapport de projet_de_fin_d__tudes__pfe__safwen (8)Rapport de projet_de_fin_d__tudes__pfe__safwen (8)
Rapport de projet_de_fin_d__tudes__pfe__safwen (8)
 

Viewers also liked

Transaction Analytics
Transaction AnalyticsTransaction Analytics
Transaction Analytics
Ariel Smoliar
 
AWS Config - Advanced AWS Meetup SF
AWS Config - Advanced AWS Meetup SFAWS Config - Advanced AWS Meetup SF
AWS Config - Advanced AWS Meetup SF
Ariel Smoliar
 
Sumo Logic AWS CloudTrail Application
Sumo Logic AWS CloudTrail ApplicationSumo Logic AWS CloudTrail Application
Sumo Logic AWS CloudTrail Application
Ariel Smoliar
 
AWS Config Rules - Advanced AWS Meetup
AWS Config Rules - Advanced AWS MeetupAWS Config Rules - Advanced AWS Meetup
AWS Config Rules - Advanced AWS Meetup
Ariel Smoliar
 
15 lean mfg toyota production system (1)
15 lean mfg toyota production system (1)15 lean mfg toyota production system (1)
15 lean mfg toyota production system (1)
Ashima Kandari
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results
Symantec
 
Lean manufacturing and the toyota production system
Lean manufacturing and the toyota production systemLean manufacturing and the toyota production system
Lean manufacturing and the toyota production systemGrace Falcis
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
Nati Shalom
 
Python - code quality and production monitoring
Python - code quality and production monitoringPython - code quality and production monitoring
Python - code quality and production monitoring
David Melamed
 

Viewers also liked (9)

Transaction Analytics
Transaction AnalyticsTransaction Analytics
Transaction Analytics
 
AWS Config - Advanced AWS Meetup SF
AWS Config - Advanced AWS Meetup SFAWS Config - Advanced AWS Meetup SF
AWS Config - Advanced AWS Meetup SF
 
Sumo Logic AWS CloudTrail Application
Sumo Logic AWS CloudTrail ApplicationSumo Logic AWS CloudTrail Application
Sumo Logic AWS CloudTrail Application
 
AWS Config Rules - Advanced AWS Meetup
AWS Config Rules - Advanced AWS MeetupAWS Config Rules - Advanced AWS Meetup
AWS Config Rules - Advanced AWS Meetup
 
15 lean mfg toyota production system (1)
15 lean mfg toyota production system (1)15 lean mfg toyota production system (1)
15 lean mfg toyota production system (1)
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results
 
Lean manufacturing and the toyota production system
Lean manufacturing and the toyota production systemLean manufacturing and the toyota production system
Lean manufacturing and the toyota production system
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
Python - code quality and production monitoring
Python - code quality and production monitoringPython - code quality and production monitoring
Python - code quality and production monitoring
 

Similar to Production Monitoring Platform

Why Monitoring and Logging are Important in DevOps.pdf
Why Monitoring and Logging are Important in DevOps.pdfWhy Monitoring and Logging are Important in DevOps.pdf
Why Monitoring and Logging are Important in DevOps.pdf
Datacademy.ai
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
Splunk
 
What is Platform Observability? An Overview
What is Platform Observability? An OverviewWhat is Platform Observability? An Overview
What is Platform Observability? An Overview
Kumar Kolaganti
 
SplunkLive! Munich 2018: Integrating Metrics and Logs
SplunkLive! Munich 2018: Integrating Metrics and LogsSplunkLive! Munich 2018: Integrating Metrics and Logs
SplunkLive! Munich 2018: Integrating Metrics and Logs
Splunk
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
Jithin Zcs
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
Sri Ambati
 
Driving TAS Enterprise Fitness
Driving TAS Enterprise FitnessDriving TAS Enterprise Fitness
Driving TAS Enterprise Fitness
VMware Tanzu
 
Neev Load Testing Services
Neev Load Testing ServicesNeev Load Testing Services
Neev Load Testing Services
Neev Technologies
 
Data Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubeyData Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubey
Ankita Dubey
 
DIGITAL TRANSFORMATION AND STRATEGY_final.pptx
DIGITAL TRANSFORMATION AND STRATEGY_final.pptxDIGITAL TRANSFORMATION AND STRATEGY_final.pptx
DIGITAL TRANSFORMATION AND STRATEGY_final.pptx
GeorgeDiamandis11
 
Observability in Modern Applications.pptx
Observability in Modern Applications.pptxObservability in Modern Applications.pptx
Observability in Modern Applications.pptx
Aneesh Kumar
 
Julie Rampello Maximo workshop IMC 2013 presentation
Julie Rampello Maximo workshop IMC 2013 presentationJulie Rampello Maximo workshop IMC 2013 presentation
Julie Rampello Maximo workshop IMC 2013 presentationProjetech
 
Software Operation Knowledge
Software Operation KnowledgeSoftware Operation Knowledge
Software Operation Knowledge
Devnology
 
Software Development Life Cycle (SDLC).pptx
Software Development Life Cycle (SDLC).pptxSoftware Development Life Cycle (SDLC).pptx
Software Development Life Cycle (SDLC).pptx
sandhyakiran10
 
Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...
Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...
Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...
Julie Rampello
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
Mickey Boxell
 
Software metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. MohiteSoftware metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. Mohite
Zeal Education Society, Pune
 
Doing Analytics Right - Designing and Automating Analytics
Doing Analytics Right - Designing and Automating AnalyticsDoing Analytics Right - Designing and Automating Analytics
Doing Analytics Right - Designing and Automating Analytics
Tasktop
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
DataWorks Summit/Hadoop Summit
 
software engineering
software engineering software engineering
software engineering
bharati vidhyapeeth uni.-pune
 

Similar to Production Monitoring Platform (20)

Why Monitoring and Logging are Important in DevOps.pdf
Why Monitoring and Logging are Important in DevOps.pdfWhy Monitoring and Logging are Important in DevOps.pdf
Why Monitoring and Logging are Important in DevOps.pdf
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
 
What is Platform Observability? An Overview
What is Platform Observability? An OverviewWhat is Platform Observability? An Overview
What is Platform Observability? An Overview
 
SplunkLive! Munich 2018: Integrating Metrics and Logs
SplunkLive! Munich 2018: Integrating Metrics and LogsSplunkLive! Munich 2018: Integrating Metrics and Logs
SplunkLive! Munich 2018: Integrating Metrics and Logs
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Driving TAS Enterprise Fitness
Driving TAS Enterprise FitnessDriving TAS Enterprise Fitness
Driving TAS Enterprise Fitness
 
Neev Load Testing Services
Neev Load Testing ServicesNeev Load Testing Services
Neev Load Testing Services
 
Data Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubeyData Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubey
 
DIGITAL TRANSFORMATION AND STRATEGY_final.pptx
DIGITAL TRANSFORMATION AND STRATEGY_final.pptxDIGITAL TRANSFORMATION AND STRATEGY_final.pptx
DIGITAL TRANSFORMATION AND STRATEGY_final.pptx
 
Observability in Modern Applications.pptx
Observability in Modern Applications.pptxObservability in Modern Applications.pptx
Observability in Modern Applications.pptx
 
Julie Rampello Maximo workshop IMC 2013 presentation
Julie Rampello Maximo workshop IMC 2013 presentationJulie Rampello Maximo workshop IMC 2013 presentation
Julie Rampello Maximo workshop IMC 2013 presentation
 
Software Operation Knowledge
Software Operation KnowledgeSoftware Operation Knowledge
Software Operation Knowledge
 
Software Development Life Cycle (SDLC).pptx
Software Development Life Cycle (SDLC).pptxSoftware Development Life Cycle (SDLC).pptx
Software Development Life Cycle (SDLC).pptx
 
Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...
Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...
Maximo KPI Maintenance & Asset Reliability Support Workshop IMC 2013 presenta...
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Software metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. MohiteSoftware metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. Mohite
 
Doing Analytics Right - Designing and Automating Analytics
Doing Analytics Right - Designing and Automating AnalyticsDoing Analytics Right - Designing and Automating Analytics
Doing Analytics Right - Designing and Automating Analytics
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
software engineering
software engineering software engineering
software engineering
 

Recently uploaded

2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 

Recently uploaded (20)

2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 

Production Monitoring Platform

  • 2. Objective Develop a data-driven service to understand, mitigate and prevent production outages
  • 3. “You can observe a lot by just watching.” (Yogi Berra)
  • 4. Deliver reliable and scalable intelligent monitoring platform to make customers and production happy Leveraging Data Implement Machine Learning Embrace DevOps • Logging • Time-series metrics • APIs performance • Normalization • Trends on time-series data • Metrics correlation • Outlier and anomaly detection • Predictive analytics • Collaboration • MTTI and MTTR • Failure automation • War room Approach to Solution
  • 5. Data Monitoring • The goal of monitoring is to detect problems before they turn into outages, not to detect outages • In my product planning I will be focusing on the following components: – Collecting data – Visualizing data – Trending and alerting
  • 6. Let’s Proceed in Three Phases: Phase 1 Phase 2 Phase 3 Interview dev and ops teams to better understand the production, monitoring methods and DevOps practice Implement immediate changes to the postmortem process based on challenges that were identified Develop a data-driven monitoring system to handle the outages in a period of one year
  • 7. Roadmap Over the Next Year Phase 2: Outage Understanding Outcome: Detailed and focused postmortem service Q1 Q2 Q3 Q4 Phase 3(a): Outage Mitigation Outcome: New capabilities to reduce mean time to identification of outages Phase 3(c): Continuing Outage Prevention Outcome: Contextualized data platform to reduce and prevent outages Phase 1: Interviewing Phase 3 (b): Outage Prevention Outcome: New capabilities to reduce mean time to resolution of outages
  • 8. Which production alerts or incidents require postmortem? How is knowledge shared today between Ops and Dev teams? How do you allocate ownership for fixing bugs after an outage? What is the actionable learning process after outage investigation? What are the communication channels? Which monitoring and alerting systems are being used? Which metrics are you using to measure continuous improvement? What KPIs are you using? What data do you log? What are the main problems you see today in your production deployment? Can you specify any common or unusual patterns (dependency on user traffic, etc.)? Across how many data centers and cloud providers is the code deployed? Phase 1: Interview Dev and Ops TeamsProductionMonitoringDevOps Discuss the following topics:
  • 9. Phase 2: Outage Understanding Immediate Changes • Postmortem format should include four main components and not take too much time to complete: – Description of the outage – Timeline of the events that identify the sequence of what actually happened – Contributing conditions analysis: why the outage occurred and what contributed to it – Recommendations to prevent the outage in the future • Company’s greatest asset is its people. We need to make sure that the engineers/ops feel comfortable to share the relevant information to better conduct root cause analysis • Actionable learning and ownership: – Assign tasks to team members and track progress (field ticket/bug id) – Update playbook (github/wiki) depending on the recommendations – Encourage discussion between engineering and ops teams in live chat rooms Goal: Make sure postmortem focuses on the process and the technology, not finding who to blame; ensure that data allows for actionable learning process
  • 10. Priorities for the Team • Expanding the functionalities of the service to: – Assign ownership and prioritize tasks – Automatically open JIRA ticket to track the progress – Update production launch readiness checklist (optional) – Tag events (data center, device, etc.) • Adding screenshot of graphs to the form • Visualizing events that lead to outage on timeline • Storing event timelines • Exploring option to use monitoring tools (ganglia/CloudWatch) API to pull metric data • Reviewing recent outage data to look for patterns Backend/UI Data Science
  • 11. Mockups Timeline visualization of events during an outage investigation
  • 12. Phase 3(a): Outage Mitigation • We should be able to better investigate outages with the PostMortem service – Analyzing simultaneously multiple timelines of previous outages (historical data) can help to identify patterns and improve time for MTTI and MTTR – If an outage events sequence is repeated, we should make sure that that the postmortem recommendations are better implemented – Sharing knowledge, graphs and reports from the PostMortem service can improve collaboration between teams • We will be designing an open API platform to collect and analyze data (network, databases, APM metrics, servers, system, logs, CDN) across all domains from all our monitoring systems into a single place • We will start exploring multiple analytics areas (baselining, correlation, trending, outlier and anomaly detection) on time-series data and can expand to include categorical data • We will set bi-monthly meetings to share information and get feedback from our internal customers in order to learn from recent outages and communicate our progress Goal: Expand the postmortem process with new tools to reduce the time spent on identifying and investigating an outage. This phase will also involve designing the advanced platform
  • 13. Priorities for the Team • Designing and implementing platform and data pipeline to collect, analyze and store timestamped numerical data • Automating historical outage timelines comparison • Adding reporting system and option to share analysis insights • Tracking system of open tasks from previous outages • Examining baseline creation for production • Initial work on correlation analysis across multiple domains (PCA, etc.) • Exploring open source projects (Netflix, Twitter, Etsy) for outlier and anomaly detection • Reviewing trending algorithms Backend/UI Data Science
  • 15. Phase 3(b): Outage Mitigation • We should work with other teams to identify business’s KPIs and then determine which metrics can be collected to create and monitor those KPIs. Some examples for KPIs: – Availability, latency, HTTP error codes (4xx, 5xx), user experience/number of users/revenue, etc. • As we are moving forward with the new monitoring platform, it’s important to see if we are improving these three parameters: – Mean Time to Identification (MTTI) – Mean Time to Resolution (MTTR) – Number of outages • We will focus on data quality and stress the importance of logging to the engineering teams because the results of our analytics engine (for example correlating infrastructure metrics related to end user experience with our mobile app) depend on the data we have • We will keep automating our analytics engine to ensure that the platform is scalable and not built on top of pre-defined patterns or rules Goal: Improve data collection, processing, normalization and correlation capabilities across the environments and data sources
  • 16. Priorities for the Team • Building scalable and stable platform to ingest data from multiple sources • Visualization of results: – beautiful dashboards – trends – correlations • Alerting based on trends • Implementing better data flow and sharing (RBAC) • Implementing trends based on time-series data • Implementing and evaluating results of running metrics correlation on-demand • Testing baselines and AD (ROC curves) Backend/UI Data Science
  • 17. Logs are not sexy but…
  • 18. Logging Practice • Log everything – will enable to take every customer action or internal transaction to gain insights into what’s working and what’s not • Assign transaction ID (session ID for example) through the app server for every transaction, expediting the investigation process • Collect logs into our log management system; later alerts will be streamed to the new platform
  • 19. API Monitoring To enrich the data, log each API call and monitor the following information: – Error code rate (autorization failures) – Latency (90th, 95th percentile) – Dependencies on 3rd party APIs as time spent on external services
  • 20. Phase 3(c): Continuing Outage Prevention • At this point our platform is already contributing to outage mitigation: – Data across all domains is collected, analyzed and visualized – Easier to share information based on historical data – Trends on time-series data allows us to predict if something may go wrong earlier, preventing outages • Improving data collection, processing, normalization and centralizing monitoring data sources is an ongoing process. Any new sources can enrich the data and help adjust the algorithms • This phase will be critical in evaluating the machine learning algorithms and making sure we have a robust alerting platform (false positives and true positives) to reduce the number of outages Goal: Converge the capabilities we have built towards a better system to reduce the number of outages
  • 21. Priorities for the Team • Implementing outlier and anomaly detection and evaluating performance • Testing predictive analytics – alerting based on sequence of events (divergence from normal baseline) that may lead to an outage • Open source the new AD framework Backend/UI Data Science • Improving the platform infrastructure • Monitoring the performance of the platform with the new solution • Visualizing outlier and anomaly detection results • Providing visibility into potential problems (predictive) • Configuring chat rooms, emails, teams and owners to share information/alerts • Planning a failure automation process
  • 22. Long-Term Product Vision Automation Collaboration Analytics Automating workflow for relevant teams and advancing failure automation will be needed for the growing number of employees and the increasingly complex infrastructure. Utilizing war room will make sure that all relevant teams are involved and monitoring together. An enhanced onboarding process will be needed for new engineers to understand potential issues with production. Reducing the massive data stream to a more contextualized view for faster escalation. Clustering, predictive analytics, and a recommendation capability will be the core for the success of the solution.
  • 23. Conclusions • Contextualize insights across all domains to make sure the best user experience is continually provided • Accelerate time required to investigate and resolve production problems, leading to increased uptime • Increase productivity: right information gets to the right people at the right time Deploying this three phase approach will help to:

Editor's Notes

  1. Watch in play mode