Production Monitoring Platform

Ariel Smoliar
Monitoring Platform

Objective
Develop a data-driven service to understand,
mitigate and prevent production outages

“You can observe a lot by just watching.”
(Yogi Berra)

Deliver reliable and scalable intelligent monitoring platform
to make customers and production happy
Leveraging Data
Implement
Machine Learning
Embrace DevOps
• Logging
• Time-series metrics
• APIs performance
• Normalization • Trends on time-series data
• Metrics correlation
• Outlier and anomaly detection
• Predictive analytics
• Collaboration
• MTTI and MTTR
• Failure automation
• War room
Approach to Solution

Data Monitoring
• The goal of monitoring is to detect problems before they turn
into outages, not to detect outages
• In my product planning I will be focusing on the following
components:
– Collecting data
– Visualizing data
– Trending and alerting

Let’s Proceed in Three Phases:
Phase 1
Phase 2
Phase 3
Interview dev and ops teams to better understand the
production, monitoring methods and DevOps practice
Implement immediate changes to the postmortem process
based on challenges that were identified
Develop a data-driven monitoring system to handle the
outages in a period of one year

Roadmap Over the Next Year
Phase 2: Outage Understanding
Outcome: Detailed and focused
postmortem service
Q1 Q2 Q3 Q4
Phase 3(a): Outage Mitigation
Outcome: New capabilities to reduce
mean time to identification of outages
Phase 3(c): Continuing Outage Prevention
Outcome: Contextualized data platform to
reduce and prevent outages
Phase 1:
Interviewing
Phase 3 (b): Outage Prevention
Outcome: New capabilities to reduce
mean time to resolution of outages

Which production alerts or incidents require postmortem?
How is knowledge shared today between Ops and Dev teams?
How do you allocate ownership for fixing bugs after an outage?
What is the actionable learning process after outage investigation?
What are the communication channels?
Which monitoring and alerting systems are being used?
Which metrics are you using to measure continuous improvement?
What KPIs are you using?
What data do you log?
What are the main problems you see today in your production deployment?
Can you specify any common or unusual patterns (dependency on user traffic, etc.)?
Across how many data centers and cloud providers is the code deployed?
Phase 1: Interview Dev and Ops TeamsProductionMonitoringDevOps
Discuss the following topics:

Phase 2: Outage Understanding
Immediate Changes
• Postmortem format should include four main components and not take too much time to
complete:
– Description of the outage
– Timeline of the events that identify the sequence of what actually happened
– Contributing conditions analysis: why the outage occurred and what contributed to it
– Recommendations to prevent the outage in the future
• Company’s greatest asset is its people. We need to make sure that the engineers/ops feel
comfortable to share the relevant information to better conduct root cause analysis
• Actionable learning and ownership:
– Assign tasks to team members and track progress (field ticket/bug id)
– Update playbook (github/wiki) depending on the recommendations
– Encourage discussion between engineering and ops teams in live chat rooms
Goal: Make sure postmortem focuses on the process and the technology, not finding
who to blame; ensure that data allows for actionable learning process

Priorities for the Team
• Expanding the functionalities of
the service to:
– Assign ownership and prioritize tasks
– Automatically open JIRA ticket to
track the progress
– Update production launch readiness
checklist (optional)
– Tag events (data center, device, etc.)
• Adding screenshot of graphs to
the form
• Visualizing events that lead to
outage on timeline
• Storing event timelines
• Exploring option to use
monitoring tools
(ganglia/CloudWatch) API to pull
metric data
• Reviewing recent outage data to
look for patterns
Backend/UI Data Science

Mockups
Timeline visualization of events during an outage investigation

Phase 3(a): Outage Mitigation
• We should be able to better investigate outages with the PostMortem service
– Analyzing simultaneously multiple timelines of previous outages (historical data) can help to
identify patterns and improve time for MTTI and MTTR
– If an outage events sequence is repeated, we should make sure that that the postmortem
recommendations are better implemented
– Sharing knowledge, graphs and reports from the PostMortem service can improve
collaboration between teams
• We will be designing an open API platform to collect and analyze data (network, databases, APM
metrics, servers, system, logs, CDN) across all domains from all our monitoring systems into a
single place
• We will start exploring multiple analytics areas (baselining, correlation, trending, outlier and
anomaly detection) on time-series data and can expand to include categorical data
• We will set bi-monthly meetings to share information and get feedback from our internal
customers in order to learn from recent outages and communicate our progress
Goal: Expand the postmortem process with new tools to reduce the time spent on
identifying and investigating an outage. This phase will also involve designing the
advanced platform

• Designing and implementing
platform and data pipeline to
collect, analyze and store
timestamped numerical data
• Automating historical outage
timelines comparison
• Adding reporting system and
option to share analysis
insights
• Tracking system of open tasks
from previous outages
• Examining baseline creation
for production
• Initial work on correlation
analysis across multiple
domains (PCA, etc.)
• Exploring open source
projects (Netflix, Twitter,
Etsy) for outlier and
anomaly detection
• Reviewing trending
algorithms

Mockups
Presenting multiple timelines of previous outages

Phase 3(b): Outage Mitigation
• We should work with other teams to identify business’s KPIs and then determine which
metrics can be collected to create and monitor those KPIs. Some examples for KPIs:
– Availability, latency, HTTP error codes (4xx, 5xx), user experience/number of users/revenue, etc.
• As we are moving forward with the new monitoring platform, it’s important to see if we
are improving these three parameters:
– Mean Time to Identification (MTTI)
– Mean Time to Resolution (MTTR)
– Number of outages
• We will focus on data quality and stress the importance of logging to the engineering
teams because the results of our analytics engine (for example correlating infrastructure
metrics related to end user experience with our mobile app) depend on the data we have
• We will keep automating our analytics engine to ensure that the platform is scalable and
not built on top of pre-defined patterns or rules
Goal: Improve data collection, processing, normalization and correlation capabilities
across the environments and data sources

• Building scalable and stable
platform to ingest data from
multiple sources
• Visualization of results:
– beautiful dashboards
– trends
– correlations
• Alerting based on trends
• Implementing better data
flow and sharing (RBAC)
• Implementing trends
based on time-series data
• Implementing and
evaluating results of
running metrics
correlation on-demand
• Testing baselines and AD
(ROC curves)

Logging Practice
• Log everything – will enable to take every
customer action or internal transaction to gain
insights into what’s working and what’s not
• Assign transaction ID (session ID for example)
through the app server for every transaction,
expediting the investigation process
• Collect logs into our log management system;
later alerts will be streamed to the new
platform

API Monitoring
To enrich the data, log each API call and monitor
the following information:
– Error code rate (autorization failures)
– Latency (90th, 95th percentile)
– Dependencies on 3rd party APIs as time spent on
external services

Phase 3(c): Continuing Outage
Prevention
• At this point our platform is already contributing to outage mitigation:
– Data across all domains is collected, analyzed and visualized
– Easier to share information based on historical data
– Trends on time-series data allows us to predict if something may go
wrong earlier, preventing outages
• Improving data collection, processing, normalization and centralizing
monitoring data sources is an ongoing process. Any new sources can
enrich the data and help adjust the algorithms
• This phase will be critical in evaluating the machine learning
algorithms and making sure we have a robust alerting platform (false
positives and true positives) to reduce the number of outages
Goal: Converge the capabilities we have built towards a better system to reduce the
number of outages

• Implementing outlier and
anomaly detection and
evaluating performance
• Testing predictive analytics
– alerting based on sequence
of events (divergence from
normal baseline) that may
lead to an outage
• Open source the new AD
framework
• Improving the platform
infrastructure
• Monitoring the performance of
the platform with the new
solution
• Visualizing outlier and anomaly
detection results
• Providing visibility into potential
problems (predictive)
• Configuring chat rooms, emails,
teams and owners to share
information/alerts
• Planning a failure automation
process

Long-Term Product Vision
Automation
Collaboration
Analytics
Automating workflow for relevant teams and advancing
failure automation will be needed for the growing number
of employees and the increasingly complex infrastructure.
Utilizing war room will make sure that all relevant teams
are involved and monitoring together. An enhanced
onboarding process will be needed for new engineers to
understand potential issues with production.
Reducing the massive data stream to a more contextualized
view for faster escalation. Clustering, predictive analytics,
and a recommendation capability will be the core for the
success of the solution.

Conclusions
• Contextualize insights across all domains to make sure the
best user experience is continually provided
• Accelerate time required to investigate and resolve
production problems, leading to increased uptime
• Increase productivity: right information gets to the right
people at the right time
Deploying this three phase approach will help to:

Production Monitoring Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Production Monitoring Platform

Similar to Production Monitoring Platform (20)

Recently uploaded

Recently uploaded (20)

Production Monitoring Platform

Editor's Notes