Z E B R I U M
Autonomous Incident & Root Cause Detection
Using ML to catch and characterize incidents faster with zero configuration
© 2020 by Zebrium, Inc.
LARRY LANCASTER
Founder/CTO
Zebrium
SCOTT MCALLISTER
Developer Advocate
PagerDuty
GAVIN COHEN
VP Product
Zebrium
Zebrium makes Forbes AI50 2020 list –
America’s Most Promising Artificial
Intelligence Companies.
The Problem: Humans Don’t Scale With Complexity
Today’s tools put the burden on the human –
to spot outliers & know what to search in logs
Last
Decade
Today
INCIDENT IMPACT 1x 100x
INCIDENT COMPLEXITY 1x 100x
HUMAN BANDWIDTH 1x 1x
SCENARIO
BEST IN CLASS
APPROACH
KNOWN CAUSE
KNOWN SYMPTOM
NEAR-TOTAL
AUTOMATION
KNOWN CAUSE
OR SYMPTOM
PARTIAL
AUTOMATION
UNKNOWN CAUSE
UNKNOWN SYMPTOM
HUMAN SEARCH
But humans can’t scale with
complexity
The Solution: ML Enabled Incident Recognition
Z E B R I U M
Elastic
Sumologic
DataDog
Elastic
Prometheus
Incident
Recognition
Anomaly
Detection
Logs & Metrics: Search,
Chart, Alert
Spot outliers & pattern
changes in logs & metrics
Aggregate, chart,
search, build alerts
Auto detection of unknown
incidents, with root cause
Degreeofautomation
Speed to resolve “Unknown” Incidents
Our Solution: Autonomous Monitoring
Step 2:
5 mins to learn structures
Zero Configuration
Step 3:
<1 hr to Incident Detection
Alerts via Slack, Email etc.
Step 1:
<5 mins to deploy
2 Helm commands
Optional Incident Signal from
PagerDuty or any other tool
Z E B R I U M
Our Technology
TECH DECK
Zebrium: How It Works
ML-driven Parsing &
Event Categorization
1 2 Anomaly Detection
On Event Types
Raw Log Stream
Pattern Recognition on
Correlated Anomalies
3
Metrics Exporters
Preserve time stamps,
labels, types
Anomaly detection on
metrics
Optional
User Feedback
5
4 Auto Incident Creation
DEMO ONE Time
An unplanned disruption or
degradation of service that is
actively affecting customers’
ability to use the product.
@stmcallister
PEACETIME
WARTIME
@stmcallister
NORMAL
EMERGENCY
@stmcallister
OK
NOTOK
@stmcallister
@stmcallister
DON’T
PANIC
@ stmcallister
Rachael, I’d like you to investigate
the increased latency, try to find the
cause. I’ll come back to you in 5
minutes. Understood?
Understood.
@stmcallister
The goal is to handle the situation in a
way that limits damage and reduces
recovery time and costs.
@stmcallister
@ stmcallister
Common Pitfalls in Troubleshooting
• Spending time on symptoms that aren’t relevant
• Latching on to causes of past problems
• Hunting down connections that are coincidences
Source: https://landing.google.com/sre/sre-book/chapters/effective-troubleshooting/
@ stmcallister
Reviews are
Opportunities to Learn
Scan dashboards,
Drill down
Logs &
Metrics
INCIDENT CREATION SLOW HUMAN-DRIVEN
INCIDENT RESOLUTION
!
Incident Response Without Zebrium
Search logs
Service Complexity
MTTR
Standalone
MTTR GROWS w/ COMPLEXITY
Logs &
Metrics
INCIDENT CREATION FAST AUTONOMOUS
INCIDENT RESOLUTION
!
SIGNIFICANT MTTR REDUCTION
Service Complexity
MTTR
Traditional
With Zebrium
Incident Response With Zebrium
Native search,
Grafana for
analytics
B
A
Incident Events & Root Cause
Metrics Event Timeline
Auto Incident
Breakdown
Zebrium ML
!
1 Incident alert
from any
source
2 PagerDuty
Incident Created
3 Zebrium Augments
Incident with Root Cause
4
Any monitoring,
APM, Logger,
Observability, Help
Desk or other Tool
Incident & Root Cause
Metrics Event Timeline
How Zebrium is Integrated with PagerDuty
DEMO TWO Time
Real-life Incident Detection Examples
“Our cloud provider made an API
change which caused problems
downstream. Zebrium not only
detected the issue, but also
helped us debug it quickly”.
- Aran Khanna, CEO & Co-
founder @ reserved.ai
“We were getting very odd error
messages that didn’t seem to
match our code. Right after
installing, Zebrium detected an
incident that let me quickly
figure out the actual issue.”
– Yosef Deray, Lead Software
Engineer @ Iralogix
“We used Litmus Chaos Engine
for K8s to induce failures in our
OpenEBS platform. Zebrium not
only automatically detected
every single failure, but also
identified its root cause.”
- Murat Karslioglu, VP of
Product Mgmnt @ MayaData
Examples Of Incidents Caught By ML
Bugs
Undefined variables, Undefined index,
mismatched data structures
Missing function arguments, input
validation failed
SQL syntax errors
Inter-Service Interactions
Auth & connection failures
Rate limit failures
API version mis-match errors
Container Orchestration
Missing containers, pods
Crashloopbackoffs, failed leader
elections
Kafka, Databases
Kafka broker pod failure
SQL schema errors
Database connectivity issues
Security
Excessive auth failures
Invalid Root login attempts
WAF attacks
XSS vulnerabilities
Infrastructure
OOM process failures
Out of space, CPU saturation
Object storage failures
Network corruption
VM failovers
Note: All issues caught by untrained ML (no pre-built rules required)
Contact:
Larry Lancaster – larry@zebrium.com
Scott McCallister - @stmcallister
Gavin Cohen – gavin@zebrium.com
Get started for free:
www.zebrium.com/sign-up
Z E B R I U M

Autonomous Incident and Root Cause Detection

  • 1.
    Z E BR I U M Autonomous Incident & Root Cause Detection Using ML to catch and characterize incidents faster with zero configuration © 2020 by Zebrium, Inc. LARRY LANCASTER Founder/CTO Zebrium SCOTT MCALLISTER Developer Advocate PagerDuty GAVIN COHEN VP Product Zebrium Zebrium makes Forbes AI50 2020 list – America’s Most Promising Artificial Intelligence Companies.
  • 2.
    The Problem: HumansDon’t Scale With Complexity Today’s tools put the burden on the human – to spot outliers & know what to search in logs Last Decade Today INCIDENT IMPACT 1x 100x INCIDENT COMPLEXITY 1x 100x HUMAN BANDWIDTH 1x 1x SCENARIO BEST IN CLASS APPROACH KNOWN CAUSE KNOWN SYMPTOM NEAR-TOTAL AUTOMATION KNOWN CAUSE OR SYMPTOM PARTIAL AUTOMATION UNKNOWN CAUSE UNKNOWN SYMPTOM HUMAN SEARCH But humans can’t scale with complexity
  • 3.
    The Solution: MLEnabled Incident Recognition Z E B R I U M Elastic Sumologic DataDog Elastic Prometheus Incident Recognition Anomaly Detection Logs & Metrics: Search, Chart, Alert Spot outliers & pattern changes in logs & metrics Aggregate, chart, search, build alerts Auto detection of unknown incidents, with root cause Degreeofautomation Speed to resolve “Unknown” Incidents
  • 4.
    Our Solution: AutonomousMonitoring Step 2: 5 mins to learn structures Zero Configuration Step 3: <1 hr to Incident Detection Alerts via Slack, Email etc. Step 1: <5 mins to deploy 2 Helm commands Optional Incident Signal from PagerDuty or any other tool
  • 5.
    Z E BR I U M Our Technology
  • 6.
  • 16.
    Zebrium: How ItWorks ML-driven Parsing & Event Categorization 1 2 Anomaly Detection On Event Types Raw Log Stream Pattern Recognition on Correlated Anomalies 3 Metrics Exporters Preserve time stamps, labels, types Anomaly detection on metrics Optional User Feedback 5 4 Auto Incident Creation
  • 17.
  • 18.
    An unplanned disruptionor degradation of service that is actively affecting customers’ ability to use the product. @stmcallister
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    Rachael, I’d likeyou to investigate the increased latency, try to find the cause. I’ll come back to you in 5 minutes. Understood? Understood. @stmcallister
  • 25.
    The goal isto handle the situation in a way that limits damage and reduces recovery time and costs. @stmcallister
  • 26.
    @ stmcallister Common Pitfallsin Troubleshooting • Spending time on symptoms that aren’t relevant • Latching on to causes of past problems • Hunting down connections that are coincidences Source: https://landing.google.com/sre/sre-book/chapters/effective-troubleshooting/
  • 27.
  • 28.
    Scan dashboards, Drill down Logs& Metrics INCIDENT CREATION SLOW HUMAN-DRIVEN INCIDENT RESOLUTION ! Incident Response Without Zebrium Search logs Service Complexity MTTR Standalone MTTR GROWS w/ COMPLEXITY
  • 29.
    Logs & Metrics INCIDENT CREATIONFAST AUTONOMOUS INCIDENT RESOLUTION ! SIGNIFICANT MTTR REDUCTION Service Complexity MTTR Traditional With Zebrium Incident Response With Zebrium Native search, Grafana for analytics B A Incident Events & Root Cause Metrics Event Timeline Auto Incident Breakdown
  • 30.
    Zebrium ML ! 1 Incidentalert from any source 2 PagerDuty Incident Created 3 Zebrium Augments Incident with Root Cause 4 Any monitoring, APM, Logger, Observability, Help Desk or other Tool Incident & Root Cause Metrics Event Timeline How Zebrium is Integrated with PagerDuty
  • 31.
  • 32.
    Real-life Incident DetectionExamples “Our cloud provider made an API change which caused problems downstream. Zebrium not only detected the issue, but also helped us debug it quickly”. - Aran Khanna, CEO & Co- founder @ reserved.ai “We were getting very odd error messages that didn’t seem to match our code. Right after installing, Zebrium detected an incident that let me quickly figure out the actual issue.” – Yosef Deray, Lead Software Engineer @ Iralogix “We used Litmus Chaos Engine for K8s to induce failures in our OpenEBS platform. Zebrium not only automatically detected every single failure, but also identified its root cause.” - Murat Karslioglu, VP of Product Mgmnt @ MayaData
  • 33.
    Examples Of IncidentsCaught By ML Bugs Undefined variables, Undefined index, mismatched data structures Missing function arguments, input validation failed SQL syntax errors Inter-Service Interactions Auth & connection failures Rate limit failures API version mis-match errors Container Orchestration Missing containers, pods Crashloopbackoffs, failed leader elections Kafka, Databases Kafka broker pod failure SQL schema errors Database connectivity issues Security Excessive auth failures Invalid Root login attempts WAF attacks XSS vulnerabilities Infrastructure OOM process failures Out of space, CPU saturation Object storage failures Network corruption VM failovers Note: All issues caught by untrained ML (no pre-built rules required)
  • 34.
    Contact: Larry Lancaster –larry@zebrium.com Scott McCallister - @stmcallister Gavin Cohen – gavin@zebrium.com Get started for free: www.zebrium.com/sign-up Z E B R I U M