Autonomous Incident and Root Cause Detection

Z E B R I U M
Autonomous Incident & Root Cause Detection
Using ML to catch and characterize incidents faster with zero configuration
© 2020 by Zebrium, Inc.
LARRY LANCASTER
Founder/CTO
Zebrium
SCOTT MCALLISTER
Developer Advocate
PagerDuty
GAVIN COHEN
VP Product
Zebrium
Zebrium makes Forbes AI50 2020 list –
America’s Most Promising Artificial
Intelligence Companies.

The Problem: Humans Don’t Scale With Complexity
Today’s tools put the burden on the human –
to spot outliers & know what to search in logs
Last
Decade
Today
INCIDENT IMPACT 1x 100x
INCIDENT COMPLEXITY 1x 100x
HUMAN BANDWIDTH 1x 1x
SCENARIO
BEST IN CLASS
APPROACH
KNOWN CAUSE
KNOWN SYMPTOM
NEAR-TOTAL
AUTOMATION
KNOWN CAUSE
OR SYMPTOM
PARTIAL
AUTOMATION
UNKNOWN CAUSE
UNKNOWN SYMPTOM
HUMAN SEARCH
But humans can’t scale with
complexity

The Solution: ML Enabled Incident Recognition
Z E B R I U M
Elastic
Sumologic
DataDog
Elastic
Prometheus
Incident
Recognition
Anomaly
Detection
Logs & Metrics: Search,
Chart, Alert
Spot outliers & pattern
changes in logs & metrics
Aggregate, chart,
search, build alerts
Auto detection of unknown
incidents, with root cause
Degreeofautomation
Speed to resolve “Unknown” Incidents

Our Solution: Autonomous Monitoring
Step 2:
5 mins to learn structures
Zero Configuration
Step 3:
<1 hr to Incident Detection
Alerts via Slack, Email etc.
Step 1:
<5 mins to deploy
2 Helm commands
Optional Incident Signal from
PagerDuty or any other tool

Zebrium: How It Works
ML-driven Parsing &
Event Categorization
1 2 Anomaly Detection
On Event Types
Raw Log Stream
Pattern Recognition on
Correlated Anomalies
3
Metrics Exporters
Preserve time stamps,
labels, types
Anomaly detection on
metrics
Optional
User Feedback
5
4 Auto Incident Creation

An unplanned disruption or
degradation of service that is
actively affecting customers’
ability to use the product.
@stmcallister

PEACETIME
WARTIME
@stmcallister

NORMAL
EMERGENCY
@stmcallister

Rachael, I’d like you to investigate
the increased latency, try to find the
cause. I’ll come back to you in 5
minutes. Understood?
Understood.
@stmcallister

The goal is to handle the situation in a
way that limits damage and reduces
recovery time and costs.
@stmcallister

@ stmcallister
Common Pitfalls in Troubleshooting
• Spending time on symptoms that aren’t relevant
• Latching on to causes of past problems
• Hunting down connections that are coincidences
Source: https://landing.google.com/sre/sre-book/chapters/effective-troubleshooting/

@ stmcallister
Reviews are
Opportunities to Learn

Scan dashboards,
Drill down
Logs &
Metrics
INCIDENT CREATION SLOW HUMAN-DRIVEN
INCIDENT RESOLUTION
!
Incident Response Without Zebrium
Search logs
Service Complexity
MTTR
Standalone
MTTR GROWS w/ COMPLEXITY

Logs &
Metrics
INCIDENT CREATION FAST AUTONOMOUS
INCIDENT RESOLUTION
!
SIGNIFICANT MTTR REDUCTION
Service Complexity
MTTR
Traditional
With Zebrium
Incident Response With Zebrium
Native search,
Grafana for
analytics
B
A
Incident Events & Root Cause
Metrics Event Timeline
Auto Incident
Breakdown

Zebrium ML
!
1 Incident alert
from any
source
2 PagerDuty
Incident Created
3 Zebrium Augments
Incident with Root Cause
4
Any monitoring,
APM, Logger,
Observability, Help
Desk or other Tool
Incident & Root Cause
Metrics Event Timeline
How Zebrium is Integrated with PagerDuty

Real-life Incident Detection Examples
“Our cloud provider made an API
change which caused problems
downstream. Zebrium not only
detected the issue, but also
helped us debug it quickly”.
- Aran Khanna, CEO & Co-
founder @ reserved.ai
“We were getting very odd error
messages that didn’t seem to
match our code. Right after
installing, Zebrium detected an
incident that let me quickly
figure out the actual issue.”
– Yosef Deray, Lead Software
Engineer @ Iralogix
“We used Litmus Chaos Engine
for K8s to induce failures in our
OpenEBS platform. Zebrium not
only automatically detected
every single failure, but also
identified its root cause.”
- Murat Karslioglu, VP of
Product Mgmnt @ MayaData

Examples Of Incidents Caught By ML
Bugs
Undefined variables, Undefined index,
mismatched data structures
Missing function arguments, input
validation failed
SQL syntax errors
Inter-Service Interactions
Auth & connection failures
Rate limit failures
API version mis-match errors
Container Orchestration
Missing containers, pods
Crashloopbackoffs, failed leader
elections
Kafka, Databases
Kafka broker pod failure
SQL schema errors
Database connectivity issues
Security
Excessive auth failures
Invalid Root login attempts
WAF attacks
XSS vulnerabilities
Infrastructure
OOM process failures
Out of space, CPU saturation
Object storage failures
Network corruption
VM failovers
Note: All issues caught by untrained ML (no pre-built rules required)

Contact:
Larry Lancaster – larry@zebrium.com
Scott McCallister - @stmcallister
Gavin Cohen – gavin@zebrium.com
Get started for free:
www.zebrium.com/sign-up
Z E B R I U M

Autonomous Incident and Root Cause Detection

More Related Content

What's hot

Similar to Autonomous Incident and Root Cause Detection

More from DevOps.com

Recently uploaded

Autonomous Incident and Root Cause Detection