AI Helps Observe Decentralised Systems

🤖
HOW AI HELPS OBSERVE
DECENTRALISED SYSTEMS
Dominic Wellington | @dwellington

FULL DISCLOSURE
I work for a vendor (Moogsoft)
…but this is not a product pitch
We are hiring!

We are living in a different world
from the one our systems and processes
were designed for

OLD WORLD
Static Environment
• Relatively small number of devices
• Slow rate of growth
• Low frequency of change (deployments)
Manageable AlertVolumes
• Problem is extracting enough information
• Relatively easy to understand

NEW WORLD
Fast-growing, fast-changing environment
• More and more devices
• More and more frequent releases
• More and more automation
Massive AlertVolumes
• From monitoring to observability
• Increasing specialisation

WE SPEND MORE TIME MANAGING IT
THAN USING IT

–JustinTrudeau, Prime Minister of Canada, Davos WEF 2018
“The pace of change has never been this fast,
and it will never be this slow again.”

COMPLEXITY
• Compute
• Network
• Storage
• Bare metal
• Hypervisor
• Private cloud
• Public cloud
• Hybrid cloud
• Virtual private cloud
• Software-defined networking
• Software-defined data center
• Software-defined everything
• Containers
• Serverless
• IaaS
• PaaS
• SaaS
• DevOps
Why every 9
costs 10 times more
than the last one

LIVING ONTHE EDGE
• What happens on the network edge is more & more important
• But!The edge is really far away
• Unreliable connectivity, limited bandwidth, constant ﬂux
• There’s always something going wrong somewhere
• One device or a region? One production line or a factory?

Single faults
no longer cause impacts
Fault tolerance
does not mean
Zero Incidents

WE NEED TO CHANGE MONITORING
BECAUSE SYSTEMS HAVE CHANGED

A MAZE OF TWISTY SERVICES, ALL ALIKE

Booking software outages:
Passengers across world unable
to board planes
System outage:
Customers unable to use ATMs
to withdraw cash
4-hour outage:
Co-workers & teammates
unable to communicate
Worldwide outage on NewYear’s Eve:
Family members unable to exchange
NewYear greetings
🏦✈
📱💬

“Let’s have a good old-fashioned blamestorm”

THE STATISTICS SAY IT ALL
74% of incidents
detected by
end users
before Support
is aware
>62% of the time
the Application
is not the cause
of the Incident
>36%
IncidentTickets
escalated
>32%Tickets
reassigned
across silos

😱
From an informal attendee survey at
SREcon 18

🤔
SO HOW DO WE FIX MONITORING?

MONITORING
🔍
• Periodic polling
• Filtered
• Late addition
• Incident-driven

HIDDEN ASSUMPTIONS
• Information is expensive and valuable
• Faults are easy to detect (Byzantine Fault)
• All failure conditions are knowable

DASHBOARDS 🤮
• The internal health of the system
is irrelevant
• Individual requests are what
users care about
• Every dashboard is an artefact of
a past failure

OBSERVABILITY
👁
• Continuous stream
• High-cardinality
• Built in to infrastructure & apps
• Insight-driven

REALISATIONS
• Information is cheap, only valuable if queried
• User experience is not an afterthought
• …in fact it’s a key diagnostic information source
(just don’t treat your users as canaries) 🐤

INCIDENT-DRIVEN
—
RESOURCE
CONSUMPTION
INSIGHT-DRIVEN
—
ACTIONABLE
UNDERSTANDING

HOWTO FIND ACTIONABLE INSIGHTS?
PUT EVERYTHING IN A DATA LAKE!

Objects in rear view mirror
may be less relevant than they appear

–Donald Rumsfeld
“There are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we
know there are some things we do not know.
But there are also unknown unknowns –
the ones we don't know we don't know.
It is the latter category that tend to be the difﬁcult ones.”

MONITORING AS IT IS
* slaps roof of NOC *
this bad boy can ﬁt so many monitoring tools in it
-
🤷

MONITORING AS IT SHOULD BE
🤖
/
0
1
2

😕 AI? MACHINE LEARNING? 🤔
• Stanford deﬁnition:“Machine learning is the science of getting
computers to act without being explicitly programmed.”
• AI in IT Ops: bring interesting information to the attention of human
operators – without having to deﬁne it beforehand
AI
Machine learning
Deep learning

WHERETO USE AI IN IT OPS?
• Ingestion: reduce noise and false alarms
• Correlation: identify related events across domains, avoid duplication
of effort and missed signals
• Collaboration: intelligent teaming, root cause analysis, knowledge
capture

TEACHINGTHE MACHINE
• Inputs matter: choose the right feature vectors
• Regression problems: continuous distribution
• Classiﬁcation vs clustering: set of categories

AIOps
A New Framework for IT Ops
• Proactive insight
• Intelligent notiﬁcation
• Intelligent collaboration
• Workﬂow automation
• Causal analysis
• Decision support
Ref: Innovation Insight for Algorithmic IT Operations Platforms

IN PRACTICE:
• This is IT Ops: speed matters, work in real time
• You don’t know what you need to know
• AI is a tool, not magic
• Process is how you make sure it works for users
🧠
⚡
6
7

WHY ARE WE NOT DOINGTHIS
ALREADY?
The Greek triad:
• Fear
• Honour
• Interest

These tools and
processes are incredible
force multipliers
WHYYOU SHOULD START RIGHT AWAY

🤗
THANK YOU!
Dominic Wellington | @dwellington

AI Helps Observe Decentralised Systems

Recommended

Recommended

More Related Content

Similar to AI Helps Observe Decentralised Systems

Similar to AI Helps Observe Decentralised Systems (20)

Recently uploaded

Recently uploaded (20)

AI Helps Observe Decentralised Systems