1) Observability is needed to understand complex, decentralized systems because the environment has changed from static to fast-growing with massive volumes of data.
2) AI and machine learning can help with observability by ingesting continuous streams of high-cardinality data, identifying related events, and providing intelligent collaboration.
3) Implementing AIOps provides benefits like proactive insights, intelligent notifications, causal analysis, and decision support to improve incident management in real-time.
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
AI Helps Observe Decentralised Systems
1. 🤖
HOW AI HELPS OBSERVE
DECENTRALISED SYSTEMS
Dominic Wellington | @dwellington
2. FULL DISCLOSURE
I work for a vendor (Moogsoft)
…but this is not a product pitch
We are hiring!
3.
4. We are living in a different world
from the one our systems and processes
were designed for
5. OLD WORLD
Static Environment
• Relatively small number of devices
• Slow rate of growth
• Low frequency of change (deployments)
Manageable AlertVolumes
• Problem is extracting enough information
• Relatively easy to understand
6. NEW WORLD
Fast-growing, fast-changing environment
• More and more devices
• More and more frequent releases
• More and more automation
Massive AlertVolumes
• From monitoring to observability
• Increasing specialisation
8. –JustinTrudeau, Prime Minister of Canada, Davos WEF 2018
“The pace of change has never been this fast,
and it will never be this slow again.”
9. COMPLEXITY
• Compute
• Network
• Storage
• Bare metal
• Hypervisor
• Private cloud
• Public cloud
• Hybrid cloud
• Virtual private cloud
• Software-defined networking
• Software-defined data center
• Software-defined everything
• Containers
• Serverless
• IaaS
• PaaS
• SaaS
• DevOps
Why every 9
costs 10 times more
than the last one
10. LIVING ONTHE EDGE
• What happens on the network edge is more & more important
• But!The edge is really far away
• Unreliable connectivity, limited bandwidth, constant flux
• There’s always something going wrong somewhere
• One device or a region? One production line or a factory?
17. Booking software outages:
Passengers across world unable
to board planes
System outage:
Customers unable to use ATMs
to withdraw cash
4-hour outage:
Co-workers & teammates
unable to communicate
Worldwide outage on NewYear’s Eve:
Family members unable to exchange
NewYear greetings
🏦✈
📱💬
20. THE STATISTICS SAY IT ALL
74% of incidents
detected by
end users
before Support
is aware
>62% of the time
the Application
is not the cause
of the Incident
>36%
IncidentTickets
escalated
>32%Tickets
reassigned
across silos
26. HIDDEN ASSUMPTIONS
• Information is expensive and valuable
• Faults are easy to detect (Byzantine Fault)
• All failure conditions are knowable
27. DASHBOARDS 🤮
• The internal health of the system
is irrelevant
• Individual requests are what
users care about
• Every dashboard is an artefact of
a past failure
29. REALISATIONS
• Information is cheap, only valuable if queried
• User experience is not an afterthought
• …in fact it’s a key diagnostic information source
(just don’t treat your users as canaries) 🐤
32. Objects in rear view mirror
may be less relevant than they appear
33. –Donald Rumsfeld
“There are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we
know there are some things we do not know.
But there are also unknown unknowns –
the ones we don't know we don't know.
It is the latter category that tend to be the difficult ones.”
34. MONITORING AS IT IS
* slaps roof of NOC *
this bad boy can fit so many monitoring tools in it
-
🤷
36. 😕 AI? MACHINE LEARNING? 🤔
• Stanford definition:“Machine learning is the science of getting
computers to act without being explicitly programmed.”
• AI in IT Ops: bring interesting information to the attention of human
operators – without having to define it beforehand
AI
Machine learning
Deep learning
37. WHERETO USE AI IN IT OPS?
• Ingestion: reduce noise and false alarms
• Correlation: identify related events across domains, avoid duplication
of effort and missed signals
• Collaboration: intelligent teaming, root cause analysis, knowledge
capture
38. TEACHINGTHE MACHINE
• Inputs matter: choose the right feature vectors
• Regression problems: continuous distribution
• Classification vs clustering: set of categories
39. AIOps
A New Framework for IT Ops
• Proactive insight
• Intelligent notification
• Intelligent collaboration
• Workflow automation
• Causal analysis
• Decision support
Ref: Innovation Insight for Algorithmic IT Operations Platforms
40. IN PRACTICE:
• This is IT Ops: speed matters, work in real time
• You don’t know what you need to know
• AI is a tool, not magic
• Process is how you make sure it works for users
🧠
⚡
6
7
41. WHY ARE WE NOT DOINGTHIS
ALREADY?
The Greek triad:
• Fear
• Honour
• Interest