We already trust artificial intelligence to drive our car, but we still configure thresholds and thrift through logs manually. In this talk, Ronny Lehmann, Loom CTO will discuss how he spent months analyzing modern-ops work, until he finally was able to extract the common-basis practices; and how we used this understanding to build a machine that complements ops teams, automating much of the work which is more suitable for machines - leaving for "humans" just the parts which require humans. What we built saves you time spent on parsers, on configuring and tuning rules and alerts, on conducting root-cause analysis and triage - and finally - on figuring out what to do.
2. Confidential and Proprietary July 2017
Hi!
Ronny Lehmann
CTO & Founder – Loom Systems
Formerly 8200, BioCatch
Machine-Learning | High-performance Cloud-Computing
@ronnyle_mann
3. Confidential and Proprietary July 2017
Founded in April 2015
30 people (5 in San Francisco)
Bootstrap for 2 first years, recently funded
Hiring very much
4. Confidential and Proprietary July 2017
Today’s Big-Data Bottleneck:
You are.
2000’s Big-Data Bottlenecks:
✓ Storing
✓ Querying
✓ Real-time processing
5. Confidential and Proprietary July 2017
Good dev(ops) are hard-to-find
Employee tenure very low (<3yrs. Source: PayScale)
Operations is Tribal Knowledge
Machines are very loyal, never ask for a
raise and have excellent memory. Can
(some) of this be done with machines?
6. Confidential and Proprietary July 2017
➜“I’ve been hearing this for 20 years”
Total Recall, a movie based on a book from 1966, featuring
a self-driving car as science fiction.
If Artificial-Intelligence has matured enough to drive your
car, it can probably also help with your IT.
Skeptic?!
7. Confidential and Proprietary July 2017
• Real-time trend detection
• Pattern Recognition
• Large Dimensionality
• Complex State
• Strict Methodology
HUMANS
Good at top-down tasks
BOTS
Superior at bottom-up tasks
• Deep reasoning
• Contextual thinking
• Tired
• Bored
• Lazy
• Frustrated
• Married
8. Confidential and Proprietary July 2017
That’s what we do @ Loom Systems
AIOps - Algorithmic IT operations
Use Big Data and Machine Learning Technologies to Achieve a Data-Centric Approach to
Availability and Performance Monitoring.
Extend the Data-Centric Approach to Other ITOM (IT Operations Monitoring) Disciplines, and Seek
to Exploit the Linkages It Allows Between ITOM, SIEM and Business Analytics
9. Confidential and Proprietary July 2017
Action
•Remedy
•Recommendation
•Insight
•Knowledge
Root-Cause
Analysis
•Aggregation
•Correlation
•Causality
Data
Modelling
•Visualizations
•Define KPIs
•Reporting
•Rules & Thresholds
Data
Preparation
•Collection
•Normalization
•Sanitizing
•Preprocessing
Cracking the science behind data-science
11. Confidential and Proprietary July 2017
Three layers of context
Generic Context
Something being mentioned more than normal, or is appearing after long absence
Something stopped/started happening
Common Business Context
Semantical words (timeout, Trojan, failure)
Common Software
Proprietary Business Context
Names of business products, servers, applications..
12. Confidential and Proprietary July 2017
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port 48278 ssh2
Processing
Generic context – rate of this pattern in the logs
Common Business Context –
➜ Contextual words (Warn, Failed)
➜ Common Entities (User, IP, ssh)
Proprietary Business Context –
➜ Server Name
Real-time Sturcturing, Clustering
Token & Entity Extraction and Classification
HistogrammegatronServer
MetersshdApplication
MeterronnyUser
Meter192.168.118.1source_IP
Random48278source_port
Failed password for user [user] from [source_IP] port [source_port] ssh2
15. Confidential and Proprietary July 2017
- This is not (only) anomaly-detection (!)
Algorithms
3σ
Baseline
ARIMA
Feature extraction
Detection & Alerting
History
Scoring
Self Feedback
User Direct and
Indirect Feedback
Detection
When tracking up to 1M signals -> must
automatically determine what kind of
detections are interesting for every signal
(examples: website response time, ad-
click rate)
16. Confidential and Proprietary July 2017
Root-Cause Analysis
When something breaks, anomalies are everywhere. How do you know what to fix?
17. Confidential and Proprietary July 2017
Root-Cause Analysis
When something breaks, everything starts complaining. How do you know what to fix?
18. Confidential and Proprietary July 2017
Automated Root-Cause Analysis. Aggregating the detections, correlating
and determining causality between them.
How?:
➜ Time-based causality
➜ Relationship-based analysis
➜ Graphs-based analysis
Root-Cause Analysis
20. Confidential and Proprietary July 2017
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:57 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.1 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user ronny from 192.168.118.16 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user dror from 192.168.118.4 port…
Sep 27 14:25:54 megatron sshd[7498]: WARN - Failed password for user john from 192.168.118.14 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user dan from 192.168.118.121 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user gab from 192.168.118.51 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user anna from 192.168.118.66 port…
Sep 27 14:25:55 megatron sshd[7498]: WARN - Failed password for user dan from 192.168.118.123 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user jim from 192.168.118.133 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user nate from 192.168.118.201 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user stan from 192.168.118.194 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user paul from 192.168.118.144 port…
Sep 27 14:25:56 megatron sshd[7498]: WARN - Failed password for user avi from 192.168.118.81 port…
Sep 27 14:25:57 megatron sshd[7498]: WARN - Failed password for user stas from 192.168.118.54 port…
ronny is mentioned more than normal in the context of ssh failures
The context of ssh failures is mentioned more than normal
Root-Cause Analysis – Relationship Based
25. Confidential and Proprietary July 2017
Countering Alert Flooding / Alert Fatigue
➜ Overall rate of incidents
➜ Quality of an incident
An incident report:
➜ Root-Cause Analysis
➜ History of similar incidents
➜ Insights & Recommendations
Incident Enrichments
They’re called data-scientists but these are analysts, SRE’s, DevOps and others
ASK: Who here believes that self-driving cars will be successful?
This book was released exactly 50 years ago. Indeed, science fiction sometimes takes too long to become reality
Seriously – let’s get skepticism out of the way – I’m going to be talking about a working concept.
It’s not the car, or the street, or the stoplight. It’s the AI. AI is mature, it’s ready
Humans are better at top-down, or open-ended questions, such as “where should I open my next branch”
Machines are superior in rigorous and exhausting tasks, such as “keep track on our sales in every state, sliced by affiliates, browsers; let me know if something happens”?
Can we split responsibility?
Analysis is comprised of processing, analyzing, understanding, then acting.
Loom Ops does the processing, analyzing, and – to some-extent – the understanding.
We must have automated processing if we want to:
Track much more
Ingest many sources
…
Loom covers the generic and common contexts, and will be able to inter-connect them with proprietary contexts
The single log line will automatically be processed and translated to 8 different metrics! This is without going into sequence analysis
Can you see how hard it is to extract value from machine data?
We suppress “always-broken” alerts.
The Machine-Learning based prioritization and filtering is self adjusting so that the incidents rate fits the size of the team
Detection is very hard and usually ends with a vague lead – such as user complaints, high CPU
You then go to the logs (single source of truth) but there’s all this noise. You find many unusual things in different log streams. This is RCA – the process of understanding that the kids are fighting, not because Silvia pushed John and he pushed back, but because they’re hungry
Can you see how hard it is to extract value from machine data?
The ops guy gets an alert – high-cpu on Authentication server. He starts searching the logs for errors, and after some serious amount of work, he narrows it down to this log line. Can you tell the difference in the meaning of the two scenarios?
When things go wrong, it’s hard to tell the chain of causality
We have less alerts because we suppress “always-broken” alerts, and with the help of ML-based prioritization and filtering. This reaches a much better result when compared to a human-built rule engine
Can you see how hard it is to extract value from machine data?
Then, it’s the quality of the incident, translating to MTTR.
Fuzzy matching is crucial because no one uses ticketing systems. You need to get it in “push”. And you need to be able to provide simple, fast feedback
BTW, Anomaly detection makes it possible for us to suppress “always-broken” alerts.
We also used Machine-Learning based prioritization and filtering – we adjust the incidents rate to the size of the team.
Fuzzy matching is crucial because no one uses ticketing systems. You need to get it in “push”. And you need to be able to provide simple, fast feedback