SlideShare a Scribd company logo
Dr. Jorge Cardoso
Huawei Munich Research Center &
University of Coimbra
--jorge.cardoso@huawei.com--
2018.08.28
Mastering AIOps
using Deep Learning, Time-Series Analysis,
and Distributed Tracing
iForesight iForesight is an intelligent new tool aimed at SRE cloud maintenance
teams. It enables them to quickly detect anomalies thanks to the use of
artificial intelligence when cloud service are slow or unresponsive.
DUSSELDORF, GERMANY
29–31 August
2018
2
Abstract
Artificial Intelligence for IT Operations
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually
or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide
tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series
analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during
operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M
approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE
cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence
when cloud services are slow or unresponsive.
Dr. Jorge Cardoso is Chief Architect for Intelligent CloudOps at Huawei’s German Research Center in
Munich. Previously he worked for several major companies such as SAP Research (Germany) on the
Internet of Services and the Boeing Company in Seattle (USA) on Enterprise Application Integration.
He previously gave lectures at the Karlsruhe Institute of Technology (Germany), University of Georgia
(USA), University of Coimbra and University of Madeira (Portugal). He recently published his latest
book titled “Fundamentals of Service Systems” with Springer. His current research involves the
development of the next generation of Cloud Operations and Analytics using AI, Cloud Reliability and
Resilience, and High Performance Business Process Management systems. He has a Ph.D. in
Computer Science from the University of Georgia (USA).
Jorge Cardoso
GitHub | Slideshare.net | GoogleScholar
Positions in Industry
Interests Service Reliability Engineering, AIOps, Cloud Computing, Distributed Systems, Business Process Management
3
AIOps
Artificial Intelligence for IT Operations
Data Source Connectors
Distribution Layer
Time Series Registry
Stability Analysis: Machine Learning and Analytics
Visualization, Exploration, Reporting
Real-time IngestionHistorical Data Storage
Stream ProcessingBatch Processing
Analytical Data Store
DevOpsBusiness Operators Developers
Pattern Matching Anomaly Detection
Structural Analysis RCA Event Causality
Graph Analysis
MetricsTracing Logging
Every year the management of IT Operations is more complex
► Increase in IT size, and event and alert volumes
► Digitalization with cloud, mobile, microservices
► Edge, IoT, serverless, …
To deal with this complexity, businesses are turning to AI to
automate incident management across production stacks,
including application, infrastructure, and monitoring tools.
Reference architecture for AIOps platformsSource: Gartner
4
Objective
Detect Structural Changes of Distributed Traces
N
1
N
2
N3
N4
Distributed Traces [1]
► Scenario Service requests’ response time increased 50% when compared to 12h ago.
► Observation Distributed traces’ structure has changed for the past 30 minutes
► Root cause Traces show a 3*retry to service A, before calling service B
John Miller -> Trace ID = 23T874
SLO (response time) > 50%
keystone -> nova -> glance -> network
Server
uServices
im
age-api
im
age-registry
scheduler
netw
ork
com
pute
com
pute-api
rpc-broker
authentication
Logicaltime
30%
70%
Warning Y
Warning A
Warning B
Warning C
Error Z
A B C C B ZTrace t1
Sequence of spans
[1] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, C. Shanbhag, Dapper, a
Large-Scale Distributed Systems Tracing Infrastructure, 2010, Google, Inc.
5
Objective
Detect Structural Changes of Distributed Traces
alb5g1jksao98aj8ik5e
w21
ab05g1jksao98ajkk5ew
21
ab05g1jksalm8aj8ik5e
w21
ab05g1jksao98aj8i---
---
host list
host show
hypervisor list
hypervisor show
hypervisor stats show
image add project
image create
image delete
image list
image remove project
image save
image set
image show
ip fixed add
flavor create
flavor delete
flavor list
flavor set
flavor show
flavor unset
…
Distributed
Traces
Detect
Normal/Anomalous
Traces
Encoding
alb5g1jksao98aj8ik5e
w21
ab05g1jksao98ajkk5ew
21
ab05g1jksalm8aj8ik5e
w21
ab05g1jksao98aj8i---
---
User
Commands
6
Deep Learning
Long Short Term Memory
Long Short Term Memory (LSTM)
LSTMs [1] are models which capture sequential data by using an internal memory. They are very good for analyzing
sequences of values and predicting the next point of a given time series. This makes LSTMs adequate for Machine
Learning problems that involve sequential data (see [3]) such speech recognition, machine translation, visual
recognition and description, and distributed traces analysis.
Sequential processing using LSTM [2]
[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
[2] Sequential processing in LSTM (from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[3] LSTM model description (from Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
A B C C B ZTrace t1
Sequence of spans
7
Structure Formalization
Trace and Span Representation
► Training
▪ Collect traces
▪ Parse and abstract spans
▪ Feed span lists to deep neural network
▪ Train the network with traces
► Operation
▪ Collect real-time traces
▪ Parse and abstract spans
▪ Feed span lists to deep neural network
▪ Retrieve probability matrix
▪ Analyze matrix to decide on the severity of the
anomaly
91 12 43 76 24
04 93 63 25 35
38 18 44 24 61
Trace t1
Trace t2
Trace tn
Trace sampledSequence of spans
91 12 43 76 24
Training
Operation
Predicted reconstruction
91 12 43 76 24
We analyze historical traces and collect the spans generated to create a structure containing span types
(symbols) using label encoding. The network is training with the encoded traces. During operation, an
LSTM network takes as input a sequence of symbols representing a new real-time trace, e.g., 1: “72”, 2:
“83”, 3: “54”, …n: “23”. The output is a matrix with the probability of each symbol to belong to a learned
structure. The analysis of the probability matrix enables to reason on if a trace should be mark are normal
or anomalous.
In this experiment, we processed >1M spans and >100K traces
Partial view of the spans collected
8
Overall Approach
From Encoding to Detection
Keras and Tensorflow
● LSTM network
● dropout layer of 0.2 (high, but necessary to avoid quick
divergence)
● Softmax activation to generate probabilities over the different
categories
● Because it is a classification problem, loss calculation
via categorical cross-entropy compares the output
probabilities against one-one encoding
● ADAM optimizer (instead of the classical stochastic gradient
descent) to update weights
Encoding scheme
● The encoding scheme pads each sequence of inputs with zeros, up to a pre-defined maximum length.
● This allows to pre-allocate a chain of LSTM units of a specific length
● We also pass information about the sequence lengths. This is important for not treating the zero padding as actual
inputs, and from injecting the error signal at the right unit in the sequence during back propagation.
Hot encoding
● A one hot encoding enables to represent categorical variables as binary vectors.
● Categorical values are mapped to integer values.
● Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is
marked with a 1.
Tensorflow tutorial: https://github.com/campdav/text-rnn-tensorflow
Keras tutorial: https://github.com/campdav/text-rnn-keras
Normal (green) and Anomalous (red) Traces Detected
Encoded Traces
Raw Span Events
LSTM Network
Network Training
1
2
3
4
9
Distributed Trace Anomaly Detection
Results
Goal: Classify distributed traces as normal or abnormal
► Benefits
▪ Very high accuracy in detection: 99.73%
▪ Extremely fast detection: O(n)
▪ The technique outperforms existing approaches
e.g., sequence matching, petri nets, clustering
► Limitations
▪ Requires a special (not trivial) handling when using recurrent neural networks, like LSTMs
▪ Long traces may require very long training times
e.g., back propagation across long sequences may result in vanishing gradients...
► Improvements
▪ Truncate traces (remove steps from the beginning or the end)?
e.g., Lower the accuracy of predictions
▪ Summarize traces (remove well-known subtraces)?
e.g., focus on most relevant parts of a trace while reducing their length
▪ Evaluation the prediction of remaining time to completion (see [Tax, 2017])
▪ Operationalize LSTM with other sequential data (see [Du, 2017])
[Tax, 2017] Predictive business process monitoring with LSTM neural networks, Tax, Niek and Verenich, Ilya and La Rosa, Marcello and Dumas,
Marlon, Proceedings of the 29th International Conference on Advanced Information Systems Engineering, 2017, 477--492, Springer.
[Du, 2017] M. Du, F. Li, G. Zheng, and V. Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). ACM, 1285-1298.
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

More Related Content

What's hot

data mining
data miningdata mining
data mining
uoitc
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection ML
MaatougSelim
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
Davide Gallitelli
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
Manish Jain
 
Human Emotion Recognition using Machine Learning
Human Emotion Recognition using Machine LearningHuman Emotion Recognition using Machine Learning
Human Emotion Recognition using Machine Learning
ijtsrd
 
Machine learning & Time Series Analysis
Machine learning & Time Series AnalysisMachine learning & Time Series Analysis
Machine learning & Time Series Analysis
台灣量化交易協會
 
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGDETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
ijcsit
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Big data
Big dataBig data
Big data
Nausheen Hasan
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy
 
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
Hanan Shteingart
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
Ralph Schlosser
 
Em Algorithm | Statistics
Em Algorithm | StatisticsEm Algorithm | Statistics
Em Algorithm | Statistics
Transweb Global Inc
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
TuCaoMinh2
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
Gianluca Bontempi
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Vipin Batra
 

What's hot (20)

data mining
data miningdata mining
data mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection ML
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Human Emotion Recognition using Machine Learning
Human Emotion Recognition using Machine LearningHuman Emotion Recognition using Machine Learning
Human Emotion Recognition using Machine Learning
 
Machine learning & Time Series Analysis
Machine learning & Time Series AnalysisMachine learning & Time Series Analysis
Machine learning & Time Series Analysis
 
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGDETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Big data
Big dataBig data
Big data
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Em Algorithm | Statistics
Em Algorithm | StatisticsEm Algorithm | Statistics
Em Algorithm | Statistics
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Similar to Mastering AIOps with Deep Learning

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
Eran Levy
 
Machine Learning-Based Phishing Detection
Machine Learning-Based Phishing DetectionMachine Learning-Based Phishing Detection
Machine Learning-Based Phishing Detection
IRJET Journal
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
Jorge Cardoso
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
Vrushali Lanjewar
 
slides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdfslides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdf
Andrea Morichetta
 
MCSoC'13 Keynote Talk "Taming Big Data Streams"
MCSoC'13 Keynote Talk "Taming Big Data Streams"MCSoC'13 Keynote Talk "Taming Big Data Streams"
MCSoC'13 Keynote Talk "Taming Big Data Streams"Hideyuki Kawashima
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
Jorge Cardoso
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
IRJET Journal
 
Predictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning MeetupPredictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning Meetup
Ian Downard
 
IRJET- E-Commerce Website using Skyline Queries
IRJET-  	  E-Commerce Website using Skyline QueriesIRJET-  	  E-Commerce Website using Skyline Queries
IRJET- E-Commerce Website using Skyline Queries
IRJET Journal
 
Predicting Stock Price Movements with Low Power Consumption LSTM
Predicting Stock Price Movements with Low Power Consumption LSTMPredicting Stock Price Movements with Low Power Consumption LSTM
Predicting Stock Price Movements with Low Power Consumption LSTM
IRJET Journal
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
Subrat Panda, PhD
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
Wireless Ad Hoc Networks
Wireless Ad Hoc NetworksWireless Ad Hoc Networks
Wireless Ad Hoc Networks
Tara Hardin
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Ivo Andreev
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
Ganesan Narayanasamy
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Case
random_chance
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Paul Groth
 

Similar to Mastering AIOps with Deep Learning (20)

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Machine Learning-Based Phishing Detection
Machine Learning-Based Phishing DetectionMachine Learning-Based Phishing Detection
Machine Learning-Based Phishing Detection
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
slides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdfslides_itc30_2018_Morichetta_v2.pdf
slides_itc30_2018_Morichetta_v2.pdf
 
MCSoC'13 Keynote Talk "Taming Big Data Streams"
MCSoC'13 Keynote Talk "Taming Big Data Streams"MCSoC'13 Keynote Talk "Taming Big Data Streams"
MCSoC'13 Keynote Talk "Taming Big Data Streams"
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
 
Predictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning MeetupPredictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning Meetup
 
IRJET- E-Commerce Website using Skyline Queries
IRJET-  	  E-Commerce Website using Skyline QueriesIRJET-  	  E-Commerce Website using Skyline Queries
IRJET- E-Commerce Website using Skyline Queries
 
Predicting Stock Price Movements with Low Power Consumption LSTM
Predicting Stock Price Movements with Low Power Consumption LSTMPredicting Stock Price Movements with Low Power Consumption LSTM
Predicting Stock Price Movements with Low Power Consumption LSTM
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
Wireless Ad Hoc Networks
Wireless Ad Hoc NetworksWireless Ad Hoc Networks
Wireless Ad Hoc Networks
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Case
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 

More from Jorge Cardoso

AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
Jorge Cardoso
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
Jorge Cardoso
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Jorge Cardoso
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
Jorge Cardoso
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
Jorge Cardoso
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
Jorge Cardoso
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
Jorge Cardoso
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
Jorge Cardoso
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
Jorge Cardoso
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
Jorge Cardoso
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
Jorge Cardoso
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
Jorge Cardoso
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
Jorge Cardoso
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
Jorge Cardoso
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
Jorge Cardoso
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
Jorge Cardoso
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
Jorge Cardoso
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
Jorge Cardoso
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
Jorge Cardoso
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesJorge Cardoso
 

More from Jorge Cardoso (20)

AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
 

Recently uploaded

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 

Recently uploaded (20)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 

Mastering AIOps with Deep Learning

  • 1. Dr. Jorge Cardoso Huawei Munich Research Center & University of Coimbra --jorge.cardoso@huawei.com-- 2018.08.28 Mastering AIOps using Deep Learning, Time-Series Analysis, and Distributed Tracing iForesight iForesight is an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect anomalies thanks to the use of artificial intelligence when cloud service are slow or unresponsive. DUSSELDORF, GERMANY 29–31 August 2018
  • 2. 2 Abstract Artificial Intelligence for IT Operations In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive. Dr. Jorge Cardoso is Chief Architect for Intelligent CloudOps at Huawei’s German Research Center in Munich. Previously he worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology (Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). He recently published his latest book titled “Fundamentals of Service Systems” with Springer. His current research involves the development of the next generation of Cloud Operations and Analytics using AI, Cloud Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer Science from the University of Georgia (USA). Jorge Cardoso GitHub | Slideshare.net | GoogleScholar Positions in Industry Interests Service Reliability Engineering, AIOps, Cloud Computing, Distributed Systems, Business Process Management
  • 3. 3 AIOps Artificial Intelligence for IT Operations Data Source Connectors Distribution Layer Time Series Registry Stability Analysis: Machine Learning and Analytics Visualization, Exploration, Reporting Real-time IngestionHistorical Data Storage Stream ProcessingBatch Processing Analytical Data Store DevOpsBusiness Operators Developers Pattern Matching Anomaly Detection Structural Analysis RCA Event Causality Graph Analysis MetricsTracing Logging Every year the management of IT Operations is more complex ► Increase in IT size, and event and alert volumes ► Digitalization with cloud, mobile, microservices ► Edge, IoT, serverless, … To deal with this complexity, businesses are turning to AI to automate incident management across production stacks, including application, infrastructure, and monitoring tools. Reference architecture for AIOps platformsSource: Gartner
  • 4. 4 Objective Detect Structural Changes of Distributed Traces N 1 N 2 N3 N4 Distributed Traces [1] ► Scenario Service requests’ response time increased 50% when compared to 12h ago. ► Observation Distributed traces’ structure has changed for the past 30 minutes ► Root cause Traces show a 3*retry to service A, before calling service B John Miller -> Trace ID = 23T874 SLO (response time) > 50% keystone -> nova -> glance -> network Server uServices im age-api im age-registry scheduler netw ork com pute com pute-api rpc-broker authentication Logicaltime 30% 70% Warning Y Warning A Warning B Warning C Error Z A B C C B ZTrace t1 Sequence of spans [1] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, C. Shanbhag, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, 2010, Google, Inc.
  • 5. 5 Objective Detect Structural Changes of Distributed Traces alb5g1jksao98aj8ik5e w21 ab05g1jksao98ajkk5ew 21 ab05g1jksalm8aj8ik5e w21 ab05g1jksao98aj8i--- --- host list host show hypervisor list hypervisor show hypervisor stats show image add project image create image delete image list image remove project image save image set image show ip fixed add flavor create flavor delete flavor list flavor set flavor show flavor unset … Distributed Traces Detect Normal/Anomalous Traces Encoding alb5g1jksao98aj8ik5e w21 ab05g1jksao98ajkk5ew 21 ab05g1jksalm8aj8ik5e w21 ab05g1jksao98aj8i--- --- User Commands
  • 6. 6 Deep Learning Long Short Term Memory Long Short Term Memory (LSTM) LSTMs [1] are models which capture sequential data by using an internal memory. They are very good for analyzing sequences of values and predicting the next point of a given time series. This makes LSTMs adequate for Machine Learning problems that involve sequential data (see [3]) such speech recognition, machine translation, visual recognition and description, and distributed traces analysis. Sequential processing using LSTM [2] [1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. [2] Sequential processing in LSTM (from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ [3] LSTM model description (from Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-effectiveness/ A B C C B ZTrace t1 Sequence of spans
  • 7. 7 Structure Formalization Trace and Span Representation ► Training ▪ Collect traces ▪ Parse and abstract spans ▪ Feed span lists to deep neural network ▪ Train the network with traces ► Operation ▪ Collect real-time traces ▪ Parse and abstract spans ▪ Feed span lists to deep neural network ▪ Retrieve probability matrix ▪ Analyze matrix to decide on the severity of the anomaly 91 12 43 76 24 04 93 63 25 35 38 18 44 24 61 Trace t1 Trace t2 Trace tn Trace sampledSequence of spans 91 12 43 76 24 Training Operation Predicted reconstruction 91 12 43 76 24 We analyze historical traces and collect the spans generated to create a structure containing span types (symbols) using label encoding. The network is training with the encoded traces. During operation, an LSTM network takes as input a sequence of symbols representing a new real-time trace, e.g., 1: “72”, 2: “83”, 3: “54”, …n: “23”. The output is a matrix with the probability of each symbol to belong to a learned structure. The analysis of the probability matrix enables to reason on if a trace should be mark are normal or anomalous. In this experiment, we processed >1M spans and >100K traces Partial view of the spans collected
  • 8. 8 Overall Approach From Encoding to Detection Keras and Tensorflow ● LSTM network ● dropout layer of 0.2 (high, but necessary to avoid quick divergence) ● Softmax activation to generate probabilities over the different categories ● Because it is a classification problem, loss calculation via categorical cross-entropy compares the output probabilities against one-one encoding ● ADAM optimizer (instead of the classical stochastic gradient descent) to update weights Encoding scheme ● The encoding scheme pads each sequence of inputs with zeros, up to a pre-defined maximum length. ● This allows to pre-allocate a chain of LSTM units of a specific length ● We also pass information about the sequence lengths. This is important for not treating the zero padding as actual inputs, and from injecting the error signal at the right unit in the sequence during back propagation. Hot encoding ● A one hot encoding enables to represent categorical variables as binary vectors. ● Categorical values are mapped to integer values. ● Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1. Tensorflow tutorial: https://github.com/campdav/text-rnn-tensorflow Keras tutorial: https://github.com/campdav/text-rnn-keras Normal (green) and Anomalous (red) Traces Detected Encoded Traces Raw Span Events LSTM Network Network Training 1 2 3 4
  • 9. 9 Distributed Trace Anomaly Detection Results Goal: Classify distributed traces as normal or abnormal ► Benefits ▪ Very high accuracy in detection: 99.73% ▪ Extremely fast detection: O(n) ▪ The technique outperforms existing approaches e.g., sequence matching, petri nets, clustering ► Limitations ▪ Requires a special (not trivial) handling when using recurrent neural networks, like LSTMs ▪ Long traces may require very long training times e.g., back propagation across long sequences may result in vanishing gradients... ► Improvements ▪ Truncate traces (remove steps from the beginning or the end)? e.g., Lower the accuracy of predictions ▪ Summarize traces (remove well-known subtraces)? e.g., focus on most relevant parts of a trace while reducing their length ▪ Evaluation the prediction of remaining time to completion (see [Tax, 2017]) ▪ Operationalize LSTM with other sequential data (see [Du, 2017]) [Tax, 2017] Predictive business process monitoring with LSTM neural networks, Tax, Niek and Verenich, Ilya and La Rosa, Marcello and Dumas, Marlon, Proceedings of the 29th International Conference on Advanced Information Systems Engineering, 2017, 477--492, Springer. [Du, 2017] M. Du, F. Li, G. Zheng, and V. Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). ACM, 1285-1298.
  • 10. Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY