SlideShare a Scribd company logo
Distributed Trace & Log Analysis using ML
Prof. Jorge Cardoso
E-mail: jorge.cardoso@huawei.com
Chief Architect for AIOps
Munich & Dublin Research Center
2020.11.04
Definition: AIOps platforms utilize big data, modern machine learning and other advanced analytics
technologies to directly and indirectly enhance IT operations (monitoring, automation and service
desk) functions with proactive, personal and dynamic insight.
Gartner
Huawei SRE Summit: Running Planet Scale Systems
Dublin, 2-4 November 2020
ULTRA-SCALE AIOPS LAB 1
Intelligent Cloud Operations
The emerging field of AIOps
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve
the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using
monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how
AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series
analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize,
predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management
data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data
structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability
empowers operators to quickly identify which componentsof a distributed system are faulty.
 Planet/large-scale Distributed systems and cloud computing
 Distributed traces, logs, events, alerts, metrics, ….
 Big Data platforms with Kubernetes, Hadoop, Spark, Flink, …
 Deep Learning, machine learning, data mining, advanced statistics, time-series analysis, NLP, ….
 Detect, localize, predict, and remediate failures in infrastructures
Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Munich and Ireland Research Centers. Previously he
worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in
Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology
(Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). His current research
involves the development of the next generation of AIOps platforms, Cloud Operations and Analytics tools driven by AI, Cloud
Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer
Science from the University of Georgia (USA).
GitHub | Slideshare.net | GoogleScholar
Interests: AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
ULTRA-SCALE AIOPS LAB 2
HUAWEI CLOUD
Site Reliability Engineering (SRE)
Reliability is an important feature of HAUWEI CLOUD. SRE is responsible for it …
 50% software engineer /
50% system engineer
 Load slows down systems
 Elite forces, focusing on
high-ROI work
 Automation
 Setup quantified SLO
 Run Err Budget
 Self-constraint, balanced
Dev/Ops
 Fact oriented
 Four golden signals
 Altering from time-series
 Eliminating Toil
 Implement monitoring and
develop handling
processes
 Diagnosis, analysis, and
detailed data
 Define contextual,
customer-focused SLOs
 70% of outages due to
changes in a live system
 Addressing cascading
failures
 Take turns to monitor and
document solutions
 Distributed consensus for
reliability
 Regular review meetings
 Humans add latency
 Training through brain
games
 Perform regular drills in
the environment
 Guarantee that there are
few problems
 It's either a problem or a
quick recovery
 The recovery tool is
designed in the early
stage of the process
 Continuous review and
optimization
… a few numbers…
 worldwide, Huawei Cloud has 44
availability zones across 23 regions (June
2019)
 more than 180 cloud services and 180
solutions for a wide range of sectors
 Costumers include European Organization
for Nuclear Research (CERN), PSA
Group, Shenzhen Airport, Port of Tianjin,
…
SREHAUWIECLOUD
ULTRA-SCALE AIOPS LAB 3
AI for IT Operations (AIOps)
Bringing AI to O&M
“Virtyt Koshi, the EMEA general manager for virtualization vendor Mavenir, reckons Google is able
to run all of its data centers in the Asia-Pacific with onlyabout 30 people, and that a telco would
typically have about 3,000 employees to manage infrastructure across the same area."
“We began applying machine learning two years ago (2016) to operate
our data centers more efficiently… over the past few months,
DeepMind researchers began working with Google’s data center team
to significantly improve the system’s utility. Using neural networks
trained on different operating scenarios and parameters, we created a
more efficient and adaptive framework to understand data center
dynamics and optimize efficiency.” Eric Schmidt, Dec. 2018
SRE / O&M Activities
 System monitoring and 24x7 technical support, Tier 1-3 support
 Troubleshooting and resolution of operational issues
 Backup, restoration, archival services
 Update, distribution and release of software
 Change, capacity, and configuration management
 …
Harvard Business Review
Early indicators
Moogsoft AIOps,
Amazon EC2
Predictive Scaling,
Azure VM resiliency,
Amazon S3 Intelligent
Tiering
38.4% of organizations take
more than 30 minutes to
resolve IT incidents that
impact consumer-facing
services (PagerDuty)
https://www.lightreading.com/automation/google-has-intent-to-cut-humans-out-of-network/d/d-id/742158
https://www.lightreading.com/automation/automation-is-about-job-cuts---lr-poll/d/d-id/741989
https://www.lightreading.com/automation/the-zero-person-network-operations-center-is-here-(in-finland)/d/d-id/741695
ULTRA-SCALE AIOPS LAB 4
Intelligent Operation & Maintenance
Troubleshooting
Anomaly Detection
 Response Time Analysis
‒ A service started responding to requests more slowly than
normal
‒ The change happened suddenly as a consequence of
regressionin the latest deployment
 System Load
‒ The demand placed on the system, e.g., REST API requests
per second, increase since yesterday
 Error Analysis
‒ The rate of requests that fail -- either explicitly (HTTP 5xx) or
implicitly (HTTP 2xx with wrong content) -- is increasingslowly,
but steadily
 System Saturation
‒ The resources (e.g., memory, I/O, disk) used by key controller
services is rapidly reaching threshold levels
Complex System
 OBS. Object Storage Service
 EVS. Elastic Volume Service (block storage)
 VPC. Virtual Private (private virtual networks)
 ECS. Elastic Cloud Server (scalable computing)
SRE / O&M Activities
 System monitoring and 24x7/Tier 1-3 technical support
 Troubleshooting and resolution of operationalissues
 Backup, restoration, archival services
 Update, distribution and release of software
 Change, capacity, and configuration management
 …
Troubleshooting tasks
 Anomaly detection.Determine what constitutes normal
system behavior, and then to discern departures from that
normal system behavior
 Fault diagnosis (root cause analysis). Identify links of
dependencythat represent causal relationships to discover
the true source of an anomaly
 Fault Prediction. Use of historical or streaming data to
predict incidents with varying degrees of probability
 Fault recovery.Explore how decision support systems can
manage and select recovery processes to repair failed
systems
ULTRA-SCALE AIOPS LAB 5
Troubleshooting
Monitoring Data Sources
Logs. Service, microservices, and applications
generate logs, composed of timestamped
records with a structure and free-form text,
which are stored in system files.
Metrics. Examples of metrics include CPU
load, memory available, and the response
time of a HTTP request.
Traces. Traces records the workflow and tasks
executed in response to, e.g., an HTTP
request.
Events. Major milestones which occur within a
data center can be exposed as events.
Examples include alarms, service upgrades,
and software releases.
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332",
"duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key":
"http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value":
"HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}]
{“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506,
2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]}
{"id": "dns_address_match“, "timestamp": 1529029301238, …}
{"id": "ping_packet_loss“, "timestamp": 152902933452, …}
{"id": "tcp_connection_time“, "timestamp": 15290294516578, …}
{"id": "cpu_usage_average“, "timestamp": 1529023098976, …}
2017-01-18 15:54:00.467 32552ERROR oslo_messaging.rpc.server[req-c0b38ace -
default default] Exception during message handling
System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data:
Logs, Metrics, Traces, Events, Topologies
ULTRA-SCALE AIOPS LAB 6
INNOVATION PATH FORWARD
BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT
Intelligent Log Analysis
Root Cause Localization using Application Logs
Problem
 Complex system generates high volumes of logs
for manual analysis/troubleshooting. E.g., a single-
node check using logs takes >15 min
Analyzing logs manuallyis time
consuming and error-prone
IntelligentLog Analysis
 Insight. Recent research has shown that it is
possible to identify and model the underlying
structure of application logs using ML [1, 2] Higher levels of visualization and abstraction are required (Bird's-eye
view, zoom in/out)
MAIN ACHIEVEMENT
Extracting valuableinsights form massive, noisyand
unstructured logs
 Pattern mining and NLP are used to abstract and summarize
massive logs
 Help O&M teams improving efficiency of troubleshooting using
logs
ASSUMPTIONS & LIMITATIONS
 Processing requires historical logs to learn normality
 No real-time processing. Only on-demand operation is available
Status
 Higher levels of visualization and abstraction are
effective for operators to quickly troubleshoot
system using logs
Next steps
 Research other forms of visualization (e.g., Kernel
Density Estimation, Network Diagrams,
Correlation Matrices)
TechnologyTransition
 Evaluation of infrastructures to process ultra-scale
logs (e.g., batch versus stream processing)
Potential End Users
 Service Owners, DevOps and SRE
Log Analysis
Log analysis can be integrated into traditional log
platforms
HOW IT WORKS
 Performance is a challenge! Processing million
log records per second is difficult
1. Fast log template mining using
Drain algorithm
2. Language-aware log parsing
and keyword extraction using
NLP approaches (www.spacy.io)
3. Time-series classification using
Poisson model
(en.wikipedia.org/wiki/Poisson_d
istribution)
4. Grouping using Pearson
algorithm
(en.wikipedia.org/wiki/Pearson_c
orrelation_coefficient)
5. Distance-aware correlation
Fig.
Existing Log
Analysis services
(e.g., ELK)
10000 error messages
100 templates
30 events
6 eventscorrelated
Fig.
Complex
services
generates
several GB
of logs per
second
as-is
Template mining Template analysis
 Visualization
 Anomaly detection
Fig. The process of intelligent log analysis
to-be
ULTRA-SCALE AIOPS LAB 7
INNOVATION PATH FORWARD
BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT
Trace Analysis (TA)
Structure-based Anomaly Detection and RCA
Current limitations
 Existing solutions only provides trace visualization tools
 Analyze traces manually: error-prone and not scalable
While popular, onlyvisualization tools
have been developed for trace analysis
Apply recentML and statistical
approachesto process sequential data
 Explore the use
of Deep
Learning: Long
Short Term
Memory (LSTM)
 Explore the use
of attention
networks
 Explore the use
of association-
rules TRL 4. Small scaleprototype.Basic technological components are
integrated to establish thatthey will work together.
MAIN ACHIEVEMENT
Trace anomalydetection and root-causeanalysis using
trace structure
 Previous tentative using Deep Learning (LSTM, CNN), Machine
Learning (Optiks), Sequence Analysis (LCS, Multiple sequence
alignment, Needleman-Wunsch) algorithms did not enable root
cause analysis
ASSUMPTIONS & LIMITATIONS
 Microservices are instrumented with tracing capabilities
 Tracing tracks microservice to microservice communication
Improve trace-based RCA in 90%
HOW IT WORKS
1) Learning. For each service
endpoint, learn the traces’
structure it generates
2) Modeling. Aggregate all the
traces into a behavior model
3) Anomaly detection. When a
new trace is generate, compare
its structure with the behavior
model. If it was not seen
before, an anomaly exists
4) Root-cause analysis. When an
anomaly is detected, determine
in which span it occurred and
identify host, service, function
Trace Events
OK | ANOMALY
Create VM
LSTM Network
Trace Structure
Research Next Steps
 Which data to inject into traces’ payload which
can significantly assist troubleshooting?
 New visualization tools beyond Gantt diagrams
for O&M
Technology Transition
 Integration / alignment tracing, logging and
metrics pipelines
Potential End Users
 SRE and DevOps
 PaaS and customer cloud native applications
E2E: Mobile apps, web clients, monoliths, etc.
Fig.
Red circles show structural anomalies (mockup)
Create
Model
Root-cause
Analysis
Anomaly
Detection
Host: 192.168.5.13
Service: nova-api
Function: schedule VM
Fig.
Elastic APM
Trace
vitalization
(in blue)
Structural
anomalies
(in red)
Learn traces structure
ULTRA-SCALE AIOPS LAB 8
Troubleshooting
Using Tracing Technolgies
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
Troubleshooting
1. Find root cause issues in requests across several services / hosts
2. Benchmark different systems and identify performance bottlenecks
3. Reconstruct the workflow of service-to-service communications
ULTRA-SCALE AIOPS LAB 9
Tracing Technologies
Systems and pseudo-standards
Fig. Amazon AWS X-Ray
Fig. Google Cloud Trace
Fig. OpenTracing
 Tracing Services
‒ Google Stackdriver(cloud.google.com/trace)
‒ Amazon AWS X-Ray (aws.amazon.com/xray)
‒ Lightstep (lightstep.com)
 Tracing Systems
‒ Twitter Zipkin (zipkin.io)
‒ Uber Jaeger
‒ OSProfiler
‒ Pivot Tracing (pivottracing.io)
 Tracing Standards
‒ opentelemetry.io
‒ OpenTracing
‒ OpenCensus
[1] http://opentracing.io/documentation/
[2] https://github.com/opentracing/specification/blob/master/specification.md
OpenTracing provides a consistent, expressive, vendor-neutral
APIs for popular platforms [1]
2019: The open source
distributed tracing projects
OpenCensus and OpenTracing
were merged into anew project
called OpenTelemetry
ULTRA-SCALE AIOPS LAB 10
Troubleshooting Openstack
Traces and Workflows
Distributed Tracing
1. Provides developers a detailed view of requests as they flow across a
distributed system
2. Identifies which microservices areinvolved in processinga request
3. Trace the path of a request to generate a transaction
4. Localize which microservice in a path (trace) is anomalous
ULTRA-SCALE AIOPS LAB 11
Distributed Traces
Abstractions
     
 
n
ki
kikiiiiiiikkn xXxXxXxXpxXxXpxxp ,...,,|,...,... 2211111
 Markov Chain
‒ A stochastic process is a Markov chain if:
‒ The probability distribution of a state Xi depends on the previous state
Xi-1 and does not depend on the previous states
‒ K-th order Markov Chain:
 Tree Structure
‒ Finite set of one or more nodes A, B, C, …
‒ Special nodes: the root node has no parent. Leaf nodes have children.
‒ Remaining nodes are partitioned into n >= 0 disjoint sets T1, ..., Tn,
where each set is also a tree
 Sequence of Events
‒ Given a set E = {e1,...,en} of event types, an event is a pair (A, t), where
A ∈ E and t ∈ N is the occurrence time of the event
‒ An event sequences on E is an ordered sequence of events:
‒ Tree representation: (A (B (C, D), E)
       
 

n
i
xx
n
i
iiiin ii
axpxXxXpxXpxxp
2
1
2
11111 1
)(|...
Markov
Chain
Tree
Sequence
),(),...,,(),,( 2211 nn tAtAtAs 
ULTRA-SCALE AIOPS LAB 14
Distributed Trace Analysis
Sequence Analysis
 Sequence Prediction
‒ Predict elements of a sequence on the basis of the preceding elements
‒ Given si, si+1,…, sj, predict sj+1, si,si+1, i.e., sj →sj+1
‒ Input: 1, 2, 3, 4, 5; output: 6
‒ Applications: weather forecast
 Sequence Classification
‒ Predict a class label for a given input sequence.
‒ Given si,si+1,…,sj, determine if it is legitimate, i.e., si,si+1,…,sj →yes or no
‒ Input: 1, 2, 3, 4, 5; output: “normal” or “anomalous”
‒ Applications: trace anomaly detection, DNA sequence classification
 Sequence Generation
‒ Generate new output sequence with same general characteristics as the input
‒ Input: [1, 3, 5], [7, 9, 11]; output: [3, 5 ,7]
‒ Applications: text generation
 Sequence to Sequence Prediction
‒ Predict an output sequence given an input sequence.
‒ Input: 1, 2, 3, 4, 5; output: 6, 7, 8, 9, 10
‒ Applications: text summarization
1, 2, 3, 4, 5 ► 6
1, 2, 3, 4, 5 ► NORMAL
1, 3, 5 and 7, 9, 11 ► 3, 5 ,7
1, 2, 3, 4, 5 ► 6, 7, 8, 9, 10
Sequence learning: from recognition and prediction to sequential decision making, IEEE Intelligent Systems, 2001,
http://www.sts.rpi.edu/~rsun/sun.expert01.pdf
ULTRA-SCALE AIOPS LAB 15
Distributed Trace Analysis
Sequence Analysis
1. User commands: image
delete, create server, …
N1 N2
N
3
N
4
keystone -> nova -> glance ->
network
Services
30
%
70
%
Warning Y
Warning A
Warning B
Error Z
a1 62 e2 28
Trace t1 Sequence of spans
auth_tokens_v3 1314421019193f1313162f253233191e341a3c3b402f2624192f214521162f163e2746282c192f2e2e24161d2c2c161628241635287a
images_v2 4342421119131616161a311316192216162f23213e1d2127272728241621191d5e5c471d132c4a16241616213c63
_var__detail_flavors_v2.1 5d13421314151316161b193b21242113364024192f162f131d21161621212c27282c2c2d162f5e5c132c16164d
image_schemas_v2 141319133113162f1644161b331e351a213b1613133b13192f5116132113131616252121282c2c162d21582e2e161d13643b132c4a78…
gather_result 1912141613252128211a1f402f221921162f2713282b2d2416131d13211a5a21
security-groups_v2.0 4c1319131d19131316132f
_var__images_v2 121313251b2f192f35211f241916223b1613231616242713462c53241654192c285a21
_var___var__action_servers_v2.1 61135d106c12131325191b1c193d281a4022136d13162f3b292116231d1627272741283b2416…f4a5121352f13
neutronclient.v2_0.client.retry_request 162f1319132f165124191316
4. SequenceClassification: identify invalid sequences
Service call encoding
16: {'v2.0', 'ports'}
51 : {'v2.0', '_var_', 'ports'}=200
24 : {'nova.compute.api.API.get'}=0
…
Service call encoding
13: {'v3', 'tokens', 'auth‘}
Encoding matching
53.47%
2. A distributed trace is
generated
3. The distributed trace is
encoded into a sequence
Function call sequences
[{'v3'}=200, {'v3', 'tokens', 'auth'}=200, {'v...
Function call sequences
Similar call sequence
Anomaly Detection from System Tracing Data using Multimodal Deep Learning
IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.
ULTRA-SCALE AIOPS LAB 16
Distributed Trace Analysis
Why is trace analysis challenging?
2. Cache-Aside pattern
1. Retry pattern n_tries = 3
while True:
try:
ExecuteOperation() # Success
break
except TimeoutException:
if n_tries == 0: # give up
raise Failure
else:
sleep(1000) # Wait until retrying the call
v = cache.get(k)
if v == None: # Check if the item is in the cache
v = store.get(k) # If not in cache, read from data store
cache.put(k, v) # Put a copy of the item into the cache
3. One-to-many subcalls
@app.route('/servers/<int:user_id>/<int:number>')
def create(user_id, number):
for i in range(number):
authentication(user_id)
# Call remote microservice to create a server
rpc.create_server()
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, Z
T1: A, B, C, D, E, …
T1: A, C, D, E, …
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, C, D, E, …
Many “Hidden” Software “Patterns” which affect running systems…Circuit breaker, concurrency and parallelism, priority
queues, publisher-subscriber, bulkhead, etc.
ULTRA-SCALE AIOPS LAB 17
Distributed Trace Analysis
Using LSTMs
1. Model Openstackbehavior
OPENSTACK_SEQUENCE= [
('keystone', 1.0, 'authenticate user with credentials and generate auth-token'),
('nova-api', 1.0, 'get user request and sends token to Keystone for validation'),
('keystone', 0.2, 'validate the token if not in cache'),
('nova-api', 1.0, 'starts VM creation process'),
('nova-database', 1.0, 'create initial database entry for new VM'),
('nova-scheduler', 1.0, 'locate an appropriate host using filters and weights'),
('nova-database', 1.0, 'execute query'),
('nova-compute', 1.0, 'dispatches request'),
('nova-conductor', 1.0, 'gets instance information'),
('nova-compute', 1.0, 'get image URI from image service'),
('glance-api', 1.0, 'get image metadata'),
('nova-compute', 1.0, 'setup network'),
('neutron-server', .5, 'allocate and configure network IP address'),
('nova-compute', 1.0, 'setup volume'),
('cinder-api', .75, 'attach volume to the instance or VM'),
('nova-compute', 1.0, 'generates data for the hypervisor driver')
]
3. Transform traces into sequences of integers
Train dataset
span_list
trace_id
0 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
2 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
3 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
4 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
... ...
95 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
96 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
97 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
98 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
99 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
[100 rows x 1 columns]
Test dataset
span_list
trace_id
0 [3, 5, 5, 1, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
[0,
['keystone',
'nova-api',
'nova-api',
'nova-database',
'nova-scheduler',
'nova-database',
'nova-compute',
'nova-conductor',
'nova-compute',
'glance-api',
'nova-compute',
'nova-compute',
'cinder-api',
'nova-compute']]
Cache Simulation: Probability of execution = 75%
Trace #0
 Generate Synthetic Traces (generate_traces())
‒ Create a simple model to generate traces
‒ OPENSTACK_SEQUENCE models sequences of spans which are valid
• e.g., ['keystone‘, 'nova-api', 'nova-api', 'nova-database', 'nova-scheduler', 'nova-database', 'nova-compute', 'nova-
conductor', 'nova-compute', 'glance-api', 'nova-compute', 'nova-compute', 'cinder-api', 'nova-compute']
‒ Each microservice name is encoded into a number using a dictionary (create_coders())
 Trace Mutation (mutate_traces())
‒ Selected synthetic traces are mutated
‒ Spans can be deleted or replaced
2. Generate Synthetics Traces
ULTRA-SCALE AIOPS LAB 18
Distributed Trace Analysis
Using LSTMs
[[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1 6]]
Padding
2. Hot encoding
Tensorflow tutorial: https://github.com/campdav/text-rnn-tensorflow
Keras tutorial: https://github.com/campdav/text-rnn-keras
Code available at: https://github.com/jorge-cardoso/aiops_practice
Subdirectory:dta_lstm
 Padding (_pad_traces())
‒ Pad each trace (sequence) with zeros (0), up to a pre-defined maximum length.
‒ This allows to pre-allocate a chain of LSTM units of a specific length
‒ Padding is done using pad_sequences(sequences,...)
 One hot encoding (_traces_to_binary())
‒ Enables to represent categorical variables as binary vectors
‒ Each integer value is represented as a binary vector that has all zero values except the index of the integer, which is
marked with a 1
X = pad_sequences(X, maxlen=self.trace_size, dtype=np.int16,
truncating='pre', padding='pre',
value=DEFAULTS['padding_symbol'])
maxlen: int, maximum length of all sequences.
truncating='pre' remove values at the beginning from sequences larger than maxlen
padding='pre' pads each trace at the beginning with a special integer (e.g., 0)
1. Padding
keras.utils.to_categorical(X, num_classes=self.max_n_span_types,
dtype='int32')
X.shape = (n_traces, self.trace_size)
returns shape = (n_traces, self.trace_size, self.max_n_span_types)
[[[1. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]
[0. 0. 1. ... 0. 0. 0.]
...
]
]
ULTRA-SCALE AIOPS LAB 19
Distributed Trace Analysis
Using LSTMs
X [[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1
6]]
y [[0 0 5 ... 1 6 0]
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
...
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
[0 5 5 ... 1 6 0]]
 Generate y
‒ Shift traces by one position to the left
‒ Fill new cell with DEFAULTS['padding_symbol']
# X.shape is n_traces
# X.shape.y is self.trace_size
y = np.roll(X, -1, axis=1)
# Pad j array where X array has zeros
y[X == 0] = DEFAULTS['padding_symbol']
# Write the DEFAULTS['padding_symbol'] at the last
position of the shifted array
y[:, -1] = DEFAULTS['padding_symbol']
1. Generate y by shifting X
0 3 5 2 1 7X
y 0 5 2 1 7 0
Shift traces left
ULTRA-SCALE AIOPS LAB 20
Distributed Trace Analysis
Using LSTMs
 Create DL model (_build_model())
‒ Use a LSTM network
‒ dropout layer of 0.2 (high, but necessary to avoid quick divergence)
‒ Softmax activation to generate probabilities over the different categories
‒ Because it is a classification problem, loss calculation via categorical cross-entropy compares the output probabilities
against one-one encoding
‒ ADAM optimizer (instead of the classical stochastic gradient descent) to update weights
model = Sequential()
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True,
input_shape=(self.trace_size, input_dim)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
# A Dense layer is used as the output for the network.
model.add(TimeDistributed(Dense(input_dim, activation='softmax')))
if self.gpus > 1:
model = keras.utils.multi_gpu_model(model, gpus=self.gpus)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
1. Create DL model
ULTRA-SCALE AIOPS LAB 21
Distributed Trace Analysis
Using LSTMs
 Train the model (fit())
‒ X = Obtains the traces composed of spans
‒ Pad the traces of X
‒ y = Shift the spans of the traces left
‒ Apply an one hot encoding to X and y
‒ Build the model
‒ Fit the model f(X) = y
# X_train has 2 columns: traceId, span_list. Here we select the span_list
X_train = X_train.loc[:, DEFAULTS['span_list_col_name']].values
X = self._pad_traces(X_train)
y = self._shift_traces_left(X)
self._get_lstm_parameters(X)
X = self._traces_to_binary(X)
y = self._traces_to_binary(y)
self.model = self._build_model(input_dim=X.shape[2])
self.model.fit(X, y, epochs=self.epochs, batch_size=self.batch_size, shuffle=True,
validation_split=self.validation_split)
1. Train the model
ULTRA-SCALE AIOPS LAB 22
Distributed Trace Analysis
Using LSTMs
Test Traces
X [[0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6]
[0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6]]
X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203],
[0, 0.9656691], [0, 0.9686248], [0, 0.9666971],
[0, 0.95213455], [0, 0.94981205], [0, 0.93016535],
[0, 0.60129535], [0, 0.5417027], [0, 0.6876041],
[0, 0.82936114]]
X[1]: [0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [0, 0.9302344], [0, 0.94507533],
[0, 0.9638574], [0, 0.9665326], [0, 0.9627945],
[0, 0.9435249], [0, 0.9405963], [0, 0.92075515],
[1, 0.3207201], [1, 0.46595544], [0, 0.64980185],
[0, 0.8390282]]
trace_id 0 indices [4]
trace_id 1 indices []
 Test traces (predict())
‒ X = Obtains the traces composed of spans
‒ Pad the traces of X
‒ Apply an one hot encoding to X
‒ Calculate: yhat = f(X)
‒ Identify in which spans the anomalies occur
X_test = X_test.loc[:, DEFAULTS['span_list_col_name']].values
X_test = self._pad_traces(X_test)
X_test_bin = self._traces_to_binary(X_test)
yhat = self.model.predict(X_test_bin)
self._identify_anomalies(X_test, yhat, prob=self.threshold)
return self.rank_prob
1. Test unseen traces
ULTRA-SCALE AIOPS LAB 23
Distributed Trace Analysis
Using LSTMs
 Identifying Anomalous Traces Sequences (_identify_anomalies())
‒ Mark anomalies based on the difference between y and yhat
‒ y is the shifted version of X
‒ yhat = f(X)
‒ If we cannot predict correctly the shift, this means the trace is did not occur before and is marked as anomalous
y = DTA_LSTM._shift_traces_left(X)
for i in range(len(X)):
rp = self._compare_traces(y[i], yhat[i])
self.rank_prob.append(rp)
if self.explain:
print('X[{}]: {} = {}'.format(i, X[i], rp))
return self.rank_prob
Input X = [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6]
Shifted_X = [0 0 5 5 1 9 8 6 7 6 2 6 6 1 6 0]
Prediction
yhat = [0 0 5 5 12 9 8 6 7 6 2 6 6 1 6 0]
X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203],
[0, 0.9656691], [0, 0.9686248], [0, 0.9666971],
[0, 0.95213455], [0, 0.94981205], [0, 0.93016535],
[0, 0.60129535], [0, 0.5417027], [0, 0.6876041],
[0, 0.82936114]]
Error at index [4]
X[i]: Input: [ 0 0 0 0 0 0 0 0 0 0 0 10 10 17 5 5 7 7 6 6]
y[i]: Output: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0]
yhat[i]: Predicted: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0]
Correct prediction
ULTRA-SCALE AIOPS LAB 24
Single and Bi-Models
 Use of single and bi-models to capture traces as
sequences of events and their response time
 Use sequential model representation by utilizing long-
short-term memory (LSTM) networks
Related Research
Technical University of Berlin
S. Nedelkoski,J. Cardoso,O. Kao, Anomaly Detectionfrom System Tracing Data using MultimodalDeep Learning, IEEECloud2019,July 3-8,2019,Milan, Italy.
Idea: represent distributed traces as sequences of
events and their response time
Trace Structure (sequence of events) and
Response time (duration)
e.g., Service 11 → Service 21; duration = 12ms
ULTRA-SCALE AIOPS LAB 25
Related Research
Facebook
https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chow.pdf
Measure end-to-end performance of requests
 Infer causal relationships from logs
‒ Adding instrumentation retroactively is an
expensive task
‒ Hypothesize and confirm relationships in
messages
 Log Data
‒ Request & host id, timestamp, unique event label
 Finding Relationships
‒ Samples requests, store logs in Hive and run
Hadoop jobs to infer causal relationships
‒ 2h Hadoop to analyze 1.3M requests sampled 30
days
‒ Assumesa hypothesized relationshipbetween
two segmentsuntil finding a counterexample
‒ 3 types of relationship inferred: happens-before,
mutual exclusion, pipeline
 Applications
‒ Find critical path and slack segments for
performance optimization
‒ Anomaly detection: 1) select top 5% of end-to-end
latency, 2) identify segmentswith proportionally
greater representation in the outlier set of
requests than in the non-outlier set.
As the number of traces analyzed increases, the observation of new counter
examples diminishes, leaving behind only true relationships
Copyright©2019 Huawei Technologies Co., Ltd.
All Rights Reserved.
The information in this document may contain predictive
statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Bring digital to every person, home and
organization for a fully connected,
intelligent world.
Thank you.

More Related Content

What's hot

AI in Finance: Moving forward!
AI in Finance: Moving forward!AI in Finance: Moving forward!
AI in Finance: Moving forward!
Adrian Hornsby
 
Generative AI for the rest of us
Generative AI for the rest of usGenerative AI for the rest of us
Generative AI for the rest of us
Massimo Ferre'
 
Digital twins - Technology that is Changing Industry
Digital twins - Technology that is Changing IndustryDigital twins - Technology that is Changing Industry
Digital twins - Technology that is Changing Industry
Wg Cdr Jayesh C S PAI
 
NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21 NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21
Alison B. Lowndes
 
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
Enterprise Management Associates
 
The future of AIOps
The future of AIOpsThe future of AIOps
The future of AIOps
GAVS Technologies
 
AIOps: Your DevOps Co-Pilot
AIOps: Your DevOps Co-PilotAIOps: Your DevOps Co-Pilot
AIOps: Your DevOps Co-Pilot
DevOps.com
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
AI and Big Data Analytics
AI and Big Data AnalyticsAI and Big Data Analytics
AI and Big Data Analytics
InData Labs
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
Denodo
 
NVIDIA GTC2022 Spring Highlights
NVIDIA GTC2022 Spring HighlightsNVIDIA GTC2022 Spring Highlights
NVIDIA GTC2022 Spring Highlights
NVIDIA
 
Responsible AI
Responsible AIResponsible AI
Responsible AI
Data Con LA
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
Qualcomm Research
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
Matej Varga
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
AI Product Manager
AI Product Manager AI Product Manager
AI Product Manager
Datentreiber
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
Amazon Web Services
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
Neo4j
 
Differential privacy in the real world
Differential privacy in the real worldDifferential privacy in the real world
Differential privacy in the real world
MLconf
 
NVIDIA @ AI FEST
NVIDIA @ AI FESTNVIDIA @ AI FEST
NVIDIA @ AI FEST
Alison B. Lowndes
 

What's hot (20)

AI in Finance: Moving forward!
AI in Finance: Moving forward!AI in Finance: Moving forward!
AI in Finance: Moving forward!
 
Generative AI for the rest of us
Generative AI for the rest of usGenerative AI for the rest of us
Generative AI for the rest of us
 
Digital twins - Technology that is Changing Industry
Digital twins - Technology that is Changing IndustryDigital twins - Technology that is Changing Industry
Digital twins - Technology that is Changing Industry
 
NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21 NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21
 
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
 
The future of AIOps
The future of AIOpsThe future of AIOps
The future of AIOps
 
AIOps: Your DevOps Co-Pilot
AIOps: Your DevOps Co-PilotAIOps: Your DevOps Co-Pilot
AIOps: Your DevOps Co-Pilot
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
AI and Big Data Analytics
AI and Big Data AnalyticsAI and Big Data Analytics
AI and Big Data Analytics
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
 
NVIDIA GTC2022 Spring Highlights
NVIDIA GTC2022 Spring HighlightsNVIDIA GTC2022 Spring Highlights
NVIDIA GTC2022 Spring Highlights
 
Responsible AI
Responsible AIResponsible AI
Responsible AI
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
AI Product Manager
AI Product Manager AI Product Manager
AI Product Manager
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Differential privacy in the real world
Differential privacy in the real worldDifferential privacy in the real world
Differential privacy in the real world
 
NVIDIA @ AI FEST
NVIDIA @ AI FESTNVIDIA @ AI FEST
NVIDIA @ AI FEST
 

Similar to Distributed Trace & Log Analysis using ML

The Essentials Of Project Management
The Essentials Of Project ManagementThe Essentials Of Project Management
The Essentials Of Project Management
Laura Arrigo
 
For linked in part 2 no template
For linked in part 2  no templateFor linked in part 2  no template
For linked in part 2 no template
Pankaj Tomar
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
IRJET Journal
 
Dell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western OntarioDell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western Ontario
Bill Wong
 
hari_duche_updated
hari_duche_updatedhari_duche_updated
hari_duche_updated
Hari Duche
 
AI at Scale in Enterprises
AI at Scale in Enterprises AI at Scale in Enterprises
AI at Scale in Enterprises
Ganesan Narayanasamy
 
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Sabri Skhiri
 
Activeeon - Scale Beyond Limits
Activeeon - Scale Beyond LimitsActiveeon - Scale Beyond Limits
Activeeon - Scale Beyond Limits
Activeeon
 
JIT Borawan Cloud computing part 2
JIT Borawan Cloud computing part 2JIT Borawan Cloud computing part 2
JIT Borawan Cloud computing part 2
Sawan Mishra
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Animesh Chaturvedi
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
Mrinal Kumar
 
IRJET- Search Improvement using Digital Thread in Data Analytics
IRJET- Search Improvement using Digital Thread in Data AnalyticsIRJET- Search Improvement using Digital Thread in Data Analytics
IRJET- Search Improvement using Digital Thread in Data Analytics
IRJET Journal
 
BrownResearch_CV
BrownResearch_CVBrownResearch_CV
BrownResearch_CV
Abby Brown
 
Data Analysis and Report Generation in Enterprise Mobility Solution
Data Analysis and Report Generation in Enterprise Mobility Solution Data Analysis and Report Generation in Enterprise Mobility Solution
Data Analysis and Report Generation in Enterprise Mobility Solution
IRJET Journal
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Hazelcast
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
IRJET Journal
 
Dell AI Oil and Gas Webinar
Dell AI Oil and Gas WebinarDell AI Oil and Gas Webinar
Dell AI Oil and Gas Webinar
Bill Wong
 
How to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHow to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software Platforms
Harish Nalagandla
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio
 

Similar to Distributed Trace & Log Analysis using ML (20)

The Essentials Of Project Management
The Essentials Of Project ManagementThe Essentials Of Project Management
The Essentials Of Project Management
 
For linked in part 2 no template
For linked in part 2  no templateFor linked in part 2  no template
For linked in part 2 no template
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
Dell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western OntarioDell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western Ontario
 
hari_duche_updated
hari_duche_updatedhari_duche_updated
hari_duche_updated
 
AI at Scale in Enterprises
AI at Scale in Enterprises AI at Scale in Enterprises
AI at Scale in Enterprises
 
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
 
Activeeon - Scale Beyond Limits
Activeeon - Scale Beyond LimitsActiveeon - Scale Beyond Limits
Activeeon - Scale Beyond Limits
 
JIT Borawan Cloud computing part 2
JIT Borawan Cloud computing part 2JIT Borawan Cloud computing part 2
JIT Borawan Cloud computing part 2
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
IRJET- Search Improvement using Digital Thread in Data Analytics
IRJET- Search Improvement using Digital Thread in Data AnalyticsIRJET- Search Improvement using Digital Thread in Data Analytics
IRJET- Search Improvement using Digital Thread in Data Analytics
 
BrownResearch_CV
BrownResearch_CVBrownResearch_CV
BrownResearch_CV
 
Data Analysis and Report Generation in Enterprise Mobility Solution
Data Analysis and Report Generation in Enterprise Mobility Solution Data Analysis and Report Generation in Enterprise Mobility Solution
Data Analysis and Report Generation in Enterprise Mobility Solution
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Dell AI Oil and Gas Webinar
Dell AI Oil and Gas WebinarDell AI Oil and Gas Webinar
Dell AI Oil and Gas Webinar
 
How to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHow to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software Platforms
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 

More from Jorge Cardoso

Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
Jorge Cardoso
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
Jorge Cardoso
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Jorge Cardoso
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
Jorge Cardoso
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
Jorge Cardoso
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
Jorge Cardoso
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
Jorge Cardoso
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
Jorge Cardoso
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
Jorge Cardoso
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
Jorge Cardoso
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
Jorge Cardoso
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
Jorge Cardoso
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
Jorge Cardoso
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
Jorge Cardoso
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
Jorge Cardoso
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
Jorge Cardoso
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
Jorge Cardoso
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
Jorge Cardoso
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
Jorge Cardoso
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
Jorge Cardoso
 

More from Jorge Cardoso (20)

Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
 

Recently uploaded

制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
ukwwuq
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
uehowe
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
Toptal Tech
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
bseovas
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
bseovas
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
ysasp1
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
cuobya
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Florence Consulting
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
Paul Walk
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
uehowe
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
wolfsoftcompanyco
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
zoowe
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 

Recently uploaded (20)

制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 

Distributed Trace & Log Analysis using ML

  • 1. Distributed Trace & Log Analysis using ML Prof. Jorge Cardoso E-mail: jorge.cardoso@huawei.com Chief Architect for AIOps Munich & Dublin Research Center 2020.11.04 Definition: AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. Gartner Huawei SRE Summit: Running Planet Scale Systems Dublin, 2-4 November 2020
  • 2. ULTRA-SCALE AIOPS LAB 1 Intelligent Cloud Operations The emerging field of AIOps The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which componentsof a distributed system are faulty.  Planet/large-scale Distributed systems and cloud computing  Distributed traces, logs, events, alerts, metrics, ….  Big Data platforms with Kubernetes, Hadoop, Spark, Flink, …  Deep Learning, machine learning, data mining, advanced statistics, time-series analysis, NLP, ….  Detect, localize, predict, and remediate failures in infrastructures Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Munich and Ireland Research Centers. Previously he worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology (Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). His current research involves the development of the next generation of AIOps platforms, Cloud Operations and Analytics tools driven by AI, Cloud Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer Science from the University of Georgia (USA). GitHub | Slideshare.net | GoogleScholar Interests: AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
  • 3. ULTRA-SCALE AIOPS LAB 2 HUAWEI CLOUD Site Reliability Engineering (SRE) Reliability is an important feature of HAUWEI CLOUD. SRE is responsible for it …  50% software engineer / 50% system engineer  Load slows down systems  Elite forces, focusing on high-ROI work  Automation  Setup quantified SLO  Run Err Budget  Self-constraint, balanced Dev/Ops  Fact oriented  Four golden signals  Altering from time-series  Eliminating Toil  Implement monitoring and develop handling processes  Diagnosis, analysis, and detailed data  Define contextual, customer-focused SLOs  70% of outages due to changes in a live system  Addressing cascading failures  Take turns to monitor and document solutions  Distributed consensus for reliability  Regular review meetings  Humans add latency  Training through brain games  Perform regular drills in the environment  Guarantee that there are few problems  It's either a problem or a quick recovery  The recovery tool is designed in the early stage of the process  Continuous review and optimization … a few numbers…  worldwide, Huawei Cloud has 44 availability zones across 23 regions (June 2019)  more than 180 cloud services and 180 solutions for a wide range of sectors  Costumers include European Organization for Nuclear Research (CERN), PSA Group, Shenzhen Airport, Port of Tianjin, … SREHAUWIECLOUD
  • 4. ULTRA-SCALE AIOPS LAB 3 AI for IT Operations (AIOps) Bringing AI to O&M “Virtyt Koshi, the EMEA general manager for virtualization vendor Mavenir, reckons Google is able to run all of its data centers in the Asia-Pacific with onlyabout 30 people, and that a telco would typically have about 3,000 employees to manage infrastructure across the same area." “We began applying machine learning two years ago (2016) to operate our data centers more efficiently… over the past few months, DeepMind researchers began working with Google’s data center team to significantly improve the system’s utility. Using neural networks trained on different operating scenarios and parameters, we created a more efficient and adaptive framework to understand data center dynamics and optimize efficiency.” Eric Schmidt, Dec. 2018 SRE / O&M Activities  System monitoring and 24x7 technical support, Tier 1-3 support  Troubleshooting and resolution of operational issues  Backup, restoration, archival services  Update, distribution and release of software  Change, capacity, and configuration management  … Harvard Business Review Early indicators Moogsoft AIOps, Amazon EC2 Predictive Scaling, Azure VM resiliency, Amazon S3 Intelligent Tiering 38.4% of organizations take more than 30 minutes to resolve IT incidents that impact consumer-facing services (PagerDuty) https://www.lightreading.com/automation/google-has-intent-to-cut-humans-out-of-network/d/d-id/742158 https://www.lightreading.com/automation/automation-is-about-job-cuts---lr-poll/d/d-id/741989 https://www.lightreading.com/automation/the-zero-person-network-operations-center-is-here-(in-finland)/d/d-id/741695
  • 5. ULTRA-SCALE AIOPS LAB 4 Intelligent Operation & Maintenance Troubleshooting Anomaly Detection  Response Time Analysis ‒ A service started responding to requests more slowly than normal ‒ The change happened suddenly as a consequence of regressionin the latest deployment  System Load ‒ The demand placed on the system, e.g., REST API requests per second, increase since yesterday  Error Analysis ‒ The rate of requests that fail -- either explicitly (HTTP 5xx) or implicitly (HTTP 2xx with wrong content) -- is increasingslowly, but steadily  System Saturation ‒ The resources (e.g., memory, I/O, disk) used by key controller services is rapidly reaching threshold levels Complex System  OBS. Object Storage Service  EVS. Elastic Volume Service (block storage)  VPC. Virtual Private (private virtual networks)  ECS. Elastic Cloud Server (scalable computing) SRE / O&M Activities  System monitoring and 24x7/Tier 1-3 technical support  Troubleshooting and resolution of operationalissues  Backup, restoration, archival services  Update, distribution and release of software  Change, capacity, and configuration management  … Troubleshooting tasks  Anomaly detection.Determine what constitutes normal system behavior, and then to discern departures from that normal system behavior  Fault diagnosis (root cause analysis). Identify links of dependencythat represent causal relationships to discover the true source of an anomaly  Fault Prediction. Use of historical or streaming data to predict incidents with varying degrees of probability  Fault recovery.Explore how decision support systems can manage and select recovery processes to repair failed systems
  • 6. ULTRA-SCALE AIOPS LAB 5 Troubleshooting Monitoring Data Sources Logs. Service, microservices, and applications generate logs, composed of timestamped records with a structure and free-form text, which are stored in system files. Metrics. Examples of metrics include CPU load, memory available, and the response time of a HTTP request. Traces. Traces records the workflow and tasks executed in response to, e.g., an HTTP request. Events. Major milestones which occur within a data center can be exposed as events. Examples include alarms, service upgrades, and software releases. {"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332", "duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key": "http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value": "HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}] {“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506, 2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]} {"id": "dns_address_match“, "timestamp": 1529029301238, …} {"id": "ping_packet_loss“, "timestamp": 152902933452, …} {"id": "tcp_connection_time“, "timestamp": 15290294516578, …} {"id": "cpu_usage_average“, "timestamp": 1529023098976, …} 2017-01-18 15:54:00.467 32552ERROR oslo_messaging.rpc.server[req-c0b38ace - default default] Exception during message handling System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data: Logs, Metrics, Traces, Events, Topologies
  • 7. ULTRA-SCALE AIOPS LAB 6 INNOVATION PATH FORWARD BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT Intelligent Log Analysis Root Cause Localization using Application Logs Problem  Complex system generates high volumes of logs for manual analysis/troubleshooting. E.g., a single- node check using logs takes >15 min Analyzing logs manuallyis time consuming and error-prone IntelligentLog Analysis  Insight. Recent research has shown that it is possible to identify and model the underlying structure of application logs using ML [1, 2] Higher levels of visualization and abstraction are required (Bird's-eye view, zoom in/out) MAIN ACHIEVEMENT Extracting valuableinsights form massive, noisyand unstructured logs  Pattern mining and NLP are used to abstract and summarize massive logs  Help O&M teams improving efficiency of troubleshooting using logs ASSUMPTIONS & LIMITATIONS  Processing requires historical logs to learn normality  No real-time processing. Only on-demand operation is available Status  Higher levels of visualization and abstraction are effective for operators to quickly troubleshoot system using logs Next steps  Research other forms of visualization (e.g., Kernel Density Estimation, Network Diagrams, Correlation Matrices) TechnologyTransition  Evaluation of infrastructures to process ultra-scale logs (e.g., batch versus stream processing) Potential End Users  Service Owners, DevOps and SRE Log Analysis Log analysis can be integrated into traditional log platforms HOW IT WORKS  Performance is a challenge! Processing million log records per second is difficult 1. Fast log template mining using Drain algorithm 2. Language-aware log parsing and keyword extraction using NLP approaches (www.spacy.io) 3. Time-series classification using Poisson model (en.wikipedia.org/wiki/Poisson_d istribution) 4. Grouping using Pearson algorithm (en.wikipedia.org/wiki/Pearson_c orrelation_coefficient) 5. Distance-aware correlation Fig. Existing Log Analysis services (e.g., ELK) 10000 error messages 100 templates 30 events 6 eventscorrelated Fig. Complex services generates several GB of logs per second as-is Template mining Template analysis  Visualization  Anomaly detection Fig. The process of intelligent log analysis to-be
  • 8. ULTRA-SCALE AIOPS LAB 7 INNOVATION PATH FORWARD BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT Trace Analysis (TA) Structure-based Anomaly Detection and RCA Current limitations  Existing solutions only provides trace visualization tools  Analyze traces manually: error-prone and not scalable While popular, onlyvisualization tools have been developed for trace analysis Apply recentML and statistical approachesto process sequential data  Explore the use of Deep Learning: Long Short Term Memory (LSTM)  Explore the use of attention networks  Explore the use of association- rules TRL 4. Small scaleprototype.Basic technological components are integrated to establish thatthey will work together. MAIN ACHIEVEMENT Trace anomalydetection and root-causeanalysis using trace structure  Previous tentative using Deep Learning (LSTM, CNN), Machine Learning (Optiks), Sequence Analysis (LCS, Multiple sequence alignment, Needleman-Wunsch) algorithms did not enable root cause analysis ASSUMPTIONS & LIMITATIONS  Microservices are instrumented with tracing capabilities  Tracing tracks microservice to microservice communication Improve trace-based RCA in 90% HOW IT WORKS 1) Learning. For each service endpoint, learn the traces’ structure it generates 2) Modeling. Aggregate all the traces into a behavior model 3) Anomaly detection. When a new trace is generate, compare its structure with the behavior model. If it was not seen before, an anomaly exists 4) Root-cause analysis. When an anomaly is detected, determine in which span it occurred and identify host, service, function Trace Events OK | ANOMALY Create VM LSTM Network Trace Structure Research Next Steps  Which data to inject into traces’ payload which can significantly assist troubleshooting?  New visualization tools beyond Gantt diagrams for O&M Technology Transition  Integration / alignment tracing, logging and metrics pipelines Potential End Users  SRE and DevOps  PaaS and customer cloud native applications E2E: Mobile apps, web clients, monoliths, etc. Fig. Red circles show structural anomalies (mockup) Create Model Root-cause Analysis Anomaly Detection Host: 192.168.5.13 Service: nova-api Function: schedule VM Fig. Elastic APM Trace vitalization (in blue) Structural anomalies (in red) Learn traces structure
  • 9. ULTRA-SCALE AIOPS LAB 8 Troubleshooting Using Tracing Technolgies https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components Troubleshooting 1. Find root cause issues in requests across several services / hosts 2. Benchmark different systems and identify performance bottlenecks 3. Reconstruct the workflow of service-to-service communications
  • 10. ULTRA-SCALE AIOPS LAB 9 Tracing Technologies Systems and pseudo-standards Fig. Amazon AWS X-Ray Fig. Google Cloud Trace Fig. OpenTracing  Tracing Services ‒ Google Stackdriver(cloud.google.com/trace) ‒ Amazon AWS X-Ray (aws.amazon.com/xray) ‒ Lightstep (lightstep.com)  Tracing Systems ‒ Twitter Zipkin (zipkin.io) ‒ Uber Jaeger ‒ OSProfiler ‒ Pivot Tracing (pivottracing.io)  Tracing Standards ‒ opentelemetry.io ‒ OpenTracing ‒ OpenCensus [1] http://opentracing.io/documentation/ [2] https://github.com/opentracing/specification/blob/master/specification.md OpenTracing provides a consistent, expressive, vendor-neutral APIs for popular platforms [1] 2019: The open source distributed tracing projects OpenCensus and OpenTracing were merged into anew project called OpenTelemetry
  • 11. ULTRA-SCALE AIOPS LAB 10 Troubleshooting Openstack Traces and Workflows Distributed Tracing 1. Provides developers a detailed view of requests as they flow across a distributed system 2. Identifies which microservices areinvolved in processinga request 3. Trace the path of a request to generate a transaction 4. Localize which microservice in a path (trace) is anomalous
  • 12. ULTRA-SCALE AIOPS LAB 11 Distributed Traces Abstractions         n ki kikiiiiiiikkn xXxXxXxXpxXxXpxxp ,...,,|,...,... 2211111  Markov Chain ‒ A stochastic process is a Markov chain if: ‒ The probability distribution of a state Xi depends on the previous state Xi-1 and does not depend on the previous states ‒ K-th order Markov Chain:  Tree Structure ‒ Finite set of one or more nodes A, B, C, … ‒ Special nodes: the root node has no parent. Leaf nodes have children. ‒ Remaining nodes are partitioned into n >= 0 disjoint sets T1, ..., Tn, where each set is also a tree  Sequence of Events ‒ Given a set E = {e1,...,en} of event types, an event is a pair (A, t), where A ∈ E and t ∈ N is the occurrence time of the event ‒ An event sequences on E is an ordered sequence of events: ‒ Tree representation: (A (B (C, D), E)            n i xx n i iiiin ii axpxXxXpxXpxxp 2 1 2 11111 1 )(|... Markov Chain Tree Sequence ),(),...,,(),,( 2211 nn tAtAtAs 
  • 13. ULTRA-SCALE AIOPS LAB 14 Distributed Trace Analysis Sequence Analysis  Sequence Prediction ‒ Predict elements of a sequence on the basis of the preceding elements ‒ Given si, si+1,…, sj, predict sj+1, si,si+1, i.e., sj →sj+1 ‒ Input: 1, 2, 3, 4, 5; output: 6 ‒ Applications: weather forecast  Sequence Classification ‒ Predict a class label for a given input sequence. ‒ Given si,si+1,…,sj, determine if it is legitimate, i.e., si,si+1,…,sj →yes or no ‒ Input: 1, 2, 3, 4, 5; output: “normal” or “anomalous” ‒ Applications: trace anomaly detection, DNA sequence classification  Sequence Generation ‒ Generate new output sequence with same general characteristics as the input ‒ Input: [1, 3, 5], [7, 9, 11]; output: [3, 5 ,7] ‒ Applications: text generation  Sequence to Sequence Prediction ‒ Predict an output sequence given an input sequence. ‒ Input: 1, 2, 3, 4, 5; output: 6, 7, 8, 9, 10 ‒ Applications: text summarization 1, 2, 3, 4, 5 ► 6 1, 2, 3, 4, 5 ► NORMAL 1, 3, 5 and 7, 9, 11 ► 3, 5 ,7 1, 2, 3, 4, 5 ► 6, 7, 8, 9, 10 Sequence learning: from recognition and prediction to sequential decision making, IEEE Intelligent Systems, 2001, http://www.sts.rpi.edu/~rsun/sun.expert01.pdf
  • 14. ULTRA-SCALE AIOPS LAB 15 Distributed Trace Analysis Sequence Analysis 1. User commands: image delete, create server, … N1 N2 N 3 N 4 keystone -> nova -> glance -> network Services 30 % 70 % Warning Y Warning A Warning B Error Z a1 62 e2 28 Trace t1 Sequence of spans auth_tokens_v3 1314421019193f1313162f253233191e341a3c3b402f2624192f214521162f163e2746282c192f2e2e24161d2c2c161628241635287a images_v2 4342421119131616161a311316192216162f23213e1d2127272728241621191d5e5c471d132c4a16241616213c63 _var__detail_flavors_v2.1 5d13421314151316161b193b21242113364024192f162f131d21161621212c27282c2c2d162f5e5c132c16164d image_schemas_v2 141319133113162f1644161b331e351a213b1613133b13192f5116132113131616252121282c2c162d21582e2e161d13643b132c4a78… gather_result 1912141613252128211a1f402f221921162f2713282b2d2416131d13211a5a21 security-groups_v2.0 4c1319131d19131316132f _var__images_v2 121313251b2f192f35211f241916223b1613231616242713462c53241654192c285a21 _var___var__action_servers_v2.1 61135d106c12131325191b1c193d281a4022136d13162f3b292116231d1627272741283b2416…f4a5121352f13 neutronclient.v2_0.client.retry_request 162f1319132f165124191316 4. SequenceClassification: identify invalid sequences Service call encoding 16: {'v2.0', 'ports'} 51 : {'v2.0', '_var_', 'ports'}=200 24 : {'nova.compute.api.API.get'}=0 … Service call encoding 13: {'v3', 'tokens', 'auth‘} Encoding matching 53.47% 2. A distributed trace is generated 3. The distributed trace is encoded into a sequence Function call sequences [{'v3'}=200, {'v3', 'tokens', 'auth'}=200, {'v... Function call sequences Similar call sequence Anomaly Detection from System Tracing Data using Multimodal Deep Learning IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.
  • 15. ULTRA-SCALE AIOPS LAB 16 Distributed Trace Analysis Why is trace analysis challenging? 2. Cache-Aside pattern 1. Retry pattern n_tries = 3 while True: try: ExecuteOperation() # Success break except TimeoutException: if n_tries == 0: # give up raise Failure else: sleep(1000) # Wait until retrying the call v = cache.get(k) if v == None: # Check if the item is in the cache v = store.get(k) # If not in cache, read from data store cache.put(k, v) # Put a copy of the item into the cache 3. One-to-many subcalls @app.route('/servers/<int:user_id>/<int:number>') def create(user_id, number): for i in range(number): authentication(user_id) # Call remote microservice to create a server rpc.create_server() T1: A, B, C, D, E, … T2: A, B, C, C, D, E, …. T3: A, B, C, C, C, D, E, … T4: A, B, C, C, C, Z T1: A, B, C, D, E, … T1: A, C, D, E, … T1: A, B, C, D, E, … T2: A, B, C, C, D, E, …. T3: A, B, C, C, C, D, E, … T4: A, B, C, C, C, C, D, E, … Many “Hidden” Software “Patterns” which affect running systems…Circuit breaker, concurrency and parallelism, priority queues, publisher-subscriber, bulkhead, etc.
  • 16. ULTRA-SCALE AIOPS LAB 17 Distributed Trace Analysis Using LSTMs 1. Model Openstackbehavior OPENSTACK_SEQUENCE= [ ('keystone', 1.0, 'authenticate user with credentials and generate auth-token'), ('nova-api', 1.0, 'get user request and sends token to Keystone for validation'), ('keystone', 0.2, 'validate the token if not in cache'), ('nova-api', 1.0, 'starts VM creation process'), ('nova-database', 1.0, 'create initial database entry for new VM'), ('nova-scheduler', 1.0, 'locate an appropriate host using filters and weights'), ('nova-database', 1.0, 'execute query'), ('nova-compute', 1.0, 'dispatches request'), ('nova-conductor', 1.0, 'gets instance information'), ('nova-compute', 1.0, 'get image URI from image service'), ('glance-api', 1.0, 'get image metadata'), ('nova-compute', 1.0, 'setup network'), ('neutron-server', .5, 'allocate and configure network IP address'), ('nova-compute', 1.0, 'setup volume'), ('cinder-api', .75, 'attach volume to the instance or VM'), ('nova-compute', 1.0, 'generates data for the hypervisor driver') ] 3. Transform traces into sequences of integers Train dataset span_list trace_id 0 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6] 1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6] 2 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6] 3 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6] 4 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6] ... ... 95 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6] 96 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6] 97 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6] 98 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6] 99 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6] [100 rows x 1 columns] Test dataset span_list trace_id 0 [3, 5, 5, 1, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6] 1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6] [0, ['keystone', 'nova-api', 'nova-api', 'nova-database', 'nova-scheduler', 'nova-database', 'nova-compute', 'nova-conductor', 'nova-compute', 'glance-api', 'nova-compute', 'nova-compute', 'cinder-api', 'nova-compute']] Cache Simulation: Probability of execution = 75% Trace #0  Generate Synthetic Traces (generate_traces()) ‒ Create a simple model to generate traces ‒ OPENSTACK_SEQUENCE models sequences of spans which are valid • e.g., ['keystone‘, 'nova-api', 'nova-api', 'nova-database', 'nova-scheduler', 'nova-database', 'nova-compute', 'nova- conductor', 'nova-compute', 'glance-api', 'nova-compute', 'nova-compute', 'cinder-api', 'nova-compute'] ‒ Each microservice name is encoded into a number using a dictionary (create_coders())  Trace Mutation (mutate_traces()) ‒ Selected synthetic traces are mutated ‒ Spans can be deleted or replaced 2. Generate Synthetics Traces
  • 17. ULTRA-SCALE AIOPS LAB 18 Distributed Trace Analysis Using LSTMs [[0 0 3 ... 6 1 6] [0 0 3 ... 6 1 6] [0 3 5 ... 6 1 6] ... [0 0 3 ... 6 1 6] [0 3 5 ... 6 1 6] [0 3 5 ... 6 1 6]] Padding 2. Hot encoding Tensorflow tutorial: https://github.com/campdav/text-rnn-tensorflow Keras tutorial: https://github.com/campdav/text-rnn-keras Code available at: https://github.com/jorge-cardoso/aiops_practice Subdirectory:dta_lstm  Padding (_pad_traces()) ‒ Pad each trace (sequence) with zeros (0), up to a pre-defined maximum length. ‒ This allows to pre-allocate a chain of LSTM units of a specific length ‒ Padding is done using pad_sequences(sequences,...)  One hot encoding (_traces_to_binary()) ‒ Enables to represent categorical variables as binary vectors ‒ Each integer value is represented as a binary vector that has all zero values except the index of the integer, which is marked with a 1 X = pad_sequences(X, maxlen=self.trace_size, dtype=np.int16, truncating='pre', padding='pre', value=DEFAULTS['padding_symbol']) maxlen: int, maximum length of all sequences. truncating='pre' remove values at the beginning from sequences larger than maxlen padding='pre' pads each trace at the beginning with a special integer (e.g., 0) 1. Padding keras.utils.to_categorical(X, num_classes=self.max_n_span_types, dtype='int32') X.shape = (n_traces, self.trace_size) returns shape = (n_traces, self.trace_size, self.max_n_span_types) [[[1. 0. 0. ... 0. 0. 0.] [1. 0. 0. ... 0. 0. 0.] [0. 0. 1. ... 0. 0. 0.] ... ] ]
  • 18. ULTRA-SCALE AIOPS LAB 19 Distributed Trace Analysis Using LSTMs X [[0 0 3 ... 6 1 6] [0 0 3 ... 6 1 6] [0 3 5 ... 6 1 6] ... [0 0 3 ... 6 1 6] [0 3 5 ... 6 1 6] [0 3 5 ... 6 1 6]] y [[0 0 5 ... 1 6 0] [0 0 5 ... 1 6 0] [0 5 5 ... 1 6 0] ... [0 0 5 ... 1 6 0] [0 5 5 ... 1 6 0] [0 5 5 ... 1 6 0]]  Generate y ‒ Shift traces by one position to the left ‒ Fill new cell with DEFAULTS['padding_symbol'] # X.shape is n_traces # X.shape.y is self.trace_size y = np.roll(X, -1, axis=1) # Pad j array where X array has zeros y[X == 0] = DEFAULTS['padding_symbol'] # Write the DEFAULTS['padding_symbol'] at the last position of the shifted array y[:, -1] = DEFAULTS['padding_symbol'] 1. Generate y by shifting X 0 3 5 2 1 7X y 0 5 2 1 7 0 Shift traces left
  • 19. ULTRA-SCALE AIOPS LAB 20 Distributed Trace Analysis Using LSTMs  Create DL model (_build_model()) ‒ Use a LSTM network ‒ dropout layer of 0.2 (high, but necessary to avoid quick divergence) ‒ Softmax activation to generate probabilities over the different categories ‒ Because it is a classification problem, loss calculation via categorical cross-entropy compares the output probabilities against one-one encoding ‒ ADAM optimizer (instead of the classical stochastic gradient descent) to update weights model = Sequential() model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=(self.trace_size, input_dim))) model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True)) model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True)) # A Dense layer is used as the output for the network. model.add(TimeDistributed(Dense(input_dim, activation='softmax'))) if self.gpus > 1: model = keras.utils.multi_gpu_model(model, gpus=self.gpus) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 1. Create DL model
  • 20. ULTRA-SCALE AIOPS LAB 21 Distributed Trace Analysis Using LSTMs  Train the model (fit()) ‒ X = Obtains the traces composed of spans ‒ Pad the traces of X ‒ y = Shift the spans of the traces left ‒ Apply an one hot encoding to X and y ‒ Build the model ‒ Fit the model f(X) = y # X_train has 2 columns: traceId, span_list. Here we select the span_list X_train = X_train.loc[:, DEFAULTS['span_list_col_name']].values X = self._pad_traces(X_train) y = self._shift_traces_left(X) self._get_lstm_parameters(X) X = self._traces_to_binary(X) y = self._traces_to_binary(y) self.model = self._build_model(input_dim=X.shape[2]) self.model.fit(X, y, epochs=self.epochs, batch_size=self.batch_size, shuffle=True, validation_split=self.validation_split) 1. Train the model
  • 21. ULTRA-SCALE AIOPS LAB 22 Distributed Trace Analysis Using LSTMs Test Traces X [[0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] [0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6]] X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] = [[0, 0.2625601], [1, 0.38167596], [0, 0.75843567], [0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203], [0, 0.9656691], [0, 0.9686248], [0, 0.9666971], [0, 0.95213455], [0, 0.94981205], [0, 0.93016535], [0, 0.60129535], [0, 0.5417027], [0, 0.6876041], [0, 0.82936114]] X[1]: [0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6] = [[0, 0.2625601], [1, 0.38167596], [0, 0.75843567], [0, 0.8912414], [0, 0.9302344], [0, 0.94507533], [0, 0.9638574], [0, 0.9665326], [0, 0.9627945], [0, 0.9435249], [0, 0.9405963], [0, 0.92075515], [1, 0.3207201], [1, 0.46595544], [0, 0.64980185], [0, 0.8390282]] trace_id 0 indices [4] trace_id 1 indices []  Test traces (predict()) ‒ X = Obtains the traces composed of spans ‒ Pad the traces of X ‒ Apply an one hot encoding to X ‒ Calculate: yhat = f(X) ‒ Identify in which spans the anomalies occur X_test = X_test.loc[:, DEFAULTS['span_list_col_name']].values X_test = self._pad_traces(X_test) X_test_bin = self._traces_to_binary(X_test) yhat = self.model.predict(X_test_bin) self._identify_anomalies(X_test, yhat, prob=self.threshold) return self.rank_prob 1. Test unseen traces
  • 22. ULTRA-SCALE AIOPS LAB 23 Distributed Trace Analysis Using LSTMs  Identifying Anomalous Traces Sequences (_identify_anomalies()) ‒ Mark anomalies based on the difference between y and yhat ‒ y is the shifted version of X ‒ yhat = f(X) ‒ If we cannot predict correctly the shift, this means the trace is did not occur before and is marked as anomalous y = DTA_LSTM._shift_traces_left(X) for i in range(len(X)): rp = self._compare_traces(y[i], yhat[i]) self.rank_prob.append(rp) if self.explain: print('X[{}]: {} = {}'.format(i, X[i], rp)) return self.rank_prob Input X = [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] Shifted_X = [0 0 5 5 1 9 8 6 7 6 2 6 6 1 6 0] Prediction yhat = [0 0 5 5 12 9 8 6 7 6 2 6 6 1 6 0] X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] = [[0, 0.2625601], [1, 0.38167596], [0, 0.75843567], [0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203], [0, 0.9656691], [0, 0.9686248], [0, 0.9666971], [0, 0.95213455], [0, 0.94981205], [0, 0.93016535], [0, 0.60129535], [0, 0.5417027], [0, 0.6876041], [0, 0.82936114]] Error at index [4] X[i]: Input: [ 0 0 0 0 0 0 0 0 0 0 0 10 10 17 5 5 7 7 6 6] y[i]: Output: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0] yhat[i]: Predicted: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0] Correct prediction
  • 23. ULTRA-SCALE AIOPS LAB 24 Single and Bi-Models  Use of single and bi-models to capture traces as sequences of events and their response time  Use sequential model representation by utilizing long- short-term memory (LSTM) networks Related Research Technical University of Berlin S. Nedelkoski,J. Cardoso,O. Kao, Anomaly Detectionfrom System Tracing Data using MultimodalDeep Learning, IEEECloud2019,July 3-8,2019,Milan, Italy. Idea: represent distributed traces as sequences of events and their response time Trace Structure (sequence of events) and Response time (duration) e.g., Service 11 → Service 21; duration = 12ms
  • 24. ULTRA-SCALE AIOPS LAB 25 Related Research Facebook https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/ https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chow.pdf Measure end-to-end performance of requests  Infer causal relationships from logs ‒ Adding instrumentation retroactively is an expensive task ‒ Hypothesize and confirm relationships in messages  Log Data ‒ Request & host id, timestamp, unique event label  Finding Relationships ‒ Samples requests, store logs in Hive and run Hadoop jobs to infer causal relationships ‒ 2h Hadoop to analyze 1.3M requests sampled 30 days ‒ Assumesa hypothesized relationshipbetween two segmentsuntil finding a counterexample ‒ 3 types of relationship inferred: happens-before, mutual exclusion, pipeline  Applications ‒ Find critical path and slack segments for performance optimization ‒ Anomaly detection: 1) select top 5% of end-to-end latency, 2) identify segmentswith proportionally greater representation in the outlier set of requests than in the non-outlier set. As the number of traces analyzed increases, the observation of new counter examples diminishes, leaving behind only true relationships
  • 25. Copyright©2019 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. Bring digital to every person, home and organization for a fully connected, intelligent world. Thank you.