Distributed Trace & Log Analysis using ML

Distributed Trace & Log Analysis using ML
Prof. Jorge Cardoso
E-mail: jorge.cardoso@huawei.com
Chief Architect for AIOps
Munich & Dublin Research Center
2020.11.04
Definition: AIOps platforms utilize big data, modern machine learning and other advanced analytics
technologies to directly and indirectly enhance IT operations (monitoring, automation and service
desk) functions with proactive, personal and dynamic insight.
Gartner
Huawei SRE Summit: Running Planet Scale Systems
Dublin, 2-4 November 2020

ULTRA-SCALE AIOPS LAB 1
Intelligent Cloud Operations
The emerging field of AIOps
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve
the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using
monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how
AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series
analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize,
predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management
data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data
structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability
empowers operators to quickly identify which componentsof a distributed system are faulty.
 Planet/large-scale Distributed systems and cloud computing
 Distributed traces, logs, events, alerts, metrics, ….
 Big Data platforms with Kubernetes, Hadoop, Spark, Flink, …
 Deep Learning, machine learning, data mining, advanced statistics, time-series analysis, NLP, ….
 Detect, localize, predict, and remediate failures in infrastructures
Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Munich and Ireland Research Centers. Previously he
worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in
Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology
(Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). His current research
involves the development of the next generation of AIOps platforms, Cloud Operations and Analytics tools driven by AI, Cloud
Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer
Science from the University of Georgia (USA).
GitHub | Slideshare.net | GoogleScholar
Interests: AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management

HUAWEI CLOUD
Site Reliability Engineering (SRE)
Reliability is an important feature of HAUWEI CLOUD. SRE is responsible for it …
 50% software engineer /
50% system engineer
 Load slows down systems
 Elite forces, focusing on
high-ROI work
 Automation
 Setup quantified SLO
 Run Err Budget
 Self-constraint, balanced
Dev/Ops
 Fact oriented
 Four golden signals
 Altering from time-series
 Eliminating Toil
 Implement monitoring and
develop handling
processes
 Diagnosis, analysis, and
detailed data
 Define contextual,
customer-focused SLOs
 70% of outages due to
changes in a live system
 Addressing cascading
failures
 Take turns to monitor and
document solutions
 Distributed consensus for
reliability
 Regular review meetings
 Humans add latency
 Training through brain
games
 Perform regular drills in
the environment
 Guarantee that there are
few problems
 It's either a problem or a
quick recovery
 The recovery tool is
designed in the early
stage of the process
 Continuous review and
optimization
… a few numbers…
 worldwide, Huawei Cloud has 44
availability zones across 23 regions (June
2019)
 more than 180 cloud services and 180
solutions for a wide range of sectors
 Costumers include European Organization
for Nuclear Research (CERN), PSA
Group, Shenzhen Airport, Port of Tianjin,
…
SREHAUWIECLOUD

AI for IT Operations (AIOps)
Bringing AI to O&M
“Virtyt Koshi, the EMEA general manager for virtualization vendor Mavenir, reckons Google is able
to run all of its data centers in the Asia-Pacific with onlyabout 30 people, and that a telco would
typically have about 3,000 employees to manage infrastructure across the same area."
“We began applying machine learning two years ago (2016) to operate
our data centers more efficiently… over the past few months,
DeepMind researchers began working with Google’s data center team
to significantly improve the system’s utility. Using neural networks
trained on different operating scenarios and parameters, we created a
more efficient and adaptive framework to understand data center
dynamics and optimize efficiency.” Eric Schmidt, Dec. 2018
SRE / O&M Activities
 System monitoring and 24x7 technical support, Tier 1-3 support
 Troubleshooting and resolution of operational issues
 Backup, restoration, archival services
 Update, distribution and release of software
 Change, capacity, and configuration management
 …
Harvard Business Review
Early indicators
Moogsoft AIOps,
Amazon EC2
Predictive Scaling,
Azure VM resiliency,
Amazon S3 Intelligent
Tiering
38.4% of organizations take
more than 30 minutes to
resolve IT incidents that
impact consumer-facing
services (PagerDuty)
https://www.lightreading.com/automation/google-has-intent-to-cut-humans-out-of-network/d/d-id/742158
https://www.lightreading.com/automation/automation-is-about-job-cuts---lr-poll/d/d-id/741989
https://www.lightreading.com/automation/the-zero-person-network-operations-center-is-here-(in-finland)/d/d-id/741695

Intelligent Operation & Maintenance
Troubleshooting
Anomaly Detection
 Response Time Analysis
‒ A service started responding to requests more slowly than
normal
‒ The change happened suddenly as a consequence of
regressionin the latest deployment
 System Load
‒ The demand placed on the system, e.g., REST API requests
per second, increase since yesterday
 Error Analysis
‒ The rate of requests that fail -- either explicitly (HTTP 5xx) or
implicitly (HTTP 2xx with wrong content) -- is increasingslowly,
but steadily
 System Saturation
‒ The resources (e.g., memory, I/O, disk) used by key controller
services is rapidly reaching threshold levels
Complex System
 OBS. Object Storage Service
 EVS. Elastic Volume Service (block storage)
 VPC. Virtual Private (private virtual networks)
 ECS. Elastic Cloud Server (scalable computing)
SRE / O&M Activities
 System monitoring and 24x7/Tier 1-3 technical support
 Troubleshooting and resolution of operationalissues
 Backup, restoration, archival services
 Update, distribution and release of software
 Change, capacity, and configuration management
 …
Troubleshooting tasks
 Anomaly detection.Determine what constitutes normal
system behavior, and then to discern departures from that
normal system behavior
 Fault diagnosis (root cause analysis). Identify links of
dependencythat represent causal relationships to discover
the true source of an anomaly
 Fault Prediction. Use of historical or streaming data to
predict incidents with varying degrees of probability
 Fault recovery.Explore how decision support systems can
manage and select recovery processes to repair failed
systems

Troubleshooting
Monitoring Data Sources
Logs. Service, microservices, and applications
generate logs, composed of timestamped
records with a structure and free-form text,
which are stored in system files.
Metrics. Examples of metrics include CPU
load, memory available, and the response
time of a HTTP request.
Traces. Traces records the workflow and tasks
executed in response to, e.g., an HTTP
request.
Events. Major milestones which occur within a
data center can be exposed as events.
Examples include alarms, service upgrades,
and software releases.
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332",
"duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key":
"http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value":
"HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}]
{“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506,
2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]}
{"id": "dns_address_match“, "timestamp": 1529029301238, …}
{"id": "ping_packet_loss“, "timestamp": 152902933452, …}
{"id": "tcp_connection_time“, "timestamp": 15290294516578, …}
{"id": "cpu_usage_average“, "timestamp": 1529023098976, …}
2017-01-18 15:54:00.467 32552ERROR oslo_messaging.rpc.server[req-c0b38ace -
default default] Exception during message handling
System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data:
Logs, Metrics, Traces, Events, Topologies

INNOVATION PATH FORWARD
BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT
Intelligent Log Analysis
Root Cause Localization using Application Logs
Problem
 Complex system generates high volumes of logs
for manual analysis/troubleshooting. E.g., a single-
node check using logs takes >15 min
Analyzing logs manuallyis time
consuming and error-prone
IntelligentLog Analysis
 Insight. Recent research has shown that it is
possible to identify and model the underlying
structure of application logs using ML [1, 2] Higher levels of visualization and abstraction are required (Bird's-eye
view, zoom in/out)
MAIN ACHIEVEMENT
Extracting valuableinsights form massive, noisyand
unstructured logs
 Pattern mining and NLP are used to abstract and summarize
massive logs
 Help O&M teams improving efficiency of troubleshooting using
logs
ASSUMPTIONS & LIMITATIONS
 Processing requires historical logs to learn normality
 No real-time processing. Only on-demand operation is available
Status
 Higher levels of visualization and abstraction are
effective for operators to quickly troubleshoot
system using logs
Next steps
 Research other forms of visualization (e.g., Kernel
Density Estimation, Network Diagrams,
Correlation Matrices)
TechnologyTransition
 Evaluation of infrastructures to process ultra-scale
logs (e.g., batch versus stream processing)
Potential End Users
 Service Owners, DevOps and SRE
Log Analysis
Log analysis can be integrated into traditional log
platforms
HOW IT WORKS
 Performance is a challenge! Processing million
log records per second is difficult
1. Fast log template mining using
Drain algorithm
2. Language-aware log parsing
and keyword extraction using
NLP approaches (www.spacy.io)
3. Time-series classification using
Poisson model
(en.wikipedia.org/wiki/Poisson_d
istribution)
4. Grouping using Pearson
algorithm
(en.wikipedia.org/wiki/Pearson_c
orrelation_coefficient)
5. Distance-aware correlation
Fig.
Existing Log
Analysis services
(e.g., ELK)
10000 error messages
100 templates
30 events
6 eventscorrelated
Fig.
Complex
services
generates
several GB
of logs per
second
as-is
Template mining Template analysis
 Visualization
 Anomaly detection
Fig. The process of intelligent log analysis
to-be

INNOVATION PATH FORWARD
BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT
Trace Analysis (TA)
Structure-based Anomaly Detection and RCA
Current limitations
 Existing solutions only provides trace visualization tools
 Analyze traces manually: error-prone and not scalable
While popular, onlyvisualization tools
have been developed for trace analysis
Apply recentML and statistical
approachesto process sequential data
 Explore the use
of Deep
Learning: Long
Short Term
Memory (LSTM)
 Explore the use
of attention
networks
 Explore the use
of association-
rules TRL 4. Small scaleprototype.Basic technological components are
integrated to establish thatthey will work together.
MAIN ACHIEVEMENT
Trace anomalydetection and root-causeanalysis using
trace structure
 Previous tentative using Deep Learning (LSTM, CNN), Machine
Learning (Optiks), Sequence Analysis (LCS, Multiple sequence
alignment, Needleman-Wunsch) algorithms did not enable root
cause analysis
ASSUMPTIONS & LIMITATIONS
 Microservices are instrumented with tracing capabilities
 Tracing tracks microservice to microservice communication
Improve trace-based RCA in 90%
HOW IT WORKS
1) Learning. For each service
endpoint, learn the traces’
structure it generates
2) Modeling. Aggregate all the
traces into a behavior model
3) Anomaly detection. When a
new trace is generate, compare
its structure with the behavior
model. If it was not seen
before, an anomaly exists
4) Root-cause analysis. When an
anomaly is detected, determine
in which span it occurred and
identify host, service, function
Trace Events
OK | ANOMALY
Create VM
LSTM Network
Trace Structure
Research Next Steps
 Which data to inject into traces’ payload which
can significantly assist troubleshooting?
 New visualization tools beyond Gantt diagrams
for O&M
Technology Transition
 Integration / alignment tracing, logging and
metrics pipelines
Potential End Users
 SRE and DevOps
 PaaS and customer cloud native applications
E2E: Mobile apps, web clients, monoliths, etc.
Fig.
Red circles show structural anomalies (mockup)
Create
Model
Root-cause
Analysis
Anomaly
Detection
Host: 192.168.5.13
Service: nova-api
Function: schedule VM
Fig.
Elastic APM
Trace
vitalization
(in blue)
Structural
anomalies
(in red)
Learn traces structure

Troubleshooting
Using Tracing Technolgies
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
Troubleshooting
1. Find root cause issues in requests across several services / hosts
2. Benchmark different systems and identify performance bottlenecks
3. Reconstruct the workflow of service-to-service communications

Tracing Technologies
Systems and pseudo-standards
Fig. Amazon AWS X-Ray
Fig. Google Cloud Trace
Fig. OpenTracing
 Tracing Services
‒ Google Stackdriver(cloud.google.com/trace)
‒ Amazon AWS X-Ray (aws.amazon.com/xray)
‒ Lightstep (lightstep.com)
 Tracing Systems
‒ Twitter Zipkin (zipkin.io)
‒ Uber Jaeger
‒ OSProfiler
‒ Pivot Tracing (pivottracing.io)
 Tracing Standards
‒ opentelemetry.io
‒ OpenTracing
‒ OpenCensus
[1] http://opentracing.io/documentation/
[2] https://github.com/opentracing/specification/blob/master/specification.md
OpenTracing provides a consistent, expressive, vendor-neutral
APIs for popular platforms [1]
2019: The open source
distributed tracing projects
OpenCensus and OpenTracing
were merged into anew project
called OpenTelemetry

Troubleshooting Openstack
Traces and Workflows
Distributed Tracing
1. Provides developers a detailed view of requests as they flow across a
distributed system
2. Identifies which microservices areinvolved in processinga request
3. Trace the path of a request to generate a transaction
4. Localize which microservice in a path (trace) is anomalous

Distributed Traces
Abstractions
     
 
n
ki
kikiiiiiiikkn xXxXxXxXpxXxXpxxp ,...,,|,...,... 2211111
 Markov Chain
‒ A stochastic process is a Markov chain if:
‒ The probability distribution of a state Xi depends on the previous state
Xi-1 and does not depend on the previous states
‒ K-th order Markov Chain:
 Tree Structure
‒ Finite set of one or more nodes A, B, C, …
‒ Special nodes: the root node has no parent. Leaf nodes have children.
‒ Remaining nodes are partitioned into n >= 0 disjoint sets T1, ..., Tn,
where each set is also a tree
 Sequence of Events
‒ Given a set E = {e1,...,en} of event types, an event is a pair (A, t), where
A ∈ E and t ∈ N is the occurrence time of the event
‒ An event sequences on E is an ordered sequence of events:
‒ Tree representation: (A (B (C, D), E)
       
 

n
i
xx
n
i
iiiin ii
axpxXxXpxXpxxp
2
1
2
11111 1
)(|...
Markov
Chain
Tree
Sequence
),(),...,,(),,( 2211 nn tAtAtAs 

Distributed Trace Analysis
Sequence Analysis
 Sequence Prediction
‒ Predict elements of a sequence on the basis of the preceding elements
‒ Given si, si+1,…, sj, predict sj+1, si,si+1, i.e., sj →sj+1
‒ Input: 1, 2, 3, 4, 5; output: 6
‒ Applications: weather forecast
 Sequence Classification
‒ Predict a class label for a given input sequence.
‒ Given si,si+1,…,sj, determine if it is legitimate, i.e., si,si+1,…,sj →yes or no
‒ Input: 1, 2, 3, 4, 5; output: “normal” or “anomalous”
‒ Applications: trace anomaly detection, DNA sequence classification
 Sequence Generation
‒ Generate new output sequence with same general characteristics as the input
‒ Input: [1, 3, 5], [7, 9, 11]; output: [3, 5 ,7]
‒ Applications: text generation
 Sequence to Sequence Prediction
‒ Predict an output sequence given an input sequence.
‒ Input: 1, 2, 3, 4, 5; output: 6, 7, 8, 9, 10
‒ Applications: text summarization
1, 2, 3, 4, 5 ► 6
1, 2, 3, 4, 5 ► NORMAL
1, 3, 5 and 7, 9, 11 ► 3, 5 ,7
1, 2, 3, 4, 5 ► 6, 7, 8, 9, 10
Sequence learning: from recognition and prediction to sequential decision making, IEEE Intelligent Systems, 2001,
http://www.sts.rpi.edu/~rsun/sun.expert01.pdf

Sequence Analysis
1. User commands: image
delete, create server, …
N1 N2
N
3
N
4
keystone -> nova -> glance ->
network
Services
30
%
70
%
Warning Y
Warning A
Warning B
Error Z
a1 62 e2 28
Trace t1 Sequence of spans
auth_tokens_v3 1314421019193f1313162f253233191e341a3c3b402f2624192f214521162f163e2746282c192f2e2e24161d2c2c161628241635287a
images_v2 4342421119131616161a311316192216162f23213e1d2127272728241621191d5e5c471d132c4a16241616213c63
_var__detail_flavors_v2.1 5d13421314151316161b193b21242113364024192f162f131d21161621212c27282c2c2d162f5e5c132c16164d
image_schemas_v2 141319133113162f1644161b331e351a213b1613133b13192f5116132113131616252121282c2c162d21582e2e161d13643b132c4a78…
gather_result 1912141613252128211a1f402f221921162f2713282b2d2416131d13211a5a21
security-groups_v2.0 4c1319131d19131316132f
_var__images_v2 121313251b2f192f35211f241916223b1613231616242713462c53241654192c285a21
_var___var__action_servers_v2.1 61135d106c12131325191b1c193d281a4022136d13162f3b292116231d1627272741283b2416…f4a5121352f13
neutronclient.v2_0.client.retry_request 162f1319132f165124191316
4. SequenceClassification: identify invalid sequences
Service call encoding
16: {'v2.0', 'ports'}
51 : {'v2.0', '_var_', 'ports'}=200
24 : {'nova.compute.api.API.get'}=0
…
Service call encoding
13: {'v3', 'tokens', 'auth‘}
Encoding matching
53.47%
2. A distributed trace is
generated
3. The distributed trace is
encoded into a sequence
Function call sequences
[{'v3'}=200, {'v3', 'tokens', 'auth'}=200, {'v...
Function call sequences
Similar call sequence
Anomaly Detection from System Tracing Data using Multimodal Deep Learning
IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.

Why is trace analysis challenging?
2. Cache-Aside pattern
1. Retry pattern n_tries = 3
while True:
try:
ExecuteOperation() # Success
break
except TimeoutException:
if n_tries == 0: # give up
raise Failure
else:
sleep(1000) # Wait until retrying the call
v = cache.get(k)
if v == None: # Check if the item is in the cache
v = store.get(k) # If not in cache, read from data store
cache.put(k, v) # Put a copy of the item into the cache
3. One-to-many subcalls
@app.route('/servers/<int:user_id>/<int:number>')
def create(user_id, number):
for i in range(number):
authentication(user_id)
# Call remote microservice to create a server
rpc.create_server()
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, Z
T1: A, B, C, D, E, …
T1: A, C, D, E, …
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, C, D, E, …
Many “Hidden” Software “Patterns” which affect running systems…Circuit breaker, concurrency and parallelism, priority
queues, publisher-subscriber, bulkhead, etc.

Using LSTMs
1. Model Openstackbehavior
OPENSTACK_SEQUENCE= [
('keystone', 1.0, 'authenticate user with credentials and generate auth-token'),
('nova-api', 1.0, 'get user request and sends token to Keystone for validation'),
('keystone', 0.2, 'validate the token if not in cache'),
('nova-api', 1.0, 'starts VM creation process'),
('nova-database', 1.0, 'create initial database entry for new VM'),
('nova-scheduler', 1.0, 'locate an appropriate host using filters and weights'),
('nova-database', 1.0, 'execute query'),
('nova-compute', 1.0, 'dispatches request'),
('nova-conductor', 1.0, 'gets instance information'),
('nova-compute', 1.0, 'get image URI from image service'),
('glance-api', 1.0, 'get image metadata'),
('nova-compute', 1.0, 'setup network'),
('neutron-server', .5, 'allocate and configure network IP address'),
('nova-compute', 1.0, 'setup volume'),
('cinder-api', .75, 'attach volume to the instance or VM'),
('nova-compute', 1.0, 'generates data for the hypervisor driver')
]
3. Transform traces into sequences of integers
Train dataset
span_list
trace_id
0 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
2 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
3 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
4 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
... ...
95 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
96 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
97 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
98 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
99 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
[100 rows x 1 columns]
Test dataset
span_list
trace_id
0 [3, 5, 5, 1, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
[0,
['keystone',
'nova-api',
'nova-api',
'nova-database',
'nova-scheduler',
'nova-database',
'nova-compute',
'nova-conductor',
'nova-compute',
'glance-api',
'nova-compute',
'nova-compute',
'cinder-api',
'nova-compute']]
Cache Simulation: Probability of execution = 75%
Trace #0
 Generate Synthetic Traces (generate_traces())
‒ Create a simple model to generate traces
‒ OPENSTACK_SEQUENCE models sequences of spans which are valid
• e.g., ['keystone‘, 'nova-api', 'nova-api', 'nova-database', 'nova-scheduler', 'nova-database', 'nova-compute', 'nova-
conductor', 'nova-compute', 'glance-api', 'nova-compute', 'nova-compute', 'cinder-api', 'nova-compute']
‒ Each microservice name is encoded into a number using a dictionary (create_coders())
 Trace Mutation (mutate_traces())
‒ Selected synthetic traces are mutated
‒ Spans can be deleted or replaced
2. Generate Synthetics Traces

Using LSTMs
[[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1 6]]
Padding
2. Hot encoding
Tensorflow tutorial: https://github.com/campdav/text-rnn-tensorflow
Keras tutorial: https://github.com/campdav/text-rnn-keras
Code available at: https://github.com/jorge-cardoso/aiops_practice
Subdirectory:dta_lstm
 Padding (_pad_traces())
‒ Pad each trace (sequence) with zeros (0), up to a pre-defined maximum length.
‒ This allows to pre-allocate a chain of LSTM units of a specific length
‒ Padding is done using pad_sequences(sequences,...)
 One hot encoding (_traces_to_binary())
‒ Enables to represent categorical variables as binary vectors
‒ Each integer value is represented as a binary vector that has all zero values except the index of the integer, which is
marked with a 1
X = pad_sequences(X, maxlen=self.trace_size, dtype=np.int16,
truncating='pre', padding='pre',
value=DEFAULTS['padding_symbol'])
maxlen: int, maximum length of all sequences.
truncating='pre' remove values at the beginning from sequences larger than maxlen
padding='pre' pads each trace at the beginning with a special integer (e.g., 0)
1. Padding
keras.utils.to_categorical(X, num_classes=self.max_n_span_types,
dtype='int32')
X.shape = (n_traces, self.trace_size)
returns shape = (n_traces, self.trace_size, self.max_n_span_types)
[[[1. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]
[0. 0. 1. ... 0. 0. 0.]
...
]
]

Using LSTMs
X [[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1
6]]
y [[0 0 5 ... 1 6 0]
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
...
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
[0 5 5 ... 1 6 0]]
 Generate y
‒ Shift traces by one position to the left
‒ Fill new cell with DEFAULTS['padding_symbol']
# X.shape is n_traces
# X.shape.y is self.trace_size
y = np.roll(X, -1, axis=1)
# Pad j array where X array has zeros
y[X == 0] = DEFAULTS['padding_symbol']
# Write the DEFAULTS['padding_symbol'] at the last
position of the shifted array
y[:, -1] = DEFAULTS['padding_symbol']
1. Generate y by shifting X
0 3 5 2 1 7X
y 0 5 2 1 7 0
Shift traces left

Using LSTMs
 Create DL model (_build_model())
‒ Use a LSTM network
‒ dropout layer of 0.2 (high, but necessary to avoid quick divergence)
‒ Softmax activation to generate probabilities over the different categories
‒ Because it is a classification problem, loss calculation via categorical cross-entropy compares the output probabilities
against one-one encoding
‒ ADAM optimizer (instead of the classical stochastic gradient descent) to update weights
model = Sequential()
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True,
input_shape=(self.trace_size, input_dim)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
# A Dense layer is used as the output for the network.
model.add(TimeDistributed(Dense(input_dim, activation='softmax')))
if self.gpus > 1:
model = keras.utils.multi_gpu_model(model, gpus=self.gpus)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
1. Create DL model

Using LSTMs
 Train the model (fit())
‒ X = Obtains the traces composed of spans
‒ Pad the traces of X
‒ y = Shift the spans of the traces left
‒ Apply an one hot encoding to X and y
‒ Build the model
‒ Fit the model f(X) = y
# X_train has 2 columns: traceId, span_list. Here we select the span_list
X_train = X_train.loc[:, DEFAULTS['span_list_col_name']].values
X = self._pad_traces(X_train)
y = self._shift_traces_left(X)
self._get_lstm_parameters(X)
X = self._traces_to_binary(X)
y = self._traces_to_binary(y)
self.model = self._build_model(input_dim=X.shape[2])
self.model.fit(X, y, epochs=self.epochs, batch_size=self.batch_size, shuffle=True,
validation_split=self.validation_split)
1. Train the model

Using LSTMs
Test Traces
X [[0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6]
[0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6]]
X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203],
[0, 0.9656691], [0, 0.9686248], [0, 0.9666971],
[0, 0.95213455], [0, 0.94981205], [0, 0.93016535],
[0, 0.60129535], [0, 0.5417027], [0, 0.6876041],
[0, 0.82936114]]
X[1]: [0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [0, 0.9302344], [0, 0.94507533],
[0, 0.9638574], [0, 0.9665326], [0, 0.9627945],
[0, 0.9435249], [0, 0.9405963], [0, 0.92075515],
[1, 0.3207201], [1, 0.46595544], [0, 0.64980185],
[0, 0.8390282]]
trace_id 0 indices [4]
trace_id 1 indices []
 Test traces (predict())
‒ X = Obtains the traces composed of spans
‒ Pad the traces of X
‒ Apply an one hot encoding to X
‒ Calculate: yhat = f(X)
‒ Identify in which spans the anomalies occur
X_test = X_test.loc[:, DEFAULTS['span_list_col_name']].values
X_test = self._pad_traces(X_test)
X_test_bin = self._traces_to_binary(X_test)
yhat = self.model.predict(X_test_bin)
self._identify_anomalies(X_test, yhat, prob=self.threshold)
return self.rank_prob
1. Test unseen traces

Using LSTMs
 Identifying Anomalous Traces Sequences (_identify_anomalies())
‒ Mark anomalies based on the difference between y and yhat
‒ y is the shifted version of X
‒ yhat = f(X)
‒ If we cannot predict correctly the shift, this means the trace is did not occur before and is marked as anomalous
y = DTA_LSTM._shift_traces_left(X)
for i in range(len(X)):
rp = self._compare_traces(y[i], yhat[i])
self.rank_prob.append(rp)
if self.explain:
print('X[{}]: {} = {}'.format(i, X[i], rp))
return self.rank_prob
Input X = [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6]
Shifted_X = [0 0 5 5 1 9 8 6 7 6 2 6 6 1 6 0]
Prediction
yhat = [0 0 5 5 12 9 8 6 7 6 2 6 6 1 6 0]
X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203],
[0, 0.9656691], [0, 0.9686248], [0, 0.9666971],
[0, 0.95213455], [0, 0.94981205], [0, 0.93016535],
[0, 0.60129535], [0, 0.5417027], [0, 0.6876041],
[0, 0.82936114]]
Error at index [4]
X[i]: Input: [ 0 0 0 0 0 0 0 0 0 0 0 10 10 17 5 5 7 7 6 6]
y[i]: Output: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0]
yhat[i]: Predicted: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0]
Correct prediction

Single and Bi-Models
 Use of single and bi-models to capture traces as
sequences of events and their response time
 Use sequential model representation by utilizing long-
short-term memory (LSTM) networks
Related Research
Technical University of Berlin
S. Nedelkoski,J. Cardoso,O. Kao, Anomaly Detectionfrom System Tracing Data using MultimodalDeep Learning, IEEECloud2019,July 3-8,2019,Milan, Italy.
Idea: represent distributed traces as sequences of
events and their response time
Trace Structure (sequence of events) and
Response time (duration)
e.g., Service 11 → Service 21; duration = 12ms

Related Research
Facebook
https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chow.pdf
Measure end-to-end performance of requests
 Infer causal relationships from logs
‒ Adding instrumentation retroactively is an
expensive task
‒ Hypothesize and confirm relationships in
messages
 Log Data
‒ Request & host id, timestamp, unique event label
 Finding Relationships
‒ Samples requests, store logs in Hive and run
Hadoop jobs to infer causal relationships
‒ 2h Hadoop to analyze 1.3M requests sampled 30
days
‒ Assumesa hypothesized relationshipbetween
two segmentsuntil finding a counterexample
‒ 3 types of relationship inferred: happens-before,
mutual exclusion, pipeline
 Applications
‒ Find critical path and slack segments for
performance optimization
‒ Anomaly detection: 1) select top 5% of end-to-end
latency, 2) identify segmentswith proportionally
greater representation in the outlier set of
requests than in the non-outlier set.
As the number of traces analyzed increases, the observation of new counter
examples diminishes, leaving behind only true relationships

Copyright©2019 Huawei Technologies Co., Ltd.
All Rights Reserved.
The information in this document may contain predictive
statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Bring digital to every person, home and
organization for a fully connected,
intelligent world.
Thank you.

Distributed Trace & Log Analysis using ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Trace & Log Analysis using ML

Similar to Distributed Trace & Log Analysis using ML (20)

More from Jorge Cardoso

More from Jorge Cardoso (20)

Recently uploaded

Recently uploaded (20)

Distributed Trace & Log Analysis using ML