Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015

Mining System Logs to Learn Error
Predictors
A Case Study of a Telemetry System
Barbara Russo
L.E.S.E.R.
Faculty of Computer Science, Free University of Bozen-Bolzano, Italy
Barbara.Russo@unibz.it
Universität Stuttgart - June 9th, 2015

A collaboration between
Free University of Bozen-Bolzano, Italy
and
University of Alberta, Canada
Barbara Russo, Giancarlo Succi, Witold Pedrycz (2015) Mining system logs to learn
error predictors: a case study of a telemetry system, Empirical Software Engineering:
Volume 20, Issue 4 (2015), pp. 879-927
Universität Stuttgart - June 9th, 2015 2

System events
•  Events describe the behaviour within and across
subsystems or components
–  the system changes over time
•  Logs track events

The value of logs
•  Log events carry information on
–  the software application that generated the event and its
state,
–  the task and the user whose interaction with the system
triggered the event, and
–  the time-stamp at which the event is generated.

Logs can be cryptic

Errors
•  Some behaviours are desirable and some are not
•  Undesirable behaviours are referred to as system errors
–  crashes that immediately stop the system and are easily
identifiable
–  deviations from the expected output that let the system run
and reveal only at completion of system tasks

Meaning of errors
•  Events in error state (errors) act as alerts
–  ? Manifestations of system failures
–  ? Originated from a series of preceding events
–  ? Immediate action must be taken
–  ? Indication of an underlying problem

Goal
•  Analysing the behaviour of a (composite) system by
mining logs of events and predicting future system
misbehaviour
•  Composite: many applications or subsystems

Method
•  Solve a classification problem with SVM
•  Build a sequence abstraction by mining logs
•  Integrate several statistical techniques to control for
data brittleness and accuracy of model selection and
validation
•  Discuss the classification problem at different degree
of defectiveness

Sequences
•  A single event may not suffice to predict system
failures
•  An event sequence is a set of events ordered by their
timestamp occurring within a given time window
•  A sequence abstraction is a representation of identified
sequences in formal format that machines can read

Research question
•  Is the amount and type of information carried by a
sequence enough to predict errors?

Isolating sequences
Different length, different types

Abstracting sequences
µ1 … µn
s7
s30
s2
s14
s10
Same length, same types

Example – sequence type
•  sv1=[0,1,0,1]
•  sv2=[2,1,1,0]

Sequence type
•  µi – number of the events of type i in a sequence
•  sv=[µ1, …,µn] – vector of event multiplicities
•  ρ(sv) = sum of # errors in sequences mapping into sv

Features to feed SVM
•  v= [sv, µ(sv), ν(sv)] – feature
–  µ(sv) = # sequences mapping into sv
–  ν(sv) average # of users in sequences mapping into sv
•  v is an faulty feature if at least one event in one
sequence is in error state

Sequence vector semantic
•  Patterns of system behaviour
–  If µ>1 and ρ>0 such sequences denote a reliability problem
that recurs
•  Distributed teams
–  If ν>1 the comparative analysis of features with ρ>0 or ρ=0
tells whether errors are originated by multi users working
for the same tasks

Example - features
•  v1= [0,1,0,1;1,1], sv1=[0,1,0,1]
–  µ(sv1) =1, ν(sv1)=1, ρ(sv1)=0
•  v2 = [2,1,1,0;1,2], sv2=[2,1,1,0]
–  µ(sv2) =1, ν(sv2)=2, ρ(sv2)=2

The classification problem
19
Data Sets Classifier
Different ex-ante
distributions:
(faulty, non-faulty)
G2 =Non-Faulty
G1= Faulty
Ex-post classification
differs on different
classifier’s thresholds
Features

Classification
•  False Positive = features v that are predicted faulty
but do not contain error(s), ρ(sv)=0
•  True positive = features v that are predicted faulty and
contain error(s), ρ(sv)>0
•  False negative = features v that are predicted non-
faulty but that contain error(s), ρ(sv)>0
•  True negative = features v that are predicted non-faulty
and do not contain error(s), ρ(sv)=0

Measures of accuracy

Build classifiers on historical data
22
Classifier
Training Set
Test
Set
1.  To tune classifier’s parameters
2.  To compute classifier’s fitting
performance

Compare prediction performance
23
Classifier1
Validation
Set
Classifier2
Classifiern
…

Validating sequence abstraction
•  Did we put too much information in our features?
–  Information Gain selects features that most contribute to the
information of a given classification category:
Classification category: sequences with a given number of error events

Control the effect of the dataset
nature
•  Does set balancing increase the quality of prediction?
–  If classification categories are not equally represented in
datasets, classifiers might have low precision even though
true positive rate is high and false positive rate is low.
–  Such imbalanced data sets are very frequent in software
engineering data

Parametric classification
•  The problem varies depending on how many errors we
allow in the system
•  c – cut-off value, i.e., number of errors in a sequence
vector
•  Categories:
–  G1(c)={v = [sv, µ(sv),ν(sv)] | ρ(sv)≥c}
–  G2(c)={v = [sv, µ(sv),ν(sv)] | ρ(sv)<c}

The case study

Business Questions
•  In our case study:
–  Can we use Support Vector Machines to build suitable
predictors?
–  Is there any Support Vector Machine that performs best for
all system applications?
–  Is there any machine that does it for different levels of
reliability requested to the system?

Descriptive analysis across apps
54 datasets out of them
25 with some faulty
features

Across system applications
Applications ordered by size of features set
Percentageoffaultyfeatures

Effects of Information Gain

Splitting data
•  Three approaches to control for artificial assumptions
–  Varying the size of splitting “t-splitting”
–  Reducing features with IG and varying size “t-splitting
reduced”
–  Balancing sets “k-splitting” , i.e., manipulating sets so that
the number of instances in the two categories are balanced

Types of SVM
•  Different kernels
–  Multilayer perceptron
–  Linear
–  Radial Basis Function

Fitting performance ac. applications
Number of applications for which a classifier
outperforms (with MR) the others in quality of fit

Prediction
No filter
Filtered with IG
•  Models with high fitting performance
(bal>0.73)
•  Prediction performance averaged across t-
splitting and models

Findings
•  Better with IG filtering, MP is best across applications,
but it is not the unique (Clustering applications?)
•  Artificial balance does not help to identify a single
classifier, but it helps to increase convergence in those
classifiers that are not reduced with IG

Findings (superior than literature)
•  Best performance at individual application (MP, c=3):
–  1% false positive rate, 94% true positive rate, and 95%
precision
•  Best performance across applications averaged over
models for c=2,
–  9% false positive rate, 78% true positive rate, and 95%
precision,

What predictions can tell managers
•  Application the manages software tools of cars
–  Pervasive in the telemetry system
•  106 distinct features of 10 different event types, 18%
multiple sequences, and 89% with more than one user
•  c=1
•  IG reduction from 12 to 7 still including µ and ν

Confusion matrix: prediction - MP

Prediction - assumptions
•  Behaviour is the same in next three months
•  1000 features
•  Category balance is the one for the test set (fitting)
(39%)
–  390 faulty features and 610 non- faulty features

In numbers
•  We have 390 faulty features and 610 non-faulty
features and 450 predicted faulty features
•  Predicted faulty features that have no error:
–  67 = 11%*610
•  Fail to predict faulty features = 70 =18%*390
Pred pos Pred neg Total
Pos 82% 18% 100%
Neg 11% 89% 100%
Total 45% 54% 100%

Cost of prediction
•  Inspection cost. Wasting time ≥ 67 * average cost to
fix one error
–  There might be more than one error in one sequence on
average
•  Cost for undiscovered errors. Defect slippage ≥ 70
–  Measure of system unreliability
–  Cost to repair errors at late stages (inaccuracy, higher cost
due to pressure, not being able to fix)

0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
False Positive Rate
TruePositiveRate
Best prediction
models
Equal chance
Higher cost to fix
undiscovered errors
Higher inspection costs
Prediction
MP
RBF
L
FPr=11%, TPr=82%

Recommendations
•  Select models that first accurately fit historical data
before using them for predictions
–  Best models for quality of fit are not always the best
predictors for all splitting sizes of a feature set
•  Reduce information redundancy

Recommendations
•  Report fitting accuracy
•  Use parametric classification
–  The parameter being the number of errors a sequence must
contain in order to be classified as defective/faulty.
•  Study prediction at different cut-off values or with different
splitting size or balance to solve the prediction problem
independently from the level of reliability requested for the
system and the nature of the data.

Thank you

With artificial balance
•  It does not help to identify a single classifier
•  It helps to increase convergence in those classifiers
that are not reduced with IG
47

With IG filter
48
Best classifiers across
different t-splitting;
classifiers with b<0.73 are
not reported

Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (17)

Similar to Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015

Similar to Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015 (20)

Recently uploaded

Recently uploaded (20)

Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015