Machine learning at b.e.s.t. summer university

Machine learning -
A summer approach
July 21 2016

Introduction
Árpád Fülöp arpad.fulop@balabit.com
László Kovács laszlo.kovacs@balabit.com

What to expect
To have an introduction to machine learning by walking through the steps
of a real project as an example.
1. What does a data scientist do
2. Basic ideas in machine learning
3. Demo: a machine learning application
4. Data preparation for predictive modeling
5. Building predictive models: Decision trees and forests
6. Validating predictive models
7. Measuring performance of predictive models

Part 1
What does a data scientist do

(http://www.marketingdistillery.com/2014/11/29/is-data-science-a-buzzword-modern-data-scientist-defined/)
An interdisciplinary job

Prediction
The ability to make reliable predictions about future events by using the
patterns seen in historical data.
Examples:
- Which one of my customers will end their contract
based on their mobile phone usage data?
- Given the friendship graph of my users,
what new connections are likely to be made?

Anomaly detection
Uncovering unusual events, potential frauds by noticing deviation of the data
from what is normal.
Examples:
- It could be suspicious if a customer suddenly
consumes much less power than it is usual for
them according to the data from the meters.
- By knowing typically which user issues which
commands, I am able to recognize weird and
outlying operations on a computer.

Gaining insights
Extracting hidden connections, knowledge about our customers, products,
business processes.
Examples:
- Based on the data about their visits, we can
discover typical segments of users and observe
in which aspect they use our web site similarly
- We crawl Twitter for thousands of user
feedbacks and learn the general sentiment and
emotions towards about our company

Making valid decisions
The possibility to validate business related hypotheses or comparing
alternatives in a mathematical sense.
Examples:
- Will the subscription rate drop if we change the text
used in my email marketing campaign?
- How do I redesign my web page to maximize the time
spent by the visitors?
1) Define an experiment. 2) Measure the results on a
sample. 3) Infer the properties of the whole population.

What do we do?
We are building a data driven IT security product.
The software aims to find anomalies in IT security related system logs.
The behavior of the users of the IT system is analyzed and if unusual
behavior is detected, alerts are raised.
This helps the work if IT security experts of a company by drawing their
attention on the most important events in the system.

What is Big Data?
Whatever this is

What is Big Data?
From a technical point of view:
“a term for data sets that are so large or complex that traditional data processing
applications are inadequate” (Wikipedia) -> Infrastructure-wise Big Data
From a layman’s point of view:
“extremely large data sets that may be analysed computationally to reveal
patterns, trends, and associations, especially relating to human behaviour and
interactions” (Oxford Dictionary) -> Impact-wise Big Data

What is Big Data?
The 3 Vs:
- Volume the data to be processed takes GBs, TBs of space
- Velocity new data comes frequently at a high speed
- Variety the data has a variety of formats and cannot be stored in
tabular (relational) form
(Gartner)

Data scientist vs Big Data
Data scientists
Professionals who
deal with big data

Part 2
Basic ideas in machine learning

Types of learning
Understanding (meaningful learning):
What you learn converts to understanding of the concept.
Your knowledge is general: you can apply your it to new situations.
Memorizing (rote learning):
What you learn can be quickly recalled, but is superficial and cannot
be applied in another context.

Computers and memorization
Computers themselves are the best to accurately store and recall huge
amounts of data: documents, dictionaries, bits of video files, etc.
But this is solely memorizing. Do they understand what’s going on?
If a student is memorizing a question
bank before the exam without
understanding a word, she might pass an
exam containing the same questions, but
fails when answering new questions.

Machine learning
Machine learning is about making computers to be able to learn through
examples.
To goal is to, after having seen many examples, finding such patterns that
can be generalized so well that they can be used in future situations.
Can a student actually gain understanding
seeing questions and their answers?

Meaningful machine learning
As data scientist, while using machine learning as a tool, our most
important task is to prevent memorizing (or overfitting in this
context) because we want to use the acquired knowledge for new
examples in the future.
Although the machine will not “understand” the
data ever,
we can motivate our algorithms to find trends,
correlation structures, and connections.

A machine processes (learns from) more data than a human; it can deal
with amounts of data that we cannot.
With machines, learning can be automated;
machines deal with repetitive tasks more easily than humans.
The patterns found by the machines will never be perfect,
but given enough examples of appropriate quality and quantity, they will
be useful.
Nature of machine learning

When to use machine learning
If the following conditions hold:
1) There is a pattern to be learned; a pattern between the questions
(inputs) and answers (output)
2) We cannot formulate the pattern mathematically
3) We have enough data (examples) for learning
(Abu-Mostafa: Learning from Data)

When not to use machine learning
To find out, for instance:
- The winning numbers of next week’s lottery (no pattern)
- The area of a triangle (can be formulated)
- The time of the next financial crisis (not enough data)

Learning game (Abu-Mostafa: Learning from Data)
label = “0”
label = “1”
label = ?

Learning game (Abu-Mostafa: Learning from Data)
Takeaways:
- There is no single solution but there are many possible
ones
- The amount of learning examples during learning raises our
confidence about our solution

Key aspects to consider in a machine learning
task
Data What are the examples, how do we get them?
Unit of observation What is considered one example?
Observed features What attributes do we store about an example?
Observed target variable What is the attribute we want to be able to
predict?
Outcome What is the meaning of the predicted target
variable?
Business Case How can we use the predictions?

Predict if an employee wants to quit
Data Personal, work-related data from HR database
Unit of observation One employee
Observed features Overtime, effectiveness, patterns in days-off and
sick-days, commuting time, etc.
Observed target variable Who quit in the past?
Outcome What are the chances of someone quitting?
Business Case Prevent quitting by focused countermeasures, eg.
mentoring.

Predicting flight delays
Data Air traffic data from airport systems
Unit of observation A single flight from A to B
Observed features Origin, destination, airline, day of year, weather
Observed target variable Delay in minutes
Outcome Prediction of punctuality
Business Case What are the expected loss on delays?

Biometric authentication with mouse dynamics
Data Server logs about user sessions
Unit of observation A single movement of a mouse cursor from A to B
Observed features Length, straightness, speed
Observed target variable The username of the user
Outcome Anomaly level of a user session
Business Case Improved security with automatic alerts

Classify mood of music
Data 500 mp3 files
Unit of observation Song in mp3
Observed features ?
Observed target variable Manually defined labels either “cheerful” or “blue”
Outcome ?
Business Case ?

How to represent in data table format
Feature #1 Feature #2 ...
Target
variable
...
...
Examples,
data point
Headers
Features
(observed attributes) Observed target
variable

Part 3
Demo: a machine learning application

Part 4
Data preparation for predictive modeling

Data preprocessing
Raw data
:(
Data ready to
be analyzed
:)
Data representations
Join data tables
Character encoding Aggregations
Pivoting
Parsing raw data
Date formats
REPRODUCIBLE
PROCESS

The data source: Remote Desktop connection
logs

Raw data of a session
record timestamp client timestamp button state x y
1434623080.316000 4053743.247000 NoButton Move 686 281
1434623080.419000 4053743.357000 NoButton Move 687 287
1434623080.615000 4053743.559000 Left Pressed 687 287
1434623080.745000 4053743.684000 Left Released 687 287
1434623081.557000 4053744.495000 NoButton Move 690 288
1434623081.667000 4053744.605000 NoButton Move 742 300
… (some 10k lines)

The target variable: what is the goal of the
analysis?
Target
variable
...
Examples,
data point
...
Features
variable

The examples: was would be appropriate as
an example?
Made by
user?
1
1
0
0
...
...
Examples,
data point
...
Features
variable

Gesture
A gesture: moving the cursor from one point to another in one go.
- Large enough to capture the mouse moving characteristics of a user,
- Small enough to have a lot of them to learn from.
A possible definition of a gesture:
We process the raw file from the beginning row-by-row. At each step,
if the time difference is larger than 0.3 sec, or a mouse button is
pressed, the current gesture ends and a new one starts.

Gesture extraction
record timestamp time difference button state x y
1434623080.316000 - NoButton Move 686 281
1434623080.419000 0.103 NoButton Move 687 287
1434623080.615000 0.196 NoButton Move 687 287
1434623080.745000 0.130 NoButton Move 687 287
1434623081.557000 0.812 NoButton Move 690 288
1434623081.667000 0.110 NoButton Move 742 300
1434623081.877000 0.210 NoButton Move 748 300
Gesture
#1
Gesture
#2

The features: what are appropriate features of
a gesture?
Made by
user?
1
1
0
0
...
...
Gestures
...
Features
variable

Feature engineering
What properties of gestured can be defined that might be useful in
differentiating between users?
*CLICK*
ts0, x0, y0
ts1, x1, y1
tsn, xn, yn

Feature engineering
What properties of gestured can be defined that might be useful in
differentiating between users?
*CLICK*
ts0, x0, y0
ts1, x1, y1
tsn, xn, yn
Duration: tsn - ts0
Path length: sum of distances between consecutive points
Avg. speed: path length / duration
Time to click: time spent between last move and click (if any)
Mean/std/etc. of (consecutive) speed/acceleration/etc. values
Also: angles

The data set is now ready to be analyzed
Avg speed
(pixel/sec)
Duration
(sec)
...
Made by
user?
34.5 5 1
12.1 3 1
1.23 12 0
55.9 3 0
... ... ...
...
Gestures
...
Features
variable

Part 5
Building predictive models:
Decision trees and forests

Outline of predictive modeling
We have many observations about a certain event, process etc. Each
observation is a pair of several features and a target variable.
With a learning algorithm and our data we aim to build a predictive
model that learns what is the typical value of the target for any
combination of feature values.
We then can use the model for predicting the value of the target of
(new) observations solely based on their features.

Example: wine prices
Rain during
harvest
(mm)
mean
temperature
in May (℃)
... Price (€)
18 5 2.5
200 4 16
180 10 250
100 2 9.5
... ... ...
...
...
Examples,
data point
Headers
Features
variable

Example: the Titanic data set
sex fare (£) ... survived
male 200 1
female 40 1
female 150 1
male 40 0
... ... ...
...
...
Examples,
data point
Headers
Features
variable

Prediction problems
The two main types of prediction problems are:
- Classification: the target variable is a categorical variable (e.g.,
yes/no decision, letters to be recognized)
- Regression: the target variable is a continuous variable (e.g., age,
income, stock prices)
Either for classification or regression, there are hundreds of learning
algorithms to choose from. Picking one is a problem itself, and influences
the success of the project.

Example of regression (1D)
Each blue point is an observation. We have to build a model that can tell
the income based on the age of the client.
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000

The task translates to fitting a curve to the points that we see!
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000

1st solution: connecting the dots.
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000

2nd solution: draw a straight line through the points.
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000

Example of regressionMonthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000
Which one is the better?

Example of regressionMonthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000
New clients have arrived (red); let’s see how the model performs!

Decision tree
We make predictions about the target (y) by answering
questions about the features (x1
, …, xn
).
An answer to a question either leads to a next question or directly to a
prediction.
We store the series of decisions in a tree structure. The leaves contain the
predictions. Each node that is not a leaf contains a question.

Example: Titanic decision tree
Male?
Age >= 10?
Family members on
ship >= 3?
survives
survivesdies
dies
yes no
yes no
yes no

Building a tree (2D)
Let us build a decision tree to decide whether an article on a news portal will
be popular or not! We have two features: # of photos, # of paragraphs.
# paragraphs
#photos
: popular : not popular

Cut the space based on # of paragraphs.
# paragraphs
> 10?
10
# paragraphs
#photos

Cut the space based on # of paragraphs.
# paragraphs
> 10?
not
popular
yes
10
# paragraphs
#photos

The rest is not homogeneous enough; we proceed with cutting.
# paragraphs
> 10?
not
popular
yesno
10
# paragraphs
#photos

Cut the rest of the space based on # of photos.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
# paragraphs
#photos

Cut the rest of the space based on # of photos.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
yes
popular
# paragraphs
#photos

The rest is not homogeneous enough; we proceed with cutting.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
yes
popular
no
# paragraphs
#photos

Cut again based on # of paragraphs.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
noyes
popular # paragraphs
< 2?
# paragraphs
#photos

Cut again based on # of paragraphs.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
noyes
< 2?
yes
not popular# paragraphs
#photos

The rest is homogeneous enough; we stop cutting the space.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
noyes
< 2?
noyes
not popular popular# paragraphs
#photos

Using tree for prediction
What will be the popularity of a new article according to the tree?
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
# paragraphs
#photos
10
6
2
noyes
< 2?
noyes
not popular popular

What if we do not stop cutting the space?
Take this task as an example.

We cut the space to fully homogeneous areas.

You see that red area in the middle of a large blue one?

You see that red area in the middle of a large blue one? It is more like the
result of “getting lost in the details” than seeing the true trend.
If a new point that is in fact blue,
accidentally falls there, it will be
classified as red.

Stopping criteria
A couple of rules for trees to prevent getting lost in the
details, i.e., growing too large, i.e., overfitting:
- They cannot have more levels than x,
- We do not cut areas with less than x points,
- We do not cut areas if one of the new areas would
have less than x points,
- We do not cut areas that are homogeneous enough
(as measured by entropy, Gini index etc.)

Random forests
In a forest there are several independent trees. Each tree grows seeing a
different random part of the whole data set.
When making prediction, the prediction of the forest is voted by the trees.
?
?
?
?
?
?
?
?
?
?
...

Aggregating the gesture-level predictions
After learning, the forest can predict if a gesture was legal or not.
Avg. speed
(pixel/sec)
Duration
(sec)
...
Made by
user?
(prediction)
34.5 5 1
12.1 3 1
1.23 12 0
55.9 3 0
... ... ...

Aggregating the gesture-level predictions
We need to make a decision about a whole session for a user!
For this, we aggregate the predictions for the gestures in a whole session.
Avg. speed
(pixel/sec)
Duration
(sec)
...
Made by
user?
(prediction)
34.5 5 1
12.1 3 1
1.23 12 0
55.9 3 0
... ... ...
- If < 0.5, the session is
regarded as illegal
- If > 0.5, the session is
regarded as legal
Average

Part 6
Validating predictive models

Overfitting is bad, what to do about it?
We are afraid of the more complex models but we need them!
How should we decide the amount of complexity which is JUST ENOUGH?
● A good model fits on the known examples (obviously) but also fits on unseen
examples
● That is the point: predicting the outcome of unseen examples is similar to
predicting examples from the future
● We can simulate having new examples by slicing the known dataset into two
parts:
○ Training dataset: examples only for training
○ Test dataset: examples only for measuring performance

Validation
Feature #1 Feature #2 ... Target
Every
known
example

Every
known
example
Training set
Test set
Validation

How it is done?
One can increase the complexity of the learning model as long as the
goodness of fit on the UNSEEN data increases.
● The goodness of fit on the training set will increase until full overfitting
● The goodness of fit on the test set will increase but just to a certain
point
We can visualize it with the learning curve.

The learning curve
“The goodness of fit on the training set will increase until full overfitting.”
That is, the error will decrease on the training set until full overfitting.
Amount of complexity we allow
Training dataset
Errorofmodel

“The goodness of fit on the test set will increase but just to a certain point.”
That is, the error on the test set will decrease but just to a certain point.
The learning curve
Unseen test
dataset
Training dataset
Errorofmodel

The learning curve
After the optimal point every “bit of knowledge” the model gains, is not
general, but just data-specific knowledge about the particular training
dataset it sees.
Errorofmodel
Training dataset
Unseen test
dataset
Optimal
complexity

Validation
There are some techniques (e.g., cross validation) which try to eliminate this loss of
information by selecting different parts of the known data as training sets, and then
aggregating the results of these different scenarios.
We sacrifice some data (and potentially
information) but we gain objective,
measurable knowledge about how well
our model will perform “out there”.

Part 7
Measuring performance of
predictive models

Measuring performance
What do we mean exactly by “goodness of fit”?
We would like to have minor differences between the predictions and the
real value of the target attributes from the test data set.
If our problem is regression (the truth is a continuous variable):
● Add up all the differences between the prediction and the truth for all
examples;
● The smaller the sum, the better our model.
● Exact match is rare, but a close guess is usable.
● E.g.: RMSE, root of mean squared error

Measuring performance
What do we mean exactly by “goodness of fit”?
We would like to have minor differences between the predictions and the
real value of the target attributes from the test data set.
If our problem is classification:
● If the predicted class misses the true class, there is no magnitude of
error. Not correct is not correct. (There is no “slightly pregnant
woman”.)
● Counting the rate of the correct predictions seems like a good idea,
but it is not a great one.

What can a classifier model do?
Not so many things, considering two classes, namely: “positive” and “negative”:
Predicts “Positive” when the reality is “Positive”
Predicts “Positive” when the reality is “Negative”
Predicts “Negative” when the reality is “Negative”
Predicts “Negative” when the reality is “Positive”

Let’s make a small table
If we rearrange the smiley faces:
True Positive
True Negative
False Positive
False Negative
False
Pos.
False
Neg.
True
Neg.
+ -
+
Reality
Prediction
-
True
Pos.
The confusion matrix catches ‘em all.
(The most important 2-by-2 matrix in machine learning.)

With a perfect model:
5 0
0 5
+ -
+
Reality
Prediction
-
● 5 positive and 5 negative cases in the
dataset to be predicted.
● Every prediction is correct.

1
1 4
+ -
+
Reality
Prediction
-
4
A more realistic scenario:
● 5 positive and 5 negative cases in the
dataset to be predicted
● There is one misclassified case for
each class.

Accuracy = calculate the rate of the correctly classified cases.
5
5 5
+ -
+
Reality
Prediction
-
985
With the confusion matrix:
sum(blue cells) / sum(all cells) = 990/1000 = 99%
Is this a good model?
Note that in a case like this, the model is likely to be
used to spot the NEGATIVE events. (Those are the
rare, interesting cases.)
This particular model has an awful performance on
those cases. Half of them are mis-classified!
How was that comment on measuring
performance?

Some performance measures...
There are several other methods which use the values of the confusion matrix in
order to evaluate a classification model.
The method needs to be chosen carefully for the purpose of the application.

Many classifiers don’t give strict verdicts
Though the target variable might be a discrete variable (orange/green), in practise
the classifier models are giving class-probabilities back (e.g., X% chance of
being green).
0% 100%
This means that one can decide which probability is high enough to predict a
particular label. If a music song seems to be 95% “cheerful” it’s a safer bet, than
one which is 52% cheerful.
Legend:
Color: the true class of a particular event known from the test dataset
Position: the probability of being in the green class as estimated by the model.

being green).
0% 100%
The good news: We have a much more detailed view on how the model works,
and on the amount of confidence it has about each prediction it makes.
The bad news: In order to retrieve discrete predictions the user must decide how
to transform the probabilities into classes, i.e, define a probability threshold which
separates the classes.

being green).
0% 100%
Legend:
This is an amazing model! We can find a point in the middle, which separates the
points into two groups, which are 100% the same as the original two categories.

Remember… we don’t have perfect models :(
What should we do when our model outputs something more realistic, like this:
0% 100%
Legend:
In the two sides, the picture is clear. But there are some borderline cases, where
there are some confusion.
2 questions arise:
● How should we find a good threshold?
● How to evaluate a model, without a pre-defined threshold?

Finding a threshold depends on the application, and the problem domain itself, and
has little to do with machine learning.
A threshold with low false positive rate is needed before applying a risky treatment.
A threshold with low false negative rate is needed before blood transfusion.
How should we find a good threshold?
- Towards the left hand side, we classify every green correctly but misclassify a lot of
oranges as greens. This means a lots of false positives.
- Towards the right hand side, we classify every orange correctly, but misclassify a lot of
greens as oranges. This means a lots of false negatives.
0% 100%
A B

0% 100%
A B
How to evaluate without a pre-defined threshold?
ALL
Every application have to deal with the
false positive - false negative trade-off,
and they deal with it differently.
Regardless of the application, we have to
be able to tell if a model is better than
another, objectively.
Why not to compute the false positives
and false negatives for EVERY threshold,
and have a look at a particular model, by
considering these different scenarios?

ROC curve (Receiver Operating Characteristic)
Every point on the red curve is
showing the corresponding rate of
false positives and true positives for
a particular threshold.
The dotted line is a random model.
The further the red line is from the
dotted line the better the model.
Howmanytruepositives
atathreshold?
How many false positives
at a threshold?
0% 100%
0%
100%

ROC curve (Receiver Operating Characteristic)
AUC = “Area Under Curve”
The bigger the area under the red line,
the better the model.
The area under the dotted line: 0.5
The perfect model: 1.0
“You can decrease the false-positive rate
to 0, and in that process, you don’t
generate any false negatives.”
0% 100%
0%
100%
Howmanytruepositives
atathreshold?
How many false positives
at a threshold?

Wrap up
- Data science as a field is big and diverse, machine learning is a key
tool to master
- Given enough examples machines can learn
- Learning is more complex than memorizing
- A great effort is needed to prepare the examples (features and target)
- The bigger challenge is not fitting a model, but to avoid overfitting
- Several key decisions have to be made, after a model has been
constructed

Machine learning at b.e.s.t. summer university

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Machine learning at b.e.s.t. summer university

Similar to Machine learning at b.e.s.t. summer university (20)

Recently uploaded

Recently uploaded (20)

Machine learning at b.e.s.t. summer university