Introduction to Machine Learning

INTRODUCTION TO MACHINE
LEARNING

CHILD LEARNING
Child:Daddy what is danger?
Dad: The possibility of suffering harm or injury.
Child:Daddy what is an injury?
Dad: An instance of being injured.
Child:Daddy what is an instance?
Dad: An example or single occurrence of something.
Child:Daddy does it bother you that I’m asking so many
questions?
Dad: Not at all, if you don't ask you will never know.

CHILD LEARNING
Dad: Let me give you some examples…

CHILD LEARNING
Child:Now I understand, everything is dangerous
Dad: No, there are things that aren't dangerous

CHILD LEARNING
Child:And what are those?

CHILD LEARNING
And there is the most natural mode of learning
Action Reaction Lesson
Touching hot stove aching hand Do not touch again
Playing with toys Fun Continue playing
Running in to the road Screaming parent Don’t run to roads
Running in the house Fun Run in the house
Eating chocolate Fun Search for chocolate
Eating too much chocolate Stomach ache Don’t eat too much
Saying bla bla No Reaction Try variations
Saying daddy Overexcited parents Do that again

SO, HOW CHILDREN LEARN?
1. From explanation
2. From examples
3. Reinforcement Learning

ABOUT US
Algorithms Technology
Business

AGENDA
• What is Machine Learning
• Typical Machine Learning Tasks
• Supervised Learning
• Unsupervised Learning
• How to Get Started
• Summary

WHAT IS MACHINE LEARNING?
We say that a computer program is learning a task, if its
performance on that task is improving as more experience is
processed

Machine
Learning
Statistics
Databases
& Big Data
Decision
Theory
Artificial
Intelligence
Optimization

Machine
Learning
Statistics
Databases
& Big Data
Decision
Theory
Artificial
Intelligence
Optimization
Data Science

TYPICAL MACHINE LEARNING TASKS
No two Machine Learning tasks are identical. Yet, we often use
the following categories:
• Reinforcement Learning

SUPERVISED LEARNING
Estimate or Predict an unknown result, given explicit values of some
explaining features.
The learning takes place as history of observations, for which both the
explaining features and the results are known.
Experience = supervised examples (exactly as in inferring what is dangerous
from examples)
We call the dataset that describe the experience training set

SUPERVISED LEARNING
Estimate or Predict an unknown result, given explicit values of some
explaining features.
We call the dataset that describe the experience training set
When the unknown result is numeric, we call the task Regression
When the unknown result is categorical, we call the task Classification

SUPERVISED LEARNING
Example 1: What will be the annual spent of a new customer,
given a set of explaining features (e.g., demographics, first
purchases, first deposit etc.)?
Task qualifications: Prediction, Regression
Training set: a file, in which each row represents a customer. For
each such customer we will extract the explaining features, at
the prediction point, as well as the annual spent (a year later).

SUPERVISED LEARNING
Example 2: What is the activity currently performed by a user
who is wearing a smart watch with inertial sensors?
Task qualifications: Assessment, Classification
Input: A set of sensor-based signals, along with an annotation of
the activity during each signal.
Requires a significant amount of pre-processing in order to
produce the training set.

SUPERVISED LEARNING
PredictionAssessment
Classification
Regression

UNSUPERVISED LEARNING
Given a specific set of records, described by a given set of
features, either:
1. Extract interesting patterns that appear in the data
2. Provide insightful representation of the distribution of the
data
Experience: the more records we have, the more significant are
the patterns that we can extract, or more accurate is the
representation

Example: Market Segmentation
Input data: Customers’ descriptions
Objective: Provide an insightful representation of the market
(what types of customers are there?)
Also known as cluster analysis

REINFORCEMENT LEARNING
Learning how to best react to situation through trial and error.
Simple Example: Multiple A/B testing
More Typical: Robot Navigation
Designing a RL system requires solving two difficult challenges:
• The exploration – exploitation dilemma
• Attributing delayed rewards

UNSTRUCTURED INPUTS
The input data often come in an unstructured form, such as:
• Free text
• Speech
• Images
• Video
• Sensors
• Networks

SUPERVISED LEARNING
X1 X2 X3 … Xn-2 Xn-1 Xn Y
x1,1 x2,1 x3,1 … xn-2,1 xn-1,1 xn,1 y1
x1,2 x2,2 x3,2 … xn-2,2 xn-1,2 xn,2 y2
.
.
.
.
.
.
.
.
.
…
…
…
.
.
.
.
.
.
.
.
.
x1,m-1 x2,m-1 x3,m-1 … xn-2,m-1 xn-1,m-1 xn,m-1 ym-1
x1,m x2,m x3,m … xn-2,m xn-1,m xn,m ym
𝑌 = 𝑓 𝑋1, 𝑋2, … , 𝑋 𝑛

LEARNING THE CONCEPT OF A BIRD
An alien asks you: “What is a bird?”
You can try and define a bird, but the alien does not understand
Why don’t you give an example…

Is Bird?Can Fly ?ColorExample #
YesYesBlack1
What do you say about the following classification model:
“If Color = Black and Can_Fly = Yes then Bird
Else Not_Bird”?

YesYesBlack1
YesYesGrey2
What do you say about the following classification model:
“If Can_Fly = Yes then Bird Else Not_Bird”?

YesYesBlack1
YesYesGrey2
NoYesBlack3
Supervised Learning means generalizing from given
observations.

GENERALIZATION VS. SPECIFICATION
• A general concept is built based on the explaining features. The
right set of explaining features is crucial for learning
• Being over specific means memorizing and not learning
• Being too general means being too coarse and missing some of
the details
• Finding the sweet spot between generalization and specificity is
hard

GENERALIZATION VS. SPECIFICATION
Let us find a function that estimates Y=f(X)

Too General / Too
Simple / Under
fitted
Too Specific / Too
Complex / Over
fitted
A nice solution to
the trade-off

OVER FITTING & UNDER FITTING
• We search for
• We know that in addition to the functional dependency (called
bias), the actual Y values are also affected by noise (called
variance)
• We want the model to learn the bias, but not to be affected by
the variance.
• A model that is too simple to learn the bias is called under fitted
• A model that is overly complex that it adapts itself to the
variance is called over fitted
𝑌 = 𝑓 𝑋1, 𝑋2, … , 𝑋 𝑛
The more complexity you add to the
model, you can always better fit it to
the training observations.
This is not always a good practice!

A PARTIAL LIST OF SUPERVISED
LEARNING METHODS
• K- Nearest Neighbor
• SVM (Optimal Margin Linear Separation)
• Decision Trees
• Naïve Bayes
• Linear Regression
• Logistic Regression
• (Deep) Neural Networks

K-NN
Recipients
EmailLength
Given a new observation, find
the K closest available
observations and:
• In regression, use the
average result of these K
observations
• In Classification, use voting
amongst these K
observations

K-NN
Recipients
EmailLength
K=3
Few concerns:
• What should be k?
• Which distance measure should
be used?
• Computation

LINEAR SEPERATORS
How would you classify
this data?
X1
X2

LINEAR SEPERATORS
X1
X2
In SVM we search for
the linear separator
that has the maximal
margin.
Using a
mathematical trick,
called The Kernel
Trick, SVMs can also
find non-linear
separators

DECISION TREES
Example: classify new customers into one of two groups:
Standard and VIP.
Training set: a list of customers that were once new, along with
an annotation that reflect if these customers should have been
identified as VIP (this annotation is made only after some time).
Let us say that we have 1,000 VIPs and 4,000 Standard new
customers

DECISION TREES
Let us say that we have 1,000 VIPs and 4,000 Standard new
customers
1,000 V
4,000 S

DECISION TREES
The population is a mix of different types. What if we could find
splitting criterion that will create two (or more), more pure sub
populations
1,000 V
4,000 S

DECISION TREES
The population is a mix of different types. What if we could find
splitting criterion that will create two (or more), more pure sub
populations
1,000 V
4,000 S
Self Employed
600 V
800 S
Employees
400 V
3,200 S

DECISION TREES
Now, we can take each sub-population and split it recursively,
until some stopping criteria are met.
1,000 V
4,000 S
Self Employed
600 V
800 S
Employees
400 V
3,200 S

DECISION TREES
• Decision trees are a result of recursive splitting mechanism
• Each split is chosen as to maximize the purity of the sub
populations that results from the split
• Few ways to model node purity. Often the concept of minimal
entropy (or a variation of minimal entropy) is used
• Each split is made according to the values of one of the
explaining features

LINEAR REGRESSION
0
50
100
150
200
250
300
350
400
450
0 1000 2000 3000
HousePrice($1000s)
Square Feet

SUPERVISED LEARNING EVALUATION
Since Supervised Learning is all about generalization, a good
model is a model that can be applied successfully to new
observations
In classification tasks, we are often interested in the probability
that the model will extract the true outcome. This probability is
called the model accuracy
In regression tasks, we are often interested in the average
deviation between the outcome of the model and the true
outcome. This deviation is called RMSE

It is always possible to build an over fitted model. So the quality
of the model on the training set say very little on the capability
of the model to generalize to new observations.
Therefor, never evaluate a model using the training set
Instead:
• Use an independent (randomly selected) test set
• Use cross validation

RedBlue
17Blue
50Red
Classified As
Actual
Confusion Matrix

RedBlue
17Blue
50Red
Classified As
Actual
Confusion Matrix
Accuracy (on test set) = (7+5)/(7+5+1+0)

CROSS VALIDATION
Randomly break the training set into k mutually exclusive,
collectively exhaustive sets, of similar size (often k=10).
For i=1,2,…k:
Train a model using all the sets, except for the i-th set.
Evaluate the trained model over the i-th set.
You end up with k evaluation measures. Evaluate the entire
model as the average of these k results.

SUPERVISED LEARNING SUMMARY
• Two sub problems: classification and regression
• Supervised Learning is all about generalizing from a given
training set
• There is an inherent, hard to solve trade-off between
generalization and over specification
• The more complexity you add to your model, the better it
can fit the training set. You may gain an over fitted model
• Therefor, you never evaluate a model on the training set that
was used to induce it
• Instead, use either and independent test set, or cross
validation

SUPERVISED LEARNING SUMMARY
• We also got familiar with 4 SL methods: K-NN, SVM, Decision
trees and Linear regression

X1 X2 X3 … Xn-2 Xn-1 Xn
x1,1 x2,1 x3,1 … xn-2,1 xn-1,1 xn,1
x1,2 x2,2 x3,2 … xn-2,2 xn-1,2 xn,2
.
.
.
.
.
.
.
.
.
…
…
…
.
.
.
.
.
.
.
.
.
x1,m-1 x2,m-1 x3,m-1 … xn-2,m-1 xn-1,m-1 xn,m-1
x1,m x2,m x3,m … xn-2,m xn-1,m xn,m
Extract interesting patterns from the input set or
Provide an insightful representation of the input space

Unsupervised Learning tasks:
• Cluster Analysis
• Association Rules Mining
• Hidden Markov Models
• Dimensionality Reduction
• Self-Organising Maps

CLUSTER ANALYSIS
Data points that share a
cluster need to be similar
Data points in different
clusters need to be different
Similarity = Low distance Difference = High distance?

CLUSTER ANALYSIS
K-Means:
Initialize: place k cluster centroids on the feature space
Repeat until some stopping criteria are met:
Associate each data point to the closest centroid
Move each centroid to the center of the points that are
associated to it

CLUSTER ANALYSIS
Does distance means similarity?
What distance?

CLUSTER ANALYSIS
What distance?
For example, let us look at similarity in monthly salary.
Mr. X earns $2,500 a month
Mrs. Y earns $250,000 a month
Mr. Z earns $100,00 a month. Is he more similar (in terms of
salary) to X or to Y?

CLUSTER ANALYSIS
What distance?
How should we compute a multi-dimensional distance?
Player Name Height Position Age Plays in Goals this
year
Annual
Wages
Country of
Birth
Lionel Andrés
Messi
169 cm Forward 30 Spain 41 M 36 EUR Portugal
Cristiano
Ronaldo
185cm Forward 31 Spain 27 M 17 EUR Argentina

HOW TO GET STARTED
• Maintaining and manipulating more and more data becomes
more and more affordable
• Machine Learning suggest a very reach set of boxes.
Selecting the right boxes and building a business solution
requires lots of experience
• Training the right models, tuning the parameters, evaluating
performance and implementation all require some level of
expertise but this should not be your first concerns
• The prediction is not in the box

HOW TO GET STARTED
Business Value
Implement
Machine
Learning
Business
Definition

HOW TO GET STARTED
A recommended checklist, before you even start:
1. What am trying to achieve, businesswise?
2. What data it requires? Do I have this data? Am I allowed to
use it?
3. What will be the output of a machine learning model?
4. Can my operations use that output? How?
5. What machine learning task am I trying to solve?
6. What are the success criteria?
7. Who will be the ones to run the project?
8. How long will it take? How much will it cost?

SUMMARY
• Machine learning = designing machines that learn from
experience
• Three typical tasks:
• Supervised Learning:
• Learning means generalization
• Generalization vs. Specification, Over fitting and Under fitting
• Classification vs. Regression

SUMMARY
• Supervised Learning algorithms:
• K-NN
• SVM
• Decision Trees
• Linear Regression
• More
• Cluster analysis: similarity and distance
• Association rules
• The big data challenge of Machine Learning
• CRISP-DM

INTRODUCTION TO MACHINE LEARNING

Introduction to Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Machine Learning

Similar to Introduction to Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning