Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning

Chapter Two
Supervised learning
Shumet Tadesse Nigatu
Department of Computer Science
College of Informatics
University of Gondar
February 2024
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 1 / 76

Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree

Introduction
Definition
Supervised learning is a type of machine learning paradigm in which an
algorithm is trained on a labeled dataset.
What is Labeled data?
Labeled data consists of input-output pairs, where the input is the data
the algorithm processes, and the output is the corresponding desired or
target output.
Goal
The goal of supervised learning is for the algorithm to learn a mapping
from inputs to outputs, making predictions or classifications on new,
unseen data.

Examples of Supervised learning
Input (X) Output (Y) Application
email spam?(0/1) spam filtering
audio text transcripts speech recognition
English Amharic Machine translation
ad, user information click?(0/1) online advertizing
image of phone defect?(0/1) visual inspection
Task
Each record is characterized by a tuple (X,y), where X is the attribute
set and y is the class label.
− X: attribute, predictor, independent variable, input
− y: class, response, dependent variable, output
Learn a model that maps each attribute set X into one of the
predefined class label y
− Find a model for class attribute as a function of the values of other
attributes.

How does supervised learning work?
Supervised learning involve two steps: Model construction and Model
usage
Model construction
A pair of input objects (tuples) and desired output value used to train
the model
The set of tuples used for model construction is training set
The model can be represented as classification rules, decision trees, or
mathematical formulae.

Model Usage
For predicting future or unknown objects
The known label of test sample is compared with the predicted result
from the model
Accuracy rate is the percentage of test set samples that are correctly
predicted by the model
Test set is independent of training set

Outline
1 Introduction
2 Regression
3 Classification
6 Decision Tree

What is Regression?
Regression is a statistical method used in machine learning and
statistics to model the relationship between a dependent variable (or
target) and one or more independent variables (or features).
The goal of regression analysis is to understand and quantify the
relationship between variables, make predictions, and identify patterns
in the data.
Regression analysis can be used to model causality and make
prediction
There are different types of regression models, but the most common
ones are linear regression and logistic regression.
− Linear Regression: Predict a continuous dependent variable based on
one or more independent variables.
− Logistic Regression: Predict the probability that an instance belongs
to a particular category (binary classification).

Linear regression
Linear regression, a staple of classical statistical modeling, is one of
the simplest algorithms for doing supervised learning.
It serves as a good starting point for more advanced approaches
It is important to have a good understanding of linear regression
before studying more complex learning methods.
Simple Linear Regression Model
Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (X) variables
The relationship between variables is described by a linear function.
A change of one variable causes other variable to change.

Linear Regression
The linear regression model provides a sloped straight line
representing the relationship between the variables.
Mathematically, we can represent a linear regression as: y = a0 + a1x +
Where,
y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
= random error

Linear Regression Function
y = a0 + a1x
Slope of regression line a1 = r
Sy
Sx
, where
r =
P
((x − x)(y − y))
pP
(x − x)2
P
(y − y)2
, Sy =
rP
(y − y)2
n − 1
,
Sx =
rP
(x − x)2
n − 1
Y-intercept of a regression line a0 = y − a1x

Simple Linear Regression by Hand-Method 1
There are just a handful of steps in linear regression.
1 Calculate average(mean) of your X variable
2 Calculate the difference between each X and the average X
3 Square the differences and add it all up
4 Calculate average(mean) of your Y variable
5 Calculate the difference between each Y and the average Y
6 Square the differences and add it all up
7 Multiply the differences (of X and Y from their respective averages)
and add them all together
8 Calculate r
9 Calculate a1 using r, Sy and Sx
10 Calculate a0 using y, a1 and x

Simple Linear Regression by Hand - Method 2
Step 1: Calculate X · Y , X2, and Y 2
Step 2: Calculate ΣX, ΣY , ΣX · Y , ΣX2, and ΣY 2
Step 3: Calculate a0: The formula to calculate a0 is:
a0 =
(ΣY )(ΣX2) − (ΣX)(ΣXY )
n(ΣX2) − (ΣX)2
Step 4: Calculate a1: The formula to calculate a1 is:
a1 =
n(ΣXY ) − (ΣX)(ΣY )
n(ΣX2) − (ΣX)2
Step 5: Place a0 and a1 in the estimated linear regression equation.
The estimated linear regression equation is: ŷ = a0 + a1 · x

Example
For this example, we use the salary.csv dataset. The dataset contains the
following two variables for 30 employees.
Years of Experience
Salary
We want to create a simple linear regression model using years of
experience as the explanatory variable and salary as the response variable.

How to Make Predictions with Linear
Regression?
Linear regression is a method we can use to quantify the relationship
between one or more predictor variables and a response variable.
One of the most common reasons for fitting a regression model is to
use the model to predict the values of new observations.
Steps to make predictions with a regression model
1 Collect the data.
2 Fit a regression model to the data.
3 Verify that the model fits the data well.
4 Use the fitted regression equation to predict the values of new
observations.

Exercise 1
The following dataset is about 12 months marketing budget and sales.
Create a simple linear regression model using Spend as the explanatory
variable and sales as the response variable.
Month Spend Sales
1 1000 9914
2 4000 40487
3 5000 54324
4 4500 50044
5 3000 34719
6 4000 42551
7 9000 94871
8 11000 118914
9 15000 158484
10 12000 131348
11 7000 78504
12 3000 36284

Exercise 2
The following dataset consists of students study hours and coresponding
academic scores. Create a Linear Regression model to predict
marks(scores) based on the time spent to study.

Multiple Linear Regression
Multiple linear regression is used to model the relationship between
multiple independent variables (also called predictors or features) and
a single dependent variable.
In contrast to simple linear regression, which involves only one
independent variable, multiple linear regression considers two or more
predictors.
The general form of a multiple linear regression model with ’p’
predictors is given by the equation:
Y = β0 + β1X1 + β2X2 + . . . + βpXp +
Here:
− Y is the dependent variable (the variable you are trying to predict).
− β0 is the intercept term (the value of Y when all predictors are zero).
− β1, β2, . . . , βp are the coefficients, representing the change in Y for a
one-unit change in the corresponding predictor.
− X1, X2, . . . , Xp are the independent variables (predictors).
− is the error term, representing the unobserved factors that affect Y
but are not included in the model.

Multiple Linear Regression by Hand
Suppose we have a dataset with one response variable y and two
predictor variables X1 and X2
Use the following steps to fit a multiple linear regression model
− Step 1: Calculate X2
1 , X2
2 , X1y, X2y, and X1X2
− Step 2: Calculate Regression Sums. Next, make the following
regression sum calculations:
X
x2
1 =
X
X2
1 −
X
X1
2
/n
X
x2
2 =
X
X2
2 −
X
X2
2
/n
X
x1y =
X
X1y −
X
X1
X
y

/n
X
x2y =
X
X2y −
X
X2
X
y

/n
X
x1x2 =
X
X1X2 −
X
X1
X
X2

/n

Multiple Linear Regression by Hand
Step 3: Calculate b0, b1, and b2 The formula to calculate b1 is:
b1 =
P
x2
2

(
P
x1y) − (
P
x1x2) (
P
x2y)
P
x2
1
P
x2
2

− (
P
x1x2)2
The formula to calculate b2 is:
b2 =
P
x2
1

(
P
x2y) − (
P
x1x2) (
P
x1y)
P
x2
1
P
x2
2

− (
P
x1x2)2
The formula to calculate b0 is:
b0 = ȳ − b1X̄1 − b2X̄2
Step 5: Place b0, b1, and b2 in the estimated linear regression
equation.
The estimated linear regression equation is: ŷ = b0 + b1 · x1 + b2 · x2

Exercise 3
The following dataset consists of employees age, experience and income.
Create a Linear Regression model to predict income based on the age and
experience.

Check Your Progress
1 If the intercept (a0) in a simple linear regression model is 5, what
does this value represent in the context of the data?
2 What is the fundamental difference between simple linear regression
and multiple linear regression?
A The number of predictors
B The complexity of the model
C The type of response variable
D The presence of interactions
3 Why might one prefer multiple linear regression over simple linear
regression when modeling relationships between variables?

Polynomial Regression
Polynomial regression is a form of linear regression in which the
relationship bettween the independet variable x and the dependent
variable y is modelled as an nth degree polynomial
In other words, instead of fitting a straight line (as in simple linear
regression) or a plane (as in multiple linear regression), polynomial
regression fits a curve to the data.
The general equation for polynomial regression of degree n is:
Y = a0 + a1 · X + a2 · X2
+ . . . + an · Xn
+ ε
Here:
Y is the dependent variable.
X is the independent variable.
a0, a1, . . . , an are the coefficients of the polynomial terms.
ε is the error term.

Why we use polynomial Regression?
Polynomial regression is used when the relationship between the
independent variable(s) and the dependent variable is not linear but
can be better represented by a polynomial curve.
Here are some common scenarios where polynomial regression can be
beneficial:
1 Non-linear Relationships: When the relationship between the
variables is curvilinear or follows a pattern that cannot be captured by
a straight line.
2 Higher Order Patterns: Some relationships may exhibit higher-order
patterns, such as quadratic, cubic, or higher-degree behavior.
Polynomial regression can capture these patterns by including terms
like X2
, X3
, etc.
3 Flexibility in Modeling: Polynomial regression provides a flexible
framework to model a wide range of relationships. By adjusting the
degree of the polynomial, you can fine-tune the model to better
represent the characteristics of the data.

When to Use Polynomial Regression?
We use polynomial regression when the relationship between a
predictor and response variable is nonlinear.
The easiest way to detect a nonlinear relationship is to create a
scatterplot of the response vs. predictor variable.
Example:

Check Your Progress
1 How does polynomial regression differ from linear regression?
A Polynomial regression uses a nonlinear relationship
B Polynomial regression can handle multiple predictors
C Polynomial regression is always more accurate
D Polynomial regression assumes homoscedasticity
2 Provide examples of real-world scenarios where polynomial regression
might outperform linear regression.

The Problem of Overfitting
Overfitting is an undesirable machine learning behavior that occurs
when the machine learning model gives accurate predictions for
training data but not for new data.
Overfitting is a common issue in machine learning where a model
learns not only the underlying patterns in the training data but also
captures noise and random fluctuations that are specific to that data.
In other words, an overfit model performs well on the training data
but fails to generalize effectively to new, unseen data.

Underfitting
A model is said to have underfitting when a it is too simple to
capture data complexities
It represents the inability of the model to learn the training data
effectively result in poor performance both on the training and testing
data
− In simple terms, an underfit model’s are inaccurate, especially when
applied to new, unseen examples
It mainly happens when we uses very simple model with overly
simplified assumptions
To address underfitting problem of the model, we need to use more
complex models, with enhanced feature representation, and less
regularization

Reasons for Underfitting
The model is too simple, So it may be not capable to represent the
complexities in the data
The input features which is used to train the model is not the
adequate representations of underlying factors influencing the target
variable.
The size of the training dataset used is not enough.
Features are not scaled.
Techniques to Reduce Underfitting
Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to
get better results.

Overfitting
A model is said to be overfitted when the model does not make
accurate predictions on testing data.
When a model gets trained with so much data, it starts learning from
the noise and inaccurate data entries in our data set.
− When testing with test data results in High variance.
− Then the model does not categorize the data correctly, because of too
many details and noise.
The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have
more freedom in building the model based on the dataset and
therefore they can really build unrealistic models.
In a nutshell, Overfitting is a problem where the evaluation of
machine learning algorithms on training data is different from
unseen data.

Addressing Overfitting
Question
Our goal when creating a model is to be able to use the model to predict
outcomes correctly for new examples. A model which does this is said to
generalize well.
When a model fits the training data well but does not work well with new
examples that are not in the training set, this is an example of —.
Reducing the problem of overfitting in machine learning involves
applying various techniques to ensure that the model generalizes well
to new, unseen data.
Here are some common techniques to address overfitting:
− Collect more training examples
− Select features to include /exclude
− Regularization

Regularization
Regularization is a technique used in machine learning to prevent
overfitting and improve the generalization performance of models
Reduce size of parameters
− effectively reducing the size of the parameter vector by driving some of
its components to zero
In regularization the coefficient values are reduced to zero.
The regularization approach reduces the size of the independent
factors while maintaining the same number of variables.

Check Your Progress
1 What is underfitting in the context of machine learning models?
A Model fits the training data too closely.
B Model generalizes well to new, unseen data.
C Model fails to capture the underlying patterns in the data.
D Model exhibits high training accuracy.
2 Which of the following scenarios is indicative of overfitting?
A Low training error and low testing error.
B High training error and low testing error.
C Low training error and high testing error.
D High training error and high testing error.

Metrics used to evaluate regression
In regression problems, the prediction error is used to define the
model performance.
− The prediction error is also referred to as residuals and it is defined as
the difference between the actual and predicted values.
The regression model tries to fit a line that produces the smallest
difference between predicted and actual(measured) values.
Residual = actual value − predicted value
error(e) = y − ŷ

There are several evaluation metrics commonly used to assess the
performance of regression models.
Mean Absolute Error (MAE)
It is the average of the absolute differences between the actual and
predicted values. MAE = 1
n
Pn
i=1 |yi − ŷi |
Mean Squared Error (MSE)
It is the average of the squared differences between the actual and the
predicted values. MSE = 1
n
Pn
i=1(yi − ŷi )2
Root Mean Squared Error (RMSE)
It is the average root-squared difference between the real value and the
predicted value. RMSE =
q
1
n
Pn
i=1(yi − ŷi )2
By taking a square root of MSE, we get the Root Mean Square Error.

R-squared (Coefficient of Determination)
R-squared explains to what extent the variance of one variable
explains the variance of the second variable.
In other words, it measures the proportion of variance of the
dependent variable explained by the independent variable.
R squared is a popular metric for identifying model accuracy.
− A larger R squared value indicates a better fit.
R2 = 1 −
Pn
i=1(yi −ŷi )2
Pn
i=1(yi −ȳ)2
Adjusted R-Square
Adjusted R2 is the same as standard R2 except that it penalizes
models when additional features are added.
It measures the variation explained by only the independent variables
that actually affect the dependent variable.
Adjusted R2 = 1 − (1−R2)(n−1)
n−k−1

Exercise
Suppose you have a dataset with 10 observations and two variables, X
(independent variable) and Y (dependent variable). You have built a linear
regression model and obtained the following predictions (Ŷ ) and actual
values (Y):
Y=[10,15,20,25,30,35,40,45,50,55] Ŷ =[12,18,22,28,33,37,41,47,51,56]
Compute the following regression evaluation metrics:
1 MAE
2 MSE
3 RMSE
4 R2
What is the primary characteristic of MAE?
A Emphasizes larger errors
B Sensitive to outliers
C Uses squared differences
D Provides a percentage measure

Outline
1 Introduction
2 Regression
3 Classification
6 Decision Tree

Definition
Classification is a machine learning technique used to predict group
membership for data instances.
Given a collection of records (training set), each record contains a
set of attributes, one of the attributes is the class.
− Each record is characterized by a tuple (X,y), where X is the attribute
set and y is the class label.
X: attribute, predictor, independent variable, input
y: class, response, dependent variable, output
Task: Learn a model that maps each attribute set X into one of the
predefined class label y
− Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
− A test set is used to determine the accuracy of the model.
− Usually, the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.

Examples of Classification Tasks
Email Spam Detection
− Given an email, predict whether it is spam or not
Handwritten Digit Recognition
− Given an image of a handwritten digit, classify it into one of the digits
from 0 to 9.
Disease Diagnosis
− Based on medical test results and patient information, predict whether
a patient has a specific disease (e.g., diabetes, cancer).
Sentiment Analysis
− Analyze a piece of text (e.g., a product review) and classify it as
positive, negative, or neutral sentiment.
Image Recognition
− Classify objects in an image into predefined categories (e.g., cat, dog,
car).
Credit Scoring
− Determine whether a person is likely to default on a loan based on their
credit history, income, and other relevant factors.
Customer Churn Prediction
− Predict whether a customer is likely to churn (stop using) a service
based on their usage patterns and demographics.

Illustrating Classification Task

Classification Techniques
There are various classification methods. Popular classification techniques
include the following.
K-nearest neighbor: KNN is a non-parametric, lazy learning
algorithm where an instance is classified by the majority class of its
k-nearest neighbors.
Logistic Regression: Despite its name, logistic regression is used for
binary classification problems. It models the probability of an instance
belonging to a particular class.
Decision tree classifier: divide decision space into piecewise
constant regions.
Random Forest: Random Forest is an ensemble learning method
that constructs multiple decision trees during training and outputs the
class that is the mode of the classes from individual trees.
Support Vector Machines (SVM): SVM is a linear model creates a
line or a hyperplane which separates the data into classes.
Neural networks: partition by non-linear boundaries
Bayesian network: a probabilistic model

Check Your Progress
Question 1: What is the primary goal of a classification algorithm?
A Minimize computational complexity
B Predict continuous values
C Predict categorical labels for new instances
D Maximize feature dimensionality
Question 2: What is the primary goal of a classification algorithm in
handwritten digit recognition?
A Predicting colors of digits
B Identifying shapes of digits
C Predicting continuous values
D Recognizing the digit class (0-9)

Outline
1 Introduction
2 Regression
3 Classification
6 Decision Tree

What is KNN
K nearest neighbors is a simple algorithm used for both classification
and regression.
− It belongs to the family of lazy learning algorithms, as it doesn’t build
an explicit model during training.
It basically stores all available cases to classify the new cases by a
majority vote of its k neighbors.
− It memorizes the training instances and makes predictions based on the
majority class (for classification) or the average (for regression) of the
k-nearest neighbors in the feature space.
The case assigned to the class is most common amongst its K nearest
neighbors measured by a distance function (Euclidean, Manhattan,...).
If K = 1, then the case is simply assigned to the class of its nearest
neighbor.

KNN
Steps involved in the KNN algorithm
1 Define the Value of K
2 Select the Distance Metric
3 Prepare the Data
4 Split the Dataset
5 Calculate Distances
6 Majority Voting (for Classification) or Weighted Average (for
Regression)
7 Make Predictions
8 Evaluate the Model

KNN
Example : Assume we have a dataset which can be plotted as follows
Next, we need to classify new data point with black dot into blue or
red class. We are assuming K = 3 i.e., it would find three nearest
data points.
We can see in the above diagram the three nearest neighbors of the
data point with black dot.
Among those three, two of them lies in Red class hence the black dot
will also be assigned in red class.

KNN:Example
Let L= { age=7, sex= F then class label of recreation is ?}
Determine the L class of recreation based on following data record
ED=
p
(x1 − x2)2 + (y1 − y2)2 because we have only 2 attribute
We should represent value of attribute in numeric Let female= 0 and
male =1
k=3

KNN: Example
Find the ED of each and every person with L .
A= (56 − 7)2
+ (0 − 0)2
= 2400 then
√
2401===49
B=(34 − 7)2
+ (1 − 0)2
= 730 then
√
730 ===27.0
C=(25 − 7)2
+ (0 − 0)2
= 324 then
√
324 ===18
D=(40 − 7)2
+ (1 − 0)2
= 1090 then
√
1090 ===33.01
E=(35 − 7)2
+ (1 − 0)2
= 785 then
√
785 ===28.01
F=(32 − 7)2
+ (0 − 0)2
= 625 then
√
625 ===25,
G=(40 − 7)2
+ (0 − 0)2
= 1089 then
√
1089 ===33,
H=(20 − 7)2
+ (1 − 0)2
= 170 then
√
170 ===13.038
Then k=3 , take smallest Ed measures which are 3 of smallest values i.e.,
13=h, Neither, 18=c, Neither, 25= F, football therefore L belongs to class
of neither

KNN: Key Concepts
Distance Metric: KNN relies on a distance metric (e.g., Euclidean
distance) to measure the similarity or dissimilarity between instances
in the feature space.
Parameter ’K’: The parameter ’K’ represents the number of
neighbors to consider when making a prediction. It is a crucial factor
that affects the model’s performance.
Decision Rule: For classification, the majority class among the
k-nearest neighbors determines the predicted class. For regression, the
average of the target values is used.

Advantages and Disadvantages of KNN
Advantages
Simple and easy to understand.
No training phase, making it computationally efficient during training.
Versatile and effective in a wide range of applications.
Disadvantage
Computationally expensive during prediction, especially with large
datasets.
Sensitive to irrelevant or redundant features.
Optimal choice of ’K’ may be task-dependent.

Check Your Progress
Question 1: What is the primary principle behind the K-Nearest
Neighbors (KNN) algorithm?
A Minimizing error residuals
B Finding centroids
C Predicting based on the majority class among k-neighbors
D Maximizing entropy
Question 2: In terms of computational complexity, during which phase
does KNN generally become more computationally expensive?
A Training phase
B Testing phase
C Feature extraction phase
D Model interpretation phase

Outline
1 Introduction
2 Regression
3 Classification
6 Decision Tree

What is Logistic Regression?
Logistic regression is a supervised machine learning algorithm used for
classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not.
Logistic regression is used for binary classification where we use
sigmoid function, that takes input as independent variables and
produces a probability value between 0 and 1.
Sigmoid Function/Logistic Function
The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1.
g(z) =
1
1 + e−z
where
z = β0 + β1X1 + β2X2 + . . . + βnXn
e is the base of the natural logarithm (approximately 2.71828).

Sigmoid Function

Logistic Regression-Key Points
The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the “S” form.
In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).
In logistic regression, we use the concept of the threshold value,
which defines the probability of either 0 or 1. Such as values above
the threshold value tends to 1, and a value below the threshold values
tends to 0.
Logistic regression predicts the output of a categorical dependent
variable.
− Once we have a probability, how do we actually classify the data?
Choose a probability depending on the type of the classification
problem we’re solving for:
y =
(
0 if g(z) 0.5
1 if g(z) ≥ 0.5

Logistic Regression
Advantages
Simple and Interpretable: Logistic regression is relatively simple to
understand and implement.
Computational Efficiency: Logistic regression is computationally
efficient and can handle large datasets efficiently.
Works well for Binary Classification: It is particularly effective
when the dependent variable is binary (two classes).
Less Prone to Overfitting: Logistic regression tends to be less
prone to overfitting compared to more complex models, making it a
good choice for smaller datasets.
Disadvantage
Assumption of Linearity
Sensitive to Outliers
Not Ideal for Multi-Class Classification

Check Your Progress
1 What is the primary objective of logistic regression?
A. Minimizing mean squared error
B. Maximizing likelihood
C. Minimizing regularization
D. Maximizing R-squared
2 If the output of a logistic regression model is 0.8 for a given
instance, how would you predict the class?
A. Class 0
B. Class 1
C. Insufficient information to determine
D. Class probability is not relevant for prediction

Outline
1 Introduction
2 Regression
3 Classification
6 Decision Tree

Decision Tree Based Classification
A decision tree is a flow-chart-like tree structure, where
− each internal node (nonleaf node) denotes a test on an attribute,
− each branch represents an outcome of the test, and
− each leaf node (or terminal node) holds a class label
Decision tree performs classification by constructing a tree based on
training instances with leaves having class labels.
− The tree is traversed for each test instance to find a leaf, and the class
of the leaf is the predicted class
Widely used learning method as it has been applied to:
− classify medical patients based on the disease,
− equipment malfunction by cause,
− loan applicant by likelihood of payment.

Decision tree learning: Algorithm
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose ”most significant” attribute as root of
(sub)tree
− Tree is constructed in a top-down recursive divide-and-conquer manner
− At start, all the training examples/tuples are at the root
− Attributes are categorical (if continuous-valued, they are discretized in
advance)
− Examples are partitioned recursively based on selected attributes
− Optimal attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning:
− All samples (tuples) for a given node belong to the same class
− There are no remaining attributes on which the tuples may be further
partitioned
− There are no samples (tuples) left for a given branch

Choosing the Splitting Attribute
At each node, the best attribute is selected for splitting the training
examples using a Goodness function
− The best attribute is the one that separate the classes of the training
examples faster such that it results in the smallest tree
Typical goodness functions: information gain and GINI index
Information Gain: Select the attribute with the highest information
gain, that create small average disorder
− First, compute the disorder using Entropy; the expected information
needed to classify objects into classes
− Second, measure the Information Gain; to calculate by how much the
disorder of a set would reduce by knowing the value of a particular
attribute.
GINI index
− An alternative to information gain that measure impurity of attributes
in the classification task
− Select the attribute with the smallest GINI value.

Entropy
The Entropy measures the disorder of a set S containing a total of n
examples of which n+ are positive and n− are negative and it is given
by:
D(n+, n−) = −
n+
n
log2
n+
n
−
n−
n
log2
n−
n
= Entropy(S) (1)
Some useful properties of the Entropy:
− D(n, m) = D(m, n)
− D(0, m) = D(m, 0) = 0
D(S)=0 means that all the examples in S have the same class
− D(m, m) = 1
D(S)=1 means that half the examples in S are of one class and half are
in the opposite class

Information Gain
Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A)
The Information Gain measures the expected reduction in entropy due
to splitting on an attribute A
GAINsplit = Entropy(S) −
k
X
i=1
ni
n
Entropy(i)
!
(2)
Parent Node S is split into k partitions; ni is number of records in partition
i
The attribute with the highest information gain is chosen as the
splitting attribute for node N
− This attribute minimizes the information needed to classify the tuples
in the resulting partitions and reflects the least randomness or
“impurity” in these partitions

Example: The problem of “Sunburn”
You want to predict whether another person is likely to get sunburned
if he is back to the beach. How can you do this?
Data Collected: predict based on the observed properties of the people

1 Attribute Selection by Information Gain to construct the
optimal decision tree
Entropy: The Disorder of Sunburned
D({“Sarah”,“Dana”,“Alex”,“Annie”,“Emily””,“Pete”,“John”,“Katie”})
D(3+, 5−) = −3
8 log2
3
8 − 5
8 log2
5
8 = 0.954
2 Calculate the Average Disorder Associated with Hair Color
The first term of the sum: D(Sblonde) =
D({”Sarah”, ”Annie”, ”Dana”, ”Katie”}) = D(2+, 2−) = 1
|Sblonde |
|S| ∗ D(Sblonde) = 4
8 ∗ 1 = 0.5
3 The second and third terms of the sum:
Sred = {”Emily”}
Sbrown = {”Alex”, ”Pete”, ”John”}
− These are both 0 because within each set all the examples have the
same class
− So the average disorder created when splitting on ”hair color” is
0.5+0+0=0.5

Which decision variable minimizes the disorder?
Test Average Disorder of the other attributes
hair 0.50
height 0.69
weight 0.94
lotion 0.61
Which decision variable maximizes the Info Gain then?
Remember it’s the one which minimizes the average disorder.
− Gain(hair) = 0.954 - 0.50 = 0.454
− Gain(height) = 0.954 - 0.69 =0.264
− Gain(weight) = 0.954 - 0.94 =0.014
− Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?

Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree.
Which attributes is better to classify the remaining ?
This is the simplest and optimal one possible and it makes a lot of
sense.
It classifies 4 of the people on just the hair color alone.

You can view Decision Tree as an IF-THEN-ELSE statement which
tells us whether someone will suffer from sunburn.
if (Hair-Colour=“red”) then
return(sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return(sunburned = yes)
else
return(false)
end if
Rule Extraction from a decision tree
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.

Exercise: Decision Tree for “buy computer or not”. Use the training
dataset given below to construct decision tree

Why decision tree in Machine Learning?
Relatively faster learning speed (than other classification methods)
Convertible to simple and easy to understand classification if-then-else
rules
Comparable classification accuracy with other methods
Does not require any prior knowledge of data distribution, works well
on noisy data.
Pros
Reasonable training time
Easy to implement
Easy to interpret
Can Handle Mixed Data
Types
Cons
Cannot handle complicated
relationship between features
Simple decision boundaries
Problems with lots of missing
data

Check Your Progress
1 What is the role of the root node in a decision tree?
A It is the node with the highest impurity
B It is the node where predictions are made
C It is the node where the tree is pruned
D It is the topmost node where the first decision is made
2 Which technique helps to reduce the complexity of a decision
tree and prevent overfitting?
A Feature scaling
B Regularization
C Pruning
D Data normalization
3 Which of the following is an advantage of decision trees?
A Sensitivity to feature scales
B Difficulty handling non-linear relationships
C Limited interpretability
D Easy to interpret and understand

Metrics used to Evaluate Classifiers
Here are some key metrics commonly used for classifier evaluation:
1 Accuracy:
Definition: The proportion of correctly classified instances out of the
total instances.
Formula: Accuracy = Number of Correct Predictions
Total Number of Predictions
Use Case: Suitable for balanced datasets but may be misleading in the
presence of imbalanced classes.
2 Precision:
Definition: The proportion of true positive predictions out of all
positive predictions.
Formula: Precision = True Positives
True Positives + False Positives
Use Case: Emphasizes the accuracy of positive predictions, useful
when the cost of false positives is high.
3 Recall (Sensitivity or True Positive Rate):
Definition: The proportion of true positive predictions out of all actual
positives.
Formula: Recall = True Positives
True Positives + False Negatives
Use Case: Emphasizes capturing all positive instances, useful when
the cost of false negatives is high.

resume* F1 Score:
Definition: The harmonic mean of precision and recall.
Formula: F1 Score = 2×Precision×Recall
Precision + Recall
Use Case: Balances precision and recall, especially in imbalanced
datasets.
resume* Specificity (True Negative Rate):
Definition: The proportion of true negative predictions out of all
actual negatives.
Formula: Specificity = True Negatives
True Negatives + False Positives
Use Case: Emphasizes the accuracy of negative predictions.
resume* Area Under the ROC Curve (AUC-ROC):
Definition: The area under the Receiver Operating Characteristic
(ROC) curve.
Use Case: Provides a comprehensive measure of a classifier’s
performance across different decision thresholds.
resume* Confusion Matrix:
Definition: A table summarizing true positives, true negatives, false
positives, and false negatives.
Use Case: Provides a detailed breakdown of classifier performance.

Consider a binary classification problem where a model is trained to
predict whether an online transaction is fraudulent (Positive class) or not
fraudulent (Negative class). The model is evaluated on a test dataset
using a confusion matrix:
Actual Fraudulent Actual Not Fraudulent
Predicted Fraudulent 120 10
Predicted Not Fraudulent 15 850
Compute the following metrics of the model:
Accuracy:
Precision:
Recall:
F1 Score:

Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning

Recommended

Recommended

More Related Content

Similar to Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning

Similar to Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning (20)

Recently uploaded

Recently uploaded (20)

Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning