SlideShare a Scribd company logo
1 of 76
Download to read offline
Chapter Two
Supervised learning
Shumet Tadesse Nigatu
Department of Computer Science
College of Informatics
University of Gondar
February 2024
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 1 / 76
Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 2 / 76
Introduction
Definition
Supervised learning is a type of machine learning paradigm in which an
algorithm is trained on a labeled dataset.
What is Labeled data?
Labeled data consists of input-output pairs, where the input is the data
the algorithm processes, and the output is the corresponding desired or
target output.
Goal
The goal of supervised learning is for the algorithm to learn a mapping
from inputs to outputs, making predictions or classifications on new,
unseen data.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 3 / 76
Examples of Supervised learning
Input (X) Output (Y) Application
email spam?(0/1) spam filtering
audio text transcripts speech recognition
English Amharic Machine translation
ad, user information click?(0/1) online advertizing
image of phone defect?(0/1) visual inspection
Task
Each record is characterized by a tuple (X,y), where X is the attribute
set and y is the class label.
− X: attribute, predictor, independent variable, input
− y: class, response, dependent variable, output
Learn a model that maps each attribute set X into one of the
predefined class label y
− Find a model for class attribute as a function of the values of other
attributes.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 4 / 76
How does supervised learning work?
Supervised learning involve two steps: Model construction and Model
usage
Model construction
A pair of input objects (tuples) and desired output value used to train
the model
The set of tuples used for model construction is training set
The model can be represented as classification rules, decision trees, or
mathematical formulae.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 5 / 76
How does supervised learning work?
Model Usage
For predicting future or unknown objects
The known label of test sample is compared with the predicted result
from the model
Accuracy rate is the percentage of test set samples that are correctly
predicted by the model
Test set is independent of training set
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 6 / 76
How does supervised learning work?
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 7 / 76
Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 8 / 76
What is Regression?
Regression is a statistical method used in machine learning and
statistics to model the relationship between a dependent variable (or
target) and one or more independent variables (or features).
The goal of regression analysis is to understand and quantify the
relationship between variables, make predictions, and identify patterns
in the data.
Regression analysis can be used to model causality and make
prediction
There are different types of regression models, but the most common
ones are linear regression and logistic regression.
− Linear Regression: Predict a continuous dependent variable based on
one or more independent variables.
− Logistic Regression: Predict the probability that an instance belongs
to a particular category (binary classification).
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 9 / 76
Linear regression
Linear regression, a staple of classical statistical modeling, is one of
the simplest algorithms for doing supervised learning.
It serves as a good starting point for more advanced approaches
It is important to have a good understanding of linear regression
before studying more complex learning methods.
Simple Linear Regression Model
Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (X) variables
The relationship between variables is described by a linear function.
A change of one variable causes other variable to change.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 10 / 76
Linear Regression
The linear regression model provides a sloped straight line
representing the relationship between the variables.
Mathematically, we can represent a linear regression as: y = a0 + a1x + 
Where,
y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
 = random error
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 11 / 76
Linear Regression Function
y = a0 + a1x
Slope of regression line a1 = r
Sy
Sx
, where
r =
P
((x − x)(y − y))
pP
(x − x)2
P
(y − y)2
, Sy =
rP
(y − y)2
n − 1
,
Sx =
rP
(x − x)2
n − 1
Y-intercept of a regression line a0 = y − a1x
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 12 / 76
Simple Linear Regression by Hand-Method 1
There are just a handful of steps in linear regression.
1 Calculate average(mean) of your X variable
2 Calculate the difference between each X and the average X
3 Square the differences and add it all up
4 Calculate average(mean) of your Y variable
5 Calculate the difference between each Y and the average Y
6 Square the differences and add it all up
7 Multiply the differences (of X and Y from their respective averages)
and add them all together
8 Calculate r
9 Calculate a1 using r, Sy and Sx
10 Calculate a0 using y, a1 and x
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 13 / 76
Simple Linear Regression by Hand - Method 2
Step 1: Calculate X · Y , X2, and Y 2
Step 2: Calculate ΣX, ΣY , ΣX · Y , ΣX2, and ΣY 2
Step 3: Calculate a0: The formula to calculate a0 is:
a0 =
(ΣY )(ΣX2) − (ΣX)(ΣXY )
n(ΣX2) − (ΣX)2
Step 4: Calculate a1: The formula to calculate a1 is:
a1 =
n(ΣXY ) − (ΣX)(ΣY )
n(ΣX2) − (ΣX)2
Step 5: Place a0 and a1 in the estimated linear regression equation.
The estimated linear regression equation is: ŷ = a0 + a1 · x
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 14 / 76
Example
For this example, we use the salary.csv dataset. The dataset contains the
following two variables for 30 employees.
Years of Experience
Salary
We want to create a simple linear regression model using years of
experience as the explanatory variable and salary as the response variable.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 15 / 76
How to Make Predictions with Linear
Regression?
Linear regression is a method we can use to quantify the relationship
between one or more predictor variables and a response variable.
One of the most common reasons for fitting a regression model is to
use the model to predict the values of new observations.
Steps to make predictions with a regression model
1 Collect the data.
2 Fit a regression model to the data.
3 Verify that the model fits the data well.
4 Use the fitted regression equation to predict the values of new
observations.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 16 / 76
Exercise 1
The following dataset is about 12 months marketing budget and sales.
Create a simple linear regression model using Spend as the explanatory
variable and sales as the response variable.
Month Spend Sales
1 1000 9914
2 4000 40487
3 5000 54324
4 4500 50044
5 3000 34719
6 4000 42551
7 9000 94871
8 11000 118914
9 15000 158484
10 12000 131348
11 7000 78504
12 3000 36284
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 17 / 76
Exercise 2
The following dataset consists of students study hours and coresponding
academic scores. Create a Linear Regression model to predict
marks(scores) based on the time spent to study.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 18 / 76
Multiple Linear Regression
Multiple linear regression is used to model the relationship between
multiple independent variables (also called predictors or features) and
a single dependent variable.
In contrast to simple linear regression, which involves only one
independent variable, multiple linear regression considers two or more
predictors.
The general form of a multiple linear regression model with ’p’
predictors is given by the equation:
Y = β0 + β1X1 + β2X2 + . . . + βpXp + 
Here:
− Y is the dependent variable (the variable you are trying to predict).
− β0 is the intercept term (the value of Y when all predictors are zero).
− β1, β2, . . . , βp are the coefficients, representing the change in Y for a
one-unit change in the corresponding predictor.
− X1, X2, . . . , Xp are the independent variables (predictors).
−  is the error term, representing the unobserved factors that affect Y
but are not included in the model.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 19 / 76
Multiple Linear Regression by Hand
Suppose we have a dataset with one response variable y and two
predictor variables X1 and X2
Use the following steps to fit a multiple linear regression model
− Step 1: Calculate X2
1 , X2
2 , X1y, X2y, and X1X2
− Step 2: Calculate Regression Sums. Next, make the following
regression sum calculations:
X
x2
1 =
X
X2
1 −
X
X1
2
/n
X
x2
2 =
X
X2
2 −
X
X2
2
/n
X
x1y =
X
X1y −
X
X1
X
y

/n
X
x2y =
X
X2y −
X
X2
X
y

/n
X
x1x2 =
X
X1X2 −
X
X1
X
X2

/n
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 20 / 76
Multiple Linear Regression by Hand
Step 3: Calculate b0, b1, and b2 The formula to calculate b1 is:
b1 =
P
x2
2

(
P
x1y) − (
P
x1x2) (
P
x2y)
P
x2
1
 P
x2
2

− (
P
x1x2)2
The formula to calculate b2 is:
b2 =
P
x2
1

(
P
x2y) − (
P
x1x2) (
P
x1y)
P
x2
1
 P
x2
2

− (
P
x1x2)2
The formula to calculate b0 is:
b0 = ȳ − b1X̄1 − b2X̄2
Step 5: Place b0, b1, and b2 in the estimated linear regression
equation.
The estimated linear regression equation is: ŷ = b0 + b1 · x1 + b2 · x2
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 21 / 76
Exercise 3
The following dataset consists of employees age, experience and income.
Create a Linear Regression model to predict income based on the age and
experience.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 22 / 76
Check Your Progress
1 If the intercept (a0) in a simple linear regression model is 5, what
does this value represent in the context of the data?
2 What is the fundamental difference between simple linear regression
and multiple linear regression?
A The number of predictors
B The complexity of the model
C The type of response variable
D The presence of interactions
3 Why might one prefer multiple linear regression over simple linear
regression when modeling relationships between variables?
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 23 / 76
Polynomial Regression
Polynomial regression is a form of linear regression in which the
relationship bettween the independet variable x and the dependent
variable y is modelled as an nth degree polynomial
In other words, instead of fitting a straight line (as in simple linear
regression) or a plane (as in multiple linear regression), polynomial
regression fits a curve to the data.
The general equation for polynomial regression of degree n is:
Y = a0 + a1 · X + a2 · X2
+ . . . + an · Xn
+ ε
Here:
Y is the dependent variable.
X is the independent variable.
a0, a1, . . . , an are the coefficients of the polynomial terms.
ε is the error term.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 24 / 76
Why we use polynomial Regression?
Polynomial regression is used when the relationship between the
independent variable(s) and the dependent variable is not linear but
can be better represented by a polynomial curve.
Here are some common scenarios where polynomial regression can be
beneficial:
1 Non-linear Relationships: When the relationship between the
variables is curvilinear or follows a pattern that cannot be captured by
a straight line.
2 Higher Order Patterns: Some relationships may exhibit higher-order
patterns, such as quadratic, cubic, or higher-degree behavior.
Polynomial regression can capture these patterns by including terms
like X2
, X3
, etc.
3 Flexibility in Modeling: Polynomial regression provides a flexible
framework to model a wide range of relationships. By adjusting the
degree of the polynomial, you can fine-tune the model to better
represent the characteristics of the data.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 25 / 76
When to Use Polynomial Regression?
We use polynomial regression when the relationship between a
predictor and response variable is nonlinear.
The easiest way to detect a nonlinear relationship is to create a
scatterplot of the response vs. predictor variable.
Example:
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 26 / 76
Check Your Progress
1 How does polynomial regression differ from linear regression?
A Polynomial regression uses a nonlinear relationship
B Polynomial regression can handle multiple predictors
C Polynomial regression is always more accurate
D Polynomial regression assumes homoscedasticity
2 Provide examples of real-world scenarios where polynomial regression
might outperform linear regression.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 27 / 76
The Problem of Overfitting
Overfitting is an undesirable machine learning behavior that occurs
when the machine learning model gives accurate predictions for
training data but not for new data.
Overfitting is a common issue in machine learning where a model
learns not only the underlying patterns in the training data but also
captures noise and random fluctuations that are specific to that data.
In other words, an overfit model performs well on the training data
but fails to generalize effectively to new, unseen data.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 28 / 76
The Problem of Overfitting
Underfitting
A model is said to have underfitting when a it is too simple to
capture data complexities
It represents the inability of the model to learn the training data
effectively result in poor performance both on the training and testing
data
− In simple terms, an underfit model’s are inaccurate, especially when
applied to new, unseen examples
It mainly happens when we uses very simple model with overly
simplified assumptions
To address underfitting problem of the model, we need to use more
complex models, with enhanced feature representation, and less
regularization
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 29 / 76
The Problem of Overfitting
Reasons for Underfitting
The model is too simple, So it may be not capable to represent the
complexities in the data
The input features which is used to train the model is not the
adequate representations of underlying factors influencing the target
variable.
The size of the training dataset used is not enough.
Features are not scaled.
Techniques to Reduce Underfitting
Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to
get better results.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 30 / 76
Overfitting
A model is said to be overfitted when the model does not make
accurate predictions on testing data.
When a model gets trained with so much data, it starts learning from
the noise and inaccurate data entries in our data set.
− When testing with test data results in High variance.
− Then the model does not categorize the data correctly, because of too
many details and noise.
The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have
more freedom in building the model based on the dataset and
therefore they can really build unrealistic models.
In a nutshell, Overfitting is a problem where the evaluation of
machine learning algorithms on training data is different from
unseen data.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 31 / 76
Addressing Overfitting
Question
Our goal when creating a model is to be able to use the model to predict
outcomes correctly for new examples. A model which does this is said to
generalize well.
When a model fits the training data well but does not work well with new
examples that are not in the training set, this is an example of —.
Reducing the problem of overfitting in machine learning involves
applying various techniques to ensure that the model generalizes well
to new, unseen data.
Here are some common techniques to address overfitting:
− Collect more training examples
− Select features to include /exclude
− Regularization
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 32 / 76
Regularization
Regularization is a technique used in machine learning to prevent
overfitting and improve the generalization performance of models
Reduce size of parameters
− effectively reducing the size of the parameter vector by driving some of
its components to zero
In regularization the coefficient values are reduced to zero.
The regularization approach reduces the size of the independent
factors while maintaining the same number of variables.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 33 / 76
Check Your Progress
1 What is underfitting in the context of machine learning models?
A Model fits the training data too closely.
B Model generalizes well to new, unseen data.
C Model fails to capture the underlying patterns in the data.
D Model exhibits high training accuracy.
2 Which of the following scenarios is indicative of overfitting?
A Low training error and low testing error.
B High training error and low testing error.
C Low training error and high testing error.
D High training error and high testing error.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 34 / 76
Metrics used to evaluate regression
In regression problems, the prediction error is used to define the
model performance.
− The prediction error is also referred to as residuals and it is defined as
the difference between the actual and predicted values.
The regression model tries to fit a line that produces the smallest
difference between predicted and actual(measured) values.
Residual = actual value − predicted value
error(e) = y − ŷ
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 35 / 76
Metrics used to evaluate regression
There are several evaluation metrics commonly used to assess the
performance of regression models.
Mean Absolute Error (MAE)
It is the average of the absolute differences between the actual and
predicted values. MAE = 1
n
Pn
i=1 |yi − ŷi |
Mean Squared Error (MSE)
It is the average of the squared differences between the actual and the
predicted values. MSE = 1
n
Pn
i=1(yi − ŷi )2
Root Mean Squared Error (RMSE)
It is the average root-squared difference between the real value and the
predicted value. RMSE =
q
1
n
Pn
i=1(yi − ŷi )2
By taking a square root of MSE, we get the Root Mean Square Error.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 36 / 76
Metrics used to evaluate regression
R-squared (Coefficient of Determination)
R-squared explains to what extent the variance of one variable
explains the variance of the second variable.
In other words, it measures the proportion of variance of the
dependent variable explained by the independent variable.
R squared is a popular metric for identifying model accuracy.
− A larger R squared value indicates a better fit.
R2 = 1 −
Pn
i=1(yi −ŷi )2
Pn
i=1(yi −ȳ)2
Adjusted R-Square
Adjusted R2 is the same as standard R2 except that it penalizes
models when additional features are added.
It measures the variation explained by only the independent variables
that actually affect the dependent variable.
Adjusted R2 = 1 − (1−R2)(n−1)
n−k−1
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 37 / 76
Exercise
Suppose you have a dataset with 10 observations and two variables, X
(independent variable) and Y (dependent variable). You have built a linear
regression model and obtained the following predictions (Ŷ ) and actual
values (Y):
Y=[10,15,20,25,30,35,40,45,50,55] Ŷ =[12,18,22,28,33,37,41,47,51,56]
Compute the following regression evaluation metrics:
1 MAE
2 MSE
3 RMSE
4 R2
What is the primary characteristic of MAE?
A Emphasizes larger errors
B Sensitive to outliers
C Uses squared differences
D Provides a percentage measure
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 38 / 76
Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 39 / 76
Definition
Classification is a machine learning technique used to predict group
membership for data instances.
Given a collection of records (training set), each record contains a
set of attributes, one of the attributes is the class.
− Each record is characterized by a tuple (X,y), where X is the attribute
set and y is the class label.
X: attribute, predictor, independent variable, input
y: class, response, dependent variable, output
Task: Learn a model that maps each attribute set X into one of the
predefined class label y
− Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
− A test set is used to determine the accuracy of the model.
− Usually, the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 40 / 76
Examples of Classification Tasks
Email Spam Detection
− Given an email, predict whether it is spam or not
Handwritten Digit Recognition
− Given an image of a handwritten digit, classify it into one of the digits
from 0 to 9.
Disease Diagnosis
− Based on medical test results and patient information, predict whether
a patient has a specific disease (e.g., diabetes, cancer).
Sentiment Analysis
− Analyze a piece of text (e.g., a product review) and classify it as
positive, negative, or neutral sentiment.
Image Recognition
− Classify objects in an image into predefined categories (e.g., cat, dog,
car).
Credit Scoring
− Determine whether a person is likely to default on a loan based on their
credit history, income, and other relevant factors.
Customer Churn Prediction
− Predict whether a customer is likely to churn (stop using) a service
based on their usage patterns and demographics.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 41 / 76
Illustrating Classification Task
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 42 / 76
Classification Techniques
There are various classification methods. Popular classification techniques
include the following.
K-nearest neighbor: KNN is a non-parametric, lazy learning
algorithm where an instance is classified by the majority class of its
k-nearest neighbors.
Logistic Regression: Despite its name, logistic regression is used for
binary classification problems. It models the probability of an instance
belonging to a particular class.
Decision tree classifier: divide decision space into piecewise
constant regions.
Random Forest: Random Forest is an ensemble learning method
that constructs multiple decision trees during training and outputs the
class that is the mode of the classes from individual trees.
Support Vector Machines (SVM): SVM is a linear model creates a
line or a hyperplane which separates the data into classes.
Neural networks: partition by non-linear boundaries
Bayesian network: a probabilistic model
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 43 / 76
Check Your Progress
Question 1: What is the primary goal of a classification algorithm?
A Minimize computational complexity
B Predict continuous values
C Predict categorical labels for new instances
D Maximize feature dimensionality
Question 2: What is the primary goal of a classification algorithm in
handwritten digit recognition?
A Predicting colors of digits
B Identifying shapes of digits
C Predicting continuous values
D Recognizing the digit class (0-9)
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 44 / 76
Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 45 / 76
What is KNN
K nearest neighbors is a simple algorithm used for both classification
and regression.
− It belongs to the family of lazy learning algorithms, as it doesn’t build
an explicit model during training.
It basically stores all available cases to classify the new cases by a
majority vote of its k neighbors.
− It memorizes the training instances and makes predictions based on the
majority class (for classification) or the average (for regression) of the
k-nearest neighbors in the feature space.
The case assigned to the class is most common amongst its K nearest
neighbors measured by a distance function (Euclidean, Manhattan,...).
If K = 1, then the case is simply assigned to the class of its nearest
neighbor.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 46 / 76
KNN
Steps involved in the KNN algorithm
1 Define the Value of K
2 Select the Distance Metric
3 Prepare the Data
4 Split the Dataset
5 Calculate Distances
6 Majority Voting (for Classification) or Weighted Average (for
Regression)
7 Make Predictions
8 Evaluate the Model
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 47 / 76
KNN
Example : Assume we have a dataset which can be plotted as follows
Next, we need to classify new data point with black dot into blue or
red class. We are assuming K = 3 i.e., it would find three nearest
data points.
We can see in the above diagram the three nearest neighbors of the
data point with black dot.
Among those three, two of them lies in Red class hence the black dot
will also be assigned in red class.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 48 / 76
KNN:Example
Let L= { age=7, sex= F then class label of recreation is ?}
Determine the L class of recreation based on following data record
ED=
p
(x1 − x2)2 + (y1 − y2)2 because we have only 2 attribute
We should represent value of attribute in numeric Let female= 0 and
male =1
k=3
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 49 / 76
KNN: Example
Find the ED of each and every person with L .
A= (56 − 7)2
+ (0 − 0)2
= 2400 then
√
2401===49
B=(34 − 7)2
+ (1 − 0)2
= 730 then
√
730 ===27.0
C=(25 − 7)2
+ (0 − 0)2
= 324 then
√
324 ===18
D=(40 − 7)2
+ (1 − 0)2
= 1090 then
√
1090 ===33.01
E=(35 − 7)2
+ (1 − 0)2
= 785 then
√
785 ===28.01
F=(32 − 7)2
+ (0 − 0)2
= 625 then
√
625 ===25,
G=(40 − 7)2
+ (0 − 0)2
= 1089 then
√
1089 ===33,
H=(20 − 7)2
+ (1 − 0)2
= 170 then
√
170 ===13.038
Then k=3 , take smallest Ed measures which are 3 of smallest values i.e.,
13=h, Neither, 18=c, Neither, 25= F, football therefore L belongs to class
of neither
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 50 / 76
KNN: Key Concepts
Distance Metric: KNN relies on a distance metric (e.g., Euclidean
distance) to measure the similarity or dissimilarity between instances
in the feature space.
Parameter ’K’: The parameter ’K’ represents the number of
neighbors to consider when making a prediction. It is a crucial factor
that affects the model’s performance.
Decision Rule: For classification, the majority class among the
k-nearest neighbors determines the predicted class. For regression, the
average of the target values is used.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 51 / 76
Advantages and Disadvantages of KNN
Advantages
Simple and easy to understand.
No training phase, making it computationally efficient during training.
Versatile and effective in a wide range of applications.
Disadvantage
Computationally expensive during prediction, especially with large
datasets.
Sensitive to irrelevant or redundant features.
Optimal choice of ’K’ may be task-dependent.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 52 / 76
Check Your Progress
Question 1: What is the primary principle behind the K-Nearest
Neighbors (KNN) algorithm?
A Minimizing error residuals
B Finding centroids
C Predicting based on the majority class among k-neighbors
D Maximizing entropy
Question 2: In terms of computational complexity, during which phase
does KNN generally become more computationally expensive?
A Training phase
B Testing phase
C Feature extraction phase
D Model interpretation phase
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 53 / 76
Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 54 / 76
What is Logistic Regression?
Logistic regression is a supervised machine learning algorithm used for
classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not.
Logistic regression is used for binary classification where we use
sigmoid function, that takes input as independent variables and
produces a probability value between 0 and 1.
Sigmoid Function/Logistic Function
The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1.
g(z) =
1
1 + e−z
where
z = β0 + β1X1 + β2X2 + . . . + βnXn
e is the base of the natural logarithm (approximately 2.71828).
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 55 / 76
Sigmoid Function
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 56 / 76
Logistic Regression-Key Points
The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the “S” form.
In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).
In logistic regression, we use the concept of the threshold value,
which defines the probability of either 0 or 1. Such as values above
the threshold value tends to 1, and a value below the threshold values
tends to 0.
Logistic regression predicts the output of a categorical dependent
variable.
− Once we have a probability, how do we actually classify the data?
Choose a probability depending on the type of the classification
problem we’re solving for:
y =
(
0 if g(z)  0.5
1 if g(z) ≥ 0.5
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 57 / 76
Logistic Regression
Advantages
Simple and Interpretable: Logistic regression is relatively simple to
understand and implement.
Computational Efficiency: Logistic regression is computationally
efficient and can handle large datasets efficiently.
Works well for Binary Classification: It is particularly effective
when the dependent variable is binary (two classes).
Less Prone to Overfitting: Logistic regression tends to be less
prone to overfitting compared to more complex models, making it a
good choice for smaller datasets.
Disadvantage
Assumption of Linearity
Sensitive to Outliers
Not Ideal for Multi-Class Classification
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 58 / 76
Check Your Progress
1 What is the primary objective of logistic regression?
A. Minimizing mean squared error
B. Maximizing likelihood
C. Minimizing regularization
D. Maximizing R-squared
2 If the output of a logistic regression model is 0.8 for a given
instance, how would you predict the class?
A. Class 0
B. Class 1
C. Insufficient information to determine
D. Class probability is not relevant for prediction
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 59 / 76
Outline
1 Introduction
2 Regression
3 Classification
4 k- Nearest Neighbors (KNN)
5 Logistic Regression
6 Decision Tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 60 / 76
Decision Tree Based Classification
A decision tree is a flow-chart-like tree structure, where
− each internal node (nonleaf node) denotes a test on an attribute,
− each branch represents an outcome of the test, and
− each leaf node (or terminal node) holds a class label
Decision tree performs classification by constructing a tree based on
training instances with leaves having class labels.
− The tree is traversed for each test instance to find a leaf, and the class
of the leaf is the predicted class
Widely used learning method as it has been applied to:
− classify medical patients based on the disease,
− equipment malfunction by cause,
− loan applicant by likelihood of payment.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 61 / 76
Decision tree learning: Algorithm
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose ”most significant” attribute as root of
(sub)tree
− Tree is constructed in a top-down recursive divide-and-conquer manner
− At start, all the training examples/tuples are at the root
− Attributes are categorical (if continuous-valued, they are discretized in
advance)
− Examples are partitioned recursively based on selected attributes
− Optimal attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning:
− All samples (tuples) for a given node belong to the same class
− There are no remaining attributes on which the tuples may be further
partitioned
− There are no samples (tuples) left for a given branch
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 62 / 76
Choosing the Splitting Attribute
At each node, the best attribute is selected for splitting the training
examples using a Goodness function
− The best attribute is the one that separate the classes of the training
examples faster such that it results in the smallest tree
Typical goodness functions: information gain and GINI index
Information Gain: Select the attribute with the highest information
gain, that create small average disorder
− First, compute the disorder using Entropy; the expected information
needed to classify objects into classes
− Second, measure the Information Gain; to calculate by how much the
disorder of a set would reduce by knowing the value of a particular
attribute.
GINI index
− An alternative to information gain that measure impurity of attributes
in the classification task
− Select the attribute with the smallest GINI value.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 63 / 76
Entropy
The Entropy measures the disorder of a set S containing a total of n
examples of which n+ are positive and n− are negative and it is given
by:
D(n+, n−) = −
n+
n
log2
n+
n
−
n−
n
log2
n−
n
= Entropy(S) (1)
Some useful properties of the Entropy:
− D(n, m) = D(m, n)
− D(0, m) = D(m, 0) = 0
D(S)=0 means that all the examples in S have the same class
− D(m, m) = 1
D(S)=1 means that half the examples in S are of one class and half are
in the opposite class
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 64 / 76
Information Gain
Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A)
The Information Gain measures the expected reduction in entropy due
to splitting on an attribute A
GAINsplit = Entropy(S) −
k
X
i=1
ni
n
Entropy(i)
!
(2)
Parent Node S is split into k partitions; ni is number of records in partition
i
The attribute with the highest information gain is chosen as the
splitting attribute for node N
− This attribute minimizes the information needed to classify the tuples
in the resulting partitions and reflects the least randomness or
“impurity” in these partitions
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 65 / 76
Example: The problem of “Sunburn”
You want to predict whether another person is likely to get sunburned
if he is back to the beach. How can you do this?
Data Collected: predict based on the observed properties of the people
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 66 / 76
Example: The problem of “Sunburn”
1 Attribute Selection by Information Gain to construct the
optimal decision tree
Entropy: The Disorder of Sunburned
D({“Sarah”,“Dana”,“Alex”,“Annie”,“Emily””,“Pete”,“John”,“Katie”})
D(3+, 5−) = −3
8 log2
3
8 − 5
8 log2
5
8 = 0.954
2 Calculate the Average Disorder Associated with Hair Color
The first term of the sum: D(Sblonde) =
D({”Sarah”, ”Annie”, ”Dana”, ”Katie”}) = D(2+, 2−) = 1
|Sblonde |
|S| ∗ D(Sblonde) = 4
8 ∗ 1 = 0.5
3 The second and third terms of the sum:
Sred = {”Emily”}
Sbrown = {”Alex”, ”Pete”, ”John”}
− These are both 0 because within each set all the examples have the
same class
− So the average disorder created when splitting on ”hair color” is
0.5+0+0=0.5
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 67 / 76
Example: The problem of “Sunburn”
Which decision variable minimizes the disorder?
Test Average Disorder of the other attributes
hair 0.50
height 0.69
weight 0.94
lotion 0.61
Which decision variable maximizes the Info Gain then?
Remember it’s the one which minimizes the average disorder.
− Gain(hair) = 0.954 - 0.50 = 0.454
− Gain(height) = 0.954 - 0.69 =0.264
− Gain(weight) = 0.954 - 0.94 =0.014
− Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 68 / 76
Example: The problem of “Sunburn”
Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree.
Which attributes is better to classify the remaining ?
This is the simplest and optimal one possible and it makes a lot of
sense.
It classifies 4 of the people on just the hair color alone.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 69 / 76
Example: The problem of “Sunburn”
You can view Decision Tree as an IF-THEN-ELSE statement which
tells us whether someone will suffer from sunburn.
if (Hair-Colour=“red”) then
return(sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return(sunburned = yes)
else
return(false)
end if
Rule Extraction from a decision tree
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 70 / 76
Exercise: Decision Tree for “buy computer or not”. Use the training
dataset given below to construct decision tree
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 71 / 76
Why decision tree in Machine Learning?
Relatively faster learning speed (than other classification methods)
Convertible to simple and easy to understand classification if-then-else
rules
Comparable classification accuracy with other methods
Does not require any prior knowledge of data distribution, works well
on noisy data.
Pros
Reasonable training time
Easy to implement
Easy to interpret
Can Handle Mixed Data
Types
Cons
Cannot handle complicated
relationship between features
Simple decision boundaries
Problems with lots of missing
data
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 72 / 76
Check Your Progress
1 What is the role of the root node in a decision tree?
A It is the node with the highest impurity
B It is the node where predictions are made
C It is the node where the tree is pruned
D It is the topmost node where the first decision is made
2 Which technique helps to reduce the complexity of a decision
tree and prevent overfitting?
A Feature scaling
B Regularization
C Pruning
D Data normalization
3 Which of the following is an advantage of decision trees?
A Sensitivity to feature scales
B Difficulty handling non-linear relationships
C Limited interpretability
D Easy to interpret and understand
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 73 / 76
Metrics used to Evaluate Classifiers
Here are some key metrics commonly used for classifier evaluation:
1 Accuracy:
Definition: The proportion of correctly classified instances out of the
total instances.
Formula: Accuracy = Number of Correct Predictions
Total Number of Predictions
Use Case: Suitable for balanced datasets but may be misleading in the
presence of imbalanced classes.
2 Precision:
Definition: The proportion of true positive predictions out of all
positive predictions.
Formula: Precision = True Positives
True Positives + False Positives
Use Case: Emphasizes the accuracy of positive predictions, useful
when the cost of false positives is high.
3 Recall (Sensitivity or True Positive Rate):
Definition: The proportion of true positive predictions out of all actual
positives.
Formula: Recall = True Positives
True Positives + False Negatives
Use Case: Emphasizes capturing all positive instances, useful when
the cost of false negatives is high.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 74 / 76
Metrics used to Evaluate Classifiers
resume* F1 Score:
Definition: The harmonic mean of precision and recall.
Formula: F1 Score = 2×Precision×Recall
Precision + Recall
Use Case: Balances precision and recall, especially in imbalanced
datasets.
resume* Specificity (True Negative Rate):
Definition: The proportion of true negative predictions out of all
actual negatives.
Formula: Specificity = True Negatives
True Negatives + False Positives
Use Case: Emphasizes the accuracy of negative predictions.
resume* Area Under the ROC Curve (AUC-ROC):
Definition: The area under the Receiver Operating Characteristic
(ROC) curve.
Use Case: Provides a comprehensive measure of a classifier’s
performance across different decision thresholds.
resume* Confusion Matrix:
Definition: A table summarizing true positives, true negatives, false
positives, and false negatives.
Use Case: Provides a detailed breakdown of classifier performance.
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 75 / 76
Metrics used to Evaluate Classifiers
Consider a binary classification problem where a model is trained to
predict whether an online transaction is fraudulent (Positive class) or not
fraudulent (Negative class). The model is evaluated on a test dataset
using a confusion matrix:
Actual Fraudulent Actual Not Fraudulent
Predicted Fraudulent 120 10
Predicted Not Fraudulent 15 850
Compute the following metrics of the model:
Accuracy:
Precision:
Recall:
F1 Score:
Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 76 / 76

More Related Content

Similar to Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning

A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesVimal Gupta
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisHARISH Kumar H R
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..butest
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..butest
 
Mc0079 computer based optimization methods--phpapp02
Mc0079 computer based optimization methods--phpapp02Mc0079 computer based optimization methods--phpapp02
Mc0079 computer based optimization methods--phpapp02Rabby Bhatt
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegressionDaniel K
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
Advanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptxAdvanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptxakashayosha
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
5.0 -Chapter Introduction
5.0 -Chapter Introduction5.0 -Chapter Introduction
5.0 -Chapter IntroductionSabrina Baloi
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Forecasting.ppt
Forecasting.pptForecasting.ppt
Forecasting.pptShrishBabu
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...ijaia
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...gerogepatton
 
Marketing Engineering Notes
Marketing Engineering NotesMarketing Engineering Notes
Marketing Engineering NotesFelipe Affonso
 

Similar to Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning (20)

A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Linear_Regression
Linear_RegressionLinear_Regression
Linear_Regression
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression Analysis
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
Mc0079 computer based optimization methods--phpapp02
Mc0079 computer based optimization methods--phpapp02Mc0079 computer based optimization methods--phpapp02
Mc0079 computer based optimization methods--phpapp02
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegression
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
Advanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptxAdvanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptx
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
5.0 -Chapter Introduction
5.0 -Chapter Introduction5.0 -Chapter Introduction
5.0 -Chapter Introduction
 
report
reportreport
report
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Forecasting.ppt
Forecasting.pptForecasting.ppt
Forecasting.ppt
 
Chap011
Chap011Chap011
Chap011
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
Marketing Engineering Notes
Marketing Engineering NotesMarketing Engineering Notes
Marketing Engineering Notes
 

Recently uploaded

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 

Recently uploaded (20)

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 

Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning

  • 1. Chapter Two Supervised learning Shumet Tadesse Nigatu Department of Computer Science College of Informatics University of Gondar February 2024 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 1 / 76
  • 2. Outline 1 Introduction 2 Regression 3 Classification 4 k- Nearest Neighbors (KNN) 5 Logistic Regression 6 Decision Tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 2 / 76
  • 3. Introduction Definition Supervised learning is a type of machine learning paradigm in which an algorithm is trained on a labeled dataset. What is Labeled data? Labeled data consists of input-output pairs, where the input is the data the algorithm processes, and the output is the corresponding desired or target output. Goal The goal of supervised learning is for the algorithm to learn a mapping from inputs to outputs, making predictions or classifications on new, unseen data. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 3 / 76
  • 4. Examples of Supervised learning Input (X) Output (Y) Application email spam?(0/1) spam filtering audio text transcripts speech recognition English Amharic Machine translation ad, user information click?(0/1) online advertizing image of phone defect?(0/1) visual inspection Task Each record is characterized by a tuple (X,y), where X is the attribute set and y is the class label. − X: attribute, predictor, independent variable, input − y: class, response, dependent variable, output Learn a model that maps each attribute set X into one of the predefined class label y − Find a model for class attribute as a function of the values of other attributes. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 4 / 76
  • 5. How does supervised learning work? Supervised learning involve two steps: Model construction and Model usage Model construction A pair of input objects (tuples) and desired output value used to train the model The set of tuples used for model construction is training set The model can be represented as classification rules, decision trees, or mathematical formulae. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 5 / 76
  • 6. How does supervised learning work? Model Usage For predicting future or unknown objects The known label of test sample is compared with the predicted result from the model Accuracy rate is the percentage of test set samples that are correctly predicted by the model Test set is independent of training set Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 6 / 76
  • 7. How does supervised learning work? Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 7 / 76
  • 8. Outline 1 Introduction 2 Regression 3 Classification 4 k- Nearest Neighbors (KNN) 5 Logistic Regression 6 Decision Tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 8 / 76
  • 9. What is Regression? Regression is a statistical method used in machine learning and statistics to model the relationship between a dependent variable (or target) and one or more independent variables (or features). The goal of regression analysis is to understand and quantify the relationship between variables, make predictions, and identify patterns in the data. Regression analysis can be used to model causality and make prediction There are different types of regression models, but the most common ones are linear regression and logistic regression. − Linear Regression: Predict a continuous dependent variable based on one or more independent variables. − Logistic Regression: Predict the probability that an instance belongs to a particular category (binary classification). Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 9 / 76
  • 10. Linear regression Linear regression, a staple of classical statistical modeling, is one of the simplest algorithms for doing supervised learning. It serves as a good starting point for more advanced approaches It is important to have a good understanding of linear regression before studying more complex learning methods. Simple Linear Regression Model Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (X) variables The relationship between variables is described by a linear function. A change of one variable causes other variable to change. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 10 / 76
  • 11. Linear Regression The linear regression model provides a sloped straight line representing the relationship between the variables. Mathematically, we can represent a linear regression as: y = a0 + a1x + Where, y= Dependent Variable (Target Variable) X= Independent Variable (predictor Variable) a0= intercept of the line (Gives an additional degree of freedom) a1 = Linear regression coefficient (scale factor to each input value). = random error Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 11 / 76
  • 12. Linear Regression Function y = a0 + a1x Slope of regression line a1 = r Sy Sx , where r = P ((x − x)(y − y)) pP (x − x)2 P (y − y)2 , Sy = rP (y − y)2 n − 1 , Sx = rP (x − x)2 n − 1 Y-intercept of a regression line a0 = y − a1x Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 12 / 76
  • 13. Simple Linear Regression by Hand-Method 1 There are just a handful of steps in linear regression. 1 Calculate average(mean) of your X variable 2 Calculate the difference between each X and the average X 3 Square the differences and add it all up 4 Calculate average(mean) of your Y variable 5 Calculate the difference between each Y and the average Y 6 Square the differences and add it all up 7 Multiply the differences (of X and Y from their respective averages) and add them all together 8 Calculate r 9 Calculate a1 using r, Sy and Sx 10 Calculate a0 using y, a1 and x Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 13 / 76
  • 14. Simple Linear Regression by Hand - Method 2 Step 1: Calculate X · Y , X2, and Y 2 Step 2: Calculate ΣX, ΣY , ΣX · Y , ΣX2, and ΣY 2 Step 3: Calculate a0: The formula to calculate a0 is: a0 = (ΣY )(ΣX2) − (ΣX)(ΣXY ) n(ΣX2) − (ΣX)2 Step 4: Calculate a1: The formula to calculate a1 is: a1 = n(ΣXY ) − (ΣX)(ΣY ) n(ΣX2) − (ΣX)2 Step 5: Place a0 and a1 in the estimated linear regression equation. The estimated linear regression equation is: ŷ = a0 + a1 · x Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 14 / 76
  • 15. Example For this example, we use the salary.csv dataset. The dataset contains the following two variables for 30 employees. Years of Experience Salary We want to create a simple linear regression model using years of experience as the explanatory variable and salary as the response variable. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 15 / 76
  • 16. How to Make Predictions with Linear Regression? Linear regression is a method we can use to quantify the relationship between one or more predictor variables and a response variable. One of the most common reasons for fitting a regression model is to use the model to predict the values of new observations. Steps to make predictions with a regression model 1 Collect the data. 2 Fit a regression model to the data. 3 Verify that the model fits the data well. 4 Use the fitted regression equation to predict the values of new observations. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 16 / 76
  • 17. Exercise 1 The following dataset is about 12 months marketing budget and sales. Create a simple linear regression model using Spend as the explanatory variable and sales as the response variable. Month Spend Sales 1 1000 9914 2 4000 40487 3 5000 54324 4 4500 50044 5 3000 34719 6 4000 42551 7 9000 94871 8 11000 118914 9 15000 158484 10 12000 131348 11 7000 78504 12 3000 36284 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 17 / 76
  • 18. Exercise 2 The following dataset consists of students study hours and coresponding academic scores. Create a Linear Regression model to predict marks(scores) based on the time spent to study. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 18 / 76
  • 19. Multiple Linear Regression Multiple linear regression is used to model the relationship between multiple independent variables (also called predictors or features) and a single dependent variable. In contrast to simple linear regression, which involves only one independent variable, multiple linear regression considers two or more predictors. The general form of a multiple linear regression model with ’p’ predictors is given by the equation: Y = β0 + β1X1 + β2X2 + . . . + βpXp + Here: − Y is the dependent variable (the variable you are trying to predict). − β0 is the intercept term (the value of Y when all predictors are zero). − β1, β2, . . . , βp are the coefficients, representing the change in Y for a one-unit change in the corresponding predictor. − X1, X2, . . . , Xp are the independent variables (predictors). − is the error term, representing the unobserved factors that affect Y but are not included in the model. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 19 / 76
  • 20. Multiple Linear Regression by Hand Suppose we have a dataset with one response variable y and two predictor variables X1 and X2 Use the following steps to fit a multiple linear regression model − Step 1: Calculate X2 1 , X2 2 , X1y, X2y, and X1X2 − Step 2: Calculate Regression Sums. Next, make the following regression sum calculations: X x2 1 = X X2 1 − X X1 2 /n X x2 2 = X X2 2 − X X2 2 /n X x1y = X X1y − X X1 X y /n X x2y = X X2y − X X2 X y /n X x1x2 = X X1X2 − X X1 X X2 /n Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 20 / 76
  • 21. Multiple Linear Regression by Hand Step 3: Calculate b0, b1, and b2 The formula to calculate b1 is: b1 = P x2 2 ( P x1y) − ( P x1x2) ( P x2y) P x2 1 P x2 2 − ( P x1x2)2 The formula to calculate b2 is: b2 = P x2 1 ( P x2y) − ( P x1x2) ( P x1y) P x2 1 P x2 2 − ( P x1x2)2 The formula to calculate b0 is: b0 = ȳ − b1X̄1 − b2X̄2 Step 5: Place b0, b1, and b2 in the estimated linear regression equation. The estimated linear regression equation is: ŷ = b0 + b1 · x1 + b2 · x2 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 21 / 76
  • 22. Exercise 3 The following dataset consists of employees age, experience and income. Create a Linear Regression model to predict income based on the age and experience. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 22 / 76
  • 23. Check Your Progress 1 If the intercept (a0) in a simple linear regression model is 5, what does this value represent in the context of the data? 2 What is the fundamental difference between simple linear regression and multiple linear regression? A The number of predictors B The complexity of the model C The type of response variable D The presence of interactions 3 Why might one prefer multiple linear regression over simple linear regression when modeling relationships between variables? Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 23 / 76
  • 24. Polynomial Regression Polynomial regression is a form of linear regression in which the relationship bettween the independet variable x and the dependent variable y is modelled as an nth degree polynomial In other words, instead of fitting a straight line (as in simple linear regression) or a plane (as in multiple linear regression), polynomial regression fits a curve to the data. The general equation for polynomial regression of degree n is: Y = a0 + a1 · X + a2 · X2 + . . . + an · Xn + ε Here: Y is the dependent variable. X is the independent variable. a0, a1, . . . , an are the coefficients of the polynomial terms. ε is the error term. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 24 / 76
  • 25. Why we use polynomial Regression? Polynomial regression is used when the relationship between the independent variable(s) and the dependent variable is not linear but can be better represented by a polynomial curve. Here are some common scenarios where polynomial regression can be beneficial: 1 Non-linear Relationships: When the relationship between the variables is curvilinear or follows a pattern that cannot be captured by a straight line. 2 Higher Order Patterns: Some relationships may exhibit higher-order patterns, such as quadratic, cubic, or higher-degree behavior. Polynomial regression can capture these patterns by including terms like X2 , X3 , etc. 3 Flexibility in Modeling: Polynomial regression provides a flexible framework to model a wide range of relationships. By adjusting the degree of the polynomial, you can fine-tune the model to better represent the characteristics of the data. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 25 / 76
  • 26. When to Use Polynomial Regression? We use polynomial regression when the relationship between a predictor and response variable is nonlinear. The easiest way to detect a nonlinear relationship is to create a scatterplot of the response vs. predictor variable. Example: Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 26 / 76
  • 27. Check Your Progress 1 How does polynomial regression differ from linear regression? A Polynomial regression uses a nonlinear relationship B Polynomial regression can handle multiple predictors C Polynomial regression is always more accurate D Polynomial regression assumes homoscedasticity 2 Provide examples of real-world scenarios where polynomial regression might outperform linear regression. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 27 / 76
  • 28. The Problem of Overfitting Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. Overfitting is a common issue in machine learning where a model learns not only the underlying patterns in the training data but also captures noise and random fluctuations that are specific to that data. In other words, an overfit model performs well on the training data but fails to generalize effectively to new, unseen data. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 28 / 76
  • 29. The Problem of Overfitting Underfitting A model is said to have underfitting when a it is too simple to capture data complexities It represents the inability of the model to learn the training data effectively result in poor performance both on the training and testing data − In simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen examples It mainly happens when we uses very simple model with overly simplified assumptions To address underfitting problem of the model, we need to use more complex models, with enhanced feature representation, and less regularization Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 29 / 76
  • 30. The Problem of Overfitting Reasons for Underfitting The model is too simple, So it may be not capable to represent the complexities in the data The input features which is used to train the model is not the adequate representations of underlying factors influencing the target variable. The size of the training dataset used is not enough. Features are not scaled. Techniques to Reduce Underfitting Increase model complexity. Increase the number of features, performing feature engineering. Remove noise from the data. Increase the number of epochs or increase the duration of training to get better results. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 30 / 76
  • 31. Overfitting A model is said to be overfitted when the model does not make accurate predictions on testing data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. − When testing with test data results in High variance. − Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from unseen data. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 31 / 76
  • 32. Addressing Overfitting Question Our goal when creating a model is to be able to use the model to predict outcomes correctly for new examples. A model which does this is said to generalize well. When a model fits the training data well but does not work well with new examples that are not in the training set, this is an example of —. Reducing the problem of overfitting in machine learning involves applying various techniques to ensure that the model generalizes well to new, unseen data. Here are some common techniques to address overfitting: − Collect more training examples − Select features to include /exclude − Regularization Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 32 / 76
  • 33. Regularization Regularization is a technique used in machine learning to prevent overfitting and improve the generalization performance of models Reduce size of parameters − effectively reducing the size of the parameter vector by driving some of its components to zero In regularization the coefficient values are reduced to zero. The regularization approach reduces the size of the independent factors while maintaining the same number of variables. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 33 / 76
  • 34. Check Your Progress 1 What is underfitting in the context of machine learning models? A Model fits the training data too closely. B Model generalizes well to new, unseen data. C Model fails to capture the underlying patterns in the data. D Model exhibits high training accuracy. 2 Which of the following scenarios is indicative of overfitting? A Low training error and low testing error. B High training error and low testing error. C Low training error and high testing error. D High training error and high testing error. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 34 / 76
  • 35. Metrics used to evaluate regression In regression problems, the prediction error is used to define the model performance. − The prediction error is also referred to as residuals and it is defined as the difference between the actual and predicted values. The regression model tries to fit a line that produces the smallest difference between predicted and actual(measured) values. Residual = actual value − predicted value error(e) = y − ŷ Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 35 / 76
  • 36. Metrics used to evaluate regression There are several evaluation metrics commonly used to assess the performance of regression models. Mean Absolute Error (MAE) It is the average of the absolute differences between the actual and predicted values. MAE = 1 n Pn i=1 |yi − ŷi | Mean Squared Error (MSE) It is the average of the squared differences between the actual and the predicted values. MSE = 1 n Pn i=1(yi − ŷi )2 Root Mean Squared Error (RMSE) It is the average root-squared difference between the real value and the predicted value. RMSE = q 1 n Pn i=1(yi − ŷi )2 By taking a square root of MSE, we get the Root Mean Square Error. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 36 / 76
  • 37. Metrics used to evaluate regression R-squared (Coefficient of Determination) R-squared explains to what extent the variance of one variable explains the variance of the second variable. In other words, it measures the proportion of variance of the dependent variable explained by the independent variable. R squared is a popular metric for identifying model accuracy. − A larger R squared value indicates a better fit. R2 = 1 − Pn i=1(yi −ŷi )2 Pn i=1(yi −ȳ)2 Adjusted R-Square Adjusted R2 is the same as standard R2 except that it penalizes models when additional features are added. It measures the variation explained by only the independent variables that actually affect the dependent variable. Adjusted R2 = 1 − (1−R2)(n−1) n−k−1 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 37 / 76
  • 38. Exercise Suppose you have a dataset with 10 observations and two variables, X (independent variable) and Y (dependent variable). You have built a linear regression model and obtained the following predictions (Ŷ ) and actual values (Y): Y=[10,15,20,25,30,35,40,45,50,55] Ŷ =[12,18,22,28,33,37,41,47,51,56] Compute the following regression evaluation metrics: 1 MAE 2 MSE 3 RMSE 4 R2 What is the primary characteristic of MAE? A Emphasizes larger errors B Sensitive to outliers C Uses squared differences D Provides a percentage measure Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 38 / 76
  • 39. Outline 1 Introduction 2 Regression 3 Classification 4 k- Nearest Neighbors (KNN) 5 Logistic Regression 6 Decision Tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 39 / 76
  • 40. Definition Classification is a machine learning technique used to predict group membership for data instances. Given a collection of records (training set), each record contains a set of attributes, one of the attributes is the class. − Each record is characterized by a tuple (X,y), where X is the attribute set and y is the class label. X: attribute, predictor, independent variable, input y: class, response, dependent variable, output Task: Learn a model that maps each attribute set X into one of the predefined class label y − Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. − A test set is used to determine the accuracy of the model. − Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. For example, one may use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 40 / 76
  • 41. Examples of Classification Tasks Email Spam Detection − Given an email, predict whether it is spam or not Handwritten Digit Recognition − Given an image of a handwritten digit, classify it into one of the digits from 0 to 9. Disease Diagnosis − Based on medical test results and patient information, predict whether a patient has a specific disease (e.g., diabetes, cancer). Sentiment Analysis − Analyze a piece of text (e.g., a product review) and classify it as positive, negative, or neutral sentiment. Image Recognition − Classify objects in an image into predefined categories (e.g., cat, dog, car). Credit Scoring − Determine whether a person is likely to default on a loan based on their credit history, income, and other relevant factors. Customer Churn Prediction − Predict whether a customer is likely to churn (stop using) a service based on their usage patterns and demographics. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 41 / 76
  • 42. Illustrating Classification Task Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 42 / 76
  • 43. Classification Techniques There are various classification methods. Popular classification techniques include the following. K-nearest neighbor: KNN is a non-parametric, lazy learning algorithm where an instance is classified by the majority class of its k-nearest neighbors. Logistic Regression: Despite its name, logistic regression is used for binary classification problems. It models the probability of an instance belonging to a particular class. Decision tree classifier: divide decision space into piecewise constant regions. Random Forest: Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes from individual trees. Support Vector Machines (SVM): SVM is a linear model creates a line or a hyperplane which separates the data into classes. Neural networks: partition by non-linear boundaries Bayesian network: a probabilistic model Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 43 / 76
  • 44. Check Your Progress Question 1: What is the primary goal of a classification algorithm? A Minimize computational complexity B Predict continuous values C Predict categorical labels for new instances D Maximize feature dimensionality Question 2: What is the primary goal of a classification algorithm in handwritten digit recognition? A Predicting colors of digits B Identifying shapes of digits C Predicting continuous values D Recognizing the digit class (0-9) Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 44 / 76
  • 45. Outline 1 Introduction 2 Regression 3 Classification 4 k- Nearest Neighbors (KNN) 5 Logistic Regression 6 Decision Tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 45 / 76
  • 46. What is KNN K nearest neighbors is a simple algorithm used for both classification and regression. − It belongs to the family of lazy learning algorithms, as it doesn’t build an explicit model during training. It basically stores all available cases to classify the new cases by a majority vote of its k neighbors. − It memorizes the training instances and makes predictions based on the majority class (for classification) or the average (for regression) of the k-nearest neighbors in the feature space. The case assigned to the class is most common amongst its K nearest neighbors measured by a distance function (Euclidean, Manhattan,...). If K = 1, then the case is simply assigned to the class of its nearest neighbor. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 46 / 76
  • 47. KNN Steps involved in the KNN algorithm 1 Define the Value of K 2 Select the Distance Metric 3 Prepare the Data 4 Split the Dataset 5 Calculate Distances 6 Majority Voting (for Classification) or Weighted Average (for Regression) 7 Make Predictions 8 Evaluate the Model Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 47 / 76
  • 48. KNN Example : Assume we have a dataset which can be plotted as follows Next, we need to classify new data point with black dot into blue or red class. We are assuming K = 3 i.e., it would find three nearest data points. We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those three, two of them lies in Red class hence the black dot will also be assigned in red class. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 48 / 76
  • 49. KNN:Example Let L= { age=7, sex= F then class label of recreation is ?} Determine the L class of recreation based on following data record ED= p (x1 − x2)2 + (y1 − y2)2 because we have only 2 attribute We should represent value of attribute in numeric Let female= 0 and male =1 k=3 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 49 / 76
  • 50. KNN: Example Find the ED of each and every person with L . A= (56 − 7)2 + (0 − 0)2 = 2400 then √ 2401===49 B=(34 − 7)2 + (1 − 0)2 = 730 then √ 730 ===27.0 C=(25 − 7)2 + (0 − 0)2 = 324 then √ 324 ===18 D=(40 − 7)2 + (1 − 0)2 = 1090 then √ 1090 ===33.01 E=(35 − 7)2 + (1 − 0)2 = 785 then √ 785 ===28.01 F=(32 − 7)2 + (0 − 0)2 = 625 then √ 625 ===25, G=(40 − 7)2 + (0 − 0)2 = 1089 then √ 1089 ===33, H=(20 − 7)2 + (1 − 0)2 = 170 then √ 170 ===13.038 Then k=3 , take smallest Ed measures which are 3 of smallest values i.e., 13=h, Neither, 18=c, Neither, 25= F, football therefore L belongs to class of neither Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 50 / 76
  • 51. KNN: Key Concepts Distance Metric: KNN relies on a distance metric (e.g., Euclidean distance) to measure the similarity or dissimilarity between instances in the feature space. Parameter ’K’: The parameter ’K’ represents the number of neighbors to consider when making a prediction. It is a crucial factor that affects the model’s performance. Decision Rule: For classification, the majority class among the k-nearest neighbors determines the predicted class. For regression, the average of the target values is used. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 51 / 76
  • 52. Advantages and Disadvantages of KNN Advantages Simple and easy to understand. No training phase, making it computationally efficient during training. Versatile and effective in a wide range of applications. Disadvantage Computationally expensive during prediction, especially with large datasets. Sensitive to irrelevant or redundant features. Optimal choice of ’K’ may be task-dependent. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 52 / 76
  • 53. Check Your Progress Question 1: What is the primary principle behind the K-Nearest Neighbors (KNN) algorithm? A Minimizing error residuals B Finding centroids C Predicting based on the majority class among k-neighbors D Maximizing entropy Question 2: In terms of computational complexity, during which phase does KNN generally become more computationally expensive? A Training phase B Testing phase C Feature extraction phase D Model interpretation phase Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 53 / 76
  • 54. Outline 1 Introduction 2 Regression 3 Classification 4 k- Nearest Neighbors (KNN) 5 Logistic Regression 6 Decision Tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 54 / 76
  • 55. What is Logistic Regression? Logistic regression is a supervised machine learning algorithm used for classification tasks where the goal is to predict the probability that an instance belongs to a given class or not. Logistic regression is used for binary classification where we use sigmoid function, that takes input as independent variables and produces a probability value between 0 and 1. Sigmoid Function/Logistic Function The sigmoid function is a mathematical function used to map the predicted values to probabilities. It maps any real value into another value within a range of 0 and 1. g(z) = 1 1 + e−z where z = β0 + β1X1 + β2X2 + . . . + βnXn e is the base of the natural logarithm (approximately 2.71828). Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 55 / 76
  • 56. Sigmoid Function Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 56 / 76
  • 57. Logistic Regression-Key Points The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the “S” form. In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function, which predicts two maximum values (0 or 1). In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0. Logistic regression predicts the output of a categorical dependent variable. − Once we have a probability, how do we actually classify the data? Choose a probability depending on the type of the classification problem we’re solving for: y = ( 0 if g(z) 0.5 1 if g(z) ≥ 0.5 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 57 / 76
  • 58. Logistic Regression Advantages Simple and Interpretable: Logistic regression is relatively simple to understand and implement. Computational Efficiency: Logistic regression is computationally efficient and can handle large datasets efficiently. Works well for Binary Classification: It is particularly effective when the dependent variable is binary (two classes). Less Prone to Overfitting: Logistic regression tends to be less prone to overfitting compared to more complex models, making it a good choice for smaller datasets. Disadvantage Assumption of Linearity Sensitive to Outliers Not Ideal for Multi-Class Classification Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 58 / 76
  • 59. Check Your Progress 1 What is the primary objective of logistic regression? A. Minimizing mean squared error B. Maximizing likelihood C. Minimizing regularization D. Maximizing R-squared 2 If the output of a logistic regression model is 0.8 for a given instance, how would you predict the class? A. Class 0 B. Class 1 C. Insufficient information to determine D. Class probability is not relevant for prediction Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 59 / 76
  • 60. Outline 1 Introduction 2 Regression 3 Classification 4 k- Nearest Neighbors (KNN) 5 Logistic Regression 6 Decision Tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 60 / 76
  • 61. Decision Tree Based Classification A decision tree is a flow-chart-like tree structure, where − each internal node (nonleaf node) denotes a test on an attribute, − each branch represents an outcome of the test, and − each leaf node (or terminal node) holds a class label Decision tree performs classification by constructing a tree based on training instances with leaves having class labels. − The tree is traversed for each test instance to find a leaf, and the class of the leaf is the predicted class Widely used learning method as it has been applied to: − classify medical patients based on the disease, − equipment malfunction by cause, − loan applicant by likelihood of payment. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 61 / 76
  • 62. Decision tree learning: Algorithm Aim: find a small tree consistent with the training examples Idea: (recursively) choose ”most significant” attribute as root of (sub)tree − Tree is constructed in a top-down recursive divide-and-conquer manner − At start, all the training examples/tuples are at the root − Attributes are categorical (if continuous-valued, they are discretized in advance) − Examples are partitioned recursively based on selected attributes − Optimal attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning: − All samples (tuples) for a given node belong to the same class − There are no remaining attributes on which the tuples may be further partitioned − There are no samples (tuples) left for a given branch Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 62 / 76
  • 63. Choosing the Splitting Attribute At each node, the best attribute is selected for splitting the training examples using a Goodness function − The best attribute is the one that separate the classes of the training examples faster such that it results in the smallest tree Typical goodness functions: information gain and GINI index Information Gain: Select the attribute with the highest information gain, that create small average disorder − First, compute the disorder using Entropy; the expected information needed to classify objects into classes − Second, measure the Information Gain; to calculate by how much the disorder of a set would reduce by knowing the value of a particular attribute. GINI index − An alternative to information gain that measure impurity of attributes in the classification task − Select the attribute with the smallest GINI value. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 63 / 76
  • 64. Entropy The Entropy measures the disorder of a set S containing a total of n examples of which n+ are positive and n− are negative and it is given by: D(n+, n−) = − n+ n log2 n+ n − n− n log2 n− n = Entropy(S) (1) Some useful properties of the Entropy: − D(n, m) = D(m, n) − D(0, m) = D(m, 0) = 0 D(S)=0 means that all the examples in S have the same class − D(m, m) = 1 D(S)=1 means that half the examples in S are of one class and half are in the opposite class Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 64 / 76
  • 65. Information Gain Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A) The Information Gain measures the expected reduction in entropy due to splitting on an attribute A GAINsplit = Entropy(S) − k X i=1 ni n Entropy(i) ! (2) Parent Node S is split into k partitions; ni is number of records in partition i The attribute with the highest information gain is chosen as the splitting attribute for node N − This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 65 / 76
  • 66. Example: The problem of “Sunburn” You want to predict whether another person is likely to get sunburned if he is back to the beach. How can you do this? Data Collected: predict based on the observed properties of the people Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 66 / 76
  • 67. Example: The problem of “Sunburn” 1 Attribute Selection by Information Gain to construct the optimal decision tree Entropy: The Disorder of Sunburned D({“Sarah”,“Dana”,“Alex”,“Annie”,“Emily””,“Pete”,“John”,“Katie”}) D(3+, 5−) = −3 8 log2 3 8 − 5 8 log2 5 8 = 0.954 2 Calculate the Average Disorder Associated with Hair Color The first term of the sum: D(Sblonde) = D({”Sarah”, ”Annie”, ”Dana”, ”Katie”}) = D(2+, 2−) = 1 |Sblonde | |S| ∗ D(Sblonde) = 4 8 ∗ 1 = 0.5 3 The second and third terms of the sum: Sred = {”Emily”} Sbrown = {”Alex”, ”Pete”, ”John”} − These are both 0 because within each set all the examples have the same class − So the average disorder created when splitting on ”hair color” is 0.5+0+0=0.5 Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 67 / 76
  • 68. Example: The problem of “Sunburn” Which decision variable minimizes the disorder? Test Average Disorder of the other attributes hair 0.50 height 0.69 weight 0.94 lotion 0.61 Which decision variable maximizes the Info Gain then? Remember it’s the one which minimizes the average disorder. − Gain(hair) = 0.954 - 0.50 = 0.454 − Gain(height) = 0.954 - 0.69 =0.264 − Gain(weight) = 0.954 - 0.94 =0.014 − Gain (lotion) = 0.954 - 0.61 =0.344 The best decision tree? Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 68 / 76
  • 69. Example: The problem of “Sunburn” Once we have finished with hair colour we then need to calculate the remaining branches of the decision tree. Which attributes is better to classify the remaining ? This is the simplest and optimal one possible and it makes a lot of sense. It classifies 4 of the people on just the hair color alone. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 69 / 76
  • 70. Example: The problem of “Sunburn” You can view Decision Tree as an IF-THEN-ELSE statement which tells us whether someone will suffer from sunburn. if (Hair-Colour=“red”) then return(sunburned = yes) else if (hair-colour=“blonde” and lotion-used=“No”) then return(sunburned = yes) else return(false) end if Rule Extraction from a decision tree One rule is created for each path from the root to the leaf node. To form a rule antecedent, each splitting criterion is logically ANDed. The leaf node holds the class prediction, forming the rule consequent. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 70 / 76
  • 71. Exercise: Decision Tree for “buy computer or not”. Use the training dataset given below to construct decision tree Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 71 / 76
  • 72. Why decision tree in Machine Learning? Relatively faster learning speed (than other classification methods) Convertible to simple and easy to understand classification if-then-else rules Comparable classification accuracy with other methods Does not require any prior knowledge of data distribution, works well on noisy data. Pros Reasonable training time Easy to implement Easy to interpret Can Handle Mixed Data Types Cons Cannot handle complicated relationship between features Simple decision boundaries Problems with lots of missing data Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 72 / 76
  • 73. Check Your Progress 1 What is the role of the root node in a decision tree? A It is the node with the highest impurity B It is the node where predictions are made C It is the node where the tree is pruned D It is the topmost node where the first decision is made 2 Which technique helps to reduce the complexity of a decision tree and prevent overfitting? A Feature scaling B Regularization C Pruning D Data normalization 3 Which of the following is an advantage of decision trees? A Sensitivity to feature scales B Difficulty handling non-linear relationships C Limited interpretability D Easy to interpret and understand Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 73 / 76
  • 74. Metrics used to Evaluate Classifiers Here are some key metrics commonly used for classifier evaluation: 1 Accuracy: Definition: The proportion of correctly classified instances out of the total instances. Formula: Accuracy = Number of Correct Predictions Total Number of Predictions Use Case: Suitable for balanced datasets but may be misleading in the presence of imbalanced classes. 2 Precision: Definition: The proportion of true positive predictions out of all positive predictions. Formula: Precision = True Positives True Positives + False Positives Use Case: Emphasizes the accuracy of positive predictions, useful when the cost of false positives is high. 3 Recall (Sensitivity or True Positive Rate): Definition: The proportion of true positive predictions out of all actual positives. Formula: Recall = True Positives True Positives + False Negatives Use Case: Emphasizes capturing all positive instances, useful when the cost of false negatives is high. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 74 / 76
  • 75. Metrics used to Evaluate Classifiers resume* F1 Score: Definition: The harmonic mean of precision and recall. Formula: F1 Score = 2×Precision×Recall Precision + Recall Use Case: Balances precision and recall, especially in imbalanced datasets. resume* Specificity (True Negative Rate): Definition: The proportion of true negative predictions out of all actual negatives. Formula: Specificity = True Negatives True Negatives + False Positives Use Case: Emphasizes the accuracy of negative predictions. resume* Area Under the ROC Curve (AUC-ROC): Definition: The area under the Receiver Operating Characteristic (ROC) curve. Use Case: Provides a comprehensive measure of a classifier’s performance across different decision thresholds. resume* Confusion Matrix: Definition: A table summarizing true positives, true negatives, false positives, and false negatives. Use Case: Provides a detailed breakdown of classifier performance. Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 75 / 76
  • 76. Metrics used to Evaluate Classifiers Consider a binary classification problem where a model is trained to predict whether an online transaction is fraudulent (Positive class) or not fraudulent (Negative class). The model is evaluated on a test dataset using a confusion matrix: Actual Fraudulent Actual Not Fraudulent Predicted Fraudulent 120 10 Predicted Not Fraudulent 15 850 Compute the following metrics of the model: Accuracy: Precision: Recall: F1 Score: Shumet Tadesse (Computer Science) ML Chapter 2 Supervised learning February 2024 76 / 76