Modelling and Evaluation
Prakash G Khaire
The Mandvi Education Society Institute
of Computer Studies
Introduction
• Structured representation of raw input data to the
meaningful pattern is called a model.
• The decision regarding which model is to be selected
for a specific data set is taken by the learning task,
based on the problem to be solved and type of data.
– Example : When the problem is related to prediction and the target
field is numeric and continuous, the regression model is assigned.
• The process of assigning a model, and fitting a specific
model to a data set is called model training.
Selecting a Model
• Machine Learning is an optimization problem.
• We try to define a model and tune the
parameters to find the most suitable solution
to a problem.
• We need to have a way to evaluate the quality
or optimality of a solution, using “Objective
Function”
Selecting a Model
Predictive Model
• Model that helps to define certain value using
the values in an input data set.
• The learning model attempts to establish a
relation between the target feature.
• The predictive models have a clear focus on
what they want to learn and how they want to
learn.
Predictive Model
• Examples of Predictive Model
– Win/Loss in a cricket match.
– Whether a transaction is fraud.
– Whether a customer may move to another
product.
Predictive Model
• The models which are used for prediction of
target features of categorical value are known
as classification models.
• Popular classification model are k-Nearest
Neighbour (kNN), Naïve Bayes, and Decision
Tree
Predictive Model
• Predictive Models may also be used to predict numerical
values of the target feature based on the predictor features.
– Prediction of revenue growth in the succeeding year
– Prediction of rainfall amount in coming monsoon
– Prediction of potential flu patients and demand for flu
shots next winter
• The Model used for prediction of numerical value of the
target feature of data instance are known as regression model
– Popular regression models are Linear Regression and
Logistic Regression models
Descriptive Models
• Models used to describe a data set or gain insight
from a data set is called unsupervised learning or
descriptive model.
• No target feature or single feature of interest in
case of unsupervised learning.
• Based on the value of all features, interesting
patterns or insights are derived about the data
set.
Descriptive Models
• Descriptive models which group together similar data
instances, i.e. data instances having a similar value of
the different features are called clustering models.
• Examples of clustering include
– Customer grouping or segmentation based on social,
demographic, ethnic, etc factors.
– Grouping of music based on different aspects like genre,
language, time-period, etc.
– Grouping of commodities in an inventory
• The most popular model fro clustering is k-Means
Training a Model
(For supervised learning)
K-fold Cross Validataion
• K-Fold CV is where a given data set is split into a K number
of sections/folds where each fold is used as a testing set at
some point.
• Lets take the scenario of 5-Fold cross validation(K=5). Here,
the data set is split into 5 folds.
• In the first iteration, the first fold is used to test the model
and the rest are used to train the model.
• In the second iteration, 2nd fold is used as the testing set
while the rest serve as the training set. This process is
repeated until each fold of the 5 folds have been used as
the testing set.
K-fold Cross Validation
K-fold Cross Validation
• The value of ‘k’ in k-fold cross-validation can be
set to any number. There are two approaches
which are extremely popular.
– 10 fold cross-validation (10 fold CV)
• One of the folds is used as the test data for validating model
performance trained based on the remaining 9 folds.
– Leave-one-out Cross Validation (LOOCV)
• Using one record or data instance at a time as a test data.
Bootstrap sampling
• It is a popular way to identify training and test
data sets from the input data sets.
• It uses the technique of Simple Random Sampling
with Replacement, which is well known
technique in sampling theory for drawing random
samples.
• This technique is particularly useful in case of
input data sets of small size, i.e. having very less
number of data instances.
Model Representation and
Interpretability
• The goal of supervised machine learning is to
learn or derive a target function which can
best determine the target variable from the
set of input variables.
• Fitness of a target function approximated by a
learning algorithm determines how correctly it
is able to classify a set of data it has never
seen.
Underfitting
• A statistical model or a machine learning algorithm is said to have
underfitting when it cannot capture the underlying trend of the data. (It’s
just like trying to fit undersized pants!)
• Underfitting destroys the accuracy of our machine learning model.
• Its occurrence simply means that our model or the algorithm does not fit
the data well enough.
• It usually happens when we have less data to build an accurate model and
also when we try to build a linear model with a non-linear data.
• In such cases the rules of the machine learning model are too easy and
flexible to be applied on such a minimal data and therefore the model will
probably make a lot of wrong predictions.
• Underfitting can be avoided by using more training data and also reducing
the features by effective feature selection.
Underfitting
Overfitting
• A statistical model is said to be overfitted, when we train it with a lot of
data (just like fitting ourselves in an oversized pants!).
• When a model gets trained with so much of data, it starts learning from
the noise and inaccurate data entries in our data set.
• Then the model does not categorize the data correctly, because of too
much of details and noise.
• The causes of overfitting are the non-parametric and non-linear methods
because these types of machine learning algorithms have more freedom
in building the model based on the dataset and therefore they can really
build unrealistic models.
• A solution to avoid overfitting is using re-sampling techniques like k-fold
cross validataion, hold back of a validation data set, remove the nodes
which have little or no predictive power for the given machine learning
problem.
Bias – Variance trade off
• In supervised learning, the class value
assigned by the learning model built based on
the training data may differ from the actual
class value.
• This error in learning can be of two types –
errors due to ‘bias’ and error due to ‘variance’.
Errors due to ‘Bias’
• The error due to bias is taken as the difference between the
expected (or average) prediction of our model and the correct value
which we are trying to predict.
• Of course you only have one model so talking about expected or
average prediction values might seem a little strange.
• However, imagine you could repeat the whole model building
process more than once: each time you gather new data and run a
new analysis creating a new model.
• Due to randomness in the underlying data sets, the resulting
models will have a range of predictions.
• Bias measures how far off in general these models' predictions are
from the correct value.
Errors due to ‘Variance’
• The error due to variance is taken as the
variability of a model prediction for a given data
point.
• Again, imagine you can repeat the entire model
building process multiple times.
• The variance is how much the predictions for a
given point vary between different realizations of
the model.
Bias – Variance trade off
• The problem in training a model can either
happen because
– The model is too simple and hence fails to interpret
the data grossly
– The model is extremely complex and magnifies even
small differences in the training data
• Bias – Variance trade off learning
– Increasing the bias will decrease the variance
– Increasing the variance will decrease the bias
Supervised Learning - Classification

Modelling and evaluation

  • 1.
    Modelling and Evaluation PrakashG Khaire The Mandvi Education Society Institute of Computer Studies
  • 2.
    Introduction • Structured representationof raw input data to the meaningful pattern is called a model. • The decision regarding which model is to be selected for a specific data set is taken by the learning task, based on the problem to be solved and type of data. – Example : When the problem is related to prediction and the target field is numeric and continuous, the regression model is assigned. • The process of assigning a model, and fitting a specific model to a data set is called model training.
  • 3.
    Selecting a Model •Machine Learning is an optimization problem. • We try to define a model and tune the parameters to find the most suitable solution to a problem. • We need to have a way to evaluate the quality or optimality of a solution, using “Objective Function”
  • 4.
  • 5.
    Predictive Model • Modelthat helps to define certain value using the values in an input data set. • The learning model attempts to establish a relation between the target feature. • The predictive models have a clear focus on what they want to learn and how they want to learn.
  • 6.
    Predictive Model • Examplesof Predictive Model – Win/Loss in a cricket match. – Whether a transaction is fraud. – Whether a customer may move to another product.
  • 7.
    Predictive Model • Themodels which are used for prediction of target features of categorical value are known as classification models. • Popular classification model are k-Nearest Neighbour (kNN), Naïve Bayes, and Decision Tree
  • 8.
    Predictive Model • PredictiveModels may also be used to predict numerical values of the target feature based on the predictor features. – Prediction of revenue growth in the succeeding year – Prediction of rainfall amount in coming monsoon – Prediction of potential flu patients and demand for flu shots next winter • The Model used for prediction of numerical value of the target feature of data instance are known as regression model – Popular regression models are Linear Regression and Logistic Regression models
  • 9.
    Descriptive Models • Modelsused to describe a data set or gain insight from a data set is called unsupervised learning or descriptive model. • No target feature or single feature of interest in case of unsupervised learning. • Based on the value of all features, interesting patterns or insights are derived about the data set.
  • 10.
    Descriptive Models • Descriptivemodels which group together similar data instances, i.e. data instances having a similar value of the different features are called clustering models. • Examples of clustering include – Customer grouping or segmentation based on social, demographic, ethnic, etc factors. – Grouping of music based on different aspects like genre, language, time-period, etc. – Grouping of commodities in an inventory • The most popular model fro clustering is k-Means
  • 11.
    Training a Model (Forsupervised learning)
  • 12.
    K-fold Cross Validataion •K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. • Lets take the scenario of 5-Fold cross validation(K=5). Here, the data set is split into 5 folds. • In the first iteration, the first fold is used to test the model and the rest are used to train the model. • In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set.
  • 13.
  • 14.
    K-fold Cross Validation •The value of ‘k’ in k-fold cross-validation can be set to any number. There are two approaches which are extremely popular. – 10 fold cross-validation (10 fold CV) • One of the folds is used as the test data for validating model performance trained based on the remaining 9 folds. – Leave-one-out Cross Validation (LOOCV) • Using one record or data instance at a time as a test data.
  • 15.
    Bootstrap sampling • Itis a popular way to identify training and test data sets from the input data sets. • It uses the technique of Simple Random Sampling with Replacement, which is well known technique in sampling theory for drawing random samples. • This technique is particularly useful in case of input data sets of small size, i.e. having very less number of data instances.
  • 16.
    Model Representation and Interpretability •The goal of supervised machine learning is to learn or derive a target function which can best determine the target variable from the set of input variables. • Fitness of a target function approximated by a learning algorithm determines how correctly it is able to classify a set of data it has never seen.
  • 17.
    Underfitting • A statisticalmodel or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. (It’s just like trying to fit undersized pants!) • Underfitting destroys the accuracy of our machine learning model. • Its occurrence simply means that our model or the algorithm does not fit the data well enough. • It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. • In such cases the rules of the machine learning model are too easy and flexible to be applied on such a minimal data and therefore the model will probably make a lot of wrong predictions. • Underfitting can be avoided by using more training data and also reducing the features by effective feature selection.
  • 18.
  • 19.
    Overfitting • A statisticalmodel is said to be overfitted, when we train it with a lot of data (just like fitting ourselves in an oversized pants!). • When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. • Then the model does not categorize the data correctly, because of too much of details and noise. • The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. • A solution to avoid overfitting is using re-sampling techniques like k-fold cross validataion, hold back of a validation data set, remove the nodes which have little or no predictive power for the given machine learning problem.
  • 20.
    Bias – Variancetrade off • In supervised learning, the class value assigned by the learning model built based on the training data may differ from the actual class value. • This error in learning can be of two types – errors due to ‘bias’ and error due to ‘variance’.
  • 21.
    Errors due to‘Bias’ • The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. • Of course you only have one model so talking about expected or average prediction values might seem a little strange. • However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. • Due to randomness in the underlying data sets, the resulting models will have a range of predictions. • Bias measures how far off in general these models' predictions are from the correct value.
  • 22.
    Errors due to‘Variance’ • The error due to variance is taken as the variability of a model prediction for a given data point. • Again, imagine you can repeat the entire model building process multiple times. • The variance is how much the predictions for a given point vary between different realizations of the model.
  • 23.
    Bias – Variancetrade off • The problem in training a model can either happen because – The model is too simple and hence fails to interpret the data grossly – The model is extremely complex and magnifies even small differences in the training data • Bias – Variance trade off learning – Increasing the bias will decrease the variance – Increasing the variance will decrease the bias
  • 24.
    Supervised Learning -Classification