Modelling and evaluation

Modelling and Evaluation
Prakash G Khaire
The Mandvi Education Society Institute
of Computer Studies

Introduction
• Structured representation of raw input data to the
meaningful pattern is called a model.
• The decision regarding which model is to be selected
for a specific data set is taken by the learning task,
based on the problem to be solved and type of data.
– Example : When the problem is related to prediction and the target
field is numeric and continuous, the regression model is assigned.
• The process of assigning a model, and fitting a specific
model to a data set is called model training.

Selecting a Model
• Machine Learning is an optimization problem.
• We try to define a model and tune the
parameters to find the most suitable solution
to a problem.
• We need to have a way to evaluate the quality
or optimality of a solution, using “Objective
Function”

Predictive Model
• Model that helps to define certain value using
the values in an input data set.
• The learning model attempts to establish a
relation between the target feature.
• The predictive models have a clear focus on
what they want to learn and how they want to
learn.

Predictive Model
• Examples of Predictive Model
– Win/Loss in a cricket match.
– Whether a transaction is fraud.
– Whether a customer may move to another
product.

Predictive Model
• The models which are used for prediction of
target features of categorical value are known
as classification models.
• Popular classification model are k-Nearest
Neighbour (kNN), Naïve Bayes, and Decision
Tree

Predictive Model
• Predictive Models may also be used to predict numerical
values of the target feature based on the predictor features.
– Prediction of revenue growth in the succeeding year
– Prediction of rainfall amount in coming monsoon
– Prediction of potential flu patients and demand for flu
shots next winter
• The Model used for prediction of numerical value of the
target feature of data instance are known as regression model
– Popular regression models are Linear Regression and
Logistic Regression models

Descriptive Models
• Models used to describe a data set or gain insight
from a data set is called unsupervised learning or
descriptive model.
• No target feature or single feature of interest in
case of unsupervised learning.
• Based on the value of all features, interesting
patterns or insights are derived about the data
set.

Descriptive Models
• Descriptive models which group together similar data
instances, i.e. data instances having a similar value of
the different features are called clustering models.
• Examples of clustering include
– Customer grouping or segmentation based on social,
demographic, ethnic, etc factors.
– Grouping of music based on different aspects like genre,
language, time-period, etc.
– Grouping of commodities in an inventory
• The most popular model fro clustering is k-Means

Training a Model
(For supervised learning)

K-fold Cross Validataion
• K-Fold CV is where a given data set is split into a K number
of sections/folds where each fold is used as a testing set at
some point.
• Lets take the scenario of 5-Fold cross validation(K=5). Here,
the data set is split into 5 folds.
• In the first iteration, the first fold is used to test the model
and the rest are used to train the model.
• In the second iteration, 2nd fold is used as the testing set
while the rest serve as the training set. This process is
repeated until each fold of the 5 folds have been used as
the testing set.

K-fold Cross Validation
• The value of ‘k’ in k-fold cross-validation can be
set to any number. There are two approaches
which are extremely popular.
– 10 fold cross-validation (10 fold CV)
• One of the folds is used as the test data for validating model
performance trained based on the remaining 9 folds.
– Leave-one-out Cross Validation (LOOCV)
• Using one record or data instance at a time as a test data.

Bootstrap sampling
• It is a popular way to identify training and test
data sets from the input data sets.
• It uses the technique of Simple Random Sampling
with Replacement, which is well known
technique in sampling theory for drawing random
samples.
• This technique is particularly useful in case of
input data sets of small size, i.e. having very less
number of data instances.

Model Representation and
Interpretability
• The goal of supervised machine learning is to
learn or derive a target function which can
best determine the target variable from the
set of input variables.
• Fitness of a target function approximated by a
learning algorithm determines how correctly it
is able to classify a set of data it has never
seen.

Underfitting
• A statistical model or a machine learning algorithm is said to have
underfitting when it cannot capture the underlying trend of the data. (It’s
just like trying to fit undersized pants!)
• Underfitting destroys the accuracy of our machine learning model.
• Its occurrence simply means that our model or the algorithm does not fit
the data well enough.
• It usually happens when we have less data to build an accurate model and
also when we try to build a linear model with a non-linear data.
• In such cases the rules of the machine learning model are too easy and
flexible to be applied on such a minimal data and therefore the model will
probably make a lot of wrong predictions.
• Underfitting can be avoided by using more training data and also reducing
the features by effective feature selection.

Overfitting
• A statistical model is said to be overfitted, when we train it with a lot of
data (just like fitting ourselves in an oversized pants!).
• When a model gets trained with so much of data, it starts learning from
the noise and inaccurate data entries in our data set.
• Then the model does not categorize the data correctly, because of too
much of details and noise.
• The causes of overfitting are the non-parametric and non-linear methods
because these types of machine learning algorithms have more freedom
in building the model based on the dataset and therefore they can really
build unrealistic models.
• A solution to avoid overfitting is using re-sampling techniques like k-fold
cross validataion, hold back of a validation data set, remove the nodes
which have little or no predictive power for the given machine learning
problem.

Bias – Variance trade off
• In supervised learning, the class value
assigned by the learning model built based on
the training data may differ from the actual
class value.
• This error in learning can be of two types –
errors due to ‘bias’ and error due to ‘variance’.

Errors due to ‘Bias’
• The error due to bias is taken as the difference between the
expected (or average) prediction of our model and the correct value
which we are trying to predict.
• Of course you only have one model so talking about expected or
average prediction values might seem a little strange.
• However, imagine you could repeat the whole model building
process more than once: each time you gather new data and run a
new analysis creating a new model.
• Due to randomness in the underlying data sets, the resulting
models will have a range of predictions.
• Bias measures how far off in general these models' predictions are
from the correct value.

Errors due to ‘Variance’
• The error due to variance is taken as the
variability of a model prediction for a given data
point.
• Again, imagine you can repeat the entire model
building process multiple times.
• The variance is how much the predictions for a
given point vary between different realizations of
the model.

Bias – Variance trade off
• The problem in training a model can either
happen because
– The model is too simple and hence fails to interpret
the data grossly
– The model is extremely complex and magnifies even
small differences in the training data
• Bias – Variance trade off learning
– Increasing the bias will decrease the variance
– Increasing the variance will decrease the bias

Supervised Learning - Classification

Modelling and evaluation

More Related Content

What's hot

Similar to Modelling and evaluation

More from eShikshak

Recently uploaded

Modelling and evaluation