LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx

LET'S PUBLISH WITH
MORE RELIABLE &
GENERALISED
PREDICTION MODEL
Ts. Syahrul Fithry Bin Senin

Learning Objectives
• Understand the meaning of generalized model
• Techniques to ensure production of the generalized model

Definition of Generalized Model

Let’s say we have this datasets

Our aim is to predict the status of health of a new data
?????

What is the best Machine Learning Model that may
produce generalised model of the data?
K-th nearest
neighbour?
Support
Vector
Machine?
Logistic
Regression
Model?
Etc….
What is the Method to ensure the generated model will be
GENERALISED ONE?

Phase 1 :
Estimate the
parameters for
machine
learning
methods
• To use any suitable machine
learning methods to use ‘some
of the data’ to estimate the
shape/curve of the data
• This estimation parameters =
TRAINING the data based on
machine learning algorithm

Phase 2:
Evaluate how
well the
machine
learning
methods work
• Need to find whether the
‘curved’ will perform a good job
to categorize the data
• This evaluation a method =
TESTING the Machine Learning
algo

Thus…..
•Using the machine learning
algorithm, data is used for
• Train the machine learning methods
• Test the machine learning methods
SOME DATA ….

Why not
using all of
the data ?
• It is the worst approach on using all data to
estimate the parameter (train the algorithm)
• Using all, there aren’t data left to test the
model of the machine learning
• Reusing the same data for training and
testing is a bad idea. There is no way to
check whether the ML model work on data
that wasn’t trained on

A slightly better idea would be
to partitioned the first the
datasets into TRAINING
datasets and remaining as the
TESTING datasets
TRAINING
TRAINING
TRAINING
TESTING
Train the TRAINING datasets
and evaluate its performance
under the TESTING datasets for
the selected machine learning
method

Question
• How do we know that the first
TRAINING datasets and the
remaining TESTING datasets is
the only best way to partition
the whole datasets?
TESTING
TRAINING
TRAINING
TRAINING
• How about taking the first sub
datasets as the TESTING and the
remaining as the TRAINING
datasets?

Question
TRAINING
TRAINING
TESTING
TRAINING
• How about taking the middle
sub datasets as the TESTING
datasets and the rest as
TRAINING datasets?
• Are these options will affect our
data modelling?

CROSS-
VALIDATION
METHOD
• We will use this method by assigning each
option of the sub-block of the dataset as the
TESTING and TRAINING datasets
• Summarize its model’s performance
• Evaluate the best model

TESTING
TRAINING
TRAINING
TRAINING
TRAINING
TESTING
TRAINING
TRAINING
TRAINING
TRAINING
TESTING
TRAINING
TRAINING
TRAINING
TRAINING
TESTING

k-Fold Cross Validation
• In our example, use 4-Fold Cross Validation
• In actual, the number of “k” (blocks) is somewhat arbitrary
• Commonly, we will divide into 10 blocks (10-fold cross validation)

The procedure of k-Fold Cross-Validation
• Original datasets were randomly into k sub-samples (i.e. k = 4), which
become the TRAINING DATASETS
• Models are estimated using k-1 subsamples for each fold, with k-th
subsamples served as the VALIDATION sample, process repeats until
every sub-sample has served as VALIDATION data and model results
is AVERAGED across folds.

FOLD 5
AVERAGE THE RESULTS FOR FINAL
K-FOLD CV MODEL & TEST WITH
TESTING DATA

The Nature of Our Data Modelling
Good Performance of “Seen Data” ≠ Good Performance of “Unseen Data”

LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx

Recommended

Recommended

More Related Content

Similar to LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx

Similar to LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx (20)

Recently uploaded

Recently uploaded (20)

LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx