5. Our aim is to predict the status of health of a new data
?????
6. What is the best Machine Learning Model that may
produce generalised model of the data?
K-th nearest
neighbour?
Support
Vector
Machine?
Logistic
Regression
Model?
Etc….
What is the Method to ensure the generated model will be
GENERALISED ONE?
7. Phase 1 :
Estimate the
parameters for
machine
learning
methods
• To use any suitable machine
learning methods to use ‘some
of the data’ to estimate the
shape/curve of the data
• This estimation parameters =
TRAINING the data based on
machine learning algorithm
8. Phase 2:
Evaluate how
well the
machine
learning
methods work
• Need to find whether the
‘curved’ will perform a good job
to categorize the data
• This evaluation a method =
TESTING the Machine Learning
algo
9. Thus…..
•Using the machine learning
algorithm, data is used for
• Train the machine learning methods
• Test the machine learning methods
SOME DATA ….
10. Why not
using all of
the data ?
• It is the worst approach on using all data to
estimate the parameter (train the algorithm)
• Using all, there aren’t data left to test the
model of the machine learning
• Reusing the same data for training and
testing is a bad idea. There is no way to
check whether the ML model work on data
that wasn’t trained on
12. A slightly better idea would be
to partitioned the first the
datasets into TRAINING
datasets and remaining as the
TESTING datasets
TRAINING
TRAINING
TRAINING
TESTING
Train the TRAINING datasets
and evaluate its performance
under the TESTING datasets for
the selected machine learning
method
13. Question
• How do we know that the first
TRAINING datasets and the
remaining TESTING datasets is
the only best way to partition
the whole datasets?
TESTING
TRAINING
TRAINING
TRAINING
• How about taking the first sub
datasets as the TESTING and the
remaining as the TRAINING
datasets?
15. CROSS-
VALIDATION
METHOD
• We will use this method by assigning each
option of the sub-block of the dataset as the
TESTING and TRAINING datasets
• Summarize its model’s performance
• Evaluate the best model
17. k-Fold Cross Validation
• In our example, use 4-Fold Cross Validation
• In actual, the number of “k” (blocks) is somewhat arbitrary
• Commonly, we will divide into 10 blocks (10-fold cross validation)
18. The procedure of k-Fold Cross-Validation
• Original datasets were randomly into k sub-samples (i.e. k = 4), which
become the TRAINING DATASETS
• Models are estimated using k-1 subsamples for each fold, with k-th
subsamples served as the VALIDATION sample, process repeats until
every sub-sample has served as VALIDATION data and model results
is AVERAGED across folds.