SlideShare a Scribd company logo
Cross-Validation in Machine
Learning
• Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously unseen
subset of the input data. We can also say that it is a technique to check
how a statistical model generalizes to an independent dataset.
• In machine learning there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our
model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After
that, we test our model on that sample before deployment, and this
complete process comes under cross-validation. This is something
different from the general train-test split.
• Hence the basic steps of cross-validations are:
• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else
check for the issues.
Key aspects of evaluating the quality of the model
are –
• How accurate the model is
• How generalized the model is
• When we start building a model and train it with the ‘entire’ dataset, we can very well calculate its accuracy on this training data set. But we
cannot test how this model will behave with new data which is not present in the training set, hence its generalization cannot be determined.
• Hence we need techniques to make use of the same data set for both training and testing of the models.
• In machine learning, Cross-Validation is the technique to evaluate how well the model has generalized and its overall accuracy. For this purpose,
it randomly samples data from the dataset to create training and testing sets. There are multiple cross-validation approaches as follows –
• 1.Hold Out Approach
• 2.Leave One Out Cross-Validation
• 3.K-Fold Cross-Validation
• 4.Stratified K-Fold Cross-Validation
• 5.Repeated Random Train Test Split
• 1. Hold Out Approach
• In the hold-out approach, the data set is split into the train and test set with random sampling. The
train set is used for training the model and the test set is used to test its accuracy with unseen data. If
the training and accuracy are almost the same then the model is said to have generalized well. It is
common to use 80% of data for training and the remaining 20% for testing.
• Advantages
• It is simple and easy to implement
• The execution time is less.
• Disadvantages
• If the dataset itself is small, setting aside portions for testing would reduce the robustness of the
model. This is because the training sample may not be representative of the entire dataset.
• The evaluation metrics may vary due to the randomness of the split between the train and test set.
• Although 80-20 split for train test is widely followed, there is no thumb rule for the split and hence
the results can vary based on how the train test split is done.
• 2. Leave One Out Cross Validation (LOOCV)
• In this technique, if there are n observations in the dataset, only one observation is
reserved for testing, and the remaining data points are used for training. This is repeated n
times till all data points have been used for testing purposes in each iteration. Finally, the
average accuracy is calculated by combining the accuracies of each iteration.
• Advantage
• Since every data participates both for training and testing, the overall accuracy is more
reliable.
• It is very useful when the dataset is small.
• Disadvantage
• LOOCV is not practical to use when the number of data observations n is huge. E.g.
imagine a dataset with 500,000 records, then 500,000 model needs to be created which is
not really feasible.
• There is a huge computational and time cost associated with the LOOCV approach.
• 3. K-Fold Cross-Validation
• In the K-Fold Cross-Validation approach, the dataset is split into K folds. Now in 1st iteration, the first
fold is reserved for testing and the model is trained on the data of the remaining k-1 folds.
• In the next iteration, the second fold is reserved for testing and the remaining folds are used for
training. This is continued till the K-th iteration. The accuracy obtained in each iteration is used to
derive the overall average accuracy for the model.
• Advantages
• K-Fold cross-validation is useful when the dataset is small and splitting it is not possible to split it in
train-test set (hold out approach) without losing useful data for training.
• It helps to create a robust model with low variance and low bias as it is trained on all data
• Disadvantages
• The major disadvantage of K-Fold Cross Validation is that the training needs to be done K times and
hence it consumes more time and resources,
• Not recommended to be used with sequential time series data.
• When the dataset is imbalanced, K-fold cross-validation may not give good results. This is because
some folds may have just a few or no records for the minority class.
• 4. Stratified K-Fold Cross-Validation
• Stratified K-fold cross-validation is useful when the data is imbalanced.
While sampling data into K-folds it makes sure that the distribution of
all classes in each fold is maintained. For example, if in the dataset 98%
of data belongs to class B and 2% to class A, the stratified sampling will
make sure each fold contains the two classes in the same ratio of 98%
to 2%.
• Advantage
• Stratified K-fold cross-validation is recommended when the dataset is
imbalanced.
• 5. Repeated Random Test-Train Split
• Repeated random test-train split is a hybrid of traditional train-test
splitting and the k-fold cross-validation method. In this technique, we
create random splits of the data into the training-test set and then
repeat this process multiple times, just like the cross-validation
method.
Examples of Cross-Validation in Sklearn Library
• About Dataset
• We will be using Parkinson’s disease dataset for all examples of cross-validation in the Sklearn library. The goal is to predict whether or not a
particular patient has Parkinson’s disease. We will be using the decision tree algorithm in all the examples.
• The dataset has 21 attributes and 195 rows. The various fields of the Parkinson’s Disease dataset are as follows –
• MDVP:Fo(Hz) – Average vocal fundamental frequency
• MDVP:Fhi(Hz) – Maximum vocal fundamental frequency
• MDVP:Flo(Hz) – Minimum vocal fundamental frequency
• MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP – Several
• measures of variation in fundamental frequency
• MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA– Several measures of variation in amplitude
• NHR,HNR – Two measures of ratio of noise to tonal components in the voice
• status – Health status of the subject (one) – Parkinson’s, (zero) – healthy
• RPDE,D2 – Two nonlinear dynamical complexity measures
• DFA – Signal fractal scaling exponent
• spread1,spread2PPE – Three nonlinear measures of fundamental frequency variation
• Importing Necessary Libraries
• We first load the libraries required to build our model.
• import pandas as pd
• import numpy as np
• from sklearn.tree import DecisionTreeClassifier
• from sklearn.model_selection import train_test_split
• from sklearn.model_selection import KFold
• from sklearn.model_selection import StratifiedKFold
• Reading CSV Data into Pandas
• Next, we load the dataset in the CSV file into the pandas dataframes
and check the top 5 rows.
• df=pd.read_csv(“Parkinsson disease.csv")
• df.head()
• Data Preprocessing
• The “name” column is not going to add any value in training the model
and can be discarded, so we are dropping it below.
• df.drop(df.columns[0], axis = 1, inplace = True)
• Next, we will separate the feature and target matrix as shown below.
• #Independent And dependent features
• X=df.drop('status', axis=1)
• y=df['status']
Hold out Approach in Sklearn
• The hold-out approach can be applied by using train_test_split module of
sklearn.model_selection
• In the below example we have split the dataset to create the test data with a size of 30%
and train data with a size of 70%. The random_state number ensures the split is
deterministic in every run.
• from sklearn.model_selection import train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
• model = DecisionTreeClassifier()
• model.fit(X_train, y_train)
• result = model.score(X_test, y_test)print(result)
• Out[38]:
• 0.7796610169491526
K-Fold Cross-Validation
• K-Fold Cross-Validation in Sklearn can be applied by using cross_val_score module of sklearn.model_selection.
• In the below example, 10 folds are used that produced 10 accuracy scores using which we calculated the mean
score.
• In [40]:
• from sklearn.model_selection import cross_val_score
• model=DecisionTreeClassifier()
• kfold_validation=KFold(10)
• results=cross_val_score(model,X,y,cv=kfold_validation)
• print(results)print(np.mean(results))
• Out[40]:
• [0.7 0.8 0.8 0.8 0.8 0.78947368
• 0.84210526 1. 0.68421053 0.36842105]
• 0.758421052631579
• Stratified K-fold Cross-Validation
• In Sklearn stratified K-fold cross-validation can be applied by
using StratifiedKFold module of sklearn.model_selection
• In the below example, the dataset is divided into 5 splits or folds. It returns 5
accuracy scores using which we calculate the final mean score.
• from sklearn.model_selection import StratifiedKFold
• skfold=StratifiedKFold(n_splits=5)
• model=DecisionTreeClassifier()scores=cross_val_score(model,X,y,cv=skfold)
• print(scores)print(np.mean(scores))
• Out[41]:
• array([0.61538462, 0.79487179, 0.71794872, 0.74358974, 0.71794872])
• 0.717948717948718
Leave One Out Cross Validation(LOOCV)
• In Sklearn Leave One Out Cross Validation (LOOCV) can be applied by using LeaveOneOut module of sklearn.model_selection
• from sklearn.model_selection import LeaveOneOut
• model=DecisionTreeClassifier()
• leave_validation=LeaveOneOut()
• results=cross_val_score(model,X,y,cv=leave_validation)
• results
• Out[22]:
• array([1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,
• 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
• 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
• 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
• 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,,,,,,,,,,,,,,,,,,,,,,,,,,,]
• print(np.mean(results))
• Out[44]:
• 0.8358974358974359
Repeated Random Test-Train Splits
• In Sklearn repeated random test-train splits can be applied by using ShuffleSplit module of
sklearn.model_selection
• In [45]:
• from sklearn.model_selection import ShuffleSplit
• model=DecisionTreeClassifier()
• ssplit=ShuffleSplit(n_splits=10,test_size=0.30)
• results=cross_val_score(model,X,y,cv=ssplit)
• print(results)print(np.mean(results))
• Out[45]:
• array([0.79661017, 0.71186441, 0.79661017, 0.88135593, 0.72881356,
• 0.84745763, 0.83050847, 0.77966102, 0.83050847, 0.81355932])

More Related Content

Similar to crossvalidation.pptx

Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Thomas Ploetz
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptxLETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
shamsul2010
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
malathieswaran29
 
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
UNIT-II-Machine-Learning.pptx Machine Learning Different AI ModelsUNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
JVSTHARUNSAI
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
Swati .
 
part3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptxpart3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptx
VaishaliBagewadikar
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
Pier Luca Lanzi
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
Shiwani Gupta
 
4.1.pptx
4.1.pptx4.1.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
HaritikaChhatwal1
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
Pier Luca Lanzi
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Machine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusMachine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. Census
Tim Enalls
 
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
DurgaDevi310087
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
nagarajan740445
 

Similar to crossvalidation.pptx (20)

Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptxLETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
 
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
UNIT-II-Machine-Learning.pptx Machine Learning Different AI ModelsUNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
 
part3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptxpart3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptx
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
4.1.pptx
4.1.pptx4.1.pptx
4.1.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusMachine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. Census
 
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 

More from PriyadharshiniG41

understanding-cholera-a-comprehensive-analysis.pdf
understanding-cholera-a-comprehensive-analysis.pdfunderstanding-cholera-a-comprehensive-analysis.pdf
understanding-cholera-a-comprehensive-analysis.pdf
PriyadharshiniG41
 
combatting-malaria-strategies-for-prevention-and-treatment.pdf
combatting-malaria-strategies-for-prevention-and-treatment.pdfcombatting-malaria-strategies-for-prevention-and-treatment.pdf
combatting-malaria-strategies-for-prevention-and-treatment.pdf
PriyadharshiniG41
 
ant colony optimization working and explanation
ant colony optimization working and explanationant colony optimization working and explanation
ant colony optimization working and explanation
PriyadharshiniG41
 
RandomForests in artificial intelligence
RandomForests in artificial intelligenceRandomForests in artificial intelligence
RandomForests in artificial intelligence
PriyadharshiniG41
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
PriyadharshiniG41
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligence
PriyadharshiniG41
 
Inference rules in artificial intelligence
Inference rules in artificial intelligenceInference rules in artificial intelligence
Inference rules in artificial intelligence
PriyadharshiniG41
 
Unit I-Final MArketing analytics unit 1 ppt
Unit I-Final MArketing analytics unit 1 pptUnit I-Final MArketing analytics unit 1 ppt
Unit I-Final MArketing analytics unit 1 ppt
PriyadharshiniG41
 
agent architecture in artificial intelligence.pptx
agent architecture in artificial intelligence.pptxagent architecture in artificial intelligence.pptx
agent architecture in artificial intelligence.pptx
PriyadharshiniG41
 
trust,bargain,negotiate in artificail intelligence
trust,bargain,negotiate in artificail intelligencetrust,bargain,negotiate in artificail intelligence
trust,bargain,negotiate in artificail intelligence
PriyadharshiniG41
 
spirometry classification enhanced using ai
spirometry classification enhanced using aispirometry classification enhanced using ai
spirometry classification enhanced using ai
PriyadharshiniG41
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
PriyadharshiniG41
 
dds.pptx
dds.pptxdds.pptx
actuators.pptx
actuators.pptxactuators.pptx
actuators.pptx
PriyadharshiniG41
 
First order logic or Predicate logic.pptx
First order logic or Predicate logic.pptxFirst order logic or Predicate logic.pptx
First order logic or Predicate logic.pptx
PriyadharshiniG41
 
Decision Tree Classification Algorithm.pptx
Decision Tree Classification Algorithm.pptxDecision Tree Classification Algorithm.pptx
Decision Tree Classification Algorithm.pptx
PriyadharshiniG41
 
problemsolving with AI.pptx
problemsolving with AI.pptxproblemsolving with AI.pptx
problemsolving with AI.pptx
PriyadharshiniG41
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
PriyadharshiniG41
 
problem characterstics.pptx
problem characterstics.pptxproblem characterstics.pptx
problem characterstics.pptx
PriyadharshiniG41
 

More from PriyadharshiniG41 (20)

understanding-cholera-a-comprehensive-analysis.pdf
understanding-cholera-a-comprehensive-analysis.pdfunderstanding-cholera-a-comprehensive-analysis.pdf
understanding-cholera-a-comprehensive-analysis.pdf
 
combatting-malaria-strategies-for-prevention-and-treatment.pdf
combatting-malaria-strategies-for-prevention-and-treatment.pdfcombatting-malaria-strategies-for-prevention-and-treatment.pdf
combatting-malaria-strategies-for-prevention-and-treatment.pdf
 
ant colony optimization working and explanation
ant colony optimization working and explanationant colony optimization working and explanation
ant colony optimization working and explanation
 
RandomForests in artificial intelligence
RandomForests in artificial intelligenceRandomForests in artificial intelligence
RandomForests in artificial intelligence
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligence
 
Inference rules in artificial intelligence
Inference rules in artificial intelligenceInference rules in artificial intelligence
Inference rules in artificial intelligence
 
Unit I-Final MArketing analytics unit 1 ppt
Unit I-Final MArketing analytics unit 1 pptUnit I-Final MArketing analytics unit 1 ppt
Unit I-Final MArketing analytics unit 1 ppt
 
agent architecture in artificial intelligence.pptx
agent architecture in artificial intelligence.pptxagent architecture in artificial intelligence.pptx
agent architecture in artificial intelligence.pptx
 
trust,bargain,negotiate in artificail intelligence
trust,bargain,negotiate in artificail intelligencetrust,bargain,negotiate in artificail intelligence
trust,bargain,negotiate in artificail intelligence
 
spirometry classification enhanced using ai
spirometry classification enhanced using aispirometry classification enhanced using ai
spirometry classification enhanced using ai
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
 
dds.pptx
dds.pptxdds.pptx
dds.pptx
 
actuators.pptx
actuators.pptxactuators.pptx
actuators.pptx
 
First order logic or Predicate logic.pptx
First order logic or Predicate logic.pptxFirst order logic or Predicate logic.pptx
First order logic or Predicate logic.pptx
 
14.08.2020 LKG.pdf
14.08.2020 LKG.pdf14.08.2020 LKG.pdf
14.08.2020 LKG.pdf
 
Decision Tree Classification Algorithm.pptx
Decision Tree Classification Algorithm.pptxDecision Tree Classification Algorithm.pptx
Decision Tree Classification Algorithm.pptx
 
problemsolving with AI.pptx
problemsolving with AI.pptxproblemsolving with AI.pptx
problemsolving with AI.pptx
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
problem characterstics.pptx
problem characterstics.pptxproblem characterstics.pptx
problem characterstics.pptx
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

crossvalidation.pptx

  • 2. • Cross-validation is a technique for validating the model efficiency by training it on the subset of input data and testing on previously unseen subset of the input data. We can also say that it is a technique to check how a statistical model generalizes to an independent dataset. • In machine learning there is always the need to test the stability of the model. It means based only on the training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular sample of the dataset, which was not part of the training dataset. After that, we test our model on that sample before deployment, and this complete process comes under cross-validation. This is something different from the general train-test split.
  • 3. • Hence the basic steps of cross-validations are: • Reserve a subset of the dataset as a validation set. • Provide the training to the model using the training dataset. • Now, evaluate model performance using the validation set. If the model performs well with the validation set, perform the further step, else check for the issues.
  • 4. Key aspects of evaluating the quality of the model are – • How accurate the model is • How generalized the model is • When we start building a model and train it with the ‘entire’ dataset, we can very well calculate its accuracy on this training data set. But we cannot test how this model will behave with new data which is not present in the training set, hence its generalization cannot be determined. • Hence we need techniques to make use of the same data set for both training and testing of the models. • In machine learning, Cross-Validation is the technique to evaluate how well the model has generalized and its overall accuracy. For this purpose, it randomly samples data from the dataset to create training and testing sets. There are multiple cross-validation approaches as follows – • 1.Hold Out Approach • 2.Leave One Out Cross-Validation • 3.K-Fold Cross-Validation • 4.Stratified K-Fold Cross-Validation • 5.Repeated Random Train Test Split
  • 5. • 1. Hold Out Approach • In the hold-out approach, the data set is split into the train and test set with random sampling. The train set is used for training the model and the test set is used to test its accuracy with unseen data. If the training and accuracy are almost the same then the model is said to have generalized well. It is common to use 80% of data for training and the remaining 20% for testing. • Advantages • It is simple and easy to implement • The execution time is less. • Disadvantages • If the dataset itself is small, setting aside portions for testing would reduce the robustness of the model. This is because the training sample may not be representative of the entire dataset. • The evaluation metrics may vary due to the randomness of the split between the train and test set. • Although 80-20 split for train test is widely followed, there is no thumb rule for the split and hence the results can vary based on how the train test split is done.
  • 6. • 2. Leave One Out Cross Validation (LOOCV) • In this technique, if there are n observations in the dataset, only one observation is reserved for testing, and the remaining data points are used for training. This is repeated n times till all data points have been used for testing purposes in each iteration. Finally, the average accuracy is calculated by combining the accuracies of each iteration. • Advantage • Since every data participates both for training and testing, the overall accuracy is more reliable. • It is very useful when the dataset is small. • Disadvantage • LOOCV is not practical to use when the number of data observations n is huge. E.g. imagine a dataset with 500,000 records, then 500,000 model needs to be created which is not really feasible. • There is a huge computational and time cost associated with the LOOCV approach.
  • 7. • 3. K-Fold Cross-Validation • In the K-Fold Cross-Validation approach, the dataset is split into K folds. Now in 1st iteration, the first fold is reserved for testing and the model is trained on the data of the remaining k-1 folds. • In the next iteration, the second fold is reserved for testing and the remaining folds are used for training. This is continued till the K-th iteration. The accuracy obtained in each iteration is used to derive the overall average accuracy for the model. • Advantages • K-Fold cross-validation is useful when the dataset is small and splitting it is not possible to split it in train-test set (hold out approach) without losing useful data for training. • It helps to create a robust model with low variance and low bias as it is trained on all data • Disadvantages • The major disadvantage of K-Fold Cross Validation is that the training needs to be done K times and hence it consumes more time and resources, • Not recommended to be used with sequential time series data. • When the dataset is imbalanced, K-fold cross-validation may not give good results. This is because some folds may have just a few or no records for the minority class.
  • 8. • 4. Stratified K-Fold Cross-Validation • Stratified K-fold cross-validation is useful when the data is imbalanced. While sampling data into K-folds it makes sure that the distribution of all classes in each fold is maintained. For example, if in the dataset 98% of data belongs to class B and 2% to class A, the stratified sampling will make sure each fold contains the two classes in the same ratio of 98% to 2%. • Advantage • Stratified K-fold cross-validation is recommended when the dataset is imbalanced.
  • 9. • 5. Repeated Random Test-Train Split • Repeated random test-train split is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data into the training-test set and then repeat this process multiple times, just like the cross-validation method.
  • 10. Examples of Cross-Validation in Sklearn Library • About Dataset • We will be using Parkinson’s disease dataset for all examples of cross-validation in the Sklearn library. The goal is to predict whether or not a particular patient has Parkinson’s disease. We will be using the decision tree algorithm in all the examples. • The dataset has 21 attributes and 195 rows. The various fields of the Parkinson’s Disease dataset are as follows – • MDVP:Fo(Hz) – Average vocal fundamental frequency • MDVP:Fhi(Hz) – Maximum vocal fundamental frequency • MDVP:Flo(Hz) – Minimum vocal fundamental frequency • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP – Several • measures of variation in fundamental frequency • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA– Several measures of variation in amplitude • NHR,HNR – Two measures of ratio of noise to tonal components in the voice • status – Health status of the subject (one) – Parkinson’s, (zero) – healthy • RPDE,D2 – Two nonlinear dynamical complexity measures • DFA – Signal fractal scaling exponent • spread1,spread2PPE – Three nonlinear measures of fundamental frequency variation
  • 11. • Importing Necessary Libraries • We first load the libraries required to build our model. • import pandas as pd • import numpy as np • from sklearn.tree import DecisionTreeClassifier • from sklearn.model_selection import train_test_split • from sklearn.model_selection import KFold • from sklearn.model_selection import StratifiedKFold
  • 12. • Reading CSV Data into Pandas • Next, we load the dataset in the CSV file into the pandas dataframes and check the top 5 rows. • df=pd.read_csv(“Parkinsson disease.csv") • df.head()
  • 13. • Data Preprocessing • The “name” column is not going to add any value in training the model and can be discarded, so we are dropping it below. • df.drop(df.columns[0], axis = 1, inplace = True) • Next, we will separate the feature and target matrix as shown below. • #Independent And dependent features • X=df.drop('status', axis=1) • y=df['status']
  • 14. Hold out Approach in Sklearn • The hold-out approach can be applied by using train_test_split module of sklearn.model_selection • In the below example we have split the dataset to create the test data with a size of 30% and train data with a size of 70%. The random_state number ensures the split is deterministic in every run. • from sklearn.model_selection import train_test_split • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4) • model = DecisionTreeClassifier() • model.fit(X_train, y_train) • result = model.score(X_test, y_test)print(result) • Out[38]: • 0.7796610169491526
  • 15. K-Fold Cross-Validation • K-Fold Cross-Validation in Sklearn can be applied by using cross_val_score module of sklearn.model_selection. • In the below example, 10 folds are used that produced 10 accuracy scores using which we calculated the mean score. • In [40]: • from sklearn.model_selection import cross_val_score • model=DecisionTreeClassifier() • kfold_validation=KFold(10) • results=cross_val_score(model,X,y,cv=kfold_validation) • print(results)print(np.mean(results)) • Out[40]: • [0.7 0.8 0.8 0.8 0.8 0.78947368 • 0.84210526 1. 0.68421053 0.36842105] • 0.758421052631579
  • 16. • Stratified K-fold Cross-Validation • In Sklearn stratified K-fold cross-validation can be applied by using StratifiedKFold module of sklearn.model_selection • In the below example, the dataset is divided into 5 splits or folds. It returns 5 accuracy scores using which we calculate the final mean score. • from sklearn.model_selection import StratifiedKFold • skfold=StratifiedKFold(n_splits=5) • model=DecisionTreeClassifier()scores=cross_val_score(model,X,y,cv=skfold) • print(scores)print(np.mean(scores)) • Out[41]: • array([0.61538462, 0.79487179, 0.71794872, 0.74358974, 0.71794872]) • 0.717948717948718
  • 17. Leave One Out Cross Validation(LOOCV) • In Sklearn Leave One Out Cross Validation (LOOCV) can be applied by using LeaveOneOut module of sklearn.model_selection • from sklearn.model_selection import LeaveOneOut • model=DecisionTreeClassifier() • leave_validation=LeaveOneOut() • results=cross_val_score(model,X,y,cv=leave_validation) • results • Out[22]: • array([1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., • 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., • 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., • 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., • 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,,,,,,,,,,,,,,,,,,,,,,,,,,,] • print(np.mean(results)) • Out[44]: • 0.8358974358974359
  • 18. Repeated Random Test-Train Splits • In Sklearn repeated random test-train splits can be applied by using ShuffleSplit module of sklearn.model_selection • In [45]: • from sklearn.model_selection import ShuffleSplit • model=DecisionTreeClassifier() • ssplit=ShuffleSplit(n_splits=10,test_size=0.30) • results=cross_val_score(model,X,y,cv=ssplit) • print(results)print(np.mean(results)) • Out[45]: • array([0.79661017, 0.71186441, 0.79661017, 0.88135593, 0.72881356, • 0.84745763, 0.83050847, 0.77966102, 0.83050847, 0.81355932])