SlideShare a Scribd company logo
1 of 34
Machine Learning -
Validation
Chode Amarnath
Learning Objective
→ Validation and overfitting.
→ Validation Strategies.
→ Data Splitting Strategies.
→ Problems occurring during Validation
Validation And Overfitting
We want to check if the model gives expected results on the unseen data.
→ we divide data we have into two parts,
→ Train
→ Validation part
→ We fit your model on the train part and check its quality on the validation
part.
→ Our model will be checked against the unseen data in the feature and actually
these data can differ from the data we have.
→ To choose the best model, we basically want to avoid underfitting on the one
So, we want your model to be able to capture patterns in the data but only those patterns
that generalizes well between both train and test data.
To Choose best model, we basically want to avoid Underfitting on the one side and
overfitting on the other side.
Let’s understand this concept on a very simple example of a binary classification test.
→ We will be using simple models defined by formulas under the picture and
visualize the results of model’s predictions.
→ we can see on the below picture that if the model is too simple, it can’t
capture underlined relationship and we will get poor results.
→ If we want our results to improve, we can increase the complexity of the
model and we will undoubtedly find that quality on the training data is going Up.
But on the other hand, if we make too complicated model like below picture.
→ It will describing noise in the train data that doesn’t generalize the test data.
→ This lead to a decrease of model quality, this called overfitting.
So, we want your model in between underfitting and overfitting ,
→ we say model is overfitted, if it’s quality on the train set is better than on the
test set.
→ In Competitions, we often say, that the model are overfitted only in case when
quality on the test set is worse than we expected.
Validation Strategies
Validation help us to select a model which will perform best on the unseen data
→ The main difference between these validation strategies is the number of
splits being done.
→ The four validation types are
→ Holdout
→ K-fold
→ Leave-one-out
Hold-out
It’s a simple data split which divide data into two parts,
→ Train Dataframe.
→ Validation Dataframe
One sample can go either to train or to validation.
→ So, the samples between train and the validation do not overlap, if they do we
can’t trust our validation.
→ When we have repeated samples in the data, we'll get better prediction for these
samples.
→ thinking about a holdout in the competition is a good idea, when we have enough
K-fold
K-fold can be viewed as a repeated holdout, because we split our data into k parts and
iterate through them, using every part as a validation set only once.
→ After this procedure, we average scores over these K-folds.
→ In K-fold, some samples never get in validation, other can be multiple times.
→ This method is good choice when we have a minimum amount of data
Leave-one-out
It’s a special case of K-fold, when K = Number of sample in our data.
→ This means that we iterate through every sample in our data.
→ this method can be helpful if we have too little data.
stratification
It is just the way to insure we’ll get similar target distribution over different faults.
→ If we split data into four faults with stratification, the average of each false
target `will be equal to one half.
We usually use holdout or K-fold on shuffle data.
→ by shuffling data we are trying to reproduce random trained validation split.
→ But sometimes, especially if you have enough samples for some class, a random
split can fail.
Validation
We want to check if the model gives expected results on the unseen data.
→ we divide data we have into two parts,
→ Train
→ Validation part
→ We fit your model on the train part and check its quality on the validation
part.
→ Our model will be checked against the unseen data in the feature and actually
these data can differ from the data we have.
→
Data Splitting strategies
The fact that most useful feature for one model are useless for another.
→ If we carefully generate feature that are drawing attention to time based patterns,
We’ll get a reliable validation with a random based split.
→ If we’ll create feature which are useful for a time-based split and are useless for a
random split, we w’ll be correct to use a random split.
→
That means, to be able to find smart ideas for feature generation and consistently
improve your model, we absolutely want to identify train/test split made by organizer.
Splitting data into train and validation
Most split can be united into three categories
→ Random, rowwise
→ Timewise
→ By id
Random Split
The most common way of making a train/test split is to split data randomly by rows
→ Rows are independent of each other.
Example:
We have a test of predicting if a client will pay off a lone.
1) Each row represent a person, and these rows are fairly dependent
each other
2) There is some dependency between family members or people which work
in the same company.
3) If a husband can pay a credit probably, his wife can do it too
4) By some miss fortune, husband present in test, wife in train and devise a
Time wise(Time based split)
We generally have everything before a particular date as a training data, and everything
after date as a test data.
→ this can be a signal to use special approach to feature generation
ID based split
Id can be unique identifier of user, shop or any other entity
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies

More Related Content

What's hot

CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdfBong-Ho Lee
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceMala Deep Upadhaya
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced dataSaurabhWani6
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision TreesRupak Roy
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysisClaireWhittaker5
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1Gautam Kumar
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Kush Kulshrestha
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in REdureka!
 
Linear Regression
Linear RegressionLinear Regression
Linear Regressionmailund
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
Machine learning interview questions and answers
Machine learning interview questions and answersMachine learning interview questions and answers
Machine learning interview questions and answerskavinilavuG
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheetJakub Czakon
 

What's hot (18)

CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysis
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in R
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
Machine learning interview questions and answers
Machine learning interview questions and answersMachine learning interview questions and answers
Machine learning interview questions and answers
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
 

Similar to Validation and Over fitting , Validation strategies

Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachAjit Ghodke
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection TechniquesSwati .
 
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...AILABS Academy
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluationeShikshak
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들Minho Lee
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Machine learning project_promotion
Machine learning project_promotionMachine learning project_promotion
Machine learning project_promotionkahhuey
 
Cross validation.pptx
Cross validation.pptxCross validation.pptx
Cross validation.pptxYouKnowwho28
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptxmohammedalherwi1
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation MethodSHUBHAM GUPTA
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
shubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptxshubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptxABINASHPADHY6
 

Similar to Validation and Over fitting , Validation strategies (20)

Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Regresión
RegresiónRegresión
Regresión
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
 
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
 
4.1.pptx
4.1.pptx4.1.pptx
4.1.pptx
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Model validation
Model validationModel validation
Model validation
 
Machine learning project_promotion
Machine learning project_promotionMachine learning project_promotion
Machine learning project_promotion
 
Cross validation.pptx
Cross validation.pptxCross validation.pptx
Cross validation.pptx
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
shubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptxshubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptx
 

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Validation and Over fitting , Validation strategies

  • 2. Learning Objective → Validation and overfitting. → Validation Strategies. → Data Splitting Strategies. → Problems occurring during Validation
  • 3. Validation And Overfitting We want to check if the model gives expected results on the unseen data. → we divide data we have into two parts, → Train → Validation part → We fit your model on the train part and check its quality on the validation part. → Our model will be checked against the unseen data in the feature and actually these data can differ from the data we have. → To choose the best model, we basically want to avoid underfitting on the one
  • 4. So, we want your model to be able to capture patterns in the data but only those patterns that generalizes well between both train and test data. To Choose best model, we basically want to avoid Underfitting on the one side and overfitting on the other side.
  • 5. Let’s understand this concept on a very simple example of a binary classification test. → We will be using simple models defined by formulas under the picture and visualize the results of model’s predictions. → we can see on the below picture that if the model is too simple, it can’t capture underlined relationship and we will get poor results. → If we want our results to improve, we can increase the complexity of the model and we will undoubtedly find that quality on the training data is going Up.
  • 6.
  • 7. But on the other hand, if we make too complicated model like below picture. → It will describing noise in the train data that doesn’t generalize the test data. → This lead to a decrease of model quality, this called overfitting.
  • 8.
  • 9. So, we want your model in between underfitting and overfitting , → we say model is overfitted, if it’s quality on the train set is better than on the test set. → In Competitions, we often say, that the model are overfitted only in case when quality on the test set is worse than we expected.
  • 10.
  • 11. Validation Strategies Validation help us to select a model which will perform best on the unseen data → The main difference between these validation strategies is the number of splits being done. → The four validation types are → Holdout → K-fold → Leave-one-out
  • 12. Hold-out It’s a simple data split which divide data into two parts, → Train Dataframe. → Validation Dataframe One sample can go either to train or to validation. → So, the samples between train and the validation do not overlap, if they do we can’t trust our validation. → When we have repeated samples in the data, we'll get better prediction for these samples. → thinking about a holdout in the competition is a good idea, when we have enough
  • 13.
  • 14. K-fold K-fold can be viewed as a repeated holdout, because we split our data into k parts and iterate through them, using every part as a validation set only once. → After this procedure, we average scores over these K-folds. → In K-fold, some samples never get in validation, other can be multiple times. → This method is good choice when we have a minimum amount of data
  • 15.
  • 16.
  • 17. Leave-one-out It’s a special case of K-fold, when K = Number of sample in our data. → This means that we iterate through every sample in our data. → this method can be helpful if we have too little data.
  • 18.
  • 19. stratification It is just the way to insure we’ll get similar target distribution over different faults. → If we split data into four faults with stratification, the average of each false target `will be equal to one half. We usually use holdout or K-fold on shuffle data. → by shuffling data we are trying to reproduce random trained validation split. → But sometimes, especially if you have enough samples for some class, a random split can fail.
  • 20.
  • 21. Validation We want to check if the model gives expected results on the unseen data. → we divide data we have into two parts, → Train → Validation part → We fit your model on the train part and check its quality on the validation part. → Our model will be checked against the unseen data in the feature and actually these data can differ from the data we have. →
  • 22.
  • 23. Data Splitting strategies The fact that most useful feature for one model are useless for another. → If we carefully generate feature that are drawing attention to time based patterns, We’ll get a reliable validation with a random based split. → If we’ll create feature which are useful for a time-based split and are useless for a random split, we w’ll be correct to use a random split. →
  • 24.
  • 25. That means, to be able to find smart ideas for feature generation and consistently improve your model, we absolutely want to identify train/test split made by organizer.
  • 26.
  • 27. Splitting data into train and validation Most split can be united into three categories → Random, rowwise → Timewise → By id
  • 28. Random Split The most common way of making a train/test split is to split data randomly by rows → Rows are independent of each other. Example: We have a test of predicting if a client will pay off a lone. 1) Each row represent a person, and these rows are fairly dependent each other 2) There is some dependency between family members or people which work in the same company. 3) If a husband can pay a credit probably, his wife can do it too 4) By some miss fortune, husband present in test, wife in train and devise a
  • 29. Time wise(Time based split) We generally have everything before a particular date as a training data, and everything after date as a test data. → this can be a signal to use special approach to feature generation
  • 30. ID based split Id can be unique identifier of user, shop or any other entity