SlideShare a Scribd company logo
Machine Learning -
Validation
Chode Amarnath
Learning Objective
→ Validation and overfitting.
→ Validation Strategies.
→ Data Splitting Strategies.
→ Problems occurring during Validation
Validation And Overfitting
We want to check if the model gives expected results on the unseen data.
→ we divide data we have into two parts,
→ Train
→ Validation part
→ We fit your model on the train part and check its quality on the validation
part.
→ Our model will be checked against the unseen data in the feature and actually
these data can differ from the data we have.
→ To choose the best model, we basically want to avoid underfitting on the one
So, we want your model to be able to capture patterns in the data but only those patterns
that generalizes well between both train and test data.
To Choose best model, we basically want to avoid Underfitting on the one side and
overfitting on the other side.
Let’s understand this concept on a very simple example of a binary classification test.
→ We will be using simple models defined by formulas under the picture and
visualize the results of model’s predictions.
→ we can see on the below picture that if the model is too simple, it can’t
capture underlined relationship and we will get poor results.
→ If we want our results to improve, we can increase the complexity of the
model and we will undoubtedly find that quality on the training data is going Up.
But on the other hand, if we make too complicated model like below picture.
→ It will describing noise in the train data that doesn’t generalize the test data.
→ This lead to a decrease of model quality, this called overfitting.
So, we want your model in between underfitting and overfitting ,
→ we say model is overfitted, if it’s quality on the train set is better than on the
test set.
→ In Competitions, we often say, that the model are overfitted only in case when
quality on the test set is worse than we expected.
Validation Strategies
Validation help us to select a model which will perform best on the unseen data
→ The main difference between these validation strategies is the number of
splits being done.
→ The four validation types are
→ Holdout
→ K-fold
→ Leave-one-out
Hold-out
It’s a simple data split which divide data into two parts,
→ Train Dataframe.
→ Validation Dataframe
One sample can go either to train or to validation.
→ So, the samples between train and the validation do not overlap, if they do we
can’t trust our validation.
→ When we have repeated samples in the data, we'll get better prediction for these
samples.
→ thinking about a holdout in the competition is a good idea, when we have enough
K-fold
K-fold can be viewed as a repeated holdout, because we split our data into k parts and
iterate through them, using every part as a validation set only once.
→ After this procedure, we average scores over these K-folds.
→ In K-fold, some samples never get in validation, other can be multiple times.
→ This method is good choice when we have a minimum amount of data
Leave-one-out
It’s a special case of K-fold, when K = Number of sample in our data.
→ This means that we iterate through every sample in our data.
→ this method can be helpful if we have too little data.
stratification
It is just the way to insure we’ll get similar target distribution over different faults.
→ If we split data into four faults with stratification, the average of each false
target `will be equal to one half.
We usually use holdout or K-fold on shuffle data.
→ by shuffling data we are trying to reproduce random trained validation split.
→ But sometimes, especially if you have enough samples for some class, a random
split can fail.
Validation
We want to check if the model gives expected results on the unseen data.
→ we divide data we have into two parts,
→ Train
→ Validation part
→ We fit your model on the train part and check its quality on the validation
part.
→ Our model will be checked against the unseen data in the feature and actually
these data can differ from the data we have.
→
Data Splitting strategies
The fact that most useful feature for one model are useless for another.
→ If we carefully generate feature that are drawing attention to time based patterns,
We’ll get a reliable validation with a random based split.
→ If we’ll create feature which are useful for a time-based split and are useless for a
random split, we w’ll be correct to use a random split.
→
That means, to be able to find smart ideas for feature generation and consistently
improve your model, we absolutely want to identify train/test split made by organizer.
Splitting data into train and validation
Most split can be united into three categories
→ Random, rowwise
→ Timewise
→ By id
Random Split
The most common way of making a train/test split is to split data randomly by rows
→ Rows are independent of each other.
Example:
We have a test of predicting if a client will pay off a lone.
1) Each row represent a person, and these rows are fairly dependent
each other
2) There is some dependency between family members or people which work
in the same company.
3) If a husband can pay a credit probably, his wife can do it too
4) By some miss fortune, husband present in test, wife in train and devise a
Time wise(Time based split)
We generally have everything before a particular date as a training data, and everything
after date as a test data.
→ this can be a signal to use special approach to feature generation
ID based split
Id can be unique identifier of user, shop or any other entity
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies

More Related Content

What's hot

CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
Bong-Ho Lee
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
Mala Deep Upadhaya
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
Sridhar Ratakonda
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
Seung Ho Choi
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
Aman Vasisht
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
SaurabhWani6
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
Rupak Roy
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysis
ClaireWhittaker5
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
SnehaDey21
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
Gautam Kumar
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in R
Edureka!
 
Linear Regression
Linear RegressionLinear Regression
Linear Regressionmailund
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Rebecca Bilbro
 
Machine learning interview questions and answers
Machine learning interview questions and answersMachine learning interview questions and answers
Machine learning interview questions and answers
kavinilavuG
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
Jakub Czakon
 

What's hot (18)

CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysis
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in R
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
Machine learning interview questions and answers
Machine learning interview questions and answersMachine learning interview questions and answers
Machine learning interview questions and answers
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
 

Similar to Validation and Over fitting , Validation strategies

Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Regresión
RegresiónRegresión
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
Swati .
 
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS Academy
 
4.1.pptx
4.1.pptx4.1.pptx
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
Minho Lee
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Model validation
Model validationModel validation
Model validation
Utkarsh Sharma
 
Machine learning project_promotion
Machine learning project_promotionMachine learning project_promotion
Machine learning project_promotion
kahhuey
 
Cross validation.pptx
Cross validation.pptxCross validation.pptx
Cross validation.pptx
YouKnowwho28
 
Improving machine learning models unit 5.pptx
Improving machine learning models unit 5.pptxImproving machine learning models unit 5.pptx
Improving machine learning models unit 5.pptx
SomnathMule5
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
Shiwani Gupta
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
NAGARAJANS68
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
mohammedalherwi1
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 

Similar to Validation and Over fitting , Validation strategies (20)

Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Regresión
RegresiónRegresión
Regresión
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
 
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
AILABS - Lecture Series - Is AI the New Electricity. Topic- Role of AI in Log...
 
4.1.pptx
4.1.pptx4.1.pptx
4.1.pptx
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Model validation
Model validationModel validation
Model validation
 
Machine learning project_promotion
Machine learning project_promotionMachine learning project_promotion
Machine learning project_promotion
 
Cross validation.pptx
Cross validation.pptxCross validation.pptx
Cross validation.pptx
 
Improving machine learning models unit 5.pptx
Improving machine learning models unit 5.pptxImproving machine learning models unit 5.pptx
Improving machine learning models unit 5.pptx
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

Validation and Over fitting , Validation strategies

  • 2. Learning Objective → Validation and overfitting. → Validation Strategies. → Data Splitting Strategies. → Problems occurring during Validation
  • 3. Validation And Overfitting We want to check if the model gives expected results on the unseen data. → we divide data we have into two parts, → Train → Validation part → We fit your model on the train part and check its quality on the validation part. → Our model will be checked against the unseen data in the feature and actually these data can differ from the data we have. → To choose the best model, we basically want to avoid underfitting on the one
  • 4. So, we want your model to be able to capture patterns in the data but only those patterns that generalizes well between both train and test data. To Choose best model, we basically want to avoid Underfitting on the one side and overfitting on the other side.
  • 5. Let’s understand this concept on a very simple example of a binary classification test. → We will be using simple models defined by formulas under the picture and visualize the results of model’s predictions. → we can see on the below picture that if the model is too simple, it can’t capture underlined relationship and we will get poor results. → If we want our results to improve, we can increase the complexity of the model and we will undoubtedly find that quality on the training data is going Up.
  • 6.
  • 7. But on the other hand, if we make too complicated model like below picture. → It will describing noise in the train data that doesn’t generalize the test data. → This lead to a decrease of model quality, this called overfitting.
  • 8.
  • 9. So, we want your model in between underfitting and overfitting , → we say model is overfitted, if it’s quality on the train set is better than on the test set. → In Competitions, we often say, that the model are overfitted only in case when quality on the test set is worse than we expected.
  • 10.
  • 11. Validation Strategies Validation help us to select a model which will perform best on the unseen data → The main difference between these validation strategies is the number of splits being done. → The four validation types are → Holdout → K-fold → Leave-one-out
  • 12. Hold-out It’s a simple data split which divide data into two parts, → Train Dataframe. → Validation Dataframe One sample can go either to train or to validation. → So, the samples between train and the validation do not overlap, if they do we can’t trust our validation. → When we have repeated samples in the data, we'll get better prediction for these samples. → thinking about a holdout in the competition is a good idea, when we have enough
  • 13.
  • 14. K-fold K-fold can be viewed as a repeated holdout, because we split our data into k parts and iterate through them, using every part as a validation set only once. → After this procedure, we average scores over these K-folds. → In K-fold, some samples never get in validation, other can be multiple times. → This method is good choice when we have a minimum amount of data
  • 15.
  • 16.
  • 17. Leave-one-out It’s a special case of K-fold, when K = Number of sample in our data. → This means that we iterate through every sample in our data. → this method can be helpful if we have too little data.
  • 18.
  • 19. stratification It is just the way to insure we’ll get similar target distribution over different faults. → If we split data into four faults with stratification, the average of each false target `will be equal to one half. We usually use holdout or K-fold on shuffle data. → by shuffling data we are trying to reproduce random trained validation split. → But sometimes, especially if you have enough samples for some class, a random split can fail.
  • 20.
  • 21. Validation We want to check if the model gives expected results on the unseen data. → we divide data we have into two parts, → Train → Validation part → We fit your model on the train part and check its quality on the validation part. → Our model will be checked against the unseen data in the feature and actually these data can differ from the data we have. →
  • 22.
  • 23. Data Splitting strategies The fact that most useful feature for one model are useless for another. → If we carefully generate feature that are drawing attention to time based patterns, We’ll get a reliable validation with a random based split. → If we’ll create feature which are useful for a time-based split and are useless for a random split, we w’ll be correct to use a random split. →
  • 24.
  • 25. That means, to be able to find smart ideas for feature generation and consistently improve your model, we absolutely want to identify train/test split made by organizer.
  • 26.
  • 27. Splitting data into train and validation Most split can be united into three categories → Random, rowwise → Timewise → By id
  • 28. Random Split The most common way of making a train/test split is to split data randomly by rows → Rows are independent of each other. Example: We have a test of predicting if a client will pay off a lone. 1) Each row represent a person, and these rows are fairly dependent each other 2) There is some dependency between family members or people which work in the same company. 3) If a husband can pay a credit probably, his wife can do it too 4) By some miss fortune, husband present in test, wife in train and devise a
  • 29. Time wise(Time based split) We generally have everything before a particular date as a training data, and everything after date as a test data. → this can be a signal to use special approach to feature generation
  • 30. ID based split Id can be unique identifier of user, shop or any other entity