2. Learning Objective
→ Validation and overfitting.
→ Validation Strategies.
→ Data Splitting Strategies.
→ Problems occurring during Validation
3. Validation And Overfitting
We want to check if the model gives expected results on the unseen data.
→ we divide data we have into two parts,
→ Train
→ Validation part
→ We fit your model on the train part and check its quality on the validation
part.
→ Our model will be checked against the unseen data in the feature and actually
these data can differ from the data we have.
→ To choose the best model, we basically want to avoid underfitting on the one
4. So, we want your model to be able to capture patterns in the data but only those patterns
that generalizes well between both train and test data.
To Choose best model, we basically want to avoid Underfitting on the one side and
overfitting on the other side.
5. Let’s understand this concept on a very simple example of a binary classification test.
→ We will be using simple models defined by formulas under the picture and
visualize the results of model’s predictions.
→ we can see on the below picture that if the model is too simple, it can’t
capture underlined relationship and we will get poor results.
→ If we want our results to improve, we can increase the complexity of the
model and we will undoubtedly find that quality on the training data is going Up.
6.
7. But on the other hand, if we make too complicated model like below picture.
→ It will describing noise in the train data that doesn’t generalize the test data.
→ This lead to a decrease of model quality, this called overfitting.
8.
9. So, we want your model in between underfitting and overfitting ,
→ we say model is overfitted, if it’s quality on the train set is better than on the
test set.
→ In Competitions, we often say, that the model are overfitted only in case when
quality on the test set is worse than we expected.
10.
11. Validation Strategies
Validation help us to select a model which will perform best on the unseen data
→ The main difference between these validation strategies is the number of
splits being done.
→ The four validation types are
→ Holdout
→ K-fold
→ Leave-one-out
12. Hold-out
It’s a simple data split which divide data into two parts,
→ Train Dataframe.
→ Validation Dataframe
One sample can go either to train or to validation.
→ So, the samples between train and the validation do not overlap, if they do we
can’t trust our validation.
→ When we have repeated samples in the data, we'll get better prediction for these
samples.
→ thinking about a holdout in the competition is a good idea, when we have enough
13.
14. K-fold
K-fold can be viewed as a repeated holdout, because we split our data into k parts and
iterate through them, using every part as a validation set only once.
→ After this procedure, we average scores over these K-folds.
→ In K-fold, some samples never get in validation, other can be multiple times.
→ This method is good choice when we have a minimum amount of data
15.
16.
17. Leave-one-out
It’s a special case of K-fold, when K = Number of sample in our data.
→ This means that we iterate through every sample in our data.
→ this method can be helpful if we have too little data.
18.
19. stratification
It is just the way to insure we’ll get similar target distribution over different faults.
→ If we split data into four faults with stratification, the average of each false
target `will be equal to one half.
We usually use holdout or K-fold on shuffle data.
→ by shuffling data we are trying to reproduce random trained validation split.
→ But sometimes, especially if you have enough samples for some class, a random
split can fail.
20.
21. Validation
We want to check if the model gives expected results on the unseen data.
→ we divide data we have into two parts,
→ Train
→ Validation part
→ We fit your model on the train part and check its quality on the validation
part.
→ Our model will be checked against the unseen data in the feature and actually
these data can differ from the data we have.
→
22.
23. Data Splitting strategies
The fact that most useful feature for one model are useless for another.
→ If we carefully generate feature that are drawing attention to time based patterns,
We’ll get a reliable validation with a random based split.
→ If we’ll create feature which are useful for a time-based split and are useless for a
random split, we w’ll be correct to use a random split.
→
24.
25. That means, to be able to find smart ideas for feature generation and consistently
improve your model, we absolutely want to identify train/test split made by organizer.
26.
27. Splitting data into train and validation
Most split can be united into three categories
→ Random, rowwise
→ Timewise
→ By id
28. Random Split
The most common way of making a train/test split is to split data randomly by rows
→ Rows are independent of each other.
Example:
We have a test of predicting if a client will pay off a lone.
1) Each row represent a person, and these rows are fairly dependent
each other
2) There is some dependency between family members or people which work
in the same company.
3) If a husband can pay a credit probably, his wife can do it too
4) By some miss fortune, husband present in test, wife in train and devise a
29. Time wise(Time based split)
We generally have everything before a particular date as a training data, and everything
after date as a test data.
→ this can be a signal to use special approach to feature generation
30. ID based split
Id can be unique identifier of user, shop or any other entity