Assessing Model Performance - Beginner's Guide

MODEL
PERFORMANCEP R E S E N T E D B Y M E G A N V E R B A K E L

Quick refresher
Today we will focus on the simple case of a binary classifier.
A binary classifier is a predictive model where the target can take the value of
0 or 1 (e.g. Predicting whether a customer will reject (0) or accept(1) an offer).
0 and 1 are called classes, where 1 is the positive class (outcome of interest).
We start by taking a historical data set where each row represents one
instance (e.g. a customer) and each column is a feature (e.g. income).
In addition, we need a target column (e.g. an outcome for each customer).
Next, we apply a machine learning algorithm to learn patterns from the
features to predict the probability of each class for each instance (row).
The target values are known for the historical data so we can use it to
understand how the model will perform when applied to new data where
we don't yet know the outcomes.

THEORY
Bias-variance trade-off
Over/under fitting
Finding the performance sweet spot
Data preparation
Performance metrics
Performance plots
PRACTICAL
Walk-through in Python
CONTENT
OUTLINE

Bias Variance Trade-off
Prediction errors can be split into error due to bias and error due to variance
Error due to bias is how far off the predictions are from the true values
Error due to variance is the variability of model predictions for a given point
As we decrease model bias by increasing complexity, variance increases,
creating a trade-off as we try to minimise both
By thinking of a model with perfect predictions as the bulls-eye, we can
visualise the four scenarios of bias and variance using the below targets:
Low Variance +
Low Bias
Low Variance +
High Bias
High Variance +
High Bias
High Variance +
Low Bias

Over and Under Fitting
Over-fitting occurs when you learn too much detail from the
training data. The model doesn't generalise well, so error increases
when you apply it to new data.
E.g. if you have one red ice cream in your training data with low
sales, you may incorrectly predict all red ice creams will have low
sales.
Under-fitting is when you don't learn enough detail, so error is high
in both your training and test sets.
Over-fitting increases as you increase complexity (e.g. add more
features, increase depth of trees), resulting in low bias, but high
variance.
As you decrease complexity, bias increases but variance decreases.
Our job is to find the optimal level of complexity that minimises
error, and balances bias and variance.

Finding the sweet spot
To find the sweet spot between under/over-fitting test different
levels of model complexity and minimise the total error.
There will always be a trade-off, so you must decide how much of an
increase in variance you will accept for a decrease in bias.
Take into consideration how similar the new data will be to the
training data.
If very similar, can create a more complex model without worrying
too much about how it will generalise to slightly different data.
If there is more variation, reduce complexity to improve the stability
of the performance on new data sets.
Must also take into account the importance of 'explainability'. If you
need to be able to explain the model to business stakeholders, a
simpler model may be preferred.

Data: Train/Test Split
To test the performance of a model, split the data into a train set
and a test set. Common splits are 80/20, and 70/30.
Train the model on the training data then apply to the test data to
check if the model works on new (unseen) data (i.e. does it
generalise).
When comparing models, select the model that minimises the
prediction error in the test data.
However, we also want to minimise the performance gap between
the train and test sets (big gap indicates overfitting, low
performance in both indicates under-fitting)
Stratify on the target to ensure the proportion of values in each class
is the same in both the train and test set. This is to maintain the
representation of the original data.

Data: Cross-validation
Cross-validation helps test for over-fitting by checking how the model holds
up when trained and tested on different subsets of the data.
The most common method is k-folds cross validation, where k is the number
of subsets to create (typically between 5 and 10). k-1 subsets are used to train
the model, which is then tested on the set held out.
At the end, check the mean and standard deviation of the error. If comparing
models, select the model with the lowest mean error and lowest standard
deviation (i.e. minimise bias and variation).
Again, make sure you stratify by your target to ensure the proportion in each
class remains consistent.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Performance Plots
Confusion Matrix - A cross tabulation of predicted labels and true labels, used to calculate
recall, precision, and accuracy. Objective in this example: we want to predict which
customers will accept the offer so we can minimise the cost of calling potential customers.
Recall (positive)* = 937 / (937+121) = 0.89
We correctly predicted 89% of 'accepts'
*Also called True Positive Rate and Sensitivity
Precision (positive) = 937 / (937+212) = 0.82
Of the cases we said would accept, 82% did
Recall (negative)* = 846 / (846+212) = 0.80
* Also called True Negative Rate and Specificity
Precision (negative) = 846 / (846+121) = 0.87
Accuracy = (846+937) / (846+212+121+937) = 0.84
We correctly predicted the label for 84% of cases
Caution: Accuracy is a poor metric if you have class imbalance. If 90% of cases reject, we could be 90% accurate by just
predicting everything will reject. This doesn't help us achieve our objective of understanding which customers will accept. We
therefore have to look at other metrics such as recall and precision for the positive class to understand the prediction error.
True Negative False Positive
False Negative True Positive

Performance Plots
ROC (Receiver Operating Characteristic) - Originally for radio signals, shows the trade-off between
the true positive rate (positive class recall) and the false positive rate (1 - negative class recall) at
different probability thresholds. We want to maximise the TPR to capture as much of the positive
class as possible, while minimising the FPR which is our error or wasted effort.
Area under the curve (AUC) - Measures the area underneath the ROC curve.
0.5 (straight diagonal line) = random (TPR/FPR are equal, the true class is split 50/50)
1 (left corner) = perfect predictions

Performance Plots
Lift & Gain - compares the model to random selection when the data is ordered by the
positive class probability (high to low). For each 10% of the population, the proportion in the
positive class (left graph - lift), and what cumulative proportion have you captured (right
graph - gain)
As hoped, the majority of cases
assigned a high probability for the
positive class were in the positive
class. If we call only the top 10%, over
90% will accept the offer
With the model we can capture > 80% of customers who
will accept the offer while calling only 50% of the total
group, compared to 50% if we called randomly

Performance Metrics
Accuracy:
Where y_hat_i is the predicted value of the ith sample, and y_i is the true value, the
proportion of correct predictions can be expressed as:
Precision & Recall:
Where tp is the number of true positive predictions (correct positive), fp is the number
of false positive predictions (negative predicted as positive), and fn is the number of
false negative predictions (positive predicted as negative).
F1 Score:
Weighted average of precision and recall:
F1 = 2 * (precision * recall) / (precision + recall)

Performance Metrics
ROC_AUC: The area under the ROC curve is the probability a random positive instance
is correctly ranked higher in probability than a random negative instance. The area
under the ROC curve is calculated using the formula for the area of a trapezoid:
Gini co-efficient:
The gini coefficient is ratio of the area between the diagonal line
(perfect equality) and the Lorenz curve (cumulative positive class
proportion) and the total area. B is equal to ROC_AUC - 0.5, thus
the gini co-efficient can be derived from ROC AUC:
Log Loss (binary):
Where y is the true label, and a probability estimate p=Pr(y=1), the log loss per sample is
the negative log-likelihood of the classifier given the true label:
Gini = A / (A + B)
Gini = (AUC-0.5)*2
https://en.wikipedia.org/wiki/Gini_coefficient

Metrics Summary
Accuracy (0-1) - maximise - of all predictions, the proportion correctly predicted
Recall (0-1) - maximise - of the instances actually in a class, the proportion correctly
predicted as that class (i.e. how many you pick up)
Precision (0-1) - maximise - of the instances predicted to be a class, the proportion
that were correct (i.e. 1-precision is the error or incorrect predictions)
F1 score (0-1) - maximise - weighted average of precision and recall (for binary
classifiers, done for positive class)
ROC_AUC (0-1) - maximise - area under the ROC curve
Gini (0-1) - maximise - a measure of inequality, where a high value indicates a
disproportionate amount of the positive class is represented in the cases with a high
probability (good!)
Log loss (0-1) - minimise - log loss increases as the predicted probability diverges
from the actual label (penalises the model based on how sure it was)

Python Practical
To calculate the performance metrics and create the plots discussed, all you
need is the probabilities for each class, the predicted class (assign a threshold to
the probabilities), and the actual outcomes.
If you are using an sklearn algorithm, these can be easily obtained after you
have fitted the model with the predict and predict_proba methods:
clf = sklearn.ensemble.RandomForestClassifier()
predicted_class = clf.predict(x_test)
probabilities = clf.predict_proba(x_test)
A range of performance metrics are available in the sklearn.metrics module:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Assessing Model Performance - Beginner's Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Assessing Model Performance - Beginner's Guide

Similar to Assessing Model Performance - Beginner's Guide (20)

Recently uploaded

Recently uploaded (20)

Assessing Model Performance - Beginner's Guide