_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf

1
The Ultimate Guide to
ML Model Performance
WHITEPAPER

2
Introduction: What is Model Monitoring?................................................................ 4
Chapter 1: Measuring Model Performance............................................................. 5
Chapter 2: Feature Quality Monitoring....................................................................11
Chapter 3: Data Drift Monitoring.............................................................................13
Chapter 4: Unstructured Model Monitoring............................................................18
Chapter 5: Class Imbalance in Monitoring..............................................................21
Conclusion............................................................................................................... 24

3
If you work with machine learning (ML) models, you probably know that model monitoring is crucial
throughout the ML lifecycle — and you might even have a sneaky feeling that you should be doing
more about it. But it’s not always easy to know what you should be monitoring and how.
In this guide to ML model performance, we’ll discuss:
• How ML model performance is measured for the most common types of ML models
• How to catch data pipeline errors before they affect your models, as part of feature quality
monitoring
• How to detect when real-world data distribution has shifted, causing your models to degrade
in performance, as part of data drift monitoring
• How to monitor unstructured Natural Language Processing and Computer Vision models, as
part of unstructured model monitoring
• How to address the common problem of class imbalance — when you see a very low
percentage of a certain type of model prediction (such as fraudulent transactions) — and how
to account for drift in this minority class
Whether you’re just beginning your ML journey with off-the-shelf models or have a large data
science team, how you monitor your models is just as important as how you develop and train them.

4
We’re all familiar with the concept of monitoring in software. ML model monitoring has the
same goals: to detect problems proactively and help teams resolve these issues. Ideally, model
monitoring is integrated into your overall ML workflow, regardless of your MLOps stack.
Introduction: What is Model Monitoring?
ML model monitoring isn’t like traditional software monitoring
Software monitoring, also called APM or application performance management for DevOps, is a good
parallel to ML model monitoring in terms of its objectives. And, just like regular software, you should
absolutely monitor the service metrics for your models—such as traffic, latency, and error codes.
However, monitoring ML is very different from monitoring software. Software uses hard-coded
logic to return results. On the other hand, models use a sophisticated form of statistical analysis
on their inputs, making judgments about new data based on historical data. If software performs
badly, you know there’s probably a bug somewhere in the logic. If models start to misbehave, it’s
not as clear why:
Monitoring model performance is far less straightforward than monitoring software performance.
• Maybe the historical data the model was trained on no longer reflects reality (e.g. models
trained on data prior to COVID-19, suddenly misbehaving after COVID-19)
գ In this case, you have to decide whether you need to retrain your model completely, or is
it just a passing trend?
• Maybe an upstream data pipeline broke, causing your model to see unusual values in
production, skewing its predictions
• Maybe an adversary is sending your model anomalous inputs to try to alter its behavior
• Maybe your model was biased against certain classes and you need to collect broader training
data
Why model monitoring matters
Production ML models don’t generally fail by crashing or throwing errors in the same way as
software; instead, they tend to silently degrade over time. Without continuous ML monitoring, your
business is in the dark when a model starts to make inaccurate predictions, which can have major
business impacts.
Uncaught issues can affect your end-users and your bottom line, and you can even end up in
violation of AI regulations or model compliance, which increasingly require teams to explain why
models gave certain predictions. This is impossible without ML monitoring.

5
The first step to monitoring model drift is to understand how model performance is measured.
After all, model performance metrics tell us how well a model is doing.
ML model performance is typically evaluated on a set of “ground truth” data. Whether this comes
from live feedback from users or an annotated dataset, ground truth data reflects the “real world”
that you’re trying to model. One of the reasons it’s hard to monitor models is that the real world is
always changing, meaning that a model that performed well today might not perform well tomorrow.
Model performance is measured differently based on the type of model.
Chapter 1: Measuring Model Performance
• Classification models label inputs as belonging to one or more categories. An example is a
model that classifies loans as high-risk, moderate-risk, or low-risk.
գ Binary classification models are used to detect just one kind of input (e.g. flagging
fraudulent transactions).
գ Multi-class classification models can put inputs into multiple buckets.
• Regression models use stochastic analysis to model the relationship between inputs and
outputs. A common application is forecasting trends for analytics.
• Ranking models take a list of inputs and return a ranking. They are often used for
recommendation systems. A common example is ranking search results.
Classification models
To understand metrics for
classification model performance,
you’ll need to first understand
some basic terminology. Let’s
look at a type of table called a
confusion matrix:

6
Since this is a binary classifier, there are only two options, represented by 0 or 1. When the model
gets its classification right, it returns a true negative or positive. When the model gets it wrong, it
returns a false negative or positive.
With that background, we’re ready to move on to the most important metrics for classification
models. While we will mostly focus on binary classifiers, these metrics have extensions for multi-
class models as well. With more than one class, you’ll have to decide whether you want to apply
weights to class metrics (i.e. because you care more about certain labels).
Accuracy
Accuracy captures the fraction of predictions your model got right:
Beware that accuracy can be deceiving when you’re working with an imbalanced dataset (we’ll get
to this later). For instance, if the majority of your inputs are positive, you could just classify all as
positive and score well on your accuracy metric.
Note that accuracy extends easily to multi-class models — you simply need to sum the true
positives in the confusion matrix, divided by the total inputs.
True positive rate (recall)
The true positive rate, also known as recall or sensitivity, is the fraction of positives that your
model classified correctly, out of all the positives.
Recall is a metric you really care about if you want coverage. In other words, you don’t want to
miss inputs that your model should have counted. For a use case like fraud protection, recall would
likely be one of your most important metrics.

7
False positive rate
The false positive rate shows how often your model classifies something as positive when it was
really a negative. These are also called “type 1 errors.”
Thinking about a fraud detection model, you likely care more about recall than false positives,
but at a certain point, you need to balance that or risk annoying your customers and hurting your
team’s productivity due to false alarms.
Precision
Precision, also called “positive predictive value,” measures how many inputs your model classified
as positive that were in fact positive.
This metric is often used in conjunction with recall. If you increase recall, you’re likely to decrease
precision, because you’ll make your model less choosy. And, in turn, this will increase your false
positive rate!
F1 score
The F1 score is the harmonic mean of precision and recall. What this does mathematically is
combine precision and recall into one metric, where you care about them both equally.
The F1 score is commonly used as an overall metric for model quality, since it captures both
wanting to have good coverage and wanting to not make too many errors.

8
Log loss
Log loss is a measure of the errors that a classification model makes (ideally, it’s 0). Compared
with other metrics like accuracy, log loss accounts for the model’s uncertainty in its prediction. For
example, if the model predicts a true positive with 60% confidence, it will get penalized compared
to predicting a true positive with 90% confidence, even though both were correct predictions.
Area under curve (AUC)
Area under curve is considered a key metric for classification models because it lets you visualize
how your model is performing. It plots the model’s true positive rate (TPR) against the false positive
rate (FPR).
If the model has a high area under the curve, that means that it has a very high recall compared to
a very low false positive rate, which is ideal.

9
Mean absolute percentage error (MAPE)
MAPE is a very common metric in statistics. You can think of it as MAE represented as a
percentage. As it’s a percentage, it’s more intuitive to reason about the risk. However, because
MAPE treats all errors the same, the forecast might look off when some of the actual values are
zero or near-zero.
Coefficient of determination (R-squared)
R-squared measures, statistically, how well a regression model fits the real data. An R-squared
of 1 indicates a perfect fit. Essentially, R-squared is a method of adding up the errors that the
model made.
Mean squared error (MSE)
Mean squared error is a measure of the average amount that the model deviates from the
observed data (the ideal, but unrealistic, value is 0).
Mean absolute error (MAE)
Mean absolute error is simply the average vertical or horizontal distance between each data point
and the linear regression line. It’s “absolute” because it considers all distances to be positive,
avoiding the issue of positive and negative errors canceling out.
Because it squares each term, MSE heavily weights outliers.
WMAPE weights the percentage errors by the actual values. This can be useful for something like
modeling customer orders when on some days orders might be low or nonexistent. It can also be
used when you want your model to care more about certain days or types of products.
Weighted mean absolute percentage error (WMAPE)
Regression models
Regression models have long been used in statistical analysis, and there are many different
methods for measuring their quality. Without getting into the mathematics in-depth, we’ll cover a
few metrics briefly.

10
Mean average precision (MAP)
The mean average precision for ranking models is a way of summing the average precision from
1…k (your cut-off rank), and then dividing it by the total number of queries that you’re looking at.
So, if you got the ranking perfectly every time, you would have a MAP of 1.
To calculate the average precision, it’s common to apply a relevance weight so that you can
penalize misses in the higher positions of the ranking.
Ranking models
How do you measure the quality of a recommendation system? For ranking models, it’s common
to use metrics like precision, recall, and F-score, and find the results for a certain cut-off rank.
E.g. “precision at 1” would find the precision for the first result; “F-score at k” would find the
F-score assuming you stop after k results. Here are a few additional metrics that are important for
monitoring ranking models:
Normalized discounted cumulative gain (NDCG)
Normalized discounted cumulative gain (NDCG) is commonly used for recommendation systems.
It’s based on the concept of cumulative gain (CG), which is a way of measuring the total relevance
of all the results in a ranking. By itself, CG doesn’t factor in the importance of a more relevant result
appearing higher in the ranking.
So, a discount is added that penalizes rankings where highly relevant documents appear lower in
the results. It reduces the relevance value of each result logarithmically proportional to its position.
When normalized across all queries, we get NDCG which returns a result between 0 and 1 to grade
the quality of a ranking.

11
The reality of data is that it isn’t always delivered on time, completely, or accurately. Furthermore,
streaming and historical data must be manipulated to be turned into features that models can use.
This involves spinning up complex, dynamic data pipelines that are prone to breaking or bugs.
Meanwhile, models continue to make use of more and more third-party data sources, spurred by
the democratization of data and the enterprise adoption of the modern data stack. As a result, it’s
harder for enterprises to keep an eye on the quality of model features.
When a data error happens, the model won’t crash and throw an error — as this would ruin
the user experience or put the business at risk. Instead, the model still serves predictions for
malformed inputs, often without anyone noticing that the predictions have degraded. Without
additional model monitoring, these errors tend to go undetected and eat away at the model’s
performance over time.
Chapter 2: Feature Quality Monitoring
Types of feature quality issues
ML models in production face three types of data integrity problems: missing values, range
violations, and type mismatches.
Missing values
Missing values are when a feature input is null or unavailable at inference time. Missing values can
occur regularly at ML model inference. Even when missing values are allowed in features, a model
can see a lot more missing values than in the training set.
Examples:
• A previously optional field is now sending a null value to the model
• A third-party data source stopped providing a certain column in their dataset
• The model is missing an entire chunk of data (e.g. customer details) due to a bug
DATA INTEGRITY ISSUES

12
Detecting feature quality issues
Data should be consistent and free of issues throughout the model’s lifecycle, and monitoring is
key to ensuring this. An early warning system should be in place to immediately catch and address
feature quality issues.
Data observability and data quality tools now exist as part of the MLOps stack. However, a
thorough approach also means having checks on models themselves to catch issues at runtime.
As your features grow, these missing values, type mismatches, or range checks can be tedious to
add to your production code, and we recommend using a model monitoring tooling that has these
checks built in.
A good starting point is to use ground-truth training data to form your “rules”. Set up a job
that regularly assesses the data your model sees against the training data, checking for any
mismatches. Also, keep an eye out for outliers, which can point toward a feature quality issue.
But beware of mitigating feature quality issues by substituting a fallback if the data is missing or
out of range. While this might seem like a reasonable approach to keeping your model from failing,
it can shift your model’s expected feature distribution over time, causing model drift.
Range violation
A range violation is a feature input either out of expected bounds or is a known error. This happens
when the model input exceeds the expected range of its values. It is quite common for categorical
inputs to have typos and cardinality mismatches to cause this problem, e.g. free-form typing for
categories and numerical fields like age, etc.
Examples:
• An invalid ZIP code
• An unknown product SKU
• The naming of categories changed because of a pipeline migration
Type mismatch
A type mismatch happens when the model expects data of a certain type (e.g. an integer) based
on its training data, but is provided data of a different type (e.g. a string) at inference time. While
it might seem surprising that type mismatches occur often, it’s common for columns to get out of
order or for the data schema to change when data undergoes pipeline transformations.
Examples:
• Providing a floating-point number when the model expects an integer
• A data object changed (e.g. CustomerReport became CustomerReports) and is no longer valid

13
Data drift occurs when the data the model sees in production no longer resembles the historical
data that it was trained on. The real world is constantly changing, so some drift is inevitable. It can
be abrupt, like the shopping behavior changes brought on by the COVID-19 pandemic. It can be
gradual, like shopping preferences shifting based on cultural trends. Or it can also be cyclical, like
shopping behavior around the holidays and in different seasons.
Regardless of how the drift occurs, it’s critical to identify these shifts quickly to maintain model
accuracy and reduce business impact. Sometimes, it’s hard to measure your model’s performance
directly using labeled/ground truth data; for a model that approves credit loans, an actual default
event might happen after months or years (or never). In these cases, data drift is a good proxy
metric. If your data is drifting, it’s likely that your model’s performance is currently declining, even if
you don’t see the evidence yet.
Furthermore, monitoring data drift helps you stay informed about how your data is changing. This
is important for iterating on models and discovering new features, and can help drive business
decisions even outside of ML.
Chapter 3: Data Drift Monitoring
Types of drift
There are three kinds of drift that you should pay attention to for your models, keeping in mind that
models may exhibit multiple types of drift at once.
Concept drift
Training data with decision
boundary
• P(Y|X)
• Probability of y output given
x input
Concept drift
• P(Y|X) changes
• Reality/behavioral change
• Relationships change, not
the input
Concept drift occurs when actual behavior changes.

14
Concept drift happens when there’s a change in the underlying relationships between features and
outcomes: the probability of Y output given X input or P(Y|X).
Example: There’s a recession, and loan applications are riskier to issue in general even though the
applicants’ characteristics like income and credit scores haven’t changed.
Feature drift
Feature drift occurs when there are changes in the distribution of a model’s inputs or P(X).
Example: A loan application model suddenly starts to see more data points from younger people.
Label drift
When label drift happens, there’s been a change in a model’s output distribution or P(Y).
Example: During a human-in-the-loop process that creates labeled data for the model, a new pool
of human reviewers tends to deny more loan applications.
Label drift and feature drift occur when the data distribution changes.
• Data changes
• Fundamental relationships
do not change
Label drift
• Output data shifts
• P(Y) changes
Feature drift
• Input data shifts
• P(X) changes
Training data with decision
boundary
• P(Y|X)
• Probability of y output given
x input

15
Intuitively, you can think about KL as measuring a large drift if there’s a high probability of
something happening in P, but a low probability for the same event in Q. When things are reversed
— a low probability in P and a high probability in Q — there will also be a large drift reported, but
not as large. KL divergence is not symmetric: the divergence of P given Q is not the same as the
divergence of Q given P.
Another way to think about KL divergence, from an information theory perspective, is the KL
encodes the number of additional bits (with log base 2) that you would need to represent an event
in P when you assume a distribution of Q. KL divergence is also called “relative entropy.”
Detecting data drift
To detect data drift, it’s important to track a metric that lets you know when the drift between
the model’s assumptions and the real data distribution is significant. In this section, we’ll explain
several common metrics used to calculate drift.
KL divergence
Kullback–Leibler (KL) divergence is a popular drift metric for ML models. It measures the statistical
difference between two probability distributions: in this case, between the model’s assumptions Q
and the actual distribution P. When Q and P are close, the KL divergence is small.
KL is the sum of the probability of each event in P multiplied by the log of the probability of the
event in P over the probability of the event in Q.
JS divergence
Jensen–Shannon (JS) divergence builds upon KL divergence to address the issue of asymmetry,
which can be somewhat unintuitive to reason about. With JS divergence, the divergence of P given
Q will be the same as the divergence of Q given P. This is achieved by normalizing and smoothing
the KL divergence.

16
Population stability index (PSI)
PSI is a counting-based method of calculating drift. It divides the distribution into bins and counts
the number of expected vs. actual inputs in those bins. In that way, it estimates two distribution
curves and how they differ.
Where:
• B is the total number of bins
• ActualProp(b) is the proportion of counts within bin b from the target distribution
• ExpectedProp(b) is the proportion of counts within bin b from the reference distribution
PSI is a number that ranges from zero to infinity and has a value of zero when the two distributions
exactly match. To avoid divide-by-zero problems, often the bins are given a starting count of 1.
Root-causing data drift
When drift happens, it might be due to feature quality issues or changes in the data distribution
because of external factors. It’s not only important to monitor for drift, but also to determine why
data is drifting in order to understand what type of action to take.
To root cause data drift, ML teams need to understand the relative impact of each feature to
the model’s predictions. Explainable AI tools can help by letting you drill down into the features
responsible for the drift, using calculations to derive explanations for each feature contribution for
each prediction.
Feature quality issues can often be addressed by fixing a bug or updating a pipeline, while concept
drift may be out of your control, requiring you to retrain your model on new data.

17
A growing number of organizations build natural language processing (NLP) and computer vision
(CV) models that are comprised of unstructured data, including text and images. These types of
models support product and service innovation, as well as simplifying business operations.
However, common performance and drift monitoring methods, such as estimating univariate
distributions using binned histograms and applying standard distributional distance metrics
(e.g. Jensen-Shannon divergence) are not directly applicable to the case of high-dimensional
vectors that represent unstructured data. The binning procedure in high-dimensional spaces is
a challenging problem whose complexity grows exponentially with the number of dimensions.
Furthermore, organizations gain actionable insights and context by detecting distributional shifts
of high-dimensional vectors as a whole, rather than marginal shifts in vector elements.
It is important for ML teams to use a cluster-based drift detection method for monitoring high-
dimensional vectors in order to precisely monitor drifts in unstructured models. In using such a
method, ML teams can detect regions of high density (clusters) in the data space, and track how
the relative density of such regions might change at production time.
In a cluster-based drift detection method, bins are defined as regions of high-density in the data
space. The density-based bins are automatically detected using standard clustering algorithms
such as K-means clustering. Once the histogram bins are achieved for both baseline and
production data, then any of the distributional distance metrics can be applied for measuring the
discrepancy between two histograms.
Figure 1 shows an example where the vector data points are 2-dimensional. Comparing the
baseline data (left plot) with the example production data (right plot), there is a shift in the data
distribution where more data points are located around the center of the plot. Note that in practice
the vector dimensions are usually much larger than 2 and such a visual diagnosis is impossible.
Chapter 4: Unstructured Model Monitoring
BASELINE PRODUCTION
Figure 1

18
The first step of the clustering-based drift detection algorithm is to detect regions of high density
(data clusters) in the baseline data. This is achieved by taking all the baseline vectors and
partitioning them into a fixed number of clusters using a variant of the K-mean clustering algorithm.
Figure 2 shows the output of the clustering step (K=3) applied to where data points are colored by
their cluster assignments. After baseline data are partitioned into clusters, the relative frequency
of data points in each cluster (i.e. the relative cluster size) implies the size of the corresponding
histogram bin. As a result, we have a 1-dimensional binned histogram of high-dimensional baseline
data.
CLUSTERING-BASED BINNING (BASELINE)
Figure 2
The goal of the clustering-based drift detection algorithm is to monitor for shifts in the data
distribution by tracking how the relative data density changes over time in different partitions
(clusters) of the space. Therefore, the number of clusters can be interpreted as the resolution
by which drift monitoring will be performed; the higher the number of clusters, the higher the
sensitivity to data drift.
After running K-mean clustering on the baseline data with a given number of clusters K, the K
cluster centroids are obtained. These cluster centroids are used to generate the binned histogram
of the production data. In particular, by fixing the cluster centroids detected from the baseline
data, each incoming data point is assigned to the bin whose cluster centroid has the smallest
distance to the data point.
By applying this procedure to the example production data shown in Figure 1 and normalizing the
bins, we can create the following cluster frequency histogram for the production data, as shown in
Figure 3.

19
CLUSTERING-BASED BINNING (PRODUCTION)
Using a conventional distance measure like JS divergence between the baseline and production
histograms gives us a final drift metric, as shown in Figure 4. This drift metric helps identify any
changes in the relative density of cluster partitions over time. Similar to univariate tabular data,
users can be alerted when there is a significant shift in the data the model sees in production.
Figure 3
Figure 4
With the growing prevalence of unstructured ML models across all industries, having a monitoring
solution in place is more important than ever to ensure proper machine learning performance.
COMPARE TWO BINNED HISTOGRAMS

20
It’s common for ML models to be used to detect rare cases that don’t actually occur that frequently
in real-world data. Consider the following examples:
Chapter 5: Class Imbalance in Monitoring
• In advertising, fewer than 1% of users might click on an ad
• Fraud might occur for <0.1% of transactions
• For an e-commerce store with a wide catalog, you might want to recommend a certain niche
product to a very small percentage of users
When one type of prediction is very rare, we say there are “imbalanced classes”. Handling
imbalanced classes is hard enough when training a model, as you have to collect enough data
samples for the model to learn about these rare cases. It poses an even greater problem for
monitoring ML models. For example, detecting data drift in an imbalanced class is like finding a
needle in a haystack. Consider this example where the model was trained on data with a small
amount of fraud and suddenly starts to see 20x the amount of fraud in production.

21
The model is likely to perform poorly and undercount fraudulent transactions since the fraud
distribution has drifted so dramatically. Ideally, our data drift monitoring would pick this up and
alert us right away.
However, drift metrics will have trouble detecting this change. They are designed to look at
the distribution as a whole, and overall the distribution hasn’t changed much, since fraudulent
transactions make up such a tiny fraction of predictions.

22
Monitoring strategies for class imbalance
What can we do to solve the class imbalance problem in model monitoring? We have a few options
to make sure that we can monitor the minority classes we care about.
Segmentation
One approach is to try to segment our drift monitoring. There are two ways to do segmentation: we
can either use the prediction score (the output) or the ground truth labels (the input).
Segmentation lets us create buckets for different classes and can examine our minority class
individually. However, this comes at the expense of having an overall drift metric for the model.
Weighting model predictions
A simpler and more popular approach than segmentation is to weight the data — essentially, make
the minority class count more and correct the imbalance. Given the label for the events, we would
multiply each event by an associated scale factor.
Applying weights gives ML teams a way to improve operational visibility for advanced use cases
involving class imbalance. Returning to the previous example, the spike in fraud is now accurately
reflected in our drift metrics after applying weights:

23
Monitoring is an essential requirement across the ML lifecycle, from development and training to
production deployment. Like any piece of software that your business depends on, models can
also fail in production; but unlike other kinds of software, models fail silently — still generating
predictions, but making them based on broken assumptions.
In this guide, you’ve learned:
• How model performance is measured, including key metrics for classification, regression, and
ranking models
• What can go wrong with feature quality and best practices for monitoring your data integrity
• How to monitor for data drift and keep your models from degrading in production
• How to monitor unstructured models, such as text and image with high-dimensional vectors
• What to be aware of when working with class imbalance use cases
We’ve even touched on what to do when you detect a problem and how you can root-cause issues
with your models. Now, you’re ready to make sure that your models remain consistent, accurate,
and reliable through monitoring!
Conclusion

24
Fiddler is a pioneer in Model Performance Management for responsible AI. The unified
environment provides a common language, centralized controls, and actionable
insights to operationalize ML/AI with trust. Model monitoring, explainable AI, analytics,
and fairness capabilities address the unique challenges of building in-house stable and
secure MLOps systems at scale.
Unlike observability solutions, Fiddler integrates deep XAI and analytics to help you
grow into advanced capabilities over time and build a framework for responsible
AI practices. Fortune 500 organizations use Fiddler across training and production
models to accelerate AI time-to-value and scale, build trusted AI solutions, and
increase revenue by improving predictions with context to business value.
fiddler.ai sales@fiddler.ai

_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf

Recommended

Recommended

More Related Content

Similar to _Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf

Similar to _Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf (20)

Recently uploaded

Recently uploaded (20)

_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf