Model validation techniques in machine learning.pdf

1/23
Model validation techniques in machine learning
leewayhertz.com/model-validation-in-machine-learning
In the rapidly evolving digital landscape, the power of data and machine learning has
transformed the way we approach complex problems and make critical decisions. From
healthcare and finance to marketing and autonomous vehicles, machine learning models
have become indispensable tools for extracting insights from vast datasets and predicting
outcomes with remarkable accuracy. Every decision, every strategic move, and every
prediction made by businesses today hinges on the strength and reliability of these models.
But how can we ensure these models are reliable? The answer lies in a critical, often
underrated step in model development: validation. Once a model has been built, the
subsequent step is not immediate implementation but validation, making it an indispensable
part of the model development process. While it may seem counterintuitive, the time invested
in model validation often surpasses the time spent on the model’s development. Why? A
model’s efficacy, robustness, and accuracy can only be ascertained through thorough and
rigorous validation.
In today’s high-stakes business landscape, an unvalidated or poorly validated model can
lead to imprecise predictions, ineffective strategies, and business failures. As businesses
lean heavily on data-driven decisions, it’s not an exaggeration to say that a company’s
success may very well hinge on the strength of its model validation techniques.

2/23
These validation methods test a model’s accuracy and evaluate its adaptability to different
scenarios, resilience under stress, and applicability to real-world situations. These aspects
collectively ensure that the model is accurate in its predictions, versatile and robust, ready to
drive your business success in a fast-paced, dynamic marketplace. Given the crucial role of
model validation, investing significant time and resources in this phase of the model
development process is not just a recommended best practice; it’s a business imperative.
Hence, understanding and implementing various model validation techniques are key to
ensuring business success.
This article will delve into different model validation techniques in machine learning,
illustrating their significance and underscoring their indispensable role in shaping successful
business strategies.
What is model validation in machine learning?
Why is model validation important?
Various machine learning models and their validation necessities
The cross-validation technique
Different types of model validation techniques
How to validate machine learning models using TensorFlow Model Analysis (TFMA)?
Challenges in ML model validation
What is model validation in machine learning?
Model validation in machine learning represents an indispensable step in the development of
AI models. It involves verifying the efficacy of an AI model by assessing its performance
against certain predefined standards. This process does not merely involve feeding data to a
model, training it, and deploying it. Rather, model validation in machine learning necessitates
a rigorous check on the model’s results to ascertain it aligns with our expectations.
Numerous model validation techniques are available, each designed to evaluate and validate
a model according to its distinct characteristics and behaviors.
In the realm of machine learning, the quality and quantity of data, as well as the ability to
manipulate it effectively, are crucial. More often than not, this involves gathering data,
cleaning it, preprocessing it, applying the appropriate algorithm, and finally determining the
most suitable model. However, the process doesn’t end there. Validating a model is as
critical as training it.
Deploying a model without validation is not a viable approach, especially in sensitive sectors
such as healthcare, where the stakes are considerably high. In scenarios where real-world
predictions have life-altering implications, the margin for error in a model is virtually non-
existent. Therefore, an uncompromising model validation process is crucial to ensuring AI
systems’ reliability and accuracy.

3/23
Why is model validation important?
Model validation is essential in creating any machine learning or artificial intelligence system
as it assures the model’s functionality and ability at the optimum level to manage unseen
data. Without it, confidence in the model’s capacity to accurately interpret new data is
uncertain. Moreover, validation aids in identifying the most suitable model, parameters, and
accuracy measures for a particular task.
Through proper model validation, potential issues can be detected and rectified early on. It
also facilitates the comparison of different models, enabling us to select the most fitting one
for our requirements. Further, it helps establish the model’s accuracy when dealing with new
data.
Notably, model validation is carried out impartially, often by a separate team or third party, to
guarantee that the model complies with the required regulations and standards. This builds
user trust in the model, reassuring them of its dependability.
Here are some of the consequences of improper model validation:
Inadequate model performance
If model validation isn’t done correctly, the model might not perform optimally when exposed
to unseen data, which is, after all, the ultimate goal of any predictive model. Various
validation techniques exist, with the most critical ones being in-time and out-of-time
validations.
In-time validation involves setting aside a portion of the development dataset. The model is
then tested against this data to assess how it performs on unseen data from the same time
frame used for its creation.
In contrast, out-of-time validation entails obtaining a dataset from a different time period and
testing the model’s response to this new, unseen data.
Both validation methods ensure developers’ confidence in their model’s performance.
Without them, the model might not be able to deliver accurate results consistently, leading to
potential problems down the line.
Questionable robustness
The effectiveness of a machine learning model hinges on its proper validation. When a
model undergoes thorough validation, developers gain confidence in its performance and
capability to act robustly in future scenarios. It ensures the model’s robustness and reliability,
which leads to trustworthy outcomes.

4/23
During the validation process, sensitivity analysis tests are performed. These tests involve
modifying the independent model variables to a certain degree to account for potential
fluctuations, such as economic changes. The goal is to ensure that these changes don’t
cause a severe impact on the dependent variable that could render the model ineffective.
Without this critical step, the model’s results could become dubious. In other words, the
model might not handle future variables well and produce questionable or even incorrect
results, which could undermine the model’s value in real-world applications.
Inability to handle stress scenarios
Machine learning models need to be resilient and adaptable, especially in extreme scenarios
such as economic recessions or global health crises like a pandemic. If a model isn’t
validated properly, it may struggle to adapt and deliver reliable predictions amidst such
volatility.
This is where stress testing measures during model validation play a crucial role. By
incorporating these tests, the model’s performance under strenuous conditions is evaluated,
and its ability to withstand unexpected shocks is strengthened.
If the validation process encompasses these stress-testing measures, it aids in deploying a
model version that is already been rigorously tested for turbulent scenarios. Consequently,
the model is less likely to fail when encountering real-world adversities, thus ensuring it
continues delivering valuable insights during crucial times.
Untrustworthy model outputs
Without proper model validation, particularly ‘out of time’ validation tests, there’s a risk that a
machine learning model may be overfitted. An overfitted model performs exceptionally well
on the data it was trained on, the development sample, but fails to generalize and perform
well on unseen data. This discrepancy can lead to unreliable and untrustworthy outputs
when the model is applied in the real world scenarios. To circumvent this issue, it’s crucial to
ensure thorough validation is carried out on your machine learning models. This way, you
can ensure that your model’s predictions remain reliable and accurate when presented with
new data.
Various machine learning models and their validation necessities
Supervised learning models
Supervised learning models hold a crucial place in machine learning, being extensively used
for predictive analysis based on data processing.

5/23
This category encompasses several model types, such as linear regression, logistic
regression, support vector machines, decision trees and random forests, and artificial neural
networks, each having unique validation needs.
Models like linear and logistic regression necessitate scrutiny for potential overfitting and
underfitting situations. Support vector machines, decision trees and random forests all
demand a split approach where the data is divided into training and test sets. The model
learns from the training set, while the test set aids in assessing its performance.
Incorporating a separate validation set is essential when dealing with artificial neural
networks. This validation set allows evaluating and comparing different models’ performance,
ensuring the most accurate model is utilized.
Unsupervised learning models
Models falling under the unsupervised learning umbrella are distinguished by their ability to
discern patterns in data devoid of external labels. These encompass varied models like
clustering, anomaly detection, neural networks, and self-organizing maps.
To validate these models, different criteria are used based on the specific model type. For
example, clustering models necessitate performance assessment measures like the
silhouette coefficient or Davies-Bouldin Index.
Anomaly detection models, on the other hand, typically rely on precision-recall curves and
ROC curves to gauge performance. Neural networks can be examined using hold-out
validation and k-fold cross-validation methods. Lastly, self-organizing maps need to be
checked using metrics like topographic or quantization errors.
Hybrid models
Hybrid models in machine learning bring together multiple techniques to optimize predictive
performance. Validating these models is crucial, as the fusion of various models can
enhance both accuracy and efficiency.
Confirming the reliability and consistency of hybrid models is another integral part of
validation. During this phase, the model is tested against data it hasn’t encountered before to
assess its accuracy and performance metrics.
Validation is key to gaining insights into the machine learning model’s capabilities and
confirming that the hybrid models neither overfit nor underfit the data.
Further, validation serves as a tool to detect any existing biases and data leakage in the
model and to identify modifications necessary for model improvement.
Deep learning models

6/23
Deep learning models, a robust variant of artificial intelligence, can be harnessed for diverse
tasks, spanning image recognition and natural language processing to powering autonomous
vehicles.
To ensure their optimal functionality, these models require rigorous validation. This step
confirms the model’s capacity to identify objects, categorize data, or forecast outcomes
precisely.
Commonly used deep learning models, such as Convolutional Neural Networks (CNNs),
deployed for image classification, need to undergo validation against known object datasets
to confirm their identification accuracy.
Similarly, Recurrent Neural Networks (RNNs), employed for natural language processing,
require testing against a text corpus to ascertain their ability to dissect text accurately and
yield correct results.
Lastly, reinforcement learning models designed for autonomous vehicles need to be trialed
against driving simulators to ensure their competence in accurately processing and reacting
to their environment.
Random Forest models
The Random Forest model, an ensemble machine learning method, amalgamates numerous
decision trees to form a highly precise and stable model. Its role in model validation is
pivotal, given its capacity to curtail the overfitting risk, thus delivering a more accurate model
performance prediction.
Arbitrarily selecting samples from the training dataset creates multiple decision trees, each
offering a prediction. The ultimate prediction is a mean of all tree predictions, offering a more
precise outcome than any individual tree.
The ability of this approach to promote better generalization of the model comes in handy
during model validation. It heightens the probability of the model delivering an accurate result
when deployed on fresh data.
Support Vector Machines (SVMs)
A Support Vector Machine (SVM) is a widely employed machine learning model for
validation, owing to its unique capability to augment the gap between different class data
points.
It can identify the best hyperplane that distinguishes data points from varying classes,
facilitating the accurate and dependable classification of these points.

7/23
Moreover, an SVM’s utility extends to identifying outliers, uncovering non-linear correlations
in data, and dealing with regression and classification issues. This versatility enhances its
popularity and makes it a go-to choice for model validation.
Neural network models
Artificial neural network models are a subset of machine learning models that emulate the
workings of a human brain. They can independently learn and make decisions without being
limited by pre-set parameters or prior knowledge. Certain validation requisites must be met
for these models to perform effectively and provide accurate results.
Primarily, these models need substantial training data to make accurate decisions and build
links between the diverse inputs and outputs. The training data should reflect the data the
model will encounter in production, as inconsistencies between the training and production
data could lead to inaccurate results.
Additionally, the data should be normalized to ensure all variables are on a consistent scale,
directly influencing the model’s performance.
Moreover, the model should be subjected to various parameters and data types to verify its
capability of handling diverse inputs and outputs.
Lastly, the model’s performance should be measured against various metrics to confirm that
it delivers the expected level of accuracy. These metrics can encompass accuracy scores,
precision, recall, and F1 scores. Evaluating the model against various metrics helps identify if
the model’s performance aligns with expectations and if any modifications are necessary to
enhance its performance.
k-nearest neighbors models
The k-Nearest Neighbors (k-NN) model is a widely used supervised learning algorithm
primarily used for tackling classification and regression tasks. Its simplicity and ease of
implementation make it a favored choice for model validation.
k-NN operates by locating the k-nearest neighbors, that is, the data points most similar to a
given input, and then classifying the input based on the most common label amongst these
k-nearest neighbors. This allows the model to make reliable predictions without needing prior
training on the data.
Furthermore, k-NN is relatively simpler when compared to other machine learning models,
hence making it an optimal selection for validation processes.
Being a non-parametric model, k-NN is not influenced by the number of features or the size
of the dataset. This distinctive attribute further bolsters k-NN’s suitability for validation as it
can accurately forecast a model’s performance on unseen data.

8/23
Bayesian models
Bayesian models are a category of probabilistic models that employ Bayes’ theorem to
calculate the likelihood of a given hypothesis based on certain data. They typically rely on
prior knowledge and often hinge on the data scientist’s preliminary assumptions. Bayesian
models are employed to predict and gauge the likely distributions of unknown variables.
There are three main types of Bayesian models: Bayesian parameter estimation models,
Bayesian network models, and Bayesian non-parametric models.
Bayesian parameter estimation models are used when there’s a need to estimate a
probabilistic model’s uncertain or unknown parameters. These models are utilized to deduce
the posterior distribution of a parameter set within a probabilistic model based on the
observed data.
Bayesian network models are probabilistic graphical models that illustrate the relationships
between various variables. They predict the value of a variable based on the values of other
variables in the system.
Bayesian non-parametric models, on the other hand, are probabilistic models that make no
assumptions about the data’s underlying distribution. They are primarily used to estimate the
likelihood of a hypothesis without having to define the parameters of the distribution.
Bayesian models are beneficial for modeling intricate systems and predicting a system’s
behavior using observed data. They have found extensive applications in machine learning,
AI, medical research, and more.
Clustering models
Clustering models need to be validated to confirm that the produced clusters are significant
and that the model is dependable.
Several prerequisites must be satisfied when utilizing this methodology:
Evaluation of the produced clusters’ quality is crucial.
Comparing the clusters generated by different algorithms provides important insights.
It is important to assess the stability of clusters over multiple iterations.
The scalability of the clustering model needs to be tested to ensure it can handle large
datasets.
Reviewing the results of the clustering model is essential to ensure they are
meaningful, reliable, and reflect the characteristics of the original data.
The cross-validation technique

9/23
Let’s imagine you have created a machine learning model and trained it on a specific
dataset. The model’s accuracy on this training data is about 95%. Does this mean your
model is superbly trained and the best model due to its high accuracy? Not necessarily! Your
model is knowledgeable about the training data, and it may have even captured minor
variations, thus over-generalizing from this specific data. However, when faced with entirely
new, unseen data, it might not predict with the same accuracy. This problem is known as
overfitting.
On the other hand, there could be instances where the model fails to train effectively on the
training set as it’s unable to identify patterns. Consequently, the model might not perform well
on the test set either. This problem is known as underfitting.
To overcome these issues of overfitting and underfitting, we use a technique known as cross-
validation.
Cross-validation is a method of resampling that segments the dataset into two portions – one
for training and the other for testing. The training data is used to train the model, while the
unseen test data assesses its predictive capability. If the model performs well on the test
data with a high level of accuracy, it suggests that the model has not overfit the training data
and can be used for future predictions.
Cross-validation is a statistical process utilized to determine machine learning models’
performance (or accuracy). It aids in preventing overfitting in a predictive model, particularly
in situations where data may be limited. The data is divided into a fixed number of partitions
or folds during cross-validation. The model analysis is run on each fold, and the overall error
estimate is averaged.
When embarking on a machine learning task, you must first correctly define the problem to
select the most fitting algorithm to yield the best results. But how do we compare different
models? For instance, you have trained a model with available data, and now you want to
assess its performance. One method could be to test the model on the same dataset used
for training. However, this is not always the best practice.
Testing the model on the training dataset could lead us to presume that the training data
represents all possible real-world scenarios, which is rarely the case. Our main goal is to
ensure the model performs well on real-world data. Although the training dataset is derived
from real-world data, it only represents a small subset of all possible data points (examples)
out there.
Therefore, to truly gauge the model’s capabilities, it should be tested on data it has never
seen before, often referred to as the test dataset. However, by splitting our data into training
and testing data, aren’t we potentially missing out on some important information that the test
dataset may hold? Let’s explore the different types of cross-validation to find the answers to
this question.

10/23
Different types of model validation techniques
We will illustrate the following strategies for validation:
Splitting data into training and testing sets
Cross-validation using k-folds
Leave-one-out cross-validation method
Leave-one-group-out cross-validation
Nested cross-validation technique
Cross-validation for time-series data
Stratified k-fold cross-validation
Splitting data into training and testing sets
Train, Test Split Dataset
Holdout Holdout
Holdout
Training Data
Training Data (for Fitting)
Original Available Data
Validation Data
(Tuning Hyper-parameter)
Testing Data
(Evaluating Performance)
Testing Data
(Evaluating Performance)
Serve the same purpose for K-Fold,
Cross Validation
LeewayHertz
The crux of all validation methods is the division of your data during model training. This step
is crucial to simulate how your model would react to data it has never encountered before.

11/23
Training and testing data division: The most fundamental approach involves dividing
your data into training and testing sets. Generally, you randomly split your data into two
parts, approximately 70% for training the model and 30% for testing its performance.
The advantage of this method lies in its ability to evaluate the model’s response to new,
unseen data. Nevertheless, this approach could pose an issue if a specific subset of
our data only contains individuals with a certain age or income bracket, leading to a
situation commonly referred to as sampling bias. We could employ methods like k-fold
cross-validation to mitigate this bias, which we will delve into later. However, before
that, let’s examine the concept of a ‘holdout set.’
Holdout set: When adjusting the hyperparameters of your model, there’s a possibility
of overfitting if you optimize using just the train/test split. The reason is that the model
attempts to find hyperparameters that fit the specific train/test split you have made. To
address this problem, an additional holdout set can be created.
Train Test Holdout
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
LeewayHertz
The holdout method is one of the most straightforward and commonly implemented
evaluation techniques in machine learning projects. This approach splits the entire dataset
into two subsets – a training set and a test set. The division ratio can vary based on the use
case – common splits include 70-30, 60-40, 75-25, 80-20, or even 50-50, with the training
data usually forming the larger proportion. The split is done randomly, and unless a specific
‘random_state’ is specified, we have no control over which data points fall into the training or
testing buckets. This could result in considerable variance, as each change in the split can
lead to variations in accuracy.
However, this method has certain limitations:
The test error rates in the holdout method can be highly variable and can greatly depend on
which observations land in the training and testing sets. This can lead to high variance. The
model is only trained on a portion of the data, which might not be ideal when the available
data is not abundant. This could lead to overestimating the test error, indicating a high bias.
One of the primary advantages of the holdout method is that it is computationally less
demanding compared to other cross-validation techniques, making it a cost-effective option.
If you are only using a train/test split, comparing the distributions of your training and testing
sets is advisable. If the distributions significantly differ, you might encounter generalization
problems. You can use tools like Facets to compare their distributions easily.

12/23
Cross-validation using k-folds (k-fold CV)
Test
Test
Test
Test
Test
Train
Train
Train Train
Train
Train
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Train
Train
LeewayHertz
Cross-validation using k-folds is an effective technique used to mitigate the issue of sampling
bias in machine learning. Instead of making a single split, it involves creating numerous splits
and then conducting validation across all these different splits.
The k-fold cross-validation method divides the data into ‘k’ number of subsets, or folds. The
model is then trained on k-1 of these folds and tested on the remaining folds set aside. This
process repeats for each unique combination, ensuring every fold acts as a testing set once.
The results of each iteration are then averaged.
One of the key advantages of this approach is that every data point is used both for training
and validation, and each one is used for validation precisely once. A common practice is to
choose k=5 or k=10, as these values provide a good balance between computational
efficiency and validation accuracy.
The scores obtained from each fold in the cross-validation process can offer more insights
than just the average performance. Examining these scores’ variance or standard deviation
can provide information about the model’s stability across varying data inputs.
With k-fold cross-validation, k is smaller than the number of observations in the dataset. This
method generates the mean outcomes from k-fitted models, which demonstrate marginally
lower correlation due to the reduced intersection among the training sets incorporated in
each model. This results in reduced variance.

13/23
The k-fold method balances bias, as each training set contains fewer observations than the
leave-one-out method but more than the holdout method. This leads to an intermediate level
of bias.
Typically, k-fold cross validation is performed with k set to either 5 or 10, as these values
have been proven, through practical evidence, to provide test error estimates that are not
overly impacted by high bias or variance.
The primary drawback of this method is its computational expense. The model must be run
from scratch k times, which requires more computational resources than the holdout method,
but it’s still more efficient than the leave-one-out method.
Leave-one-out Cross-validation method
1 2 3
1 2 3
1 2 3
1 2 3
LeewayHertz
Leave-one-out Cross-validation (LOOCV) is a nuanced variant of the k-fold cross-validation
method. In LOOCV, each data point in the dataset serves as a separate test set, while all the
remaining data points form the training set. This method is identical to k-fold cross-validation,
where k equals the total number of observations.
Though LOOCV is highly thorough, it can be computationally demanding since the model
must be trained for every data point in the dataset. This method is only recommended if the
dataset is small or the computational resources can accommodate such intensive
calculations.

14/23
The process involves selecting one observation as the test data and treating all remaining
observations as training data. The model is then trained, and this procedure is repeated for
every observation in the dataset. The test error is estimated by averaging the errors across
all iterations.
When it comes to estimating test errors, LOOCV is known to offer unbiased estimates due to
its comprehensive approach to testing on each data point. However, bias isn’t the only factor
to consider in estimation problems – variance must also be accounted for. LOOCV tends to
have high variance as it averages the outputs of multiple models, each trained on nearly
identical data. As such, the outputs are highly correlated with each other.
LOOCV is a computationally expensive method since the model must be run as often as
there are data points. This issue is addressed in other methods which aim to strike a better
balance between bias and variance.
Group 1
Group 1
Group 1
Group 2
Group 2
Group 2
Group 3
Group 3
Group 3
Group 4 Group n
Group n
Group n
Group 4
Group 4
LeewayHertz
Leave-one-group-out Cross-validation (LOGOCV) is a specialized form of k-fold cross-
validation designed to handle data grouped in distinct categories. Suppose you have a
dataset containing information about multiple companies and their clients, and your goal is to
predict the companies’ success. In a standard k-fold CV, each fold may contain data from
multiple companies. However, with LOGOCV, you construct each fold to include data from a
single company. This way, you maintain the integrity of company-specific data within each
fold.
The process is akin to a hybrid of k-fold CV and LOOCV, but with a twist: instead of leaving
out a single data point (as in LOOCV) or random subsets of data (as in k-fold CV), you
exclude all data pertaining to one company or ‘group.’ This ensures each validation step
tests the model’s performance on unseen data from a completely different company.

15/23
This method is particularly advantageous when the aim is to understand how well the model
generalizes across different groups or categories in the data, as it allows for an unbiased
assessment of the model’s predictive capability on entirely new groups.
Nested cross-validation technique
Test Fold Train Folds
Validation Fold
Inner Loop
Tune Parameters
Outer Loop
Validate with Best
Parameters of
Inner Loop
Training Fold
LeewayHertz
Nested cross-validation is a robust method used in scenarios where we need to optimize
hyperparameters and estimate model performance simultaneously yet want to avoid
overfitting. This approach ingeniously integrates two loops of k-fold cross-validation – an
inner loop and an outer loop.
The inner loop primarily deals with hyperparameter optimization. It helps determine the best
hyperparameters that would tune the model for optimal performance. However, estimating
the model’s accuracy on the same data used for selecting hyperparameters can lead to
overfitting.
This is where the outer loop comes into play. The role of the outer loop is to validate the
model’s performance. It provides an unbiased estimate of model accuracy using data that
the model has not seen during the hyperparameter tuning phase in the inner loop.
This arrangement ensures that the model’s performance is assessed on unseen data,
mitigating the risk of overfitting. The nested structure of these two loops, which gives it the
name ‘nested cross-validation,’ provides a more robust and fair assessment of the model’s
predictive power and generalization ability.
Cross-validation for time-series data

16/23
Validating time-series data demands a distinctive approach, unlike other forms of data. When
employing cross-validation on time-series data, preserving the temporal order of the
observations is essential. This means the usual random partitioning strategies like k-fold CV
may lead to overfitting as they can inadvertently leak future information into the training set.
Properly handling cross-validation for time-series data involves structuring your folds so that
all the training data occurs chronologically before your test data. This technique, often called
time-series cross-validation, ensures that the model is trained on a past ‘slice’ of data at
each fold and validated on a subsequent, future ‘slice.’
This preserves the temporal structure and dependency inherent in time-series data, offering
a more reliable and realistic estimate of the model’s predictive performance on unseen,
future data. It effectively curbs the risk of information leakage from the future, keeping the
model evaluation process in line with how time-series forecasting models are intended to
work in real-world applications.
Stratified k-fold cross-validation
Stratified k-fold cross-validation offers a slight modification to the typical k-fold cross-
validation method, utilizing the concept of ‘stratified sampling’ as an alternative to ‘random
sampling.’
Before we delve into the specifics of stratified sampling, let’s first distinguish it from random
sampling. Imagine a dataset consisting of product reviews from a skincare product both
genders use. Through random sampling to segregate data into training and testing subsets,
we might inadvertently cause a disproportionate representation of one gender in the training
data while having a larger representation in the testing data. If the model is trained on data
that doesn’t truly represent the full diversity of the actual population, it’s likely to deliver less
accurate predictions when tested.
To circumvent this issue, we employ stratified sampling. In this method, the data is
partitioned to maintain the proportion of each class from the overall population.
For instance, consider a dataset containing 1000 customer reviews for a product, with a
gender ratio of 60% females and 40% males. If we wish to divide the data into training and
testing subsets following an 80:20 ratio, stratified sampling ensures that gender
proportionality is preserved in each subset. Therefore, the 800-sample training set would
include 480 reviews from females and 320 from males. Similarly, the 200-sample testing set
would maintain the same gender distribution.
This is the essence of Stratified k-fold cross-validation – creating k-folds that preserve the
original class proportionality. Consequently, it addresses the imbalance issue potentially
introduced by the Holdout and k-fold methods, enhancing the model’s ability to generalize
across the entire dataset.

17/23
How to validate machine learning models using TensorFlow Model
Analysis (TFMA)?
TensorFlow Model Analysis (TFMA) is a robust tool developed by Google, intended to
present an in-depth evaluation of the performance of machine learning models. By
leveraging Apache Beam’s capabilities, TFMA conducts computations in a distributed
manner over large-scale datasets. The unique strength of TFMA lies in its capacity to provide
deep insights into the variations in model performance across different data segments. It is
capable of calculating metrics that were utilized during the training phase, known as built-in
metrics, as well as metrics determined after the model was saved as part of the TFMA
configuration settings.
In this example, we focus on a comprehensive evaluation and analysis of a machine learning
model that has been previously trained. The complete code of the model is available in this
Github location, leveraging the Taxi Trips dataset released by the city of Chicago.
Pre-requisites
1. Basic proficiency in Apache Beam.
2. An elementary comprehension of how machine learning models function.
3. A fresh Google Colab notebook to execute the Python code in the Google Drive.
Installation of TensorFlow Model Analysis (TFMA)
Once the Google Colab notebook is set up, the initial step involves importing all the
necessary dependencies. This operation may require some time to complete.
Rename the file from Untitled.ipynb to TFMA.ipynb.
Next execute the below code to install the dependencies:
!pip install -U pip
!pip install tensorflow-model-analysis`
!pip install apache_beam
!pip install apache-beam
The initial command upgrades pip, the package management system responsible for the
installation and management of Python software packages. The term “pip” signifies
“preferred installer program.” Subsequently, the TensorFlow Model Analysis, TFMA, is
installed via the second command.
Following the completion of this process, it is crucial to reset the runtime environment before
proceeding to execute the forthcoming cells.

18/23
import sys
assert sys.version_info.major==3
import tensorflow as tf
import apache_beam as beam
import tensorflow_model_analysis as tfma
This segment of code brings in the necessary libraries such as ‘sys’, ‘tensorflow’,
‘apache_beam’, and ‘tensorflow_model_analysis’. The command ‘assert
sys.version_info.major==3’ is used to confirm that Python 3 is the version being employed to
execute the notebook.
Loading the data set
You need to download the data set stored as a tar file and extract it. Use the following code
for the same:
import io, os, tempfile
TAR_NAME = 'saved_models-2.2'
BASE_DIR = tempfile.mkdtemp()
DATA_DIR = os.path.join(BASE_DIR, TAR_NAME, 'data')
MODELS_DIR = os.path.join(BASE_DIR, TAR_NAME, 'models')
SCHEMA = os.path.join(BASE_DIR, TAR_NAME, 'schema.pbtxt')
OUTPUT_DIR = os.path.join(BASE_DIR, 'output')
!curl -O https://storage.googleapis.com/artifacts.tfx-oss-
public.appspot.com/datasets/{TAR_NAME}.tar
!tar xf {TAR_NAME}.tar
!mv {TAR_NAME} {BASE_DIR}
!rm {TAR_NAME}.tar
The downloaded dataset comes in the form of a tar file. This tar file contains a variety of
resources including the training datasets, evaluation datasets, the data schema, as well as
the saved models for both training and serving, alongside the saved models for evaluation.
Parsing the schema
The downloaded schema has to be parsed to be effectively utilized with TFMA.
import tensorflow as tf
from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
from tensorflow.core.example import example_pb2
schema = schema_pb2.Schema()
contents = file_io.read_file_to_string(SCHEMA)
schema = text_format.Parse(contents, schema)

19/23
The schema will be parsed by utilizing the text_format function from the google.protobuf
library, which transforms the protobuf message into a textual format, along with TensorFlow’s
schema_pb2.
Using the schema to create TFRecords
The subsequent step involves providing TFMA with access to the dataset. To accomplish
this, a TFRecords file must be established. Utilizing the existing schema to do this is
advantageous because it ensures each feature is assigned the accurate type.
import csv
datafile = os.path.join(DATA_DIR, 'eval', 'data.csv')
reader = csv.DictReader(open(datafile, 'r'))
examples = []
for line in reader:
example = example_pb2.Example()
for feature in schema.feature:
key = feature.name
if feature.type == schema_pb2.FLOAT:
example.features.feature[key].float_list.value[:] = (
[float(line[key])] if len(line[key]) > 0 else [])
elif feature.type == schema_pb2.INT:
example.features.feature[key].int64_list.value[:] = (
[int(line[key])] if len(line[key]) > 0 else [])
elif feature.type == schema_pb2.BYTES:
example.features.feature[key].bytes_list.value[:] = (
[line[key].encode('utf8')] if len(line[key]) > 0 else [])
# Add a new column 'big_tipper' that indicates if the tip was > 20% of the fare.
# TODO(b/157064428): Remove after label transformation is supported for Keras.
big_tipper = float(line['tips']) > float(line['fare']) * 0.2
example.features.feature['big_tipper'].float_list.value[:] = [big_tipper]
examples.append(example)
tfrecord_file = os.path.join(BASE_DIR, 'train_data.rio')
with tf.io.TFRecordWriter(tfrecord_file) as writer:
for example in examples:
writer.write(example.SerializeToString())
!ls {tfrecord_file}
It is important to highlight that TFMA supports various model types, such as TF Keras
models, models built on generic TF2 signature APIs, and TF estimator-based models.
Here we will explore how we to configure a Keras-based model.
In configuring the Keras model, the metrics and plots will be manually integrated as part of
the setup process.
Set up and run TFMA using Keras
Import tfma using the below code

20/23
import tensorflow_model_analysis as tfma
You will need to call and use the previously imported instance of TensorFlow Model Analysis
(TFMA).
# You will setup tfma.EvalConfig settings
keras_eval_config = text_format.Parse("""
## Model information
model_specs {
# For keras (and serving models) we need to add a `label_key`.
label_key: "big_tipper"
}
## You will post training metric information. These will be merged with any built-in
## metrics from training.
metrics_specs {
metrics { class_name: "ExampleCount" }
metrics { class_name: "BinaryAccuracy" }
metrics { class_name: "BinaryCrossentropy" }
metrics { class_name: "AUC" }
metrics { class_name: "AUCPrecisionRecall" }
metrics { class_name: "Precision" }
metrics { class_name: "Recall" }
metrics { class_name: "MeanLabel" }
metrics { class_name: "MeanPrediction" }
metrics { class_name: "Calibration" }
metrics { class_name: "CalibrationPlot" }
metrics { class_name: "ConfusionMatrixPlot" }
# ... add additional metrics and plots ...
}
## You will slice the information
slicing_specs {} # overall slice
slicing_specs {
feature_keys: ["trip_start_hour"]
}
slicing_specs {
feature_keys: ["trip_start_day"]
}
slicing_specs {
feature_values: {
key: "trip_start_month"
value: "1"
}
}
slicing_specs {
feature_keys: ["trip_start_hour", "trip_start_day"]
}
""", tfma.EvalConfig())
It’s crucial to establish a TensorFlow Model Analysis (TFMA) EvalSharedModel that
references the Keras model.

21/23
keras_model_path = os.path.join(MODELS_DIR, 'keras', '2')
keras_eval_shared_model = tfma.default_eval_shared_model(
eval_saved_model_path=keras_model_path,
eval_config=keras_eval_config)
keras_output_path = os.path.join(OUTPUT_DIR, 'keras')
Finally run TFMA.
keras_eval_result = tfma.run_model_analysis(
eval_shared_model=keras_eval_shared_model,
eval_config=keras_eval_config,
data_location=tfrecord_file,
output_path=keras_output_path)
Upon completing the evaluation, the next step is to examine the output through the lens of
TensorFlow Model Analysis (TFMA) visualizations.
To showcase metrics, the function tfma.view.render_slicing_metrics can be employed. By
default, this function presents an “Overall” slice. If the aim is to view a specific slice, one can
input the column name (by setting slicing_column) or supply a tfma.SlicingSpec.
Track your model’s performance
Model training will be executed utilizing the training dataset, which should ideally echo the
characteristics of the test dataset and data destined for production deployment.
However, it’s worth noting that even though the inference request data may mirror the
training data initially, over time, it’s likely to deviate. This can influence the model’s
performance.
Therefore, it is crucial to continuously monitor and gauge the model’s performance to stay
alert to and mitigate any potential changes.
This is where TensorFlow Model Analysis (TFMA) steps in as an invaluable tool.

22/23
output_paths = []
for i in range(3):
# Create a tfma.EvalSharedModel that points to our saved model.
eval_shared_model = tfma.default_eval_shared_model(
eval_saved_model_path=os.path.join(MODELS_DIR, 'keras', str(i)),
eval_config=keras_eval_config)
output_path = os.path.join(OUTPUT_DIR, 'time_series', str(i))
output_paths.append(output_path)
# Run TFMA
tfma.run_model_analysis(eval_shared_model=eval_shared_model,
eval_config=keras_eval_config,
data_location=tfrecord_file,
output_path=output_path)
eval_results_from_disk = tfma.load_eval_results(output_paths[:2])
tfma.view.render_time_series(eval_results_from_disk)
With TensorFlow Model Analysis (TFMA), it’s possible to assess and verify machine learning
models across various data segments. This provides the ability to comprehend the
performance of models across different data conditions and scenarios
Challenges in ML model validation
Data density: With the advancement of new ML and AI techniques, there is a growing
need for large, diverse, and detailed datasets to build effective models. This data can
also often be less structured, which makes validation challenging. Thus, developing
innovative tools becomes essential for data integrity and suitability checks.
Theoretical clarity: A major challenge in the ML/AI domain lies in understanding its
methodologies. They are not as comprehensively grasped by practitioners as more
traditional techniques. This lack of understanding extends to the specifics of a certain
method and how well-suited a procedure is for a given modeling scenario. This makes
it harder for model developers to substantiate the appropriateness of the theoretical
framework in a specific modeling or business context.
Model documentation and coding: An integral part of model validation is evaluating
the extent and completeness of the model documentation. The ideal documentation
would be exhaustive and standalone, enabling third-party reviewers to recreate the
model without accessing the model code. This level of detail is challenging even with
standard modeling techniques and even more so in the context of ML/AI, making model
validation a demanding task.

23/23
Evaluation of results and model testing: With ML/AI, financial institutions may need
to reconsider standard backtesting methods, like measures of model fit and other
analytics such as sensitivity analysis or stability analysis. More computationally taxing
techniques, like k-fold cross-validation, might be necessary to test the accuracy and
robustness of ML/AI models. Unlike traditional methods like ordinary regression,
sensitivity analysis may vary significantly depending on the model type due to the less
direct correlation between inputs and outputs.
Third-party models: In the world of ML and AI, it’s often more common for companies
to use models created by external vendors. Regulatory standards require these
externally sourced models to undergo the same rigorous checks as those built in-
house, and this could be even more important with ML/AI models. Traditional methods
of checking model performance can be more difficult in the context of ML/AI. Therefore,
companies may rely more on less formal methods of validation like regular monitoring
of models and ensuring they are based on solid theory. These checks would involve a
thorough review of documents that detail how the model was customized, how it was
developed, and how suitable it is for the company’s needs.
Endnote
Model validation is a critical step in machine learning, determining the accuracy and reliability
of various models, including supervised, unsupervised, deep learning models, and many
others. The importance of model validation lies in its ability to ensure that models perform
adequately, are robust, and can handle stress scenarios.
Techniques such as cross-validation, splitting data into training and testing sets, and tools
like TensorFlow Model Analysis (TFMA) help in this process. Despite the challenges, model
validation is essential. Without it, the outputs of the models may not be reliable, which can
negatively affect business decisions.
Model validation safeguards the integrity of data-driven decision-making and is pivotal to the
success of businesses in a data-centric world. It’s not just an option; it’s a necessity.
Want reliable, high-performing machine learning models? Collaborate with LeewayHertz’s
team of ML experts who specialize in developing robust ML models capable of withstanding
stress scenarios and delivering consistent results. Contact us today to elevate your machine
learning initiatives!
Start a conversation by filling the form

Model validation techniques in machine learning.pdf

Recommended

Recommended

More Related Content

Similar to Model validation techniques in machine learning.pdf

Similar to Model validation techniques in machine learning.pdf (20)

More from AnastasiaSteele10

More from AnastasiaSteele10 (9)

Recently uploaded

Recently uploaded (20)

Model validation techniques in machine learning.pdf