Supervised Machine Learning Guide

Supervised Learning
• Supervised learning is the most common sub-branch of machine
learning today.
• Typically, new machine learning practitioners will begin their journey
with supervised learning algorithms. Therefore, the first of this three
post series will be about supervised learning.
• Supervised machine learning algorithms are designed to learn by
example.
• The name “supervised” learning originates from the idea that training
this type of algorithm is like having a teacher supervise the whole
process.

Supervised Learning
• When training a supervised learning algorithm, the training data will
consist of inputs paired with the correct outputs.
• During training, the algorithm will search for patterns in the data that
correlate with the desired outputs.
• After training, a supervised learning algorithm will take in new unseen
inputs and will determine which label the new inputs will be classified
as based on prior training data.
• The objective of a supervised learning model is to predict the correct
label for newly presented input data.

Supervised Learning
• At its most basic form, a supervised learning algorithm can be written
simply as:
• Where Y is the predicted output that is determined by a mapping
function that assigns a class to an input value x.
• The function used to connect input features to a predicted output is
created by the machine learning model during training.

Types of Supervised Learning
• Supervised learning can be split into
two subcategories: Classification and
regression.
• Classification:
• During training, a classification
algorithm will be given data points
with an assigned category. The job of a
classification algorithm is to then take
an input value and assign it a class, or
category, that it fits into based on the
training data provided.

• Classification:
• The most common example of classification is determining if an email is
spam or not.
• With two classes to choose from (spam, or not spam), this problem is
called a binary classification problem. The algorithm will be given training
data with emails that are both spam and not spam.
• The model will find the features within the data that correlate to either
class and create the mapping function mentioned earlier: Y=f(x).
• Then, when provided with an unseen email, the model will use this
function to determine whether or not the email is spam.

• Classification:
• Classification problems can be solved with a numerous amount of
algorithms. Whichever algorithm you choose to use depends on the
data and the situation. Here are a few popular classification
algorithms:
• Linear Classifiers
• Support Vector Machines
• Decision Trees
• K-Nearest Neighbor
• Random Forest

• Regression
• Regression is a predictive statistical process where the model
attempts to find the important relationship between dependent and
independent variables. The goal of a regression algorithm is to predict
a continuous number such as sales, income, and test scores. The
equation for basic linear regression can be written as so:
• Where x[i] is the feature(s) for the data and where w[i] and b are
parameters which are developed during training.

• For simple linear regression models with only one feature in the data,
the formula looks like this:
• Where w is the slope, x is the single feature and b is the y-intercept.
Familiar?
• For simple regression problems such as this, the models predictions
are represented by the line of best fit.
• For models using two features, the plane will be used. Finally, for a
model using more than two features, a hyperplane will be used.

• Imagine we want to determine a student’s test
grade based on how many hours they studied the
week of the test. Lets say the plotted data with a
line of best fit looks like this:
• There is a clear positive correlation between
hours studied (independent variable) and the
student’s final test score (dependent variable).
• A line of best fit can be drawn through the data
points to show the models predictions when
given a new input.
• Say we wanted to know how well a student would
do with five hours of studying. We can use the
line of best fit to predict the test score based on
other student’s performances.

• There are many different types of regression algorithms. The three
most common are listed below:
• Linear Regression
• Logistic Regression
• Polynomial Regression

Summary
• Supervised learning is the simplest subcategory of machine learning
and serves as an introduction to machine learning to many machine
learning practitioners.
• Supervised learning is the most commonly used form of machine
learning, and has proven to be an excellent tool in many fields.

Learning Curves
• A learning curve is just a plot showing the progress over the
experience of a specific metric related to learning during the training
of a machine learning model.
• They are just a mathematical representation of the learning process.

Single Curves
• The most popular example of a learning
curve is loss over time. Loss (or cost)
measures our model error, or “how bad
our model is doing”.
• So, for now, the lower our loss becomes,
the better our model performance will
be.
• In the picture below, we can see the
expected behavior of the learning
process:
• Despite the fact it has slight ups and
downs, in the long term, the loss
decreases over time, so the model is
learning.

Single Curves
• Other examples of very popular
learning curves are accuracy,
precision, and recall.
• All of these capture model
performance, so the higher they are,
the better our model becomes.
• See below an example of a typical
accuracy curve over time:
• The model performance is growing
over time, which means the model is
improving with experience (it’s
learning).
• We also see it grows at the beginning,
but over time it reaches a plateau,
meaning it’s not able to learn
anymore.

Multiple Curves
• One of the most widely used metrics
combinations is training loss + validation loss
over time.
• The training loss indicates how well the model
is fitting the training data, while the validation
loss indicates how well the model fits new
data.
• We will see this combination later on, but for
now, see below a typical plot showing both
metrics:
• Another common practice is to have multiple
metrics in the same chart as well as those
metrics for different models.

Two Main Types
• We often see these two types of learning
curves appearing in charts:
• Optimization Learning Curves: Learning
curves calculated on the metric by which
the parameters of the model are being
optimized, such as loss or Mean Squared
Error
• Performance Learning Curves: Learning
curves calculated on the metric by which
the model will be evaluated and selected,
such as accuracy, precision, recall, or F1
score
• Below you can see an example in Machine
Translation showing BLEU (a performance
score) together with the loss (optimization
score) for two different models (orange and
green):

How to Detect Model Behavior?
• High Bias/Underfitting
• Bias: High bias occurs when the learning algorithm is not taking into
account all the relevant information, becoming unable to capture the
model’s richness and complexity
• Underfitting: When the algorithm is not able to model either training data
or new data, consistently obtaining high error values that don’t decrease
over time
• We can see they are closely tied, as the more biased a model is, the more it
underfits the data.
• Let’s imagine our data are the blue dots below, and we want to come up
with a linear model for regression purposes:

• High Bias/Underfitting
• Let’s imagine our data are the blue dots below, and we want to come up with a
linear model for regression purposes:
• Suppose we’re very lazy machine learning practitioners and we propose this line
as a model:
• Clearly, a straight line like that doesn’t represent the pattern of our dots. It lacks
some complexity to describe the nature of the given data. We can see how the
biased model doesn’t take into account relevant information, which leads to
underfitting.

• It’s doing a terrible job with the training data already, so what would
be the performance for a new example?
• It’s pretty obvious it performs as poorly with the new example as it
does with the training data:

• Now, how can we use learning curves to detect
our model is underfitting? See an example
showing validation and training cost (loss)
curves:
• The cost (loss) function is high and doesn’t
decrease with the number of iterations, both for
the validation and training curves
• We could actually use just the training curve and
check that the loss is high and that it doesn’t
decrease, to see that it’s underfitting

High Variance/Overfitting
• Variance: High variance happens when the model is too complex and
doesn’t represent the simpler real patterns existing in the data
• Overfitting: The algorithm captures well the training data, but it
performs poorly on new data, so it’s not able to generalize
• These are also directly related concepts: The higher the variance of a
model, the more it overfits the training data.
• Let’s take the same example as before, where we wanted a linear
model to approximate these blue dots:

• Well, we understand intuitively that this line is not what we wanted,
either. Indeed, it fits the data, but it doesn’t represent the real
pattern in it.
• When a new example appears, it will struggle to model it. See a new
example (in orange):
• Using the overfitted model, it won’t predict well enough the new
example:

• How could we use learning curves to detect a
model is overfitting? We’ll need both the
validation and training loss curves:
• The training loss goes down over time, achieving
low error values
• The validation loss goes down until a turning
point is found, and there it starts going up again.
That point represents the beginning of
overfitting

Finding the Right Bias/Variance Tradeoff
• The solution to the bias/variance problem is to find a sweet spot
between them.
• In the example given above:
• a good linear model for the data would be a line like this:
• So, when a new example appears:
• We will make a better prediction:

Finding the Right Bias/Variance Tradeoff
• We can use the validation and training
loss curves to find the right bias/variance
tradeoff:
• The training process should be stopped
when the validation error trend changes
from descending to ascending
• If we stop the process before that point,
the model will underfit
• If we stop the process after that point, the
model will overfit

Training, Validation and Test.
• Training data. This type of data builds up the machine learning
algorithm. The data scientist feeds the algorithm input data, which
corresponds to an expected output.
• The model evaluates the data repeatedly to learn more about the
data’s behavior and then adjusts itself to serve its intended purpose.
• Validation data. During training, validation data infuses new data into
the model that it hasn’t evaluated before. Validation data provides
the first test against unseen data, allowing data scientists to evaluate
how well the model makes predictions based on the new data.
• Not all data scientists use validation data, but it can provide some
helpful information to optimize hyperparameters, which influence
how the model assesses data.

Training, Validation and Test.
• Test data. After the model is built, testing data once again validates
that it can make accurate predictions.
• If training and validation data include labels to monitor performance
metrics of the model, the testing data should be unlabeled. Test data
provides a final, real-world check of an unseen dataset to confirm
that the ML algorithm was trained effectively.
• While each of these three datasets has its place in creating and
training ML models, it’s easy to see some overlap between them.
• The difference between training data vs. test data is clear: one trains
a model, the other confirms it works correctly, but confusion can pop
up between the functional similarities and differences of other types
of datasets.

Training data vs. validation data
• ML algorithms require training data to achieve an objective. The algorithm
will analyze this training dataset, classify the inputs and outputs, then
analyze it again. Trained enough, an algorithm will essentially memorize all
of the inputs and outputs in a training dataset — this becomes a problem
when it needs to consider data from other sources, such as real-world
customers.
• Here is where validation data is useful. Validation data provides an initial
check that the model can return useful predictions in a real-world setting,
which training data cannot do. The ML algorithm can assess training data
and validation data at the same time.
• Validation data is an entirely separate segment of data, though a data
scientist might carve out part of the training dataset for validation — as
long as the datasets are kept separate throughout the entirety of training
and testing.

• For example, let’s say an ML algorithm is supposed to analyze a
picture of a vertebrate and provide its scientific classification.
• The training dataset would include lots of pictures of mammals, but
not all pictures of all mammals, let alone all pictures of all
vertebrates. So, when the validation data provides a picture of a
squirrel, an animal the model hasn’t seen before, the data scientist
can assess how well the algorithm performs in that task.
• This is a check against an entirely different dataset than the one it was
trained on.

• Based on the accuracy of the predictions after the validation stage,
data scientists can adjust hyperparameters such as learning rate,
input features and hidden layers. These adjustments prevent
overfitting, in which the algorithm can make excellent determinations
on the training data, but can't effectively adjust predictions for
additional data.
• The opposite problem, underfitting, occurs when the model isn’t
complex enough to make accurate predictions against either training
data or new data.
• In short, when you see good predictions on both the training datasets
and validation datasets, you can have confidence that the algorithm
works as intended on new data, not just a small subset of data.

Validation data vs. testing data
• Not all data scientists rely on both validation data and testing data. To
some degree, both datasets serve the same purpose: make sure the
model works on real data.
• However, there are some practical differences between validation
data and testing data.
• If you opt to include a separate stage for validation data analysis, this
dataset is typically labeled so the data scientist can collect metrics
that they can use to better train the model.

Validation data vs. testing data
• In this sense, validation data occurs as part of the model training
process.
• Conversely, the model acts as a black box when you run testing data
through it. Thus, validation data tunes the model, whereas testing
data simply confirms that it works.
• There is some semantic ambiguity between validation data and
testing data.
• Some organizations call testing datasets “validation datasets.”
Ultimately, if there are three datasets to tune and check ML
algorithms, validation data typically helps tune the algorithm and
testing data provides the final assessment.

Generalization
• In machine learning, generalization is a definition to demonstrate how
well is a trained model to classify or forecast unseen data.
• Training a generalized machine learning model means, in general, it
works for all subset of unseen data.
• An example is when we train a model to classify between dogs and
cats.
• If the model is provided with dogs images dataset with only two
breeds, it may obtain a good performance.

Generalization
• But, it possibly gets a low classification score when it is tested by
other breeds of dogs as well.
• This issue can result to classify an actual dog image as a cat from the
unseen dataset.
• Therefore, data diversity is very important factor in order to make a
good prediction.
• In the sample above, the model may obtain 85% performance score
when it is tested by only two dog breeds and gains 70% if trained by
all breeds.
• However, the first possibly gets a very low score (e.g. 45%) if it is
evaluated by an unseen dataset with all breed dogs.

Generalization
• This for the latter can be unchanged given than it has been trained by
high data diversity including all possible breeds.
• It should be taken into account that data diversity is not the only
point to care in order to have a generalized model.
• It can be resulted by nature of a machine learning algorithm, or by
poor hyper-parameter configuration.

Variance-bias trade-off
• The prediction results of a machine learning model stand somewhere
between
• a) low-bias, low-variance
• b) low-bias, high-variance
• c) high-bias, low-variance,
• d) high-bias, high-variance.
• A low-biased, high-variance model is called overfit and a high-biased,
low-variance model is called underfit.

Variance-bias trade-off
• By generalization, we find the best trade-off between underfitting and
overfitting so that a trained model obtains the best performance.
• An overfit model obtains a high prediction score on seen data and low
one from unseen datsets. An underfit model has low performance in
both seen and unseen datasets.
Three models with underfitting (left), goodfit (middle), and overfitting (right).

Determinant factors to train generalized
models
• Dataset
• In order to train a classifier and generate a generalized machine
learning model, a used dataset should contain diversity. It should be
noted that it doesn’t mean a huge dataset but a dataset containing all
different samples.
• This helps classifier to be trained not only from a specific subset of
data and therefore, the generalization is better fulfilled.
• In addition, during training, it is recommended to use cross validation
techniques such as K-fold or Monte-Carlo cross validations. These
techniques better secure to exploit all possible portions of data and
to avoid generating an overfit model.

models
• Machine Learning algorithm
• Machine learning algorithms differently act against overfitting,
underfitting.
• Overfitting is more likely with nonlinear, non-parametric machine
learning algorithms.
• For instance, Decision Tree is a non-parametric machine learning
algorithms, meaning its model is more likely with overfitting.
• On the other hand, some machine learning models are too simple to
capture complex underlying patterns in data.
• This cause to build an underfit model. Examples are linear and logistic
regression.

models
• Model complexity
• When a machine learning models becomes too complex, it is usually prone
to overfitting. There are methods that help to make the model simpler.
• They are called Regularization methods. Following we explain it.
• Regularization
• Regularization is collection of methods to make a machine learning model
simpler.
• To this end, certain approaches are applied to different machine learning
algorithms, for instance, pruning for decision trees, dropout techniques for
neural networks, and adding a penalty parameters to the cost function in
Regression.

Supervised Machine Learning Guide

Recommended

Recommended

More Related Content

Similar to Supervised Machine Learning Guide

Similar to Supervised Machine Learning Guide (20)

Recently uploaded

Recently uploaded (20)

Supervised Machine Learning Guide