Core Machine Learning Algorithms

Philosophies of Modeling
The simplest explanation is the best explanation.
In modeling, if we are given two models that
predict equally well, then we should always
choose the simpler one. 
Machine Learning India 1

Algorithm #1:
Least Squares Fitting

Scatterplot of your data:

What is the plot good for?

Prediction? BAM!

How do you do that?

You fit a line!

But is this the best line?

Or does the new line fit our data better?

How about a horizontal line?

How do you judge whether or not
a line is a good fit?

By seeing how close it is to the
data points? BAM!

Back to the horizontal line.

Residual:

Total Error = Sum of Squared Residuals
= (b – y1)2 + (b – y2)2 + …. (b – Yn) 2

What if rotate the line a whole lot?

So there is a sweet spot between
a horizontal and a vertical line!

y = mx + c
Slope Y-Intercept

Line:

We will have to find the optimal values
of ‘m’ and ‘c’, in order to minimize the
sum of squared residuals.

Since we want to fit a line that will give us
the least amount of ‘sum of squares’, this
method for finding the best values of ‘m’
and ‘c’ is called least squares.

Plotting the ‘sum of squared residuals’
versus each rotation…

Big Important Concept #1:
We have to minimize the difference
between the observed values (target
values) and the line (output values).

We do this by taking the derivative and
finding where the value of the derivative
equals zero.

Reducible and Irreducible error!

And you’re done!

Algorithm #2:
Linear Regression

Fitting a linear model:
1. Use least squares.
2. Calculate R2.
3. Calculate p-vale for R2.

Before understanding R2, let us understand
what variance, standard deviation,
covariance and correlation mean.

Variance is the average of the squared
differences from the mean.

• It is a measure of how much the
members of a group differ from the
mean value of the group.
• It is a measure of how spread out the
members are.
• It is the square root of variance.
Standard Deviation:

For the entire population.

For a sample from the population.

Covariance is the measure of the joint
variability of two random variables.
The sign of covariance shows the tendency
of the linear relationship between variables.

Formula for covariance:
Over the entire
population

Formula for covariance:
Over a sample
from population

Correlation is a statistical technique that
can show whether and how strongly pairs
of variables are related.
For example, height and weight are related;
taller people tend to be heavier than
shorter people.

Difference?

Covariance provides the direction of the
linear relationship, while correlation
provides the direction as well as strength.

Covariance has no upper or lower bounds,
and the value is dependent on the scale of
the variable, while…
Correlation is always between -1 and +1,
and is scale independent.

Guidelines:
• First find out the pattern that the data is
exhibiting, by looking at a scatterplot.
• Correlation is only applicable to linear
relationships.
• Correlation is not causation.
• Correlation strength does not necessarily mean
that correlation is statistically significant.

Guess the correlation coefficients!

How about these?

Pearson’s Correlation Coefficient:
In statistics, the Pearson correlation coefficient
(PCC), is a measure of the linear correlation
between two variables X and Y.

Revision!

How can we more objectively state
whether or not a relationship exists
between two variables?

Relationship rule of thumb:
If |r| >= 2 / (√n)
Then, a relationship exists.

2. Calculate R2.
3. Calculate p-value for R2.
Coming back to,

r2 : R2 : R-Squared
It is a measure of how well a model fits to
data. It measures the goodness-of-fit.
It can also be seen as a statistical measure
of how close the data is fitted to the line.

r2 : R2 : R-Squared
In general higher the R2, better the model
fits your data. R2 can be defined as a
percentage as well as a decimal value
between 0 and 1.

r2 : R2 : R-Squared

R2 = Var(mean) – Var(line)
Var(mean)

If R2 turns out to be 80%, then it
means that there is 80% less variation
around the line than the mean.

R2 gives the percentage of variation
explained by the relationship between two
variables.

If someone gives you the value of the plain
old R (PCC), just square it!

Adjusted R2
The adjusted R-squared is a modified
version of R-squared that has been
adjusted for the number of predictors in
the model.

Adjusted R2

P-value
When you perform a hypothesis test in statistics, a
p-value helps you determine the significance of
your results. It answers the question, “Does this
result provide enough evidence that something
is wrong with my assumptions, or could this
result come out just because of luck?”

The smaller the p-value, the lesser
likely it is that the result we got, is an
outcome of luck.

Process:
1. Assuming that the null hypothesis is true.
2. Taking a sample and getting the statistic.
3. Working out how likely it is to get a statistic
like this, by calculating the p-value.

If ‘p’ is low, NULL must GO!


If ‘p’ is high, alternative
hypothesis is a lie! 

2. Calculate R2.
3. Calculate p-value for R2.
Coming back to,
Done!

Linear Regression Visualization

Overfitting and Underfitting!

One of the major aspects of training your
machine learning model is avoiding
overfitting. The model will have a low
accuracy if it is overfitting. This happens
because your model is trying too hard to
capture the noise in your training dataset.

By noise we mean the data points that don’t
really represent the true properties of your data,
but random chance. Learning such data points,
makes your model more flexible, at the risk of
overfitting. The concept of balancing bias and
variance, is helpful in understanding the
phenomenon of overfitting.

Bias Variance Tradeoff:
The inability of a machine learning model to
capture the true relationship is called bias.
The difference in fits between datasets is
called variance. The goal is to achieve low
bias and low variance.

Bias Variance Tradeoff

No Free Lunch Theorem:
No single machine learning algorithm is
better than all others on all problems. It is
common to try multiple models and find
the one that works the best for that
particular problem.

Algorithm #3:
Multiple Linear Regression

Multiple Linear Regression is just an
extension of simple linear regression.
It is used to determine a mathematical
relationship among a number of random
variables. In other terms, MLR examines how
multiple independent variables are related to one
dependent variable.

The equation:

Alert:
• Having more independent variables can make
the model complicated.
• Adding more independent variables does not
guarantee a better prediction model.

Alert:
Lack of multicollinearity must be checked for.
Multicollinearity is the phenomenon where one of
more independent variables in a regression model
strongly predict one or more other independent
variables. It might result in dummy-variable trap.
Homework!

Regularization:
This is a form of regression, that constrains/
regularizes or shrinks the coefficient estimates
towards zero. In other words, this technique
discourages learning a more complex or flexible
model, so as to avoid the risk of overfitting.
Ridge Regression
Lasso Regression

How do we estimate which parameters
are actually important for our model?

• Have domain knowledge.
• Use Subset Selection Methods.
– All-in method
– Backward Elimination
– Forward Elimination
– Bidirectional Elimination
– Score Comparison

General Intuition:

Algorithm #4:
Polynomial Regression

Polynomial Regression:
In statistics, polynomial regression is a form of
regression analysis in which the relationship
between the independent variable x and the
dependent variable y is modeled as an nth
degree polynomial in x.

The equation:

The fitment:

The fitment in 3-Dimensions:

Woah, we had a great time
predicting continuous values!

What if I want to predict
discrete values?

Algorithm #5:
Logistic Regression

Logistic regression is a predictive analysis. It is
used to describe data and to explain the
relationship between one dependent binary
variable and one or more independent variables.

Logistic regression is intended for binary
(two-class) classification problems.

y = mx + c
Slope Y-Intercept

Logistic Function

Evaluating classification model with the
help of metrics! Choosing the right metric is
paramount in judging how well the model is
performing.

A confusion matrix is a table that is often
used to describe the performance of a
classification model (or "classifier") on a set
of test data for which the true values are
known. The confusion matrix itself is
relatively simple to understand, but the
related terminology can be confusing.

Woah, we had a great time
predicting binary discrete values!

What if I want to predict n-ary
discrete values?

Algorithm #5:
Softmax Regression

Softmax regression (or multinomial logistic
regression) is a generalization of logistic
regression to the case where we want to handle
multiple classes.

In logistic regression we assumed that the
labels were binary: y(i) ∈ {0,1}. We used such
a classifier to distinguish between two
categories. Softmax regression allows us to
handle y(i) ∈ {1, …, K} where K is the number
of classes.

What if we have a huge
number of classes?

Algorithm #6:
Linear Discriminant Analysis

Algorithm #6:
Let us first understand Principal Component Analysis

Algorithm #6:
Principle Component Analysis

In real world data analysis tasks we analyze
complex data i.e. multi-dimensional data.

As the dimensions of data increase, the difficulty
to visualize it and to perform computations on
the data also increases. How do we do it?
Remove the redundant dimensions.
Only keep the most important dimensions.

Principal component analysis (PCA) to the
rescue! It is a technique used to emphasize
variation and bring out strong patterns in a
dataset. It's often used to make data easy to
explore and visualize.
It is used for dimensionality reduction.

Too much of visualization.
StatQuest to our rescue!

https://www.youtube.com/watch?
v=FgakZw6K1QQ

The main idea of principal component analysis (PCA) is
to reduce the dimensionality of a data set consisting
of many variables correlated with each other, either
heavily or lightly, while retaining the variation present
in the dataset, up to the maximum extent.

The same is done by transforming the variables to a
new set of variables, which are known as the
principal components (or simply, the PCs) and are
orthogonal, ordered such that the retention of
variation present in the original variables decreases as
we move down in the order.

So, in this way, the 1st principal component retains
maximum variation that was present in the original
components. The principal components are the
eigenvectors of a covariance matrix, and hence they
are orthogonal.

Puzzle!
If you want to reduce the dimensionality of
data from 2D to 1D, while classifying it into two
categories. How will you do it?

Algorithm #7:

Linear discriminant analysis is similar to
PCA, both can help us reduce the
dimensionality, but LDA also focuses on
increasing or maximizing the linear
separability between classes, in data.

PCA and LDA both rank the new axes in
order of importance. PCA accounts for
the most variation in data, while LDA
accounts for the most separability in
data.

An eigenvector is a vector whose direction remains
unchanged when a linear transformation is applied to
it. Consider the image below in which three vectors
are shown. The green square is only drawn to illustrate
the linear transformation that is applied to each of
these three vectors.

More about Eigenvectors on:
www.visiondummy.com/2014/03/eigenvalues-
eigenvectors/

Algorithm #8:
Support Vector Machine

A Support Vector Machine (SVM) is a
discriminative classifier formally defined
by a separating hyperplane.
It is an algorithm for linearly separable
binary sets.

In other words, given labeled training data
(supervised learning), the algorithm outputs an
optimal hyperplane which categorizes new
examples. In two dimentional space this
hyperplane is a line dividing a plane in two
parts wherein each class lay in either side.

The goal of the SVM is to classify all the
training vectors two classes.

Confusing? Don’t worry, we shall learn in
laymen terms.

Suppose you are given plot of two label classes
on graph as shown in the image. Can you
decide a separating line for the classes?

Any point that is left of line falls into black circle
class and on right falls into blue square class.
Separation of classes. That’s what SVM does.

So far so good. Now consider what if we had
data as shown in image below?

We apply transformation and add one more
dimension as we call it z-axis. Now can you
draw a separating hyperplane? Yes!

When we transform back this line to original
plane, it maps to circular boundary as shown in
image. These transformations are called
kernels.

Kernel functions:
These are functions which takes low dimensional input
space and transform it to a higher dimensional space
i.e. it converts not separable problem to separable
problem, these functions are called kernels. It is mostly
useful in non-linear separation problem. Simply put, it
does some extremely complex data transformations,
then find out the process to separate the data based on
the labels or outputs you’ve defined.

A bit complicated!

Which one do you think is appropriate?

Well, both the answers are correct. The first
one tolerates some outlier points. The second
one is trying to achieve 0 tolerance with perfect
partition.

But, there is trade off. In real world
application, finding perfect classes for millions
of samples from the training data set takes lot
of time. Therefore we define two terms
regularization parameter and gamma. These
are tuning parameters in SVM classifier.

Varying those we can achieve a considerable
non-linear classification line with more
accuracy in reasonable amount of time.

The Regularization parameter (often termed as
C parameter) tells the SVM optimization – the
extent to which you want to avoid
misclassifying each training example.

For large values of C, the optimization will choose a
smaller-margin hyperplane if that hyperplane does a
better job of getting all the training points classified
correctly. Conversely, a very small value of C will cause
the optimizer to look for a larger-margin separating
hyperplane, even if that hyperplane misclassifies more
points.

The gamma parameter defines how far the influence
of a single training example reaches, with low values
meaning ‘far’ and high values meaning ‘close’.

In other words, with low gamma, points far away from
plausible separation line are considered in calculation
for the separation line. Where as high gamma means
that the points close to plausible line are considered in
calculation.

How do we find out the right
hyperplane?

Identify the right hyperplane (scenario #1):

Rule #1:
Select the hyper-plane which segregates the
two classes better.

Rule #2:
Maximizing the distances between nearest data point
(either class) and hyper-plane helps us to decide the
right hyper-plane.

Rule #3:
SVM selects the hyper-plane which classifies the
classes accurately prior to maximizing margin.

SVM has a feature to ignore outliers and find the
hyper-plane that has maximum margin. Hence, we can
say, SVM is robust to outliers.

Algorithm
1.Define an optimal hyperplane: maximize margin
2.Extend the above definition for non-linearly separable
problems: have a penalty term for misclassifications.
3.Map data to high dimensional space where it is easier
to classify with linear decision surfaces: reformulate
problem so that data is mapped implicitly to this space.

To define an optimal hyperplane we need
to maximize the width of the margin (w).

We find w and b by solving the following objective
function using Quadratic Programming.

Algorithm #9:
Naïve Bayes
Cutest

The Naive Bayes Classifier technique is based on
the so-called Bayesian theorem and is
particularly suited when the dimensionality of
the inputs is high. Despite its simplicity, Naive
Bayes can often outperform more sophisticated
classification methods.

As indicated, the objects can be classified as
either GREEN or RED. Our task is to classify new
cases as they arrive, i.e., decide to which class
label they belong, based on the currently exiting
objects.

Since there are twice as many GREEN objects as
RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as
likely to have membership GREEN rather than
RED. In the Bayesian analysis, this belief is
known as the prior probability.

Since there is a total of 60 objects, 40 of which are
GREEN and 20 RED, our prior probabilities for class
membership are:

Since the objects are well clustered, it is
reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X (test point), the
more likely that it belongs to that particular
color. To measure this likelihood, we draw a
circle around X which encompasses a number
(to be chosen a priori) of points irrespective of
their class labels.

Then we calculate the number of points in the circle
belonging to each class label. From this we calculate
the likelihood:

Although the prior probabilities indicate that X may
belong to GREEN (given that there are twice as many
GREEN compared to RED) the likelihood indicates
otherwise; that the class membership of X is RED
(given that there are more RED objects in the vicinity of
X than GREEN). In the Bayesian analysis, the final
classification is produced by combining both sources
of information, i.e., the prior and the likelihood, to
form a posterior probability using the so-called Bayes'
rule

Finally, we classify X as RED since its class
membership achieves the largest posterior
probability.

Algorithm #10:
K-Nearest Neighbors
These neighbors
are not annoying.

“Birds of a feather flock together.”

K-Nearest Neighbors is one of the most basic
yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning
domain and finds intense application in pattern
recognition, data mining and intrusion
detection.

An understanding of how we calculate the
distance between points on a graph is necessary
before moving on. If you are unfamiliar with or
need a refresher on how this calculation is done.
Homework

Algorithm #11:
K-Means Clustering

K-means clustering is a type of unsupervised learning,
which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the
number of groups represented by the variable K. The
algorithm works iteratively to assign each data point to
one of K groups based on the features that are
provided. Data points are clustered based on feature
similarity.

The results of the K-means clustering algorithm are:
• The centroids of the K clusters, which can be used to
label new data
• Labels for the training data (each data point is
assigned to a single cluster)

Rather than defining groups before looking at the data,
clustering allows you to find and analyze the groups
that have formed organically.

Each centroid of a cluster is a collection of
feature values which define the resulting groups.
Examining the centroid feature weights can be
used to qualitatively interpret what kind of
group each cluster represents.

BAM! You guys are pros at regression, classification,
dimensionality reduction and clustering!!
Feeling like a data-scientist, eh?

Core Machine Learning Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Core Machine Learning Algorithms

Similar to Core Machine Learning Algorithms (20)

Recently uploaded

Recently uploaded (20)

Core Machine Learning Algorithms