CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
Core Machine Learning Algorithms
1. Philosophies of Modeling
The simplest explanation is the best explanation.
In modeling, if we are given two models that
predict equally well, then we should always
choose the simpler one.
Machine Learning India 1
21. We will have to find the optimal values
of ‘m’ and ‘c’, in order to minimize the
sum of squared residuals.
Machine Learning India 21
22. Since we want to fit a line that will give us
the least amount of ‘sum of squares’, this
method for finding the best values of ‘m’
and ‘c’ is called least squares.
Machine Learning India 22
23. Plotting the ‘sum of squared residuals’
versus each rotation…
Machine Learning India 23
28. Big Important Concept #1:
We have to minimize the difference
between the observed values (target
values) and the line (output values).
Machine Learning India 28
29. Big Important Concept #2:
We do this by taking the derivative and
finding where the value of the derivative
equals zero.
Machine Learning India 29
30. Big Important Concept #3:
Reducible and Irreducible error!
Machine Learning India 30
37. • It is a measure of how much the
members of a group differ from the
mean value of the group.
• It is a measure of how spread out the
members are.
• It is the square root of variance.
Standard Deviation:
Machine Learning India 37
40. For a sample from the population.
Machine Learning India 40
41. Covariance is the measure of the joint
variability of two random variables.
The sign of covariance shows the tendency
of the linear relationship between variables.
Machine Learning India 41
44. Correlation is a statistical technique that
can show whether and how strongly pairs
of variables are related.
For example, height and weight are related;
taller people tend to be heavier than
shorter people.
Machine Learning India 44
46. Covariance provides the direction of the
linear relationship, while correlation
provides the direction as well as strength.
Machine Learning India 46
47. Covariance has no upper or lower bounds,
and the value is dependent on the scale of
the variable, while…
Correlation is always between -1 and +1,
and is scale independent.
Machine Learning India 47
48. Guidelines:
• First find out the pattern that the data is
exhibiting, by looking at a scatterplot.
• Correlation is only applicable to linear
relationships.
• Correlation is not causation.
• Correlation strength does not necessarily mean
that correlation is statistically significant.
Machine Learning India 48
51. Pearson’s Correlation Coefficient:
In statistics, the Pearson correlation coefficient
(PCC), is a measure of the linear correlation
between two variables X and Y.
Machine Learning India 51
55. How can we more objectively state
whether or not a relationship exists
between two variables?
Machine Learning India 55
56. Relationship rule of thumb:
If |r| >= 2 / (√n)
Then, a relationship exists.
Machine Learning India 56
57. Fitting a linear model:
1. Use least squares.
2. Calculate R2.
3. Calculate p-value for R2.
Coming back to,
Machine Learning India 57
58. r2 : R2 : R-Squared
It is a measure of how well a model fits to
data. It measures the goodness-of-fit.
It can also be seen as a statistical measure
of how close the data is fitted to the line.
Machine Learning India 58
59. r2 : R2 : R-Squared
In general higher the R2, better the model
fits your data. R2 can be defined as a
percentage as well as a decimal value
between 0 and 1.
Machine Learning India 59
62. If R2 turns out to be 80%, then it
means that there is 80% less variation
around the line than the mean.
Machine Learning India 62
63. Big Important Concept #4:
R2 gives the percentage of variation
explained by the relationship between two
variables.
Machine Learning India 63
64. Big Important Concept #5:
If someone gives you the value of the plain
old R (PCC), just square it!
Machine Learning India 64
65. Adjusted R2
The adjusted R-squared is a modified
version of R-squared that has been
adjusted for the number of predictors in
the model.
Machine Learning India 65
67. P-value
When you perform a hypothesis test in statistics, a
p-value helps you determine the significance of
your results. It answers the question, “Does this
result provide enough evidence that something
is wrong with my assumptions, or could this
result come out just because of luck?”
Machine Learning India 67
68. The smaller the p-value, the lesser
likely it is that the result we got, is an
outcome of luck.
Machine Learning India 68
69. Process:
1. Assuming that the null hypothesis is true.
2. Taking a sample and getting the statistic.
3. Working out how likely it is to get a statistic
like this, by calculating the p-value.
Machine Learning India 69
70. If ‘p’ is low, NULL must GO!
Machine Learning India 70
71. If ‘p’ is high, alternative
hypothesis is a lie!
Machine Learning India 71
72. Fitting a linear model:
1. Use least squares.
2. Calculate R2.
3. Calculate p-value for R2.
Coming back to,
Done!
Machine Learning India 72
75. One of the major aspects of training your
machine learning model is avoiding
overfitting. The model will have a low
accuracy if it is overfitting. This happens
because your model is trying too hard to
capture the noise in your training dataset.
Machine Learning India 75
76. By noise we mean the data points that don’t
really represent the true properties of your data,
but random chance. Learning such data points,
makes your model more flexible, at the risk of
overfitting. The concept of balancing bias and
variance, is helpful in understanding the
phenomenon of overfitting.
Machine Learning India 76
77. Big Important Concept #7:
Bias Variance Tradeoff:
The inability of a machine learning model to
capture the true relationship is called bias.
The difference in fits between datasets is
called variance. The goal is to achieve low
bias and low variance.
Machine Learning India 77
80. Big Important Concept #8:
No Free Lunch Theorem:
No single machine learning algorithm is
better than all others on all problems. It is
common to try multiple models and find
the one that works the best for that
particular problem.
Machine Learning India 80
82. Multiple Linear Regression is just an
extension of simple linear regression.
It is used to determine a mathematical
relationship among a number of random
variables. In other terms, MLR examines how
multiple independent variables are related to one
dependent variable.
Machine Learning India 82
86. Alert:
• Having more independent variables can make
the model complicated.
• Adding more independent variables does not
guarantee a better prediction model.
Machine Learning India 86
87. Alert:
Lack of multicollinearity must be checked for.
Multicollinearity is the phenomenon where one of
more independent variables in a regression model
strongly predict one or more other independent
variables. It might result in dummy-variable trap.
Homework!
Machine Learning India 87
88. Regularization:
This is a form of regression, that constrains/
regularizes or shrinks the coefficient estimates
towards zero. In other words, this technique
discourages learning a more complex or flexible
model, so as to avoid the risk of overfitting.
Ridge Regression
Lasso Regression
Machine Learning India 88
89. How do we estimate which parameters
are actually important for our model?
Machine Learning India 89
90. • Have domain knowledge.
• Use Subset Selection Methods.
– All-in method
– Backward Elimination
– Forward Elimination
– Bidirectional Elimination
– Score Comparison
Machine Learning India 90
93. Polynomial Regression:
In statistics, polynomial regression is a form of
regression analysis in which the relationship
between the independent variable x and the
dependent variable y is modeled as an nth
degree polynomial in x.
Machine Learning India 93
101. Logistic regression is a predictive analysis. It is
used to describe data and to explain the
relationship between one dependent binary
variable and one or more independent variables.
Machine Learning India 101
102. Logistic regression is intended for binary
(two-class) classification problems.
Machine Learning India 102
113. Big Important Concept #9:
Evaluating classification model with the
help of metrics! Choosing the right metric is
paramount in judging how well the model is
performing.
Machine Learning India 113
114. A confusion matrix is a table that is often
used to describe the performance of a
classification model (or "classifier") on a set
of test data for which the true values are
known. The confusion matrix itself is
relatively simple to understand, but the
related terminology can be confusing.
Machine Learning India 114
122. Softmax regression (or multinomial logistic
regression) is a generalization of logistic
regression to the case where we want to handle
multiple classes.
Machine Learning India 122
123. In logistic regression we assumed that the
labels were binary: y(i) ∈ {0,1}. We used such
a classifier to distinguish between two
categories. Softmax regression allows us to
handle y(i) ∈ {1, …, K} where K is the number
of classes.
Machine Learning India 123
131. In real world data analysis tasks we analyze
complex data i.e. multi-dimensional data.
Machine Learning India 131
132. As the dimensions of data increase, the difficulty
to visualize it and to perform computations on
the data also increases. How do we do it?
Remove the redundant dimensions.
Only keep the most important dimensions.
Machine Learning India 132
133. Principal component analysis (PCA) to the
rescue! It is a technique used to emphasize
variation and bring out strong patterns in a
dataset. It's often used to make data easy to
explore and visualize.
It is used for dimensionality reduction.
Machine Learning India 133
134. Too much of visualization.
StatQuest to our rescue!
Machine Learning India 134
136. The main idea of principal component analysis (PCA) is
to reduce the dimensionality of a data set consisting
of many variables correlated with each other, either
heavily or lightly, while retaining the variation present
in the dataset, up to the maximum extent.
Machine Learning India 136
137. The same is done by transforming the variables to a
new set of variables, which are known as the
principal components (or simply, the PCs) and are
orthogonal, ordered such that the retention of
variation present in the original variables decreases as
we move down in the order.
Machine Learning India 137
138. So, in this way, the 1st principal component retains
maximum variation that was present in the original
components. The principal components are the
eigenvectors of a covariance matrix, and hence they
are orthogonal.
Machine Learning India 138
139. Puzzle!
If you want to reduce the dimensionality of
data from 2D to 1D, while classifying it into two
categories. How will you do it?
Machine Learning India 139
141. Linear discriminant analysis is similar to
PCA, both can help us reduce the
dimensionality, but LDA also focuses on
increasing or maximizing the linear
separability between classes, in data.
Machine Learning India 141
142. Linear discriminant analysis is similar to
PCA, both can help us reduce the
dimensionality, but LDA also focuses on
increasing or maximizing the linear
separability between classes, in data.
Machine Learning India 142
148. PCA and LDA both rank the new axes in
order of importance. PCA accounts for
the most variation in data, while LDA
accounts for the most separability in
data.
Machine Learning India 148
149. An eigenvector is a vector whose direction remains
unchanged when a linear transformation is applied to
it. Consider the image below in which three vectors
are shown. The green square is only drawn to illustrate
the linear transformation that is applied to each of
these three vectors.
Machine Learning India 149
150. More about Eigenvectors on:
www.visiondummy.com/2014/03/eigenvalues-
eigenvectors/
Machine Learning India 150
152. A Support Vector Machine (SVM) is a
discriminative classifier formally defined
by a separating hyperplane.
It is an algorithm for linearly separable
binary sets.
Machine Learning India 152
153. In other words, given labeled training data
(supervised learning), the algorithm outputs an
optimal hyperplane which categorizes new
examples. In two dimentional space this
hyperplane is a line dividing a plane in two
parts wherein each class lay in either side.
Machine Learning India 153
159. Suppose you are given plot of two label classes
on graph as shown in the image. Can you
decide a separating line for the classes?
Machine Learning India 159
160. Any point that is left of line falls into black circle
class and on right falls into blue square class.
Separation of classes. That’s what SVM does.
Machine Learning India 160
161. So far so good. Now consider what if we had
data as shown in image below?
Machine Learning India 161
162. We apply transformation and add one more
dimension as we call it z-axis. Now can you
draw a separating hyperplane? Yes!
Machine Learning India 162
167. When we transform back this line to original
plane, it maps to circular boundary as shown in
image. These transformations are called
kernels.
Machine Learning India 167
168. Kernel functions:
These are functions which takes low dimensional input
space and transform it to a higher dimensional space
i.e. it converts not separable problem to separable
problem, these functions are called kernels. It is mostly
useful in non-linear separation problem. Simply put, it
does some extremely complex data transformations,
then find out the process to separate the data based on
the labels or outputs you’ve defined.
Machine Learning India 168
170. Which one do you think is appropriate?
Machine Learning India 170
171. Well, both the answers are correct. The first
one tolerates some outlier points. The second
one is trying to achieve 0 tolerance with perfect
partition.
Machine Learning India 171
172. But, there is trade off. In real world
application, finding perfect classes for millions
of samples from the training data set takes lot
of time. Therefore we define two terms
regularization parameter and gamma. These
are tuning parameters in SVM classifier.
Machine Learning India 172
173. Varying those we can achieve a considerable
non-linear classification line with more
accuracy in reasonable amount of time.
Machine Learning India 173
174. The Regularization parameter (often termed as
C parameter) tells the SVM optimization – the
extent to which you want to avoid
misclassifying each training example.
Machine Learning India 174
175. For large values of C, the optimization will choose a
smaller-margin hyperplane if that hyperplane does a
better job of getting all the training points classified
correctly. Conversely, a very small value of C will cause
the optimizer to look for a larger-margin separating
hyperplane, even if that hyperplane misclassifies more
points.
Machine Learning India 175
176. The gamma parameter defines how far the influence
of a single training example reaches, with low values
meaning ‘far’ and high values meaning ‘close’.
Machine Learning India 176
177. In other words, with low gamma, points far away from
plausible separation line are considered in calculation
for the separation line. Where as high gamma means
that the points close to plausible line are considered in
calculation.
Machine Learning India 177
179. How do we find out the right
hyperplane?
Machine Learning India 179
180. Identify the right hyperplane (scenario #1):
Machine Learning India 180
181. Rule #1:
Select the hyper-plane which segregates the
two classes better.
Machine Learning India 181
182. Identify the right hyperplane (scenario #2):
Machine Learning India 182
183. Rule #2:
Maximizing the distances between nearest data point
(either class) and hyper-plane helps us to decide the
right hyper-plane.
Machine Learning India 183
184. Identify the right hyperplane (scenario #3):
Machine Learning India 184
185. Rule #3:
SVM selects the hyper-plane which classifies the
classes accurately prior to maximizing margin.
Machine Learning India 185
186. SVM has a feature to ignore outliers and find the
hyper-plane that has maximum margin. Hence, we can
say, SVM is robust to outliers.
Machine Learning India 186
187. Algorithm
1.Define an optimal hyperplane: maximize margin
2.Extend the above definition for non-linearly separable
problems: have a penalty term for misclassifications.
3.Map data to high dimensional space where it is easier
to classify with linear decision surfaces: reformulate
problem so that data is mapped implicitly to this space.
Machine Learning India 187
188. To define an optimal hyperplane we need
to maximize the width of the margin (w).
Machine Learning India 188
192. The Naive Bayes Classifier technique is based on
the so-called Bayesian theorem and is
particularly suited when the dimensionality of
the inputs is high. Despite its simplicity, Naive
Bayes can often outperform more sophisticated
classification methods.
Machine Learning India 192
193. As indicated, the objects can be classified as
either GREEN or RED. Our task is to classify new
cases as they arrive, i.e., decide to which class
label they belong, based on the currently exiting
objects.
Machine Learning India 193
194. Since there are twice as many GREEN objects as
RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as
likely to have membership GREEN rather than
RED. In the Bayesian analysis, this belief is
known as the prior probability.
Machine Learning India 194
196. Since there is a total of 60 objects, 40 of which are
GREEN and 20 RED, our prior probabilities for class
membership are:
Machine Learning India 196
197. Since the objects are well clustered, it is
reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X (test point), the
more likely that it belongs to that particular
color. To measure this likelihood, we draw a
circle around X which encompasses a number
(to be chosen a priori) of points irrespective of
their class labels.
Machine Learning India 197
199. Then we calculate the number of points in the circle
belonging to each class label. From this we calculate
the likelihood:
Machine Learning India 199
201. Although the prior probabilities indicate that X may
belong to GREEN (given that there are twice as many
GREEN compared to RED) the likelihood indicates
otherwise; that the class membership of X is RED
(given that there are more RED objects in the vicinity of
X than GREEN). In the Bayesian analysis, the final
classification is produced by combining both sources
of information, i.e., the prior and the likelihood, to
form a posterior probability using the so-called Bayes'
rule
Machine Learning India 201
206. “Birds of a feather flock together.”
Machine Learning India 206
207. K-Nearest Neighbors is one of the most basic
yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning
domain and finds intense application in pattern
recognition, data mining and intrusion
detection.
Machine Learning India 207
209. An understanding of how we calculate the
distance between points on a graph is necessary
before moving on. If you are unfamiliar with or
need a refresher on how this calculation is done.
Homework
Machine Learning India 209
214. K-means clustering is a type of unsupervised learning,
which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the
number of groups represented by the variable K. The
algorithm works iteratively to assign each data point to
one of K groups based on the features that are
provided. Data points are clustered based on feature
similarity.
Machine Learning India 214
216. The results of the K-means clustering algorithm are:
• The centroids of the K clusters, which can be used to
label new data
• Labels for the training data (each data point is
assigned to a single cluster)
Machine Learning India 216
217. Rather than defining groups before looking at the data,
clustering allows you to find and analyze the groups
that have formed organically.
Machine Learning India 217
218. Each centroid of a cluster is a collection of
feature values which define the resulting groups.
Examining the centroid feature weights can be
used to qualitatively interpret what kind of
group each cluster represents.
Machine Learning India 218
222. BAM! You guys are pros at regression, classification,
dimensionality reduction and clustering!!
Feeling like a data-scientist, eh?
Machine Learning India 222