3. Supervised Learning
→ The machine trained on Labeled data
first, we train the machine with the input and corresponding output, and then we ask
the machine to predict the output using the test dataset.
Let's understand supervised learning with an example.
→ Suppose we have an input dataset of cats and dog images.
→ So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour,
height (dogs are taller, cats are smaller), etc. After completion of training, we input
the picture of a cat and ask the machine to identify the object and predict the
output. Now, the machine is well trained, so it will check all the features of the
object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So,
4. Advantages and Disadvantages
Advantages:
→ Since supervised learning work with the labelled dataset so we can have an
exact idea about the classes of objects.
→ These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
→ These algorithms are not able to solve complex tasks.
→ It may predict the wrong output if the test data is different from the training data.
→ It requires lots of computational time to train the algorithm.
5. Unsupervised Learning
→ In unsupervised machine learning the machine is trained on unlabeled dataset
→ The main aim of the unsupervised learning algorithm is to group or categories
the unsorted dataset according to the similarities, patterns, and differences.
→ Machines are instructed to find the hidden patterns from the input dataset.
6. Clustering
→ Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes.
→ One is market segmentation where you may have a
database of customers and want to group them into different market segments so you
can sell to them separately or serve your different market segments better.
→ The K-Means algorithm is by far the most popular, by far
the most widely used clustering algorithm.
7. Dimensionality Reduction
There are a couple of different reasons why one might want to do dimensionality
reduction.
→ One is data compression.
→ data compression not only allows us to compress the data and have
it therefore use up less computer memory or disk space, but it will also allow
us to speed up our learning algorithms.
8. → Linear regression
→ logistic regression
→ Decision tree
→ SVM Algorithm
→ Naive Bayes algorithm
→ KNN algorithm
→ K - means
→ Random Forest algorithm
→ Dimensionality reduction algorithm
9. Important Links
Code Video - https://www.youtube.com/watch?v=rw84t7QU2O0
https://www.fireblazeaischool.in/blogs/assumptions-of-linear-
regression/#Multivariate_Normality
10. The four Assumption of Linear regression
Linear Regression is a useful statistical method we can use to understand the relationship
between two variables, x and y
Linear model make the following assumptions
1) Linear relationship between the variable and the target(Linearity).
2) Multivariate normality
3) No or little Collinearity
4) Homoscedasticity
5) No auto-correlation
11. 1. The Two Variables Should be in a Linear Relationship
Linear relationship refer to the relation between the independent variables X and target
Y.
12. The Assumptions of linear relationship can be easily visualized using scatter plots
where we plot the independent variable X in the a-axis and the dependent variable Y in
the Y - axis.
15. Multivariate Normality - Histogram
Multivariate Normality means that every independent Variable X follow a Gaussian
Distribution(Normally Distribution).
1) Normality can be assessed with histograms and Q-Q plots.
2) Normality can be statistically tested for example with the Kolmogorov -
Smirnov test.
3) When the variable is not normally distributed a non-linear transformation(Eg:
logarithm - transformation) may fix this issue.
Multiple regression assumes that the residuals are normally distributed
16. Tests to check Multivariate Normality
Q-Q plots -- If the data is normally distributed then it gets a fairly a straight line.
→ If it not normal then seen with deviation in the straight line
17.
18.
19. No or Low Multicollinearity
The next assumption of linear regression is that there should be less or no
multicollinearity in the given dataset.
→ This situation occurs when the features or independent variables of a
given dataset are highly correlated to each other.
→ In a model having correlated variables, it becomes difficult to determine
which variable is contributing to predict the target variable. Another thing is, the
standard errors tend to increase due to the presence of correlated variables.
20. Methods to handle Multicollinearity
→ You can drop one of those features which are highly correlated in the given data.
→ Derive a new feature from collinear features and drop these features (used for
making new features).
21. Multicollinearity can be detected via various methods. we will focus on the most
common one – VIF (Variable Inflation Factors).
→ VIF determines the strength of the correlation between the independent
variables. It is predicted by taking a variable and regressing it against every other
variable.
https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
→ R^2 value is determined to find out how well an independent variable is
described by the other independent variables. A high value of R^2 means that the
variable is highly correlated with the other variables. This is captured by the VIF which is
denoted below.
→ VIF = 1/ 1 - R^2
→ So, the closer the R^2 value to 1, the higher the value of VIF and the higher
the multicollinearity with the particular independent variable.
22. VIF starts at 1 and has no upper limit
→ VIF = 1, no correlation between the independent variable and the other variables.
→ VIF exceeding 5 or 10 indicates high multicollinearity between this independent
variable and the others.
25. Multicollinearity in categorical Variables
→ multicollinearity can be detected with spearman rank correlation
coefficient(ordinal variables)
→ Chi - Square test (Nominal variables).
It is important to note that the variables to be compared should have only 2
categories i.e 1 and 0 the chi-square test fails to determine the correlation between
variables with more than 2 categories
26. Chi-Square Test
Chi - Square test is a statistical test which is used to find out the
→ Difference between the observed and the expected data.
→ Find the correlation between categorical variables is due to chance, or if it is due
to a relationship between them.
→ It is important to note that the variables to be compared should have only 2
categories i.e 1 and 0 the chi-square test fails to determine the correlation between
variables with more than 2 categories.
Link
27. Homoscedasticity
Important video link https://www.youtube.com/watch?v=35jMqo2IroE
→ To understand Homoscedasticity we must understand Residual value of the
dependent variable in regression Analysis.
→ Residual value are the difference between the actual and predicted value.
→ Homoscedasticity refers to whether these residuals distributed equally or
whether they tend to cluster together at some values and spread far at some other
values
→ If the residuals are equally distributed then it is called homoscedasticity.
→ if the residuals tend to cluster together at some values it is called
Heteroscedasticity
28. If we do regression analysis and draw the chart of residual variable distribution
Residuals are distributed uniformly and any cluster formed uniformly
29. Heteroscedasticity
→ From left to right the distribution has taken triangle shape
→ At left side the values are coming very close together as we are going left to
right the values are far away from each other
30. To Check
Draw regplot on the basis of predicted and residual to check homoscedasticity
31. No Autocorrelation
When you are building a linear regression model for forecasting purpose, we’ll come
across this problem called autocorrelation and create multiple issues in interpreting the
results.
Autocorrelation
→ Linear model assumes that error terms are independent.
→ when you build regression model the error terms need to be completely
independent of each other.
→ the error term is the difference between the expected price at a particular
time and the price that was actually observed.
32. → when you have uncorrelated the error terms will be randomly distributed
across the origin and no pattern
33. Outliers
An outlier is a data point which is significantly different from the remaining data.
Algorithms susceptible to outliers
1) Linear models(Linear models)
2) Adaboost
34.
35.
36.
37. Detecting Outliers
Normal Distribution
→ 99% of the observation of a normal distribution variable lie within the mean
+- 3 * standard deviation.
→ values outside +- 3 * standard deviation are considered as outliers.
Skewed Distribution
→ The general approach is to calculate the quantiles and then Interquartile
range(IQR)
→ IQR = 75th Quantile - 25th Quantile
→ Upper Limit = 75th Quantile + IQR * 1.5
39. Linear regression
→ In linear regression a relationship is established between independent and
dependent variable by fitting the the line.
→ The line is represented by a line y(Dependent variable) = a*x(independent ) +
b(intercept)
→ linear regression is used to solve regression problems
40. Advantages of Linear Regression
https://www.geeksforgeeks.org/ml-advantages-and-disadvantages-of-linear-regression/
→ Linear Regression is simple to implement and easier to interpret the output
coefficient.
Explanation:
→ when you know the relationship between the independent
and dependent variable have a linear relationship, this algorithm is the best to use
because of it’s less complexity to compared to other model.
→ Linear Regression is susceptible to overfitting but it can be avoided using
some dimensionality reduction technique(Regularization L1 and L2 technique and cross
validation)
41. Disadvantages of Linear regression
→ On the other hand in linear regression technique outliers can have huge
effects on the regression and boundaries are linear in the technique.
Explanation : Linear regression assumes a linear relationship between
dependent and independent variables, that means it assumes that there is a straight
line relationship between them
Summary :
→ Linear relationship is a great tool to analyze the relationship among the
variables but it is not recommended for most practical applications
42. Assumptions of Logistic Regression
Logistic regression does not make many of the key assumptions of linear regression
and general linear models that are based on ordinary least squares algorithms
particularly regarding linearity, normality, homoscedasticity, and measurement level.
→ First, logistic regression does not require a linear relationship between the
dependent and independent variables.
→ Second, the error terms (residuals) do not need to be normally distributed.
→ Third, homoscedasticity is not required.
→ Finally, the dependent variable in logistic regression is not measured on an
interval or ratio scale.
43. Supervised learning
→ In supervised learning, we are given a data set and already know what our
correct output should look like, having the idea that there is a relationship between the
input and the output.
→ Supervised learning problems are categorized into
1) Regression.
2) Classification problems.
44. Classification
→ In a classification problem, we are instead trying to predict results in a
discrete output. In other words, we are trying to map input variables into discrete
categories.
→ The main goal of classification is to predict the target class (Yes/
No).
→ The classification problem is just like the regression problem, except that
the values we now want to predict take on only a small number of discrete values.
For now, we will focus on the binary classification problem in which y can take on
only two values, 0 and 1.
45. Types of classification:
Binary classification.
When there are only two classes to predict, usually 1 or 0 values.
Multi-Class Classification
When there are more than two class labels to predict we call multi-classification task.
46.
47.
48. Logistic regression model
https://www.javatpoint.com/logistic-regression-in-machine-learning
→ Logistic regression predicts the output of a categorical dependent variable.
→ Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc.
→ But instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
→ In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
49. Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
50. Logistic Function (Sigmoid Function):
→ The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
It maps any real value into another value within a range of 0 and 1.
→ In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to 1,
and a value below the threshold values tends to 0.
53. Regression
→ Linear Regression is used to handle regression problems
→ whereas Logistic regression is used to handle the classification problems.
→ Linear regression provides a continuous output
→ Logistic regression provides discreet output.
→ The purpose of Linear Regression is to find the best-fitted line
while Logistic regression is one step ahead and fitting the line values to the
sigmoid curve.
→
54. When do you use linear regression vs Decision
Trees?
→ Linear regression is a linear model, which means it works really nicely when the
data has a linear shape.
→ But, when the data has a non-linear shape, then a linear model cannot capture the
non-linear features.
→ So in this case, you can use the decision trees, which do a better job at capturing
the non-linearity in the data by dividing the space into smaller sub-spaces depending on
the questions asked.
55. Support Vector Machines
→ SVMs are considered by many to be the most powerful 'black box' learning
algorithm, and by posing a cleverly-chosen optimization objective, one of the most
widely used learning algorithms today.
→ Compared to both logistic regression and neural networks, the support Vector
Machine, or SVM sometimes give a cleaner and sometimes more powerful way of
learning **complex nonlinear functions**.
→ SVMs are also called as Large Margin Classifiers.
58. Large Margin Intuition
In SVMs Theta(transpose X), just a little bit bigger than Zero. and other much Less than
Zero.
This builds an extra safety factor or safety margin factor in SVM.
59.
60. Consider a case, if you set “C” a very large value
If “C” is very very large, then minimizing this optimization objective we are going to be
highly motivated to choose a value, so that this first term equal to Zero.
What would it take to make this first term in the objective equal to Zero.
→ when ever we have a training example of Y = 1, if you to make the first term
Zero, what we need is to find a value of theta.
→ so that Thete(transpose)Xi is >= 1
→ when ever we have a example with label 0,
→ so that Thete(transpose)Xi is >= -1
61.
62. The SVM will instead choose this decision boundary in black and that seems like a
better decision boundary.
→ This back decision boundary has a larger distance, that distance is called a
“Margin”.
→ This distance is called the margin of the SVM.
→ this gives a SVM robustness and it tries to separate the data with a large
margin as possible.
67. Definition and Objective
The objective of the support vector machine algorithm is to find a hyperplane in an N-
dimensional space(N — the number of features) that distinctly classifies the data
points.
The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the
maximum margin. Hence, we can say, SVM classification is robust to outliers.
SVM offers high accuracy compared to other classifiers such as logistic regression
and decision trees.
It is used in variety of application such as face detection, intrusion detection,
classification emails, news articles and web pages.
68.
69.
70. Support Vector, Hyperplane
Support vectors
Support vectors are the data points, which are closest to the hyperplane and influence
the position and orientation of the hyperplane, using these support vectors, we
maximize the margin of the classifier
Hyperplane
A hyperplane is a decision plane which separates between a set of objects having
different class memberships.
Margin
A margin is a gap between the two lines on the closest class points. If the margin is
larger in between the classes, then it is considered a good margin, a smaller margin is a
71. How does SVM work?
The main objective is to segregate the given dataset in the best possible way. The
objective is to select a hyperplane with the maximum possible margin between support
vectors in the given dataset.
Dealing with non-linear and inseparable planes
Some problems can’t be solved using linear hyperplane, as shown in the figure below
In such situation, SVM uses a kernel trick to transform the input space to a higher
dimensional
72. SVM Kernels
kernel takes a low-dimensional input space and transforms it into a higher dimensional
space. In other words, you can say that it converts non separable problem to separable
problems by adding more dimension to it. It is most useful in non-linear separation
problem. Kernel trick helps you to build a more accurate classifier.
Non-linear data
73.
74. Tuning Hyperparameters
Kernel: The main function of the kernel is to transform the given dataset input data into
the required form. Polynomial and RBF are useful for non-linear hyperplane. Polynomial
and RBF kernels compute the separation line in the higher dimension. This
transformation can lead to more accurate classifiers.
Regularization: Regularization parameter in python's Scikit-learn C parameter used to
maintain regularization. A smaller value of C creates a small-margin hyperplane and a
larger value of C creates a larger-margin hyperplane.
Gamma: A lower value of Gamma will loosely fit the training dataset, whereas a higher
value of gamma will exactly fit the training dataset, which causes over-fitting. In other
words, you can say a low value of gamma considers only nearby points in calculating the
separation line, while the a value of gamma considers all the data points in the
calculation of the separation line.
75.
76. Important Decision trees best
→
https://www.youtube.com/watch?v=eKD5gxPPeY0&list=PLBv09BD7ez_4temBw
7vLA19p3tdQH6FYO&index=1
77. Decision Tree(https://www.youtube.com/watch?v=nWuUahhK3Oc&t=1126s)
A supervised learning technique in Data Mining, which can be used for prediction of both
Numeric and Non-Numeric independent variable.
Trees in general use a divide and conquer strategy to try to divide the training data into
smaller and smaller subsets.
Algorithm go through all the predictors and see which one of them is the most
predictive of the target feature and that feature will be the root of our tree.
So you have the root at the top and then you have splits and then you have decision
nodes
When tree terminate we call that a terminal node.
78. But the questions you should ask (and should know the answer to) are:
→ How do you split a decision tree?
→ What are the different splitting criteria?
→ What is the difference between Gini and Information Gain?
LINK - https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/
79.
80. Parent and Child Node: A node that gets divided into sub-nodes is known as Parent
Node, and these sub-nodes are known as Child Nodes. Since a node can be divided into
multiple sub-nodes, therefore a node can act as a parent node of numerous child nodes
Root Node: The top-most node of a decision tree. It does not have any parent node. It
represents the entire population or sample
Leaf / Terminal Nodes: Nodes that do not have any child node are known as
Terminal/Leaf Nodes
A decision tree makes decisions by splitting nodes into sub-nodes. This process is performed multiple times during
the training process until only homogenous nodes are left.
81. How to choose what feature to split on at each node?
At Rote Node, As well as the left branch and the right branch of the decision tree.
Important points to consider for
→ We had to decide if there were a few examples at that node comprising a
mix of cats and dogs.
→ Decision tree will choose what feature to split on in order to try to maximize
purity.
82. Purity
Purity(Means you want to get to what subsets, which are as close as possible to all data
samples
Example : If we had feature that said does the animal has a cat DNA, we didn’t have this
feature, but if we did, we could have split on this feature at the root node.
83. Two categories based on the type of target
variable
1.Continuous Target Variable
→ Reduction in Variance
2. Categorical Target Variable
→ Gini Impurity
→ Information Gain
→ Chi-Square
84. Variance
→ Variance is the measure of spread it tells us how far your data is spread
from the mean.
→ Low value of variance is leading to more pure nodes.
→ High value of variance is leading to more impure nodes.
→ low variance for splitting
85. Properties of Variance
→ Used when the target is continuous.
→ Split with lower Variance is selected.
86. Decision Tree Splitting Method #1: Reduction in Variance
Reduction in Variance is a method for splitting the node used when the target variable is
continuous, i.e., regression problems.
It is so-called because it uses variance as a measure for deciding the feature on which
node is split into child nodes.
→ x is sample, MU is mean, n is number of sample
→ lower value of variance is moving to pure nodes
87.
88.
89. Here are the steps to split a decision tree using reduction in variance:
For each split,
→ individually calculate the variance of each child node
→ Calculate the variance of each split as the weighted average variance of child
nodes
→ Select the split with the lowest variance
→ Perform steps 1-3 until completely homogeneous nodes are achieved
97. Example
To select a feature to split further we need to know how impure or pure that split will be.
→ A pure sub-split means that either you should be getting “yes” or “no”. Suppose this is our dataset.
https://www.analyticsvidhya.com/blog/2021/10/an-introduction-to-random-forest-algorithm-for-beginners/
98. Random Forest(Bagging Algorithm)
Random Forest is a versatile machine learning method capable of performing both
regression and classification tasks.
It also undertakes dimensional reduction methods, treats missing values, outlier
values and other essential steps of data exploration, and does a fairly good job. It is a
type of ensemble learning method, where a group of weak models combine to form a
powerful model.
→ We’re combining multiple trees to get the final output. And hence it is called
forest.
But why is it called Random Forest.
→ Because we use the random bootstrap sample.
99. Random forests creates decision trees on randomly selected data samples, gets
prediction from each tree and selects the best solution by means of voting. It also
provides a pretty good indicator of the feature importance.
100.
101. How does the algorithm work?
It works on four steps:
1)From a given dataset multiple bootstrap samples are created and
the number of bootstrap samples depend on the number of models we want to train.
Eg : If I want to build 10 models here then I’ll create 10
bootstrap samples.
2) Construct a decision tree for each bootstrap sample and get a
predicted value from each decision tree.
3) Perform a vote for each predicted result.
4) select the prediction result with the most votes as the final
prediction.
102. Resample
Resampling is the process of creation of new samples based on observed sample
→ Permutation tests
→ Bootstrapping -- Bootstrapping is a statistical procedure that resamples a single
dataset to create many simulated samples.
103. Bootstrapping
Bootstrapping is a powerful, non-parametric resampling technique that’s used to assess
the uncertainty in the estimator.
→ In bootstrapping, a large number of samples with the same size are drawn
repeatedly from an original sample.
→ This allows a given observation to be included in more than one sample, which
is known as sampling with replacement.
→ Each sample is of identical size.
→ The larger n, the closer the set of samples will be to the ideal bootstrap sample.
104.
105. Bootstrap aggregation
Bootstrap aggregation, also known as bagging, is a powerful ensemble method that was
proposed to prevent overfitting.
→ The concept behind bagging is to combine the prediction of several base
learners to create a more accurate output.
→ Algorithms such as neural network and decisions trees are example of
unstable learning algorithms.
→ Bagging also supports the classification and regression problem.
→ Bootstrap is effective on small dataset.
106.
107.
108.
109. Advantages:
Random forests is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.
It does not suffer from the overfitting problem. The main reason is that it takes the
average of all the predictions, which cancels out the biases.
The algorithm can be used in both classification and regression problems.
Random forests can also handle missing values. There are two ways to handle these:
using median values to replace continuous variables, and computing the proximity-
weighted average of missing values.
You can get the relative feature importance, which helps in selecting the most
contributing features for the classifier.
110. Disadvantages:
Random forests is slow in generating predictions because it has multiple decision
trees. Whenever it makes a prediction, all the trees in the forest have to make a
prediction for the same given input and then perform voting on it. This whole process
is time-consuming.
The model is difficult to interpret compared to a decision tree, where you can easily
make a decision by following the path in the tree.
111. Finding important features
Random forests also offers a good feature selection indicator. Scikit-learn provides an
extra variable with the model, which shows the relative importance or contribution of
each feature in the prediction.
112. Random Forests vs Decision Trees
Random forests is a set of multiple decision trees.
Deep decision trees may suffer from overfitting, but random forests prevents overfitting
by creating trees on random subsets.
Decision trees are computationally faster.
Random forests is difficult to interpret, while a decision tree is easily interpretable and
can be converted to rules.
113. Gradient Boosting Algorithm
GBM is a boosting algorithm used when we deal with plenty of data to make predictions
with high prediction power
Boosting
→ Boosting is actually a machine learning algorithm which combines the
prediction of several base estimator in order to improve robustness over a single
estimator.
→ It combines multiple weak or average predictors to build a strong predictor.
114. XGboost is a powerful machine learning algorithm especially speed and accuracy is
required.
XGboost requires model requires parameter tuning to improve and fully leverage its
advantages over other algorithms.
115. XGboost or extreme gradient boosting is one of the well- known gradient boosting
technique(ensemble) having enhanced performance and speed in tree-based
116. Algorithm Working Process
Linear Regression - A relationship is established between Independent and Dependent
variable by fitting a straight line.
Logistic Regression - In Logistic regression instead fitting a straight line an s shaped
sigmoid function is fitted to get the output in discrete form, which provides to maximum
values 0 or 1
Linear regression - The method for calculating loss function in linear regression is the
mean squared error
Logistic regression - whereas for logistic regression it is maximum likelihood
estimation.