part3Module 3 ppt_with classification.pptx

Machine Learning
1
Department of Computer Science &
Engineering
Course Name & Course Code
Presentation Material
Department of Computer Science & Engineering
Course Code: Semester: V
Course Title: Machine Learning Year: 2022
Faculty Name: Prof. Arunkumar, Dr. Tina Babu , Dr. Revathi V, Dr. Geetha, Prof. Ranjini

MODULE 3
Syllabus – Supervised Learning
Introduction to Supervised Learning, Introduction to Perceptron
model and its adaptive learning algorithms (gradient Decent and
Stochastic Gradient Decent), Introduction to classification, Naive
Bayes classification Binary and multi class Classification, decision
trees and random forest, Regression (methods of function
estimation) --Linear regression and Non-linear regression, logistic
regression, Introduction To Kernel Based Methods of machine
learning: K-Nearest neighbourhood, kernel functions, SVM,
Introduction to ensemble based
learning methods
2
Department of Computer Science &
Engineering
Course Name & Course Code

Introduction to Supervised Learning
• Machines are trained using well "labelled" training
data, and on basis of that data, machines predict
the output.
– The labelled data means some input data is already
tagged with the correct output.
• The training data provided to the machines work
as the supervisor that teaches the machines to
predict the output correctly.
• Supervised learning is a process of providing input
data as well as correct output data to the machine
learning model. The aim of a supervised learning
algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).

1. Regression
• Used if there is a relationship between the input
variable and the output variable.
• It is used for the prediction of continuous variables,
such as Weather forecasting, Market Trends, etc.

2. Classification
• Used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-
Female, True-false, etc.

Introduction to Perceptron Model

• What is perceptron?
• Neural Network In 5 Minutes | What Is A Neural Network? | How Neural Networks Work |
Simplilearn – YouTube
• Perceptron is a building block of an Artificial Neural Network.
• Perceptron is a linear Machine Learning algorithm
used for supervised learning for various binary
classifiers.
• This algorithm enables neurons to learn elements
and processes them one by one during preparation.

• What is the Perceptron model in Machine
Learning?
• Perceptron is also understood as an Artificial Neuron or
neural network unit that helps to detect certain input
data computations in business intelligence.
• Perceptron model is also treated as one of the
best and simplest types of Artificial Neural
networks. However, it is a supervised learning
algorithm of binary classifiers.

• Basic Components of Perceptron
• it as a single-layer neural network with four
main parameters
– input values,
– weights and Bias,
– net sum,
– an activation function.

• Perceptron thus has the following three basic
elements

• Why do we Need Weight and Bias?
• Network gets trained it adjusts both parameters to
achieve the desired values and the correct output.
• Weights - Weights are used to measure the
importance of each feature in predicting output
value.
• Features with values close to zero are said to have
lesser weight or significance. These have less
importance in the prediction process compared to
the features with values further from zero known as
weights with a larger value.
• Besides high-weighted features having greater
predictive power than low-weighting ones, the weight
can also be positive or negative.

• Why do we Need Weight and Bias?
• Bias - bias delays the trigger of the
activation function. It acts like an intercept
in a linear equation.
• Bias is a constant used to adjust the output
and help the model to provide the best fit
output for the given data.

• Learning Rate – It’s a positive constant that is
used to moderate the degree to which weights
are changed at each step.
• What is Perceptron: A Beginners Guide for
Perceptron [Updated] (simplilearn.com)

• Algorithm

• Example 1 - 2 AND GATE Perceptron Training
Rule | Artificial Neural Networks Machine
Learning by Mahesh Huddar – YouTube
• Example 2 - 3. OR GATE Perceptron Training
Rule | Artificial Neural Networks Machine
Learning by Mahesh Huddar - YouTube
• Example 3 - Perceptron Rule to design XOR
Logic Gate Solved Example ANN Machine
Learning by Mahesh Huddar - YouTube

• 1. Gradient Descent | Delta Rule | Delta Rule
Derivation Nonlinearly Separable Data by
Mahesh Huddar - YouTube

• Limitations of Gradient Descent
1) converging to a local minimum can sometimes
be quite slow (i.e., it can require many
thousands of gradient descent steps)
2) if there are multiple local minima in the error
surface, then there is no guarantee that the
procedure will find the global minimum.

• Stochastic Gradient Descent
• Incremental Gradient Descent
– approximate this gradient descent search by
updating weights incrementally, following the
calculation of the error for each individual
example.

• One way to view this stochastic gradient descent
is to consider a distinct error function
defined for each individual training
example d as follows
• Stochastic gradient descent iterates over the
training examples d in D, at each iteration altering
the weights according to the gradient with
respect to

• The key differences between standard
gradient descent and stochastic gradient
descent are:

V Sem – Machine Learning Department of Computer Science & Engineering

Supervised Learning: Popular
Supervised Algorithms

Machine Learning: Glimpse

Introduction to classification
52

Classification in Machine Learning
• Classification is a supervised machine learning method where the model
tries to predict the correct label of a given input data. In classification,
the model is fully trained using the training data, and then it is evaluated
on test data before being used to perform prediction on new unseen
data.
• For instance, an algorithm can learn to predict whether a given email is
spam or ham (no spam), as illustrated below.

Lazy Learners Vs. Eager Learners
• Eager learners are machine learning algorithms that first build a
model from the training dataset before making any prediction on
future datasets. They spend more time during the training process
because of their eagerness to have a better generalization during
the training from learning the weights, but they require less time
to make predictions.
• Most machine learning algorithms are eager learners, and below
are some examples:
• Logistic Regression.
• Support Vector Machine.
• Decision Trees.
• Artificial Neural Networks.

Lazy Learners Vs. Eager Learners
• Lazy learners or instance-based learners, on the other hand, do not
create any model immediately from the training data, and this is where
the lazy aspect comes from.
• They just memorize the training data, and each time there is a need to
make a prediction, they search for the nearest neighbor from the whole
training data, which makes them very slow during prediction. Some
examples of this kind are:
• K-Nearest Neighbor.
• Case-based reasoning.

Machine Learning Classification Vs.
Regression

Machine Learning Classification in
Real Life
Healthcare
Training a machine learning model on historical patient data can help healthcare
specialists accurately analyze their diagnoses:
• During the COVID-19 pandemic, machine learning models were implemented to
efficiently predict whether a person had COVID-19 or not.
Education
• Education is one of the domains dealing with the most textual, video, and audio
data. This unstructured information can be analyzed with the help of Natural
Language technologies to perform different tasks such as:
• The classification of documents per category.
Sustainable agriculture
• Agriculture is one of the most valuable pillars of human survival. Introducing
sustainability can help improve farmers' productivity at a different level without
damaging the environment:
• By using classification models to predict which type of land is suitable for a given
type of seed.

Different Types of Classification
Binary Classification
The goal is to classify the input data into two mutually exclusive categories. The training
data in such a situation is labeled in a binary format: true and false; positive and
negative; O and 1; spam and not spam, etc. depending on the problem being tackled.
For instance, we might want to detect whether a given image is a truck or a boat.
Logistic Regression and Support Vector Machines algorithms are natively designed for
binary classifications. However, other algorithms such as K-Nearest Neighbors and
Decision Trees can also be used for binary classification.

Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive
class labels, where the goal is to predict to which class a given input example
belongs to. In the following case, the model correctly classified the image to be a
plane.
• Most of the binary classification algorithms can be also used for multi-class
classification. These algorithms include but are not limited to:
• Random Forest
• Naive Bayes
• K-Nearest Neighbors
• Gradient Boosting
• SVM
• Logistic Regression.

• Didn’t you say that SVM and Logistic Regression do not support multi-class
classification by default?
• → That’s correct. However, we can apply binary transformation approaches such as
one-versus-one and one-versus-all to adapt native binary classification algorithms for
multi-class classification tasks.
• One-versus-one: this strategy trains as many classifiers as there are pairs of labels. If
we have a 3-class classification, we will have three pairs of labels, thus three
classifiers, as shown below.
• For N labels, we will have Nx(N-1)/2 classifiers. Each classifier is trained on a single
binary dataset, and the final class is predicted by a majority vote between all the
classifiers. One-vs-one approach works best for SVM and other kernel-based
algorithms.

• One-versus-rest: at this stage, we start by considering each label as an
independent label and consider the rest combined as only one label. With
3-classes, we will have three classifiers.
• In general, for N labels, we will have N binary classifiers.

Multi-Label Classification
• In multi-label classification tasks, we try to predict 0 or more classes for
each input example. In this case, there is no mutual exclusion because
the input example can have more than one label.
• Such a scenario can be observed in different domains, such as auto-
tagging in Natural Language Processing, where a given text can contain
multiple topics. Similarly to computer vision, an image can contain
multiple objects,

Multi-Label Classification
• It is not possible to use multi-class or binary classification
models to perform multi-label classification. However, most
algorithms used for those standard classification tasks have
their specialized versions for multi-label classification. We
can cite:
• Multi-label Decision Trees
• Multi-label Gradient Boosting
• Multi-label Random Forests

Imbalanced Classification
• For the imbalanced classification, the number of examples is unevenly
distributed in each class, meaning that we can have more of one class
than the others in the training data. Let’s consider the following 3-class
classification scenario where the training data contains: 60% of trucks,
25% of planes, and 15% of boats.

Imbalanced Classification
• The imbalanced classification problem could occur in the following
scenario:
• Fraudulent transaction detections in financial industries
• Rare disease diagnosis
• Customer churn analysis
• Using conventional predictive models such as Decision Trees,
Logistic Regression, etc. could not be effective when dealing with
an imbalanced dataset, because they might be biased toward
predicting the class with the highest number of observations, and
considering those with fewer numbers as noise.
• So, does that mean that such problems are left behind?
• Of course not! We can use multiple approaches to tackle the
imbalance problem in a dataset. The most commonly used
approaches include sampling techniques or harnessing the power
of cost-sensitive algorithms.

Classification: Meaning
• Process of arranging data into
homogeneous (similar)
groups according to their
common characteristics.
• Raw data cannot be easily
understood, and it is not fit
for further analysis and
interpretation. Arrangement
of data helps users in
comparison and analysis.
• For example,
– the population of a town can
be grouped according to sex,
age, marital status, etc.
• “Classification is the
process of arranging
data into sequences
according to their
common
characteristics or
separating them into
different related
parts.”
– Prof. Secrist

Classification of Data
• The method of arranging data into “homogeneous classes”
according to the common features present in the data is known as
classification.
• A planned data analysis system makes the fundamental data easy
to find and recover.
– This can be of particular interest for legal discovery, risk
management, and compliance.
– Written methods and sets of guidelines for data classification should
determine what levels and measures the company will use to
organise data and define the roles of employees within the business
regarding input stewardship.
– Once a data -classification scheme has been designed, the security
standards that stipulate proper approaching practices for each
division and the storage criteria that determines the data’s lifecycle
demands should be discussed.

Classification of Data: Objectives
• To consolidate the volume of data in such a way that
similarities and differences can be quickly understood.
Figures can consequently be ordered in sections with
common traits.
• To aid comparison.
• To point out the important characteristics of the data
at a flash.
• To give importance to the prominent data collected
while separating the optional elements.
• To allow a statistical method of the materials
gathered.

Introduction to Classification

Naïve Bayes Classifier
Example 1: 1. Solved Example Naive Bayes Classifier
to classify New Instance PlayTennis Example
Mahesh Huddar – YouTube
to classify New Instance | Species Example by
Mahesh Huddar – YouTube
to classify New Instance Car Example by Mahesh
Huddar - YouTube

Bayes theorem in Multi class classification
• The exact Bayesian classification is technically impractical since we have
many evidence variables (predictors) in our dataset. When the number
of predictors increases, many records that we want to classify will not
have an exact match.
• The above equation shows only the case where we have 3 evidence
variables and even with only 3 of them it is not easy to find an exact
match.

Bayes theorem in Multi class classification
• The naive assumption introduces that the variables are independent given the
class. So we can calculate the conditional probability as follows:
• By assuming the conditional independence between variables we can convert the
Bayes equation into a simpler and naive one. Even though assuming independence
between variables sounds superficial, the Naive Bayes algorithm performs pretty
well in many classification tasks.
• For more detail: https://www.geeksforgeeks.org/naive-bayes-classifiers/

Decision trees
❑ Decision Tree is the most powerful and popular tool for classification
and prediction. A Decision tree is a flowchart-like tree structure, where
each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node)
holds a class label.
https://youtu.be/RmajweUFKvM

Important Terminology related
to Decision Trees
1. Root Node: It represents the entire population or sample and
this further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more
sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes,
then it is called the decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or
Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say the opposite process of
splitting.
6. Branch / Sub-Tree: A subsection of the entire tree is called
branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-
nodes is called a parent node of sub-nodes whereas sub-nodes
are the child of a parent node.

Assumptions while creating
Decision Tree
• In the beginning, the whole training set is
considered as the root.
• Feature values are preferred to be categorical. If
the values are continuous then they are
discretized prior to building the model.
• Records are distributed recursively on the basis
of attribute values.
• Order to placing attributes as root or internal
node of the tree is done by using some statistical
approach.

Decision trees expressivity
• Decision Trees follow Sum of Product (SOP)
representation. The Sum of product (SOP) is also
known as Disjunctive Normal Form. For a class,
every branch from the root of the tree to a leaf
node having the same class is conjunction
(product) of values, different branches ending in
that class form a disjunction (sum).

A Decision Tree for the concept PlayTennis
• This tree classifies Saturday mornings according to
whether or not they are suitable for playing tennis.

Decision trees expressivity
• Decision trees represent a disjunction of
conjunctions on constraints on the value of
attributes:

How do Decision Trees work?
• Decision trees use multiple algorithms to decide
to split a node into two or more sub-nodes. The
creation of sub-nodes increases the
homogeneity of resultant sub-nodes. In other
words, we can say that the purity of the node
increases with respect to the target variable.

Algorithms used in Decision
Trees:
• ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression
Tree)
CHAID → (Chi-square automatic
interaction detection Performs multi-level
splits when computing classification trees)
MARS → (multivariate adaptive regression
splines)

Steps in ID3 algorithm:
1. It begins with the original set S as the root node.
2. On each iteration of the algorithm, it iterates
through the very unused attribute of the set S and
calculates Entropy(H) and Information gain(IG) of
this attribute.
3. It then selects the attribute which has the smallest
Entropy or Largest Information gain.
4. The set S is then split by the selected attribute to
produce a subset of the data.
5. The algorithm continues to recur on each subset,
considering only attributes never selected before.

Attribute Selection Measures
• If the dataset consists of N attributes then
deciding which attribute to place at the root or
at different levels of the tree as internal nodes is
a complicated step. By just randomly selecting
any node to be the root can’t solve the issue. If
we follow a random approach, it may give us
bad results with low accuracy.

Attribute Selection Measures
• Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square

Entropy
• Entropy is a measure of the randomness in the
information being processed. The higher the
entropy, the harder it is to draw any conclusions
from that information. Flipping a coin is an
example of an action that provides information
that is random.
•

• Where S → Current state, and Pi → Probability of an event i of state S or
Percentage of class i in a node of state S.

Entropy definition
• if the target attribute can take on c different
values, then the entropy of S relative to this
c-wise classification is
• Defined as

Entropy in binary classification
• Entropy measures the impurity of a collection of examples. It depends from the distribution
of the random variable p.
– S is a collection of training examples
– p+ the proportion of positive examples in S
– p– the proportion of negative examples in S
Entropy (S) ≡ – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0,94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0 ≤ p ≤ 1, 0 ≤ entropy ≤ 1

Information Gain
• Information gain or IG is a statistical property
that measures how well a given attribute
separates the training examples according to
their target classification. Constructing a
decision tree is all about finding an attribute that
returns the highest information gain and the
smallest entropy.

Entropy calculation
– Here the percentage of students who play cricket
is 0.5 and the percentage of students who do not
play cricket is of course also 0.5.
– Since the log of 0.5 bases two is -1, the entropy
for this node will be 1

Entropy calculation in a pure node
Entropy is zero here
Lower entropy means more pure node and higher entropy means less pure nodes.

Information gain as entropy reduction
• a measure of the effectiveness of an attribute in classifying the training
data is called information gain,
• This is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• The information gain, Gain(S, A) of an attribute A,

• https://youtu.be/coOTEc-0OGw
• Decision Tree | ID3 Algorithm | Solved
Numerical Example |
https://youtu.be/fs0wsU2sSPQ
• How to build a decision Tree for
Boolean Function |

Which attribute is the best classifier?

Gini Index
It is calculated by subtracting the sum of the
squared probabilities of each class from one. It
favors larger partitions and easy to implement
whereas information gain favors smaller partitions
with distinct values.

How to avoid/counter
Overfitting in Decision Trees?
• Building trees that “adapt too much” to the
training examples may lead to “overfitting”.
• Here are two ways to remove overfitting:
1. Pruning Decision Trees.
2. Random Forest

Pruning Decision Trees
• The splitting process results in fully grown
trees until the stopping criteria are reached.
But, the fully grown tree is likely to overfit
the data, leading to poor accuracy on unseen
data.

• In pruning, you trim off the branches of the tree,
i.e., remove the decision nodes starting from the
leaf node such that the overall accuracy is not
disturbed. This is done by segregating the actual
training set into two sets: training data set, D
and validation data set, V. Prepare the decision
tree using the segregated training data set, D.
Then continue trimming the tree accordingly to
optimize the accuracy of the validation data set,
V.

• the ‘Age’ attribute in the left-hand side of the tree has been pruned as it has
more importance on the right-hand side of the tree, hence removing overfitting.

Random Forest
• Random Forest is an example of ensemble
learning, in which we combine multiple machine
learning algorithms to obtain better predictive
performance.
• Why the name “Random”?
• Two key concepts that give it the name random:
1. A random sampling of training data set when building trees.
2. Random subsets of features considered when splitting nodes.
• The random forest algorithm solves the above challenge by combining the
predictions made by multiple decision trees and returning a single output. This
is done using an extension of a technique called bagging, or bootstrap
aggregation.

Random Forest
• Bagging is a procedure that is applied to reduce
the variance of machine learning models. It
works by averaging a set of observations to
reduce variance.
• https://youtu.be/eM4uJ6XGnSM

Random forest-Here is how bagging works:

Bootstrap
• If we had more than one training dataset, we
could train multiple decision trees on each
dataset and average the results.
• However, since we usually only have one
training dataset in most real-world scenarios, a
statistical technique called bootstrap is used to
sample the dataset with replacement.
• Then, multiple decision trees are created, and
each tree is trained on a different data sample:

Aggregation
• In this step, the prediction of each decision tree will be
combined to come up with a single output.
• In the case of a classification problem, a majority class
prediction is made:

Why do we randomly sample variables in the
random forest algorithm?
• In the random forest algorithm, it is not only
rows that are randomly sampled, but
variables too.
• This is because if we were to build multiple
decision trees with the same features, every
tree will be similar and highly correlated with
each other, potentially yielding the same
result. This will again lead to the issue of high
variance.

Decision Trees vs. Random Forests - Which
One Is Better and Why?
• Random forests typically perform better than decision trees due
to the following reasons:
• Random forests solve the problem of overfitting because they
combine the output of multiple decision trees to come up with a
final prediction.
• When you build a decision tree, a small change in data leads to a
huge difference in the model’s prediction. With a random forest,
this problem does not arise since the data is sampled many times
before generating a prediction.
• In terms of speed, however, the random forests are slower since
more time is taken to construct multiple decision trees. Adding
more trees to a random forest model will improve its accuracy to a
certain extent, but also increases computation time.
•

Decision Trees vs. Random Forests - Which One Is
Better and Why?
• decision trees are also easier to interpret than random forests
since they are straightforward. It is easy to visualize a decision tree
and understand how the algorithm reached its outcome. A
random forest is harder to deconstruct since it is more complex
and combines the output of multiple decision trees to make a
prediction.

Example: Random Forest
– Suppose there is a dataset that contains multiple
fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided
into subsets and given to each decision tree.
During the training phase, each decision tree
produces a prediction result, and when a new
data point occurs, then based on the majority of
results, the Random Forest classifier predicts the
final decision

Applications of Random Forest
• Banking: Banking sector mostly uses this
algorithm for the identification of loan risk.
• Medicine: With the help of this algorithm,
disease trends and risks of the disease can be
identified.
• Land Use: We can identify the areas of similar
land use by this algorithm.
• Marketing: Marketing trends can be identified
using this algorithm.

Issues in decision trees learning
• determining how deeply to grow the decision tree,
• handling continuous attributes,
• choosing an appropriate attribute selection measure,
• handling training data with missing attribute values,
• handling attributes with differing costs,
• improving computational efficiency.

part3Module 3 ppt_with classification.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to part3Module 3 ppt_with classification.pptx

Similar to part3Module 3 ppt_with classification.pptx (20)

More from VaishaliBagewadikar

More from VaishaliBagewadikar (7)

Recently uploaded

Recently uploaded (20)

part3Module 3 ppt_with classification.pptx