MACHINE LEARNING (INTEGRATED)
(21ISE62)
Module 3
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
8/13/2024 1
Dr. Shivashankar, ISE, GAT
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering
Course Outcomes
After Completion of the course, student will be able to:
 Illustrate Regression Techniques and Decision Tree Learning
Algorithm.
 Apply SVM, ANN and KNN algorithm to solve appropriate problems.
 Apply Bayesian Techniques and derive effective learning rules.
 Illustrate performance of AI and ML algorithms using evaluation
techniques.
 Understand reinforcement learning and its application in real world
problems.
Text Book:
1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, India Edition 2013.
2. EthemAlpaydın, Introduction to machine learning, MIT press, Second edition.
3. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining,
Pearson, First Impression, 2014.
8/13/2024 2
Dr. Shivashankar, ISE, GAT
Module 3: Bayesian Learning: Conditional probability
INTRODUCTION
• Conditional probability is the probability that depends on a
previous result or event.
• It help us understand how events are related to each other.
• When the probability of one event happening doesn’t influence
the probability of any other event, then events are called
independent, otherwise dependent events.
• It is defined as the probability of any event occurring when
another event has already occurred.
• In other words, it calculates the probability of one event
happening given that a certain condition is satisfied.
• It is represented as P (A | B) which means the probability of A
when B has already happened.
8/13/2024 3
Dr. Shivashankar, ISE, GAT
Cont…
Conditional Probability Formula:
• When the intersection of two events happen, then the formula for conditional
probability for the occurrence of two events is given by;
• P(A|B) = N(A∩B)/N(B) or
• P(B|A) = N(A∩B)/N(A)
• Where P(A|B) represents the probability of occurrence of A given B has occurred.
• N(A ∩ B) is the number of elements common to both A and B.
• N(B) is the number of elements in B, and it cannot be equal to zero.
• Let N represent the total number of elements in the sample space.
• N(A ∩ B)/N can be written as P(A ∩ B) and N(B)/N as P(B).
𝑃 𝐴 𝐵 =
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃(𝐵)
• Therefore, P(A ∩ B) = P(B) P(A|B) if P(B) ≠ 0
• = P(A) P(B|A) if P(A) ≠ 0
• Similarly, the probability of occurrence of B when A has already occurred is given by,
• P(B|A) = P(B ∩ A)/P(A)
8/13/2024 4
Dr. Shivashankar, ISE, GAT
Cont…
How to Calculate Conditional Probability?
To calculate the conditional probability, we can use the following method:
Step 1: Identify the Events. Let’s call them Event A and Event B.
Step 2: Determine the Probability of Event A i.e., P(A)
Step 3: Determine the Probability of Event B i.e., P(B)
Step 4: Determine the Probability of Event A and B i.e., P(A∩B).
Step 5: Apply the Conditional Probability Formula and calculate the required
probability.
Conditional Probability of Independent Events
For independent events, A and B, the conditional probability of A and B with respect to
each other is given as follows:
P(B|A) = P(B)
P(A|B) = P(A)
8/13/2024 5
Dr. Shivashankar, ISE, GAT
Cont…
Problem 1: Two dies are thrown simultaneously, and the sum of the numbers obtained is
found to be 7. What is the probability that the number 3 has appeared at least once?
Solution:
• Event A indicates the combination in which 3 has appeared at least once.
• Event B indicates the combination of the numbers which sum up to 7.
• A = {(3, 1), (3, 2), (3, 3)(3, 4)(3, 5)(3, 6)(1, 3)(2, 3)(4, 3)(5, 3)(6, 3)}
• B = {(1, 6)(2, 5)(3, 4)(4, 3)(5, 2)(6, 1)}
• P(A) = 11/36
• P(B) = 6/36
• A ∩ B = 2
• P(A ∩ B) = 2/36
• Applying the conditional probability formula we get,
• P(A|B) = P(A∩B)/P(B) = (2/36)/(6/36) = ⅓
8/13/2024 6
Dr. Shivashankar, ISE, GAT
Cont…
Problem 2: In a group of 100 computer buyers, 40 bought CPU, 30 purchased monitor,
and 20 purchased CPU and monitors. If a computer buyer chose at random and bought a
CPU, what is the probability they also bought a Monitor?
Solution:
As per the first event, 40 out of 100 bought CPU,
So, P(A) = 40% or 0.4
Now, according to the question, 20 buyers purchased both CPU and monitors. So, this is
the intersection of the happening of two events. Hence,
P(A∩B) = 20% or 0.2
Conditional probability is
P(B|A) = P(A∩B)/P(B)
P(B|A) = 0.2/0.4 = 2/4 = ½ = 0.5
The probability that a buyer bought a monitor, given that they purchased a CPU, is 50%.
8/13/2024 7
Dr. Shivashankar, ISE, GAT
Cont…
Question 7: In a survey among a group of students, 70% play football, 60% play
basketball, and 40% play both sports. If a student is chosen at random and it is
known that the student plays basketball, what is the probability that the
student also plays football?
Solution:
Let’s assume there are 100 students in the survey.
Number of students who play football = n(A) = 70
Number of students who play basketball = n(B) = 60
Number of students who play both sports = n(A ∩ B) = 40
To find the probability that a student plays football given that they play
basketball, we use the conditional probability formula:
P(A|B) = n(A ∩ B) / n(B)
Substituting the values, we get:
P(A|B) = 40 / 60 = 2/3
Therefore, probability that a randomly chosen student who plays basketball also
plays football is 2/3.
8/13/2024 8
Dr. Shivashankar, ISE, GAT
BAYES THEOREM
• Bayes’ theorem describes the probability of occurrence of an event related to any
condition.
• Bayes’ Theorem is used to determine the conditional probability of an event.
• Bayesian methods provide the basis for probabilistic learning methods that
accommodate knowledge about the prior probabilities of alternative hypotheses.
• To define Bayes theorem precisely:
• P(h) to denote the initial probability that hypothesis h holds.
• P(h) is often called the prior-probability of h and may reflect any background
knowledge, chance that h is a correct hypothesis.
• P(D) to denote the prior probability that training data D will be observed (i.e., the
probability of D given no knowledge about which hypothesis holds).
• P(D|h) to denote the probability of observing data D given some world in which
hypothesis h holds.
• P (h|D) is called the posterior-probability of h, because it reflects our confidence that h
holds.
• Notice the posterior probability P(h|D) reflects the influence of the training data D, in
contrast to the prior probability P(h) , which is independent of D.
8/13/2024 9
Dr. Shivashankar, ISE, GAT
BAYES THEOREM
• If A and B are two events, then the formula for the Bayes theorem is given by:
• P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
• Where P(A|B) is the probability of condition when event A is occurring while event B has
already occurred.
P(A) – Probability of event A
P(B) – Probability of event B
P(A|B) – Probability of A given B
P(B|A) – Probability of B given A
From the definition of conditional probability, Bayes theorem can be derived for events as given
below:
P(A|B) = P(A ⋂ B)/ P(B), where P(B) ≠ 0
P(B|A) = P(B ⋂ A)/ P(A), where P(A) ≠ 0
• Since P(A∩ 𝐵)𝑎𝑛𝑑 𝑃(𝐵 ∩ 𝐴) are equal
• P(A|B) X P(B) = P(B|A) X P(A)
∴ P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
this is the Bays theorem.
8/13/2024 10
Dr. Shivashankar, ISE, GAT
Likelihood
probability
Posterior
probability
Marginal
probability
Prior probability
Cont…
Problem 1: A patient takes a lab test for cancer diagnosis. There are two possible
outcomes in this case: ⊕(positive) and ⊖ (negative). The test returns a correct positive
results in only 98%. If the cases in which the diseases is actually present and a correct
negative result in only 97% of the cases in which the disease in present. Furthermore,
0.008 of the entire population have this cancer.
Compute the following values.
1). P(Cancer) 2). P(¬𝐶𝑎𝑛𝑐𝑒𝑟) 3). P(+ve Cancer)
4). P(-ve Cancer) 5). P(+| (¬𝐶𝑎𝑛𝑐𝑒𝑟) 4). P(-| (¬𝐶𝑎𝑛𝑐𝑒𝑟)
Solution:
• P(cancer)= 0.008 P(¬cancer)= 0.992
• P(⊕/cancer)=0.98 P(⊖/cancer)= 0.02
• P(⊕/¬cancer)=0.03 P(⊖/¬cancer)= 0.97 and
• P(⊕/cancer) P(cancer)=0.98 X 0.008 = 0.0078
• P(⊕/-cancer) P(-cancer)=0.03 X 0.992=0.0298
• Thus, ℎ𝑀𝐴𝑃 = -cancer.
• The exact posterior probabilities can also be determined by normalizing the above
quantities so that they sum to 1 (e.g., P(cancer/ ⊕) =
0.0078
0.0078+0.0298
= 0.207
8/13/2024 11
Dr. Shivashankar, ISE, GAT
Cont…
Problem 3: Given that they passed the exam what is the probability it is a woman ?
Answer:
P(A): Probability of a woman passing the exam is 92/100 and it's equal to 0.92
P(B|A): probability of having a woman ; 100/200 and it's equal to 0.5
P(B): The probability of passing the exam ; 169/200 and it's equal to 0.845
P(A|B) = (P(0.5) x P(0.92 ) / P(0.845)
P(A|B) = 0.54
92/169 = 0.54 too
8/13/2024 12
Dr. Shivashankar, ISE, GAT
Did not pass
the exam
Passed the
exam
Total
Women 8 92 100
Men 23 77 100
Total 31 169 200
Cont…
4. Covid-19 has taken over the world and the use of Covid19 tests is still relevant to block
the spread of the virus and protect our families.
If the Covid19 infection rate is 10% of the population, and thanks to the tests we have in
Algeria, 95% of infected people are positive with 5% false positive.
What would be the probability that I am really infected if I test positive?
Solution :
Parameters :
• P(A): 10% infected
• P(B|A): 95% Test positive while infected
• 5% False positive while non infected
• 90% not infected
• We will start multiplying the probability of infection (10%) by the probability of testing
positive given that be infected (95%) then we divided by the sum of the probability of
infection (10%) by the probability of testing positive given that be infected (95%) with
not infected (90%) multiplied by false positive (5%)
P(A|B) = P(A) * P(B|A) / Σ P(A) * P(B|A)
P(A|B) = 0.1 * 0.95 /(0.95 * 0.1) +(0.05*0.90)
P(A|B) = 0.095 / 0.095 + 0.045
P(A|B) = 0.678
8/13/2024 13
Dr. Shivashankar, ISE, GAT
Cont…
2. Let A denote the event that a “patient has liver disease”, and B the event that a “patient
is an alcoholic”. It is known from experience that 10% of the patients entering the clinic
have liver disease and 5% of the patient are alcoholics.
Also, among those patients diagnosed with liver disease, 7% are alcoholic. Given that a
patient is alcoholic, what is the probability that he will have liver disease?
Solution:
A-”patient has liver disease”.
B-”patient is an alcoholic”.
P(A)=10%=0.1
P(B)=5%=0.05
P(B|A)=7%=0.07
P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
=
0.07𝑋0.10
0.05
= 0.14
8/13/2024 14
Dr. Shivashankar, ISE, GAT
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR
HYPOTHESES
• Many learning approaches such as neural network learning, linear regression, and
polynomial curve fitting try to learn a continuous valued target function.
• Under certain assumptions any learning algorithm that minimizes the squared error
between output hypothesis predictions and the training data will output a MAXIMUM
LIKELIHOOD HYPOTHESIS.
• The significance of this result is that it provides a Bayesian justification (under certain
assumptions) for many neural network and other curve fitting methods that attempt to
minimize the sum of squared errors over the training data.
• In order to find the Maximum Likelihood Hypothesis in Bayesian learning for
continuous valued target function, we start with Maximum Likelihood Hypothesis
definition, but using lower case p to refer to the Probability Density Function
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
𝑝 𝐷|ℎ
8/13/2024 15
Dr. Shivashankar, ISE, GAT
The argument that gives the maximum
value from a target function
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR
HYPOTHESES
• ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
𝑝 𝐷|ℎ where p: probability density function
• Assume that fixed set of training instances ( 𝑥1, 𝑥2 , 𝑥3 ,……. 𝑥𝑛 ) and data D
corresponding sequence of target values D=(𝑑1, 𝑑2, … . 𝑑𝑛)
• p(D|h) → product of p(𝑑𝑖|ℎ)
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
𝑝(𝑑𝑖|ℎ)
Assume target values are normally distributed
𝑓(𝑥|𝜇)=
1
2𝜋𝜎2
𝑒
−
𝑥−𝜇 2
2𝜎2
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
𝑝(𝑑𝑖|ℎ)
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
1
2𝜋𝜎2
𝑒
−
1
2𝜎2 𝑑𝑖−𝜇 2
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
1
2𝜋𝜎2
𝑒
−
1
2𝜎2 𝑑𝑖−ℎ(𝑥𝑖) 2
8/13/2024 16
Dr. Shivashankar, ISE, GAT
Mean
Standard
Deviation
Target of
𝑖𝑡ℎ
input
output of hypothesis
𝑜𝑓 𝑖𝑡ℎ
input
Variance
Conti..
Rather than maximizing above calculated expression, we shall choose to maximize its (less
complicated) logarithmic.
This is justified because Ln(p) is a monotonic function of p. Therefore maximizing Ln p also
maximizes p.
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
𝑙𝑛
1
2𝜋𝜎2
−
1
2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2
First term is constant, discard it
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
−
1
2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2
Maximizing the negative quantity is equivalent to minimizing the corresponding position
quantity.
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
1
2𝜎2
𝑑𝑖 − ℎ(𝑥𝑖 )2
Finally we can again discard constants that are independent of h
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
𝑑𝑖 − ℎ(𝑥𝑖 )2
8/13/2024 17
Dr. Shivashankar, ISE, GAT
Leas square error hypothesis Bayesian
Learning for given continuous valued target
NAIVE BAYES CLASSIFIER
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem (helps to determine the likelihood that one event will occur
with unclear information while another has already happened) and used for
solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training
dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Naïve: It assumes that the occurrence of a certain feature is independent of
the occurrence of other features.
• Example: If the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
8/13/2024 18
Dr. Shivashankar, ISE, GAT
Conti..
Naïve and Bayes algorithm : It is a way to calculate the value of P(B|A) with the
knowledge of P(A|B).
Working of Naïve Bayes' Classifier:
 Convert the given dataset into frequency tables (divide the number of rows
and columns).
 Generate Likelihood table by finding the probabilities of given features
(𝜇, 𝜎, 𝑥𝑖).
 Now, use Bayes theorem to calculate the posterior probability.
Steps to implement:
• Step 1: Data Pre-processing step
• Step 2: Fitting Naive Bayes to the Training set
• Step 3: Predicting the test result
• Step 4: Test accuracy of the result
• Step 5: Visualizing the test set result.
8/13/2024 19
Dr. Shivashankar, ISE, GAT
NAIVE BAYES CLASSIFIER
• One highly practical Bayesian learning method is the naive Bayes learner, often called
the Naive Bayes classifier.
• The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f (x) can take on any
value from some finite set V.
• A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values 𝑎1, 𝑎2, 𝑎3, … , 𝑎𝑛 .
• The learner is asked to predict the target value, or classification, for this new instance.
• The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value.
• Naive Bayes classifier:
𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖𝑣
𝑣𝑖 ෑ
𝑖
𝑃(𝑎𝑖|𝑣𝑗)
• where 𝑉𝑁𝐵 denotes the target value output by the Naive Bayes classifier.
• Notice that in a naive Bayes classifier the number of distinct 𝑃(𝑎𝑖|𝑣𝑗) terms that must
be estimated from the training data is just the number of distinct attribute values times
the number of distinct target values-a much smaller number than if we were to
estimate the P(𝑎1, 𝑎2, …, 𝑎𝑛 |𝑣𝑗) terms as first contemplated.
8/13/2024 20
Dr. Shivashankar, ISE, GAT
NAÏVE BAYES CLASSIFIER
From Bays theorem
P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
this is the Bays theorem.
Data set
X={𝑥1, 𝑥2, … … … . . 𝑥𝑛) using these features compute output {y} features.
Multiple features
𝑓1, 𝑓2, 𝑓3, 𝑦
𝑥1, 𝑥2, 𝑥3, 𝑦1 --- (1 record)
𝑥1, 𝑥2, 𝑥3, 𝑦2 --- (2 record)
For this kind of data set how (Bayes theorem) equation changes
To compute these features in y using Bayes theorem
P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) =
(P(𝑥1 𝑦 ∗P(𝑥2|𝑦) ∗ P(𝑥3|𝑦),……..∗ P(𝑥𝑛|𝑦)∗P(y))
𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛)
=
𝑃 𝑌 ∗ς1=1
𝑛 𝑃(𝑥𝑖|𝑦)
𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛)
P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) 𝛼 𝑃 𝑌 ∗ ς1=1
𝑛
𝑃 𝑥𝑖 𝑦
Y= argmax
𝑦
[𝑃 𝑌 ∗ ς1=1
𝑛
𝑃 𝑥𝑖 𝑦 ]
8/13/2024 21
Dr. Shivashankar, ISE, GAT
Problem 1: Calculate play for TODAY
Check for dataset TODAY (outlook=Sunny, temperature= Hot)
Solution: Naive Base classifier is defined by
𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) ς𝑖 𝑃(𝑎𝑖|𝑣𝑗)= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) P(Temperature = hotI 𝑣𝑗)
𝑉𝑁𝐵(Yes)= P(Yes|Today) =
𝑃 𝑇𝑜𝑑𝑎𝑦 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠)
𝑃(𝑇𝑜𝑑𝑎𝑦)
=
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ∗𝑃 𝐻𝑜𝑡 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠)
𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝
= 2/9*2/9*9/14
=0.031
𝑉𝑁𝐵(No)= P(No|Today) =
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 ∗𝑃 𝐻𝑜𝑡 𝑁𝑜 ∗𝑃(𝑁𝑜)
𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝
=3/5*2/5*5/14 = 0.08571
To calculate P(Yes) for Today condition and normalized to one
𝑉𝑁𝐵(Yes)=
𝑉𝑁𝐵(Yes)
𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No)
=
0.031
0.031+0.08571
≈0.27
𝑉𝑁𝐵(No)=
0.08571
0.031+0.08571
= 0.734
∴ 𝑉𝑁𝐵(No) probability is high , So for TODAY (Sunny, Hot)—Play is No
8/13/2024 22
Dr. Shivashankar, ISE, GAT
Outlook
Yes No P(y) P(No)
Sunny 2 3 2/9 3/5
Overcast 4 0 4/9 0/5
Rainy 3 2 3/9 3/5
Total 9 5 100% 100%
Temperature
Yes No P(y) P(No)
Hot 2 2 2/9 2/5
Mild 4 2 4/9 2/5
Cool 3 1 3/9 1/5
Total 9 5 100% 100%
Problem 1: Apply the naive Bayes classifier to a concept learning problem, classifying days
according to whether someone will play tennis {outlook=sunny, temperature=cool,
humidity=high. Wind=strong}
8/13/2024 23
Dr. Shivashankar, ISE, GAT
Day Outlook Temperature Humidity Wind Play_Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild high Strong No
Cont…
Solution:
{Outlook=sunny, temperature=cool, Humidity=high, Wind=strong}
P(Play Tennis=yes)=9/14=0.6428
P(Play Tennis=No)=5/14=0.3571
𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) ෑ
𝑖
𝑃(𝑎𝑖|𝑣𝑗)
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) ∗P(Temperature = cool I 𝑣𝑗) ∗
P(Humidity =High | 𝑣𝑗) ∗ P(Wind = Strong | 𝑣𝑗)
𝑉𝑁𝐵(Yes)= P(Sunny|Yes) *P(Cool|Yes) *P(High|yes) *P(Strong|Yes)* P(yes)
= 2/9 * 3/9* 3/9*3/9* 0.6428 =0.0053
𝑉𝑁𝐵(No)= P(Sunny|No) *P(Cool|No) *P(High|No) *P(Strong|No)* P(No)
= 3/5 * 1/5* 4/5*3/5*0.3571=0.0206
𝑉𝑁𝐵(Yes)=
𝑉𝑁𝐵(Yes)
𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No)
=
0.0053
0.0053+0.0206
= 0.205
𝑉𝑁𝐵(No)=
𝑉𝑁𝐵(No)
𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No)
=
0.0206
0.0053+0.0206
= 0.795
Therefore, 𝑉𝑁𝐵(No)= 0.795 > 0.205, Play Tennis: No
8/13/2024 24
Dr. Shivashankar, ISE, GAT
Outlook Y N
Sunny 2/9 3/5
Overcast 4/9 0
Rain 3/9 2/5
Temperature Y N
Hot 2/9 2/5
Mild 4/9 2/5
Cold 3/9 1/5
Humidity y N
High 3/9 4/5
Normal 6/9 1/5
Wind Y N
Strong 3/9 3/5
Weak 6/9 2/5
Cont…
Problem 2: Estimate the conditional probabilities of each attributes {color, legs, height,
smelly} for the species classes {M,H} using the data set given in the table. Using these
probabilities estimate the probability values for the new instance {color=green, legs=2,
height=tall and smelly=No}.
8/13/2024 25
Dr. Shivashankar, ISE, GAT
No Color Legs Height Smelly Species
1 White 3 Short Yes M
2 Green 2 Tall No M
3 Green 3 Short Yes M
4 White 3 Short Yes M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
8 White 2 Short Yes H
Cont…
Solution : {color=green, legs=2, height=tall and smelly=No},
P(M)=4/8=0.5, P(H)=4/8=0.5
P(M/New instance)= P(M)*P(color=green/M) * P(legs=2/M) * P(Height=Tall/M) * P(Smelly=No/M)
= 0.5* 2/4 * 1/4 * 1/4 * 1/4 = 0.0039
P(H/New instance)= P(H)*P(color=green/H) * P(legs=2/H) * P(Height=Tall/H) * P(Smelly=No/H)
= 0.5* 1/4 * 4/4 * 2/4 * 3/4 = 0.048
Since P(H/New instance) > P(M/New instance)
Hence the new instance {color=green, legs=2, height=tall and smelly=No} belongs to H
8/13/2024 26
Dr. Shivashankar, ISE, GAT
Color M H
White 2/4 3/4
Green 2/4 1/4
Legs M H
2 1/4 4/4
3 3/4 0/4
Height M H
Short 3/4 2/4
Tall 1/4 2/4
Smelly M H
Yes 3/4 1/4
No 1/4 3/4
Cont…
1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|-), P(B|-) and P(C|-)
2. Use the estimate of Conditional probabilities given to predict the class label for a test
sample (A=0, B=1, C=0), use Naïve Bays Approach.
3. Estimate the conditional probabilities using the m-estimate approach with P=1/2 and m=4.
solution:
1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+)
P(A|-), P(B|-), P(C|-)
P(A=0|-)= 3/5=0.6
P(A=0|+)= 2/5=0.4
P(B=0|-)= 3/5=0.6
P(B=0|+)= 4/5=0.8
P(C=0|-)= 0/5=0.0
P(C=0|+)= 3/5=0.6
2. Classify the new instance (A=0, B=1, C=0)
P(𝐶𝑖|𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛)=
𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛|𝐶𝑖. 𝑃(𝐶𝑖)
𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛)
P(+|A=0, B=1, C=0)
=
𝑃 𝐴 = 0 + ∗𝑃 𝐵 = 1 + ∗𝑃 𝐶 = 0 + ∗𝑃(+)
𝑃(𝐴=0,𝐵=1,𝐶=0)
=
0.4∗0.2∗0.6∗0.5
𝐾
=
0.024
𝐾
8/13/2024 27
Dr. Shivashankar, ISE, GAT
Record A B C Class
1 0 0 0 +
2 0 0 1 -
3 0 1 1 -
4 0 1 1 -
5 0 0 0 +
6 1 0 0 +
7 1 0 1 -
8 1 0 1 -
9 1 1 1 +
10 1 0 1 +
Cont…
P(-|A=0, B=1, C=0)
=
𝑃 𝐴 = 0 − ∗𝑃 𝐵 = 1 − ∗𝑃 𝐶 = 0 − ∗𝑃(−)
𝑃(𝐴=0,𝐵=1,𝐶=0)
=
𝟎
𝑲
The class label should be +since
0.024
𝐾
>
0
𝐾
3. Estimate the conditional probability using the m-estimate approach
With P=1/4, m=4.
The conditional probability using m-estimate:
Prob(A|B)=
𝑛𝑐+𝑚𝑝
𝑛+𝑚
Where, 𝑛𝑐: No. of times A^B happened
n: No. of times B happened in the training data
P(A=0|+)=
2+2
5+4
=
4
9
P(A=0|-)=
3+2
5+4
=
5
9
P(B=1|+)=
1+2
5+4
=
3
9
P(B=1|-)=
2+2
5+4
=
4
9
P(C=0|+)=
3+2
5+4
=
5
9
P(C=0|-)=
0+2
5+4
=
2
9
8/13/2024 28
Dr. Shivashankar, ISE, GAT
P(A=1|-) =0.4
P(A=1|+) =0.6
P(B=1|-) =0.4
P(A=1|+) =0.2
P(C=1|-) =1
P(C=1|+) =0.4
P(A=0|-) =0.6
P(A=0|+) =0.4
P(B=0|-) =0.6
P(B=0|+) =0.8
P(C=0|-) =0
P(C=0|+) =0.6
Cont…
Problem 3: Classify the new estimate (A=0, B=1, C=0), using m-estimates approach with
P=1/2, m=4.
P(+|A=0, B=1, C=0) =
P(A=0|+) ∗ P(B=1|+)∗ P(C=0|+)∗ P(+)
P(A=0, B=1, C=0)
=
4
9
∗
3
9
∗
5
9
∗0.5
𝐾
=
0.0412
𝐾
P(-|A=0, B=1, C=0) =
P(A=0|−) ∗ P(B=1|−)∗ P(C=0|−)∗ P(−)
P(A=0, B=1, C=0)
=
5
9
∗
4
9
∗
2
9
∗0.5
𝐾
=
0.0274
𝐾
The class label should be belongs to +since
0.0412
𝐾
>
0.0274
𝐾
8/13/2024 29
Dr. Shivashankar, ISE, GAT
Artificial Neural Networks
• Motivation behind neural network is human brain. Human brain is called as the best
processor even though it works slower than other computers.
• Human brain cells, called neurons, form a complex, highly interconnected network and
send electrical signals to each other to help humans process information.
• Similarly, an artificial neural network is made of artificial neurons that work together to
solve a real world problems.
• Artificial neurons are software modules, called nodes, and artificial neural networks
are software programs or algorithms that, at their core, use computing systems to
solve mathematical calculations.
8/13/2024 30
Dr. Shivashankar, ISE, GAT
Fig 3.3: Artificial Neural Networks
Conti..
Input Layer
• This is the first layer in a typical neural network.
• Input layer neurons receive the input information from the outside world enters the artificial neural
network, process it through a mathematical function (activation function), and transmit output to the
next layer’s neurons based on comparison with a preset threshold value.
• We pre-process text, image, audio, video, and other types of data to derive their numeric representation.
Hidden Layer
• Hidden layers take their input from the input layer or other hidden layers and have a large number of
hidden layers. It contains the summation and activation function.
• Each hidden layer analyzes the output from the previous layer, processes it further, and passes it on to
the next layer. Here also, we multiply the data by edge weights as it is transmitted to the next layer.
Output Layer
• The output layer gives the final result of all the data processing by the artificial neural network. It can
have single or multiple nodes.
• For instance, if we have a binary (yes/no) classification problem, the output layer will have one output
node, which will give the result as 1 or 0.
• However, if we have a multi-class classification problem, the output layer might consist of more than one
output node.
8/13/2024 31
Dr. Shivashankar, ISE, GAT
Conti..
• It is usually a computational network based on biological neural networks that
construct the structure of the human brain.
• Similar to a human brain has neurons interconnected to each other, artificial
neural networks also have neurons that are linked to each other in various
layers of the networks.
• These neurons are known as nodes.
• Artificial neural networks (ANNs) provide a general, practical method for
learning real-valued, discrete-valued, and vector-valued functions from
examples.
• ANN learning is robust to errors in the training data and has been successfully
applied to problems such as interpreting visual scenes, speech recognition,
and learning robot control strategies.
• The fastest neuron switching times are known to be on the order of 10−3
seconds--quite slow compared to computer switching speeds of 10−10
seconds.
8/13/2024 32
Dr. Shivashankar, ISE, GAT
Biological Motivation
• The term "Artificial Neural Network(ANN)" refers to a biologically inspired sub-field of
artificial intelligence modeled after the brain.
• ANN has been inspired by biological learning system biological learning system is made
up of complex web of interconnected neurons.
• Artificial interconnected neurons like biological neurons making up an ANN.
• Each biological neuron is capable of taking a number of inputs and produce output.
• One motivation for ANN is that to work for a particular task identification through many
parallel processes.
Consider human brain:
• Number of neurons ~ 1011 neurons
• Connections per neurons ~ 104−5
• Neurons switching time ~ 10−3
seconds (0.001)
• Computer switching time ~ 10−10seconds
• Scene recognition time ~ 10−1seconds (0.1)
8/13/2024 33
Dr. Shivashankar, ISE, GAT
NEURAL NETWORK REPRESENTATIONS
• In an artificial neural network, a neurone is a logistic unit.
 Feed input via input wires.
 Logistic unit does computation.
 Sends output down output wires
• That logistic computation is just like our previous logistic regression hypothesis
calculation.
• Input – 30* 32 grid – camera.
• Output – Vehicle is steered.
• Training – Observing steering commands of human driving the vehicle.
• 960 inputs – 30 output units – Steering command recommended most.
• ALVINN – acyclic graph.
8/13/2024 34
Dr. Shivashankar, ISE, GAT
PERCEPTRONS
• One type of ANN system is based on a unit called a perceptron.
• A perceptron takes a vector of real-valued inputs, calculates a linear combination of these inputs,
then outputs a 1 if the result is greater than some threshold and -1 otherwise.
• More precisely, given inputs 𝑥1 through 𝑥2, the output o(𝑥1, … … 𝑥𝑛) computed by the perceptron
is o(𝑥1, … … 𝑥𝑛) =ቊ
1 𝑖𝑓 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛𝑥𝑛 > 0
−1 𝑜𝑡ℎ𝑟𝑒𝑤𝑖𝑠𝑒
• where each 𝒘𝒊 is a real-valued constant, or weight, that determines the contribution of input 𝒙𝒊 to
the perceptron output.
• We will sometimes write the perceptron function as
𝑎( Ԧ
𝑥) =sgn 𝑤. Ԧ
𝑥
• Where, sgn (y)=ቊ
1 𝑖𝑓 𝑦 > 0
−1 𝑜𝑡ℎ𝑟𝑒𝑤𝑖𝑠𝑒
• Learning a perceptron involves choosing values for the weights 𝑤0, … . . 𝑤𝑛. Therefore, the space H
of candidate hypotheses considered in perceptron learning is the set of all possible real-valued
weight vectors.
• H= 𝑤| 𝑤 𝜖 𝜏 𝑛+1
8/13/2024 35
Dr. Shivashankar, ISE, GAT
Representational Power of Perceptron
• We can view the perceptron as representing a hyperplane decision surface in the n-
dimensional space of instances (i.e., points).
• The perceptron outputs a 1 for instances lying on one side of the hyperplane and
outputs and a -1 for instances lying on the other side.
• The equation for this decision hyperplane is 𝑤. Ԧ
𝑥 = 0.
• Of course, some sets of positive and negative examples cannot be separated by any
hyperplane.
• Those that can be separated are called linearly separable sets of examples.
• A single perceptron can be used to represent many boolean functions.
8/13/2024 36
Dr. Shivashankar, ISE, GAT
Cont…
• AND and OR can be viewed as special cases of m-of-n functions: that is, functions
where at least m of the n inputs to the perceptron must be true.
• The OR function corresponds to m = 1 and the AND function to m = n.
• Any m-of-n function is easily represented using a perceptron by setting all input
weights to the same value (e.g., 0.5) and then setting the threshold t accordingly.
• Perceptron can represent all of the primitive boolean functions AND, OR, NAND (¬
AND), and NOR (¬ OR).
• The ability of perceptron to represent AND, OR, NAND, and NOR is important because
every boolean function can be represented by some network of interconnected units
based on these primitives.
8/13/2024 37
Dr. Shivashankar, ISE, GAT
The Perceptron Training Rule
• learning problem is to determine a weight vector that causes the perceptron to
produce the correct output for each of the given training examples.
• One way to learn an acceptable weight vector is to begin with random weights, then
iteratively apply the perceptron to each training example, modifying the perceptron
weights whenever it misclassifies an example.
• This process is repeated, iterating through the training examples as many times as
needed until the perceptron classifies all training examples correctly.
• At every step of feeding a training example, when the perceptron fails to produce the
correct +1/-1, we revise every weight 𝑤𝑖 associated with every input 𝑥𝑖, according to
the following rule:
𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖
Where ∆𝑤𝑖=ƞ 𝑡 − 𝑜 𝑥𝑖
t is the target output for the current training example,
o is the output generated by the perceptron, and
ƞ is a positive constant called the learning rate. The role of the learning rate is to moderate
the degree to which weights are changed at each step.
∆ : This is the learning rate, or the step size.
8/13/2024 38
Dr. Shivashankar, ISE, GAT
The Perceptron Training Rule
• In order to train the Perceptron f(X<W):
𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖
Where ∆𝑤𝑖=ƞ 𝑡 − 𝑜 𝑥𝑖
8/13/2024 39
Dr. Shivashankar, ISE, GAT
 Initialize the weight, W, randomly.
 For as many times as necessary:
For each training examples x𝝐𝑿
 Compute f(x,W)
 If x is misclassified:
Modify the weight, 𝑤𝑖 associated with every 𝑥𝑖 in x.
Problem
• Problem 6: Compute AND gate using single perceptron training rule.
• Solution:
A
B
Y=ቊ
1 𝑖𝑓 𝑤𝑥 + 𝑏 > 0
0 𝑖𝑓 𝑤𝑥 + 𝑏 ≤ 0
• Assume w1=1, w2=1 and bias= -1
• Perceptron training rule : y= w1x1+w2x2+b
• If x1=0, x2=0, then 0+0-1= -1
0 1 0+1-1= 0 -1
1 0 1+0-1= 0 1 y=1
1 1 1+1-1= 1
1
8/13/2024 40
Dr. Shivashankar, ISE, GAT
A B Y=A.B
0 0 0
0 1 0
1 0 0
1 1 1
Y=A.B
AND
b
x1
x2
Problems
• Problem 7: Compute OR gate using single perceptron training rule.
• Solution:
Y=ቊ
1 𝑖𝑓 𝑤𝑥 + 𝑏 > 0
0 𝑖𝑓 𝑤𝑥 + 𝑏 ≤ 0
• Assume w1=1, w2=1 and bias= -1
• Perceptron training rule : y= w1x1+w2x2+b
• If x1=0, x2=0, then 0+0-1= -1
0 1 0+1-1= 0
But output y= 0 and target =1, misclassification, let us change the w1=1, w2=2.
Then, y= w1x1+w2x2+b and w1=1, w2=2, b=-1
For (0,0), y= 0+0 -1= -1
(0,1), y= 1x0 +2x1-1 = 1
(1,0), y= 1x1+2x0 -1=0, But output = 0 and target =1,
misclassification, so let us change the w1=2 and w2=2
(0,0), y= 0+0 -1= -1
(0,1), y= 2x0 +2x1-1 = 1
(1,0), y= 2x1+0x2-1 = 1
(1,1), y= 2x1+2x1-1= 3
8/13/2024 41
Dr. Shivashankar, ISE, GAT
A B Y=A+B
0 0 0
0 1 1
1 0 1
1 1 1
OR
b
x1
x2
2
2
Y=1
-1
Problems
• Problem 7: Compute NAND gate using single perceptron training rule.
• Solution:
• Assume w1=1, w2=1 and bias= -1
• If x1=0, x2=0, then 0+0-1= -1
• Change w1=1, w2=1 and bias= 1
if (0,0), y= 1√
(0,1), y= 2√
(1,0), y= 2√
(1,1), y= 3 X
Change w1= -1, w2= -1 and bias= 2
if (0,0), y= 2√
(0,1), y= 1√
(1,0), y= 1√
(1,1), y= 0 √
8/13/2024 42
Dr. Shivashankar, ISE, GAT
A B Y=𝐴. 𝐵
0 0 1
0 1 1
1 0 1
1 1 0
NAND
b
x1
x2
-1
-1
Y=1
2
Problem
• Problem 6: Compute NOR gate using single perceptron training rule.
• Solution:
• Assume w1=-1, w2=-1 and bias= 1
• Perceptron training rule : y= w1x1+w2x2+b
• If x1=0, x2=0, then 0+0+1= -1
0 1 0-1+1= 0
1 0 -1+0+1= 0
1 1 -1-1+1= 1
8/13/2024 43
Dr. Shivashankar, ISE, GAT
A B Y=𝐴 + 𝐵
0 0 1
0 1 0
1 0 0
1 1 0
NOR
b
x1
x2
1
-1
-1 Y=1
Problem
• Problem 8: Compute NOT gate using single perceptron training rule.
• Solution:
• Y=O=wx+b, when w=1 and b=-1
• When x=0, y=1X1-1=0, misclassification, change b=1 if we change w value it doesn't
reflect any changes.
W=1, b=1, now if x=0, y=0+1=1. both output and target are mapping.
If x=1, y=wx+b=1+1=2, misclassification of the output and target value.
So change w=-1 and b=1
If x=0, y=1x0+1=1
x=-1, y= -1x1+1=0, both output and target values are mapping.
8/13/2024 44
Dr. Shivashankar, ISE, GAT
NOT
b
x
+1
1
y
Problem
Problem 1: Assume 𝑤1 = 0.6 𝑎𝑛𝑑 𝑤2 = 0.6, 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 1 𝑎𝑛𝑑 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 ƞ=0.5.
Compute OR gate using perceptron training rule.
Solution : 1. A=0, B=0 and target=0
𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2
=0.6*0+0.6*0=0
This is not greater than the threshold value of 1.
So the output =0
2. A=0, B=1 and target =1
𝑤𝑖𝑥𝑖 = 0.6 ∗ 0 +0.6*1= 0.6
This is not greater than the threshold value of 1. So the output =0.
𝑤𝑖=𝑤𝑖+ƞ(t-o) 𝑥𝑖
𝑤1=0.6+0.5(1-0)0=0.6
𝑤2=0.6+0.5(1-0)1=1.1
Now 𝒘𝟏=0.6, 𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5
8/13/2024 45
Dr. Shivashankar, ISE, GAT
A B Y=A+B
(Target)
0 0 0
0 1 1
1 0 1
1 1 1
Problem
• Now 𝒘𝟏=0.6, 𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5
1. A=0, B=0 and target=0
𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2
=0.6*0+1.1*0=0
This is not greater than the threshold value of 1.
So the output =0
2. A=0, B=1 and target =1
𝑤𝑖𝑥𝑖 = 0.6 ∗ 0 +1.1*1= 1.1
This is greater than the threshold value of 1.
So the output =1.
3. A=1, B=0 and target =1
𝑤𝑖𝑥𝑖 = 0.6 ∗ 1 +1.1*0= 0.6
This is not greater than the threshold value of 1.
So the output =0.
𝑤𝑖=𝑤𝑖+ƞ(t-0) 𝑥𝑖
𝑤1=0.6+0.5(1-0)1=1.1
𝑤2=1.1+0.5(1-0)0=1.1
8/13/2024 46
Dr. Shivashankar, ISE, GAT
Problem
• Now 𝒘𝟏=1.1, 𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5
1. A=0, B=0 and target=0
𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2
=1.1*0+1.1*0=0
This is not greater than the threshold value of 1.
So the output =0
2. A=0, B=1 and target =1
𝑤𝑖𝑥𝑖 = 1.1 ∗ 0 +1.1*1= 1.1
This is greater than the threshold value of 1.
So the output =1.
3. A=1, B=0 and target =1
𝑤𝑖𝑥𝑖 = 1.1 ∗ 1 +1.1*0= 1.1
This is greater than the threshold value of 1.
So the output =1.
4. A=1. B=1 and target =1
𝑤𝑖=1.1*1+1.1*1=2.2
This is greater than the threshold value of 1.
So the output =1.
8/13/2024 47
Dr. Shivashankar, ISE, GAT
B
A
1.1
1.1
∈ 𝜃 = 1
𝑂𝑢𝑡𝑝𝑢𝑡
Problem
Problem 2: Assume 𝑤1 = 1.2 𝑎𝑛𝑑 𝑤2 = 0.6, 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 1 𝑎𝑛𝑑 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 ƞ=0.5.
Compute AND gate using perceptron training rule.
Solution : 1. A=0, B=0 and target=0
𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2
=1.2*0+0.6*0=0
This is not greater than the threshold value of 1.
So the output =0
2. A=0, B=1 and target =0
𝑤𝑖𝑥𝑖 = 1.2 ∗ 0 +0.6*1= 0.6
This is not greater than the threshold value of 1, So the output =0.
3. A=1, B=0 and target =0
𝑤𝑖𝑥𝑖 = 1.2 ∗ 1 +0.6*0= 1.2
This is greater than the threshold value of 1, So the output =1.
𝑤𝑖=𝑤𝑖+ƞ(t-o) 𝑥𝑖
𝑤1=1.2+0.5(0-1)1=0.7
𝑤2=0.6+0.5(0-1)0=0.6
Now 𝒘𝟏=0.7, 𝒘𝟐=0.6, threshold = 1 and learning rate ƞ=0.5
8/13/2024 48
Dr. Shivashankar, ISE, GAT
A B Y=A+B
(Target)
0 0 0
0 1 0
1 0 0
1 1 1
Problems
For 𝒘𝟏=0.7, 𝒘𝟐=0.6, threshold = 1 and learning rate ƞ=0.5
1. A=0, B=0 and target=0
𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2
=0.7*0+0.6*0=0
This is not greater than the threshold value of 1.
So the output =0
2. A=0, B=1 and target =0
𝑤𝑖𝑥𝑖 = 0.7 ∗ 0 +0.6*1= 0.6
This is not greater than the threshold value of 1.
So the output =0.
3. A=1, B=0 and target =0
𝑤𝑖𝑥𝑖 = 0.7 ∗ 1 +0.6*0= 0.7
This is not greater than the threshold value of 1.
So the output =0.
4. A=1, B=1 and target =1
𝑤𝑖𝑥𝑖 = 0.7 ∗ 1 +0.6*1= 1.3
This is greater than the threshold value of 1.
So the output =1.
8/13/2024 49
Dr. Shivashankar, ISE, GAT
A
B
0.7
0.6
∈ 𝜃 = 1
Weighted sum
Output
Problem
• Problem 3: consider X-OR gate, compute Perceptron training rule with threshold =1 and
learning rate=1.5.
• Solution: y=𝑥1 ҧ
𝑥2 + ҧ
𝑥1𝑥2
• Y=𝑍1+𝑍2
• Where, 𝑍1 = 𝑥1 ҧ
𝑥2(𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 1),
• 𝑍2= ҧ
𝑥1𝑥2 (Function 2)
• Y=𝑍1 OR 𝑍2(Function 3)
• First function: 𝑍1 = 𝑥1 ҧ
𝑥2
• Assume the initial weights are 𝑊11=𝑊21=1
• Threshold =1 and Learning rate=1.5
8/13/2024 50
Dr. Shivashankar, ISE, GAT
𝑥1 𝑥2 y
0 0 0
0 1 1
1 0 1
1 1 0
𝑋1
𝑋2
𝑍1
Y
𝑍2
𝑥1 𝑤11
𝑥2
𝑤12
𝑤21
𝑤22
𝑦1
𝑦2
y
𝑥1 𝑥2 𝑍1
0 0 0
0 1 0
1 0 1
1 1 0
Problem
(0,0) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + 1 ∗ 0 = 0 (output=0)
(0,1) 𝑍1𝑖𝑛=1 ∗ 0 + 1 ∗ 1 = 1 (output=1)
𝑤𝑖,𝑗=𝑤𝑖,𝑗+ƞ(t-o)𝑥𝑖
𝑤11=1+1.5(0-1)0=1
𝑤21=1+1.5(0-1)1=-0.5
Now, 𝑤11=1, 𝑤21=-0.5, threshold=1 and learning rate=1.5
(0,0) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + (−0.5) ∗ 0 = 0 (output=0)
(0,1) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + −0.5 ∗ 1 = −0.5 (output=0)
(1,0) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 1 + (−0.5) ∗ 0 = 1 (output=1)
(1,1) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 1 + (−0.5) ∗ 1 = 0.5 (output=0)
……………………………………………………………………………………………………………………………………
Second function: 𝑍2= ҧ
𝑥1𝑥2
• Assume the initial weights are 𝑊12=𝑊22=1
• Threshold =1 and Learning rate=1.5
• (0,0) 𝑍2𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + 1 ∗ 0 = 0 (output=0)
• (0,1) 𝑍2𝑖𝑛=1 ∗ 0 + 1 ∗ 1 = 1 (output=1)
• (1,0) 𝑍2𝑖𝑛=1 ∗ 1 + 1 ∗ 0 = 1 (output=1)
8/13/2024 51
Dr. Shivashankar, ISE, GAT
𝑥1 𝑥2 𝑧2
0 0 0
0 1 1
1 0 0
0 0 0
Problem
𝑤𝑖,𝑗=𝑤𝑖,𝑗+ƞ(t-o)𝑥𝑖
𝑤12=1+1.5(0-1)1= -0.5
𝑤22=1+1.5(0-1)0= 1
Now, 𝑤12= -0.5, 𝑤22= 1, threshold=1, learning rate=1.5
• (0,0) 𝑍2𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = −0.5 ∗ 0 + 1 ∗ 0 = 0 (output=0)
• (0,1) 𝑍2𝑖𝑛=(-0.5) ∗ 0 + 1 ∗ 1 = 1 (output=1)
• (1,0) 𝑍2𝑖𝑛= −0.5 ∗ 1 + 1 ∗ 0 = −0.5 (output=0)
• (1,1) 𝑍2𝑖𝑛= −0.5 ∗ 1 + 1 ∗ 1 = 0.5 (output=0)
• Y=𝑍1 OR 𝑍2 𝑦𝑖𝑛 = 𝑍1𝑣1 + 𝑍2𝑣2
• Assume the initial weights are XOR table
• 𝑣1 = 𝑣2 = 1, threshold=1, learning rate=1.5
• (0,0) 𝑦𝑖𝑛=𝑣𝑖 ∗ 𝑧𝑖 = 1 ∗ 0 + 1 ∗ 0 = 0 (output=0)
• (0,1) 𝑦𝑖𝑛=1 ∗ 0 + 1 ∗ 1 = 1 (output=1)
• (1,0) 𝑦𝑖𝑛=1 ∗ 1 + 1 ∗ 0 =1 (output=1)
• (0,0) 𝑦𝑖𝑛=1 ∗ 0 + 1 ∗ 0 = 0 (output=0)
• ∴ 𝑤11 = 1, 𝑤12 = −0.5, 𝑤21 = −0.5, 𝑤22 = 1
• 𝑣1 = 𝑣2 = 1.
8/13/2024 52
Dr. Shivashankar, ISE, GAT
𝑥1 𝑥2 𝑍1 𝑍2 𝑦𝑖𝑛
0 0 0 0 0
0 1 0 1 1
1 0 1 0 1
1 1 0 0 0
Problem
• Problem 4: Consider NAND gate, compute Perceptron training rule with W1=1.2,
W2=0.6 threshold =-1 and learning rate=1.5.
• Solution:
8/13/2024 53
Dr. Shivashankar, ISE, GAT
A B Y=𝐴. 𝐵
0 0 1
0 1 1
1 0 1
1 1 0
Problem
• Problem 5: Consider NOR gate, compute Perceptron training rule with W1=0.6, W1=1.
threshold =-0.5 and learning rate=1.5.
• Solution:
8/13/2024 54
Dr. Shivashankar, ISE, GAT
A B Y=𝐴 + 𝐵
0 0 1
0 1 0
1 0 0
1 1 0
Gradient Descent and the Delta Rule
• It is also important because gradient descent can serve as the basis for learning algorithms that
must search through hypothesis spaces.
• The delta training rule is best understood by considering the task of training an unthresholded
perceptron; that is, a linear unit for which the output o is given by
o=𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 +…………..+𝑤𝑛𝑥𝑛
O( Ԧ
𝑥)=𝑤. Ԧ
𝑥
Thus, a linear unit corresponds to the first stage of a perceptron, without the threshold.
 Although there are many ways to define this error, one common measure that will turn out to be
especially convenient is
 𝐸 𝑤 =
1
2
σ𝑑𝜖𝐷 𝑡𝑑 − 𝑜𝑑
2
 where D is the set of training examples, 𝑡𝑑 is the target output for training example d, and 𝑜𝑑is the
output of the linear unit for training example d.
Gradient Descent and the Delta Rule for each weight changed by
∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗
𝛿𝑗=𝑂𝑗(1=𝑂𝑗) (𝑡𝑗 − 𝑂𝑗) if j is an output unit
𝛿𝑗=𝑂𝑗(1=𝑂𝑗)σ𝑘 𝛿𝑘𝑤𝑘𝑗 if j is a hidden unit
Where ƞ is a constant called the learning rate
𝑡𝑗is the correct teacher output for unit j
𝛿𝑗is the error measure for unit j
8/13/2024 55
Dr. Shivashankar, ISE, GAT
The Backpropagation Algorithm
• Backpropagation is an effective algorithm used to train artificial neural networks, especially in
feed-forward neural networks.
• Its an iterative algorithm, that helps to minimize the cost function by determining which weights
and biases should be adjusted to minimize the loss by moving down towards the gradient of the
error.
Let us Consider networks with multiple output units rather than single units as before, we begin by
redefining E to sum the errors over all of the network output units
𝐸 𝑤 =
1
2
෍
𝑑𝜖𝐷
෍
𝑘𝜖𝑜𝑢𝑡𝑝𝑢𝑡𝑠
𝑡𝑘𝑑 − 𝑜𝑘𝑑
2
Where, outputs is the set of output units in the network, and 𝑡𝑘𝑑 and 𝑜𝑘𝑑 are the target and output
values associated with the kth output unit and training example d.
8/13/2024 56
Dr. Shivashankar, ISE, GAT
Case 1: Compute and derive the increment (∆) for output unit weight in The
Backpropagation Algorithm (𝒐𝒋)
Derivation:
𝜕𝐸𝑑
𝑗𝑛𝑒𝑡𝑗
=
𝜕𝐸𝑑𝜕0𝑗
𝜕0𝑗 𝑗𝑛𝑒𝑡𝑗
𝜕𝑜𝑗
𝑗𝑛𝑒𝑡𝑗
=
𝜕
𝜕𝑜𝑗
1
2
σ𝑘𝜖𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑡𝑘 − 𝑜𝑘
2
=
𝜕
𝜕𝑜𝑗
1
2
𝑡𝑗 − 𝑜𝑗
2
=
1
2
* 2(𝑡𝑗 − 𝑜𝑗)
𝜕(𝑡𝑗−𝑜𝑗)
𝑗𝑜𝑗
= -(𝑡𝑗−𝑜𝑗)

𝜕𝐸𝑑
𝑗𝑛𝑒𝑡𝑗
= -(𝑡𝑗−𝑜𝑗) 𝑜𝑗(1 − 𝑜𝑗) = -𝑜𝑗 (1 − 𝑜𝑗) (𝑡𝑗−𝑜𝑗)
And
𝛿𝑗 ≠ −
𝜕𝐸𝑑
𝑗𝑛𝑒𝑡𝑗
= 𝑜𝑗 (1 − 𝑜𝑗) (𝑡𝑗−𝑜𝑗) .
∆𝑤𝑗𝑖 = ƞ𝛿𝑗𝑥𝑗𝑖= ƞ𝑜𝑗 (1 − 𝑜𝑗) (𝑡𝑗−𝑜𝑗) 𝑥𝑗𝑖 . 𝒐𝒋
8/13/2024 57
Dr. Shivashankar, ISE, GAT
Type
equation
here.
∈ +
𝑛𝑒𝑡𝑗 𝛿
The Backpropagation Algorithm
BACKPROPOGATION (training_examples, ƞ, 𝒏𝒊𝒏, 𝒏𝒐𝒖𝒕, 𝒏𝒉𝒐𝒅𝒅𝒆𝒏)
Each training example is a pair of the form ( Ԧ
𝑥,Ԧ
𝑡), where Ԧ
𝑥 is the vector of network input values,
and 𝑡 is the vector of target network output values.
ƞ is the learning rate (e.g., .O5). 𝑛𝑖𝑛, is the number of network inputs, 𝑛ℎ𝑖𝑑𝑑𝑒𝑛 the number of
units in the hidden layer, and 𝑛𝑜𝑢𝑡, the number of output units.
The input from unit i into unit j is denoted 𝑥𝑖,𝑗, and the weight from unit i to unit j is
denoted 𝑤𝑖,𝑗.
 Create a feed-forward network with 𝑛𝑖𝑛 inputs, 𝑛ℎ𝑖𝑑𝑑𝑒𝑛 hidden units, and 𝑛𝑜𝑢𝑡 output units.
 Until the termination condition is met, do
 For each ( Ԧ
𝑥,Ԧ
𝑡) in training_examples, do
Propagate the input forward through the network:
1, Input the instance Ԧ
𝑥 to the network and compute the output 𝑜𝑢 of
every unit u in the network: 𝑎𝑗 = σ𝑗 𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑎𝑛𝑑 𝑦𝑗 = 𝐹 𝑎𝑗 =
1
1+𝑒
−𝑎𝑗
Propagate the errors backward through the network:
2. For each network output unit k, calculate its error term 𝛿𝑘.
𝛿𝑘 ← 𝑜𝑘 1 − 𝑜𝑘 𝑡𝑘 − 𝑜𝑘
3. For each hidden unit h, calculate its error term 𝛿ℎ.
𝛿𝑘 ← 𝑜ℎ 1 − 𝑜ℎ ෍
𝑘𝜖𝑜𝑢𝑡𝑝𝑢𝑡𝑠
𝑤𝑘ℎ𝛿𝑘
4. Update each network weight 𝑤𝑗𝑖
𝑤𝑗𝑖 ← 𝑤𝑗𝑖+∆𝑤𝑗𝑖
where , ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑥𝑗𝑖
8/13/2024 58
Dr. Shivashankar, ISE, GAT
Problems
Problem 1: Assume that the neurons have a sigmoid activation function, perform a forward pass and
backward pass on the network. Assume that the actual output of y is 0.5 and learning rate is 1.
Perform another forward pass.
Solution: Forward pass: compute output for 𝑦3, 𝑦4 and 𝑦5
𝑎𝑗 = ෍
𝑗
𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 =
1
1 + 𝑒−𝑎𝑗
𝑎3 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.1*0.35+0.8*0.9 = 0.755
𝑦3 = 𝑓(𝑎3)=
1
1+𝑒−0.755=0.68
𝑎4 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.4*0.35+0.6*0.9 = 0.68
𝑦4 = 𝑓(𝑎4)=
1
1+𝑒−0.68=0.6637
𝑎5 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.3*0.68+0.9*0.66 7= 0.801
8/13/2024 59
Dr. Shivashankar, ISE, GAT
𝑥1 = 0.35
𝑥2 = 0.9
𝐻3
𝐻4
𝑂5
𝑤13 = 0.1
𝑤14 = 0.4
𝑤23 = 0.8
𝑤24 = 0.6 𝑤45 = 0.9
𝑤35 = 0.3
𝑦5
Output y
𝑦3
𝑦4
Conti..
𝑦5 = 𝑓(𝑎5)=
1
1+𝑒−0.801=0.69 (Network output)
∴ 𝐸𝑟𝑟𝑜𝑟 = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡−𝑦5= 0.5-0.69= -0.19
…………………………………………………………………………………………………………………………………………
Each weight changed by
∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗
𝛿𝑗=𝑂𝑗(1=𝑂𝑗) (𝑡𝑗 − 𝑂𝑗) if j is an output unit
𝛿𝑗=𝑂𝑗(1=𝑂𝑗)σ𝑘 𝛿𝑘𝑤𝑘𝑗 if j is a hidden unit
Where ƞ is a constant called the learning rate
𝑡𝑗is the correct teacher output for unit j
𝛿𝑗is the error measure for unit j
Backward pass : Compute 𝛿3, 𝛿4 𝑎𝑛𝑑 𝛿5
For output unit:
𝛿5= 𝑦5 1 − 𝑦5 ( 𝑦𝑡𝑎𝑟𝑔𝑒𝑡− 𝑦5)=0.69 * (1- 0.69) * (0.5-0.69)= -0.0406
For hidden unit:
𝛿3= 𝑦3 1 − 𝑦3 ( 𝑤35∗ 𝛿5)=0.68 * (1- 0.68) * (0.3 * -0.0406)= -0.00265
𝛿4= 𝑦4 1 − 𝑦4 ( 𝑤45∗ 𝛿5)=0.6637 * (1- 0.6637) * (0.9 * -0.0406)= -0.0082
8/13/2024 60
Dr. Shivashankar, ISE, GAT
Conti..
Compute new weights:
∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗
∆𝑤45=ƞ𝛿5𝑦4= 1 * -0.0406*0.6637= -0.0269
𝑤45(new)=∆𝑤45+𝑤45(old) = -0.0269 +0.9= 0.8731
∆𝑤14=ƞ𝛿4𝑥1= 1 * -0.0082 * 0.35 = -0.00287
𝑤14(𝑛𝑒𝑤)= ∆𝑤14+𝑤14(𝑜𝑙𝑑)= -0.00287+0.4= 0.3971
Similarly, update all other weights
8/13/2024 61
Dr. Shivashankar, ISE, GAT
i J 𝑤𝑖𝑗 𝛿𝑗 𝑥𝑖 ƞ Updated
𝑤𝑖𝑗
1 3 0.1 -0.00265 0.35 1 0.0991
2 3 0.8 -0.00265 0.9 1 0.7976
1 4 0.4 -0.0082 0.35 1 0.3971
2 4 0.6 -0.0082 0.9 1 0.5926
3 5 0.3 -0.0406 0.68 1 0.2724
4 5 0.9 -0.0406 0.6637 1 0.8731
Conti..
Updated network
2nd time Forward pass: Forward pass: compute output for 𝒚𝟑, 𝒚𝟒 and 𝒚𝟓
𝑎𝑗 = ෍
𝑗
𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 =
1
1 + 𝑒−𝑎𝑗
𝑎1 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.0991*0.35+0.7976*0.9= 0.7525
𝑦3 = 𝑓(𝑎1)=
1
1+𝑒−0.7525 = 0.6797
𝑎2 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.3971*0.35+0.5926*0.9= 0.6723
𝑦4 = 𝑓(𝑎2)=
1
1+𝑒−0.6723 = 0.6620
𝑎3 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.2724*0.6797+0.8731*0.6620 = 0.7631
𝑦5 = 𝑓(𝑎3)=
1
1+𝑒−0.7631 = 0.6820 (Network output)
Error = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦5 = 0.5 − 0.6820= -0.182
8/13/2024 62
Dr. Shivashankar, ISE, GAT
Problems
Problem 1: Assume that the neurons have a sigmoid activation function, perform a forward pass and
backward pass on the network. Assume that the actual output of y is 0.5 and learning rate is 1.
Perform another forward pass.
Solution: Forward pass: compute output for 𝑦3, 𝑦4 and 𝑦5
𝑎𝑗 = ෍
𝑗
𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 =
1
1 + 𝑒−𝑎𝑗
𝑎1 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.1*0.35+0.8*0.9 = 0.755
𝑦3 = 𝑓(𝑎1)=
1
1+𝑒−0.755=0.68
𝑎2 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.4*0.35+0.6*0.9 = 0.68
𝑦4 = 𝑓(𝑎2)=
1
1+𝑒−0.68=0.6637
𝑎3 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.3*0.68+0.9*0.66 7= 0.801
8/13/2024 63
Dr. Shivashankar, ISE, GAT
𝑥1 = 0.35
𝑥2 = 0.9
𝐻3
𝐻4
𝑂5
𝑤13 = 0.1
𝑤14 = 0.4
𝑤23 = 0.8
𝑤24 = 0.6 𝑤45 = 0.9
𝑤35 = 0.3
𝑦5
Output y
𝑦3
𝑦4
Conti..
𝑦5 = 𝑓(𝑎3)=
1
1+𝑒−0.801=0.69 (Network output)
∴ 𝐸𝑟𝑟𝑜𝑟 = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡−𝑦5= 0.5-0.69= -0.19
…………………………………………………………………………………………………………………………………………
Each weight changed by
∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗
𝛿𝑗=𝑂𝑗(1=𝑂𝑗) (𝑡𝑗 − 𝑂𝑗) if j is an output unit
𝛿𝑗=𝑂𝑗(1=𝑂𝑗)σ𝑘 𝛿𝑘𝑤𝑘𝑗 if j is a hidden unit
Where ƞ is a constant called the learning rate
𝑡𝑗is the correct teacher output for unit j
𝛿𝑗is the error measure for unit j
Backward pass : Compute 𝛿3, 𝛿4 𝑎𝑛𝑑 𝛿5
For output unit:
𝛿5= 𝑦5 1 − 𝑦5 ( 𝑦𝑡𝑎𝑟𝑔𝑒𝑡− 𝑦5)=0.69 * (1- 0.69) * (0.5-0.69)= -0.0406
For hidden unit:
𝛿3= 𝑦3 1 − 𝑦3 ( 𝑤35∗ 𝛿5)=0.68 * (1- 0.68) * (0.3 * -0.0406)= -0.00265
𝛿4= 𝑦4 1 − 𝑦4 ( 𝑤45∗ 𝛿5)=0.6637 * (1- 0.6637) * (0.9 * -0.0406)= -0.0082
8/13/2024 64
Dr. Shivashankar, ISE, GAT
Conti..
Compute new weights:
∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗
∆𝑤45=ƞ𝛿5𝑦4= 1 * -0.0406*0.6637= -0.0269
𝑤45(new)=∆𝑤45+𝑤45(old) = -0.0269 +0.9= 0.8731
∆𝑤14=ƞ𝛿4𝑥1= 1 * -0.0082 * 0.35 = -0.00287
𝑤14(𝑛𝑒𝑤)= ∆𝑤14+𝑤14(𝑜𝑙𝑑)= -0.00287+0.4= 0.3971
Similarly, update all other weights
8/13/2024 65
Dr. Shivashankar, ISE, GAT
i J 𝑤𝑖𝑗 𝛿𝑗 𝑥𝑖 ƞ Updated
𝑤𝑖𝑗
1 3 0.1 -0.00265 0.35 1 0.0991
2 3 0.8 -0.00265 0.9 1 0.7976
1 4 0.4 -0.0082 0.35 1 0.3971
2 4 0.6 -0.0082 0.9 1 0.5926
3 5 0.3 -0.0406 0.68 1 0.2724
4 5 0.9 -0.0406 0.6637 1 0.8731
Conti..
Updated network
2nd time Forward pass: Forward pass: compute output for 𝒚𝟑, 𝒚𝟒 and 𝒚𝟓
𝑎𝑗 = ෍
𝑗
𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 =
1
1 + 𝑒−𝑎𝑗
𝑎1 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.0991*0.35+0.7976*0.9= 0.7525
𝑦3 = 𝑓(𝑎1)=
1
1+𝑒−0.7525 = 0.6797
𝑎2 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.3971*0.35+0.5926*0.9= 0.6723
𝑦4 = 𝑓(𝑎2)=
1
1+𝑒−0.6723 = 0.6620
𝑎3 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.2724*0.6797+0.8731*0.6620 = 0.7631
𝑦5 = 𝑓(𝑎3)=
1
1+𝑒−0.7631 = 0.6820 (Network output)
Error = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦5 = 0.5 − 0.6820= -0.182
8/13/2024 66
Dr. Shivashankar, ISE, GAT
Conti..
Problem 2: Assume that the neurons have a sigmoid activation function, perform a
forward pass and a backward pass on the network. Assume that the actual output of y is 1
and learning rate is 0.9. Perform another forward pass.
Solution:
Forward pass: Compute output for 𝑦4, 𝑦5 and 𝑦6
8/13/2024 67
Dr. Shivashankar, ISE, GAT
𝑥1 = 1
𝑥3 = 1
𝑥2 = 0
𝐻4
𝐻5
𝑂6
𝑤1,5 = −0.3
𝑤1,4 =0.2
𝑤2,5 =0.1
𝑤2,4 =0.4
𝑤3,4 = −0.5
𝑤3,5 =0.2
𝑤4,6 =-0.3
𝑤5,6 =-0.2
𝐴𝑐𝑡𝑢𝑎𝑙𝑜𝑢𝑝𝑢𝑡 = 1
𝜃4 =-0.4 or Bias
𝜃5 =0.2
𝜃6 =0.1
Conti..
𝑎𝑗 = ෍
𝑗
𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 =
1
1 + 𝑒−𝑎𝑗
𝑎4 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 + 𝑤34 ∗ 𝑥3 +𝜃4(or bias)= (0.2*1)+(0.4*0)+(-0.5*1)+(-0.4) = -0.7
𝑜(𝐻4)=𝑦4 = 𝑓(𝑎4)=
1
1+𝑒0.7 = 0.332
𝑎5 = 𝑤15 ∗ 𝑥1 + 𝑤25 ∗ 𝑥2 + 𝑤35 ∗ 𝑥3 +𝜃5= (-0.3*1)+(0.1*0)+(0.2*1)+0.2= 0.1
𝑜(𝐻5)=𝑦5 = 𝑓(𝑎5)=
1
1+𝑒−0.1 = 0.525
𝑎6 = 𝑤46 ∗ 𝐻4 + 𝑤56 ∗ 𝐻5 + 𝜃6= (-0.3*0.332)+(-0.2*0.525)+0.1= -0.105
𝑜(𝑂6)=𝑦6 = 𝑓(𝑎6)=
1
1+𝑒0.105 = 0.474
Error= 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 - 𝑦6=1- 0.474 = 0.526
.................................................................................................................................................
Backward pass:
For output unit:
𝛿6
=𝑦6(1 - 𝑦6)(𝑦𝑡𝑎𝑟𝑔𝑒𝑡 - 𝑦6)=0.474*(1-0.474)=0.1311
For hidden unit:
𝛿5
=𝑦5(1 - 𝑦5) 𝑤56*𝛿6
= 0.525*(1-0.525)*(-0.2*0.1311)= -0.0065
𝛿4
=𝑦4(1 - 𝑦4) 𝑤46*𝛿6
= 0.332*(1-0.332)*(-0.3*0.1331)= -0.0087
8/13/2024 68
Dr. Shivashankar, ISE, GAT
Conti..
Compute new weights
∆𝑤𝑖𝑗 =ƞ𝛿𝑗𝑜𝑖
∆𝑤46 =ƞ𝛿6𝑦4=0.9*0.1311*0.332 = 0.03917
𝑤46(new)=∆𝑤46 + 𝑤46(old)=0.03917+(-0.3)= -0.261
∆𝑤14 =ƞ𝛿4𝑥1=0.9* -0.0087*1= -0.0078
𝑤14(new)=∆𝑤14 + 𝑤14(old)= -0.0076+0.2= 0.192
8/13/2024 69
Dr. Shivashankar, ISE, GAT
i j 𝑤𝑖𝑗 𝛿𝑖 𝑥𝑖 ƞ Updated 𝑤𝑖𝑗
4 6 -0.3 0.1311 0.332 0.9 -0.261
5 6 -0.2 0.1311 0.525 0.9 -0.138
1 4 0.2 -0.0087 1 0.9 0.192
1 5 -0.3 -0.0065 1 0.9 -0.306
2 4 0.4 -0.0087 0 0.9 0.4
2 5 0.1 -0.0065 0 0.9 0.1
3 4 -0.5 -0.0087 1 0.9 -0.508
3 5 0.2 -0.0065 1 0.9 0.194
Conti..
Updated network:
2nd time Forward pass: Forward pass: compute output for 𝒚𝟒, 𝒚𝟓 and 𝒚𝟔
𝑎4 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 + 𝑤34 ∗ 𝑥3 +𝜃4= (0.192*1)+(0.4*0)+(-0.508*1)+(-0.408)= -0.724
𝑜(𝐻4)=𝑦4 = 𝑓(𝑎4)=
1
1+𝑒0.724= 0.327
𝑎5 = 𝑤15 ∗ 𝑥1 + 𝑤25 ∗ 𝑥2 + 𝑤35 ∗ 𝑥3 +𝜃5= (-0.306*1)+(0.1*0)+(0.194*1)+(0.194)= 0.082
𝑜(𝐻5)=𝑦5 = 𝑓(𝑎5)=
1
1+𝑒−0.082= 0.520
𝑎6 = 𝑤46 ∗ 𝐻4 + 𝑤56 ∗ 𝐻5 + 𝜃6= (-0.261* 0.327)+(-0.138*0.520)+0.218= 0.061
𝑜(𝑂6)=𝑦6 = 𝑓(𝑎6)=
1
1+𝑒−0.061.= 0.515 (Network Output)
Error= 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 - 𝑦6= 1- 0.515= 0.485
8/13/2024 70
Dr. Shivashankar, ISE, GAT
Bayesian Learning: Conditional probability
• Conditional probability is the probability that depends on a
previous result or event.
• It help us understand how events are related to each other.
• When the probability of one event happening doesn’t influence
the probability of any other event, then events are called
independent, otherwise dependent events.
• It is defined as the probability of any event occurring when
another event has already occurred.
• In other words, it calculates the probability of one event
happening given that a certain condition is satisfied.
• It is represented as P (A | B) which means the probability of A
when B has already happened.
8/13/2024 71
Dr. Shivashankar, ISE, GAT
Cont…
Conditional Probability Formula:
• When the intersection of two events happen, then the formula for conditional
probability for the occurrence of two events is given by;
• P(A|B) = N(A∩B)/N(B) or
• P(B|A) = N(A∩B)/N(A)
• Where P(A|B) represents the probability of occurrence of A given B has occurred.
• N(A ∩ B) is the number of elements common to both A and B.
• N(B) is the number of elements in B, and it cannot be equal to zero.
• Let N represent the total number of elements in the sample space.
• N(A ∩ B)/N can be written as P(A ∩ B) and N(B)/N as P(B).
𝑃 𝐴 𝐵 =
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃(𝐵)
• Therefore, P(A ∩ B) = P(B) P(A|B) if P(B) ≠ 0
• = P(A) P(B|A) if P(A) ≠ 0
• Similarly, the probability of occurrence of B when A has already occurred is given by,
• P(B|A) = P(B ∩ A)/P(A)
8/13/2024 72
Dr. Shivashankar, ISE, GAT
Cont…
How to Calculate Conditional Probability?
To calculate the conditional probability, we can use the following method:
Step 1: Identify the Events. Let’s call them Event A and Event B.
Step 2: Determine the Probability of Event A i.e., P(A)
Step 3: Determine the Probability of Event B i.e., P(B)
Step 4: Determine the Probability of Event A and B i.e., P(A∩B).
Step 5: Apply the Conditional Probability Formula and calculate the required
probability.
Conditional Probability of Independent Events
For independent events, A and B, the conditional probability of A and B with respect to
each other is given as follows:
P(B|A) = P(B)
P(A|B) = P(A)
8/13/2024 73
Dr. Shivashankar, ISE, GAT
Cont…
Problem 1: Two dies are thrown simultaneously, and the sum of the numbers obtained is
found to be 7. What is the probability that the number 3 has appeared at least once?
Solution:
• Event A indicates the combination in which 3 has appeared at least once.
• Event B indicates the combination of the numbers which sum up to 7.
• A = {(3, 1), (3, 2), (3, 3)(3, 4)(3, 5)(3, 6)(1, 3)(2, 3)(4, 3)(5, 3)(6, 3)}
• B = {(1, 6)(2, 5)(3, 4)(4, 3)(5, 2)(6, 1)}
• P(A) = 11/36
• P(B) = 6/36
• A ∩ B = 2
• P(A ∩ B) = 2/36
• Applying the conditional probability formula we get,
• P(A|B) = P(A∩B)/P(B) = (2/36)/(6/36) = ⅓
8/13/2024 74
Dr. Shivashankar, ISE, GAT
Cont…
Problem 2: In a group of 100 computer buyers, 40 bought CPU, 30 purchased monitor,
and 20 purchased CPU and monitors. If a computer buyer chose at random and bought a
CPU, what is the probability they also bought a Monitor?
Solution:
As per the first event, 40 out of 100 bought CPU,
So, P(A) = 40% or 0.4
Now, according to the question, 20 buyers purchased both CPU and monitors. So, this is
the intersection of the happening of two events. Hence,
P(A∩B) = 20% or 0.2
Conditional probability is
P(B|A) = P(A∩B)/P(B)
P(B|A) = 0.2/0.4 = 2/4 = ½ = 0.5
The probability that a buyer bought a monitor, given that they purchased a CPU, is 50%.
8/13/2024 75
Dr. Shivashankar, ISE, GAT
Cont…
Question 7: In a survey among a group of students, 70% play football, 60% play
basketball, and 40% play both sports. If a student is chosen at random and it is
known that the student plays basketball, what is the probability that the
student also plays football?
Solution:
Let’s assume there are 100 students in the survey.
Number of students who play football = n(A) = 70
Number of students who play basketball = n(B) = 60
Number of students who play both sports = n(A ∩ B) = 40
To find the probability that a student plays football given that they play
basketball, we use the conditional probability formula:
P(A|B) = n(A ∩ B) / n(B)
Substituting the values, we get:
P(A|B) = 40 / 60 = 2/3
Therefore, probability that a randomly chosen student who plays basketball also
plays football is 2/3.
8/13/2024 76
Dr. Shivashankar, ISE, GAT
BAYES THEOREM
• Bayes’ theorem describes the probability of occurrence of an event related to any
condition.
• Bayes’ Theorem is used to determine the conditional probability of an event.
• Bayesian methods provide the basis for probabilistic learning methods that
accommodate knowledge about the prior probabilities of alternative hypotheses.
• To define Bayes theorem precisely:
• P(h) to denote the initial probability that hypothesis h holds.
• P(h) is often called the prior-probability of h and may reflect any background
knowledge, chance that h is a correct hypothesis.
• P(D) to denote the prior probability that training data D will be observed (i.e., the
probability of D given no knowledge about which hypothesis holds).
• P(D|h) to denote the probability of observing data D given some world in which
hypothesis h holds.
• P (h|D) is called the posterior-probability of h, because it reflects our confidence that h
holds.
• Notice the posterior probability P(h|D) reflects the influence of the training data D, in
contrast to the prior probability P(h) , which is independent of D.
8/13/2024 77
Dr. Shivashankar, ISE, GAT
BAYES THEOREM
• If A and B are two events, then the formula for the Bayes theorem is given by:
• P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
• Where P(A|B) is the probability of condition when event A is occurring while event B has
already occurred.
P(A) – Probability of event A
P(B) – Probability of event B
P(A|B) – Probability of A given B
P(B|A) – Probability of B given A
From the definition of conditional probability, Bayes theorem can be derived for events as given
below:
P(A|B) = P(A ⋂ B)/ P(B), where P(B) ≠ 0
P(B|A) = P(B ⋂ A)/ P(A), where P(A) ≠ 0
• Since P(A∩ 𝐵)𝑎𝑛𝑑 𝑃(𝐵 ∩ 𝐴) are equal
• P(A|B) X P(B) = P(B|A) X P(A)
∴ P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
this is the Bays theorem.
8/13/2024 78
Dr. Shivashankar, ISE, GAT
Likelihood
probability
Posterior
probability
Marginal
probability
Prior probability
Cont…
Problem 1: A patient takes a lab test for cancer diagnosis. There are two possible
outcomes in this case: ⊕(positive) and ⊖ (negative). The test returns a correct positive
results in only 98%. If the cases in which the diseases is actually present and a correct
negative result in only 97% of the cases in which the disease in present. Furthermore,
0.008 of the entire population have this cancer.
Compute the following values.
1). P(Cancer) 2). P(¬𝐶𝑎𝑛𝑐𝑒𝑟) 3). P(+ve Cancer)
4). P(-ve Cancer) 5). P(+| (¬𝐶𝑎𝑛𝑐𝑒𝑟) 4). P(-| (¬𝐶𝑎𝑛𝑐𝑒𝑟)
Solution:
• P(cancer)= 0.008 P(¬cancer)= 0.992
• P(⊕/cancer)=0.98 P(⊖/cancer)= 0.02
• P(⊕/¬cancer)=0.03 P(⊖/¬cancer)= 0.97 and
• P(⊕/cancer) P(cancer)=0.98 X 0.008 = 0.0078
• P(⊕/-cancer) P(-cancer)=0.03 X 0.992=0.0298
• Thus, ℎ𝑀𝐴𝑃 = -cancer.
• The exact posterior probabilities can also be determined by normalizing the above
quantities so that they sum to 1 (e.g., P(cancer/ ⊕) =
0.0078
0.0078+0.0298
= 0.207
8/13/2024 79
Dr. Shivashankar, ISE, GAT
Cont…
Problem 3: Given that they passed the exam what is the probability it is a woman ?
Answer:
P(A): Probability of a woman passing the exam is 92/100 and it's equal to 0.92
P(B|A): probability of having a woman ; 100/200 and it's equal to 0.5
P(B): The probability of passing the exam ; 169/200 and it's equal to 0.845
P(A|B) = (P(0.5) x P(0.92 ) / P(0.845)
P(A|B) = 0.54
92/169 = 0.54 too
8/13/2024 80
Dr. Shivashankar, ISE, GAT
Did not pass
the exam
Passed the
exam
Total
Women 8 92 100
Men 23 77 100
Total 31 169 200
Cont…
4. Covid-19 has taken over the world and the use of Covid19 tests is still relevant to block
the spread of the virus and protect our families.
If the Covid19 infection rate is 10% of the population, and thanks to the tests we have in
Algeria, 95% of infected people are positive with 5% false positive.
What would be the probability that I am really infected if I test positive?
Solution :
Parameters :
• P(A): 10% infected
• P(B|A): 95% Test positive while infected
• 5% False positive while non infected
• 90% not infected
• We will start multiplying the probability of infection (10%) by the probability of testing
positive given that be infected (95%) then we divided by the sum of the probability of
infection (10%) by the probability of testing positive given that be infected (95%) with
not infected (90%) multiplied by false positive (5%)
P(A|B) = P(A) * P(B|A) / Σ P(A) * P(B|A)
P(A|B) = 0.1 * 0.95 /(0.95 * 0.1) +(0.05*0.90)
P(A|B) = 0.095 / 0.095 + 0.045
P(A|B) = 0.678
8/13/2024 81
Dr. Shivashankar, ISE, GAT
Cont…
2. Let A denote the event that a “patient has liver disease”, and B the event that a “patient
is an alcoholic”. It is known from experience that 10% of the patients entering the clinic
have liver disease and 5% of the patient are alcoholics.
Also, among those patients diagnosed with liver disease, 7% are alcoholic. Given that a
patient is alcoholic, what is the probability that he will have liver disease?
Solution:
A-”patient has liver disease”.
B-”patient is an alcoholic”.
P(A)=10%=0.1
P(B)=5%=0.05
P(B|A)=7%=0.07
P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
=
0.07𝑋0.10
0.05
= 0.14
8/13/2024 82
Dr. Shivashankar, ISE, GAT
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR
HYPOTHESES
• Many learning approaches such as neural network learning, linear regression, and
polynomial curve fitting try to learn a continuous valued target function.
• Under certain assumptions any learning algorithm that minimizes the squared error
between output hypothesis predictions and the training data will output a MAXIMUM
LIKELIHOOD HYPOTHESIS.
• The significance of this result is that it provides a Bayesian justification (under certain
assumptions) for many neural network and other curve fitting methods that attempt to
minimize the sum of squared errors over the training data.
• In order to find the Maximum Likelihood Hypothesis in Bayesian learning for
continuous valued target function, we start with Maximum Likelihood Hypothesis
definition, but using lower case p to refer to the Probability Density Function
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
𝑝 𝐷|ℎ
8/13/2024 83
Dr. Shivashankar, ISE, GAT
The argument that gives the maximum
value from a target function
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR
HYPOTHESES
• ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
𝑝 𝐷|ℎ where p: probability density function
• Assume that fixed set of training instances ( 𝑥1, 𝑥2 , 𝑥3 ,……. 𝑥𝑛 ) and data D
corresponding sequence of target values D=(𝑑1, 𝑑2, … . 𝑑𝑛)
• p(D|h) → product of p(𝑑𝑖|ℎ)
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
𝑝(𝑑𝑖|ℎ)
Assume target values are normally distributed
𝑓(𝑥|𝜇)=
1
2𝜋𝜎2
𝑒
−
𝑥−𝜇 2
2𝜎2
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
𝑝(𝑑𝑖|ℎ)
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
1
2𝜋𝜎2
𝑒
−
1
2𝜎2 𝑑𝑖−𝜇 2
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
ෑ
𝑖=1
𝑛
1
2𝜋𝜎2
𝑒
−
1
2𝜎2 𝑑𝑖−ℎ(𝑥𝑖) 2
8/13/2024 84
Dr. Shivashankar, ISE, GAT
Mean
Standard
Deviation
Target of
𝑖𝑡ℎ
input
output of hypothesis
𝑜𝑓 𝑖𝑡ℎ
input
Variance
Conti..
Rather than maximizing above calculated expression, we shall choose to maximize its (less
complicated) logarithmic.
This is justified because Ln(p) is a monotonic function of p. Therefore maximizing Ln p also
maximizes p.
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
𝑙𝑛
1
2𝜋𝜎2
−
1
2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2
First term is constant, discard it
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
−
1
2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2
Maximizing the negative quantity is equivalent to minimizing the corresponding position
quantity.
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
1
2𝜎2
𝑑𝑖 − ℎ(𝑥𝑖 )2
Finally we can again discard constants that are independent of h
ℎ𝑀𝐿 = argmax
ℎ𝜖𝐻
෍
𝑖=1
𝑚
𝑑𝑖 − ℎ(𝑥𝑖 )2
8/13/2024 85
Dr. Shivashankar, ISE, GAT
Leas square error hypothesis Bayesian
Learning for given continuous valued target
NAIVE BAYES CLASSIFIER
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem (helps to determine the likelihood that one event will occur
with unclear information while another has already happened) and used for
solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training
dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Naïve: It assumes that the occurrence of a certain feature is independent of
the occurrence of other features.
• Example: If the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
8/13/2024 86
Dr. Shivashankar, ISE, GAT
Conti..
Naïve and Bayes algorithm : It is a way to calculate the value of P(B|A) with the
knowledge of P(A|B).
Working of Naïve Bayes' Classifier:
 Step 1: Convert the given dataset into frequency tables (divide the number of
rows and columns).
 Step 2: Generate Likelihood table by finding the probabilities of given features
(𝜇, 𝜎, 𝑥𝑖).
 Step 3: Now, use Bayes theorem to calculate the posterior probability.
Steps to implement:
• Data Pre-processing step
• Fitting Naive Bayes to the Training set
• Predicting the test result
• Test accuracy of the result
• Visualizing the test set result.
8/13/2024 87
Dr. Shivashankar, ISE, GAT
NAIVE BAYES CLASSIFIER
• One highly practical Bayesian learning method is the naive Bayes learner, often called
the Naive Bayes classifier.
• The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f (x) can take on any
value from some finite set V.
• A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values 𝑎1, 𝑎2, 𝑎3, … , 𝑎𝑛 .
• The learner is asked to predict the target value, or classification, for this new instance.
• The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value.
• Naive Bayes classifier:
𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖𝑣
𝑣𝑖 ෑ
𝑖
𝑃(𝑎𝑖|𝑣𝑗)
• where 𝑉𝑁𝐵 denotes the target value output by the Naive Bayes classifier.
• Notice that in a naive Bayes classifier the number of distinct 𝑃(𝑎𝑖|𝑣𝑗) terms that must
be estimated from the training data is just the number of distinct attribute values times
the number of distinct target values-a much smaller number than if we were to
estimate the P(𝑎1, 𝑎2, …, 𝑎𝑛 |𝑣𝑗) terms as first contemplated.
8/13/2024 88
Dr. Shivashankar, ISE, GAT
NAÏVE BAYES CLASSIFIER
From Bays theorem
P(A|B) =
P(B|A) X P(A)
𝑃(𝐵)
this is the Bays theorem.
Data set
X={𝑥1, 𝑥2, … … … . . 𝑥𝑛) using these features compute output {y} features.
Multiple features
𝑓1, 𝑓2, 𝑓3, 𝑦
𝑥1, 𝑥2, 𝑥3, 𝑦1 --- (1 record)
𝑥1, 𝑥2, 𝑥3, 𝑦2 --- (2 record)
For this kind of data set how (Bayes theorem) equation changes
To compute these features in y using Bayes theorem
P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) =
(P(𝑥1 𝑦 ∗P(𝑥2|𝑦) ∗ P(𝑥3|𝑦),……..∗ P(𝑥𝑛|𝑦)∗P(y))
𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛)
=
𝑃 𝑌 ∗ς1=1
𝑛 𝑃(𝑥𝑖|𝑦)
𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛)
P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) 𝛼 𝑃 𝑌 ∗ ς1=1
𝑛
𝑃 𝑥𝑖 𝑦
Y= argmax
𝑦
[𝑃 𝑌 ∗ ς1=1
𝑛
𝑃 𝑥𝑖 𝑦 ]
8/13/2024 89
Dr. Shivashankar, ISE, GAT
Problem 1: Calculate play for TODAY
Check for dataset TODAY (outlook=Sunny, temperature= Hot)
Solution: Naive Base classifier is defined by
𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) ς𝑖 𝑃(𝑎𝑖|𝑣𝑗)= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) P(Temperature = hotI 𝑣𝑗)
𝑉𝑁𝐵(Yes)= P(Yes|Today) =
𝑃 𝑇𝑜𝑑𝑎𝑦 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠)
𝑃(𝑇𝑜𝑑𝑎𝑦)
=
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ∗𝑃 𝐻𝑜𝑡 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠)
𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝
= 2/9*2/9*9/14
=0.031
𝑉𝑁𝐵(No)= P(No|Today) =
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 ∗𝑃 𝐻𝑜𝑡 𝑁𝑜 ∗𝑃(𝑁𝑜)
𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝
=3/5*2/5*5/14 = 0.08571
To calculate P(Yes) for Today condition and normalized to one
𝑉𝑁𝐵(Yes)=
𝑉𝑁𝐵(Yes)
𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No)
=
0.031
0.031+0.08571
≈0.27
𝑉𝑁𝐵(No)=
0.08571
0.031+0.08571
= 0.734
∴ 𝑉𝑁𝐵(No) probability is high , So for TODAY (Sunny, Hot)—Play is No
8/13/2024 90
Dr. Shivashankar, ISE, GAT
Outlook
Yes No P(y) P(No)
Sunny 2 3 2/9 3/5
Overcast 4 0 4/9 0/5
Rainy 3 2 3/9 3/5
Total 9 5 100% 100%
Temperature
Yes No P(y) P(No)
Hot 2 2 2/9 2/5
Mild 4 2 4/9 2/5
Cool 3 1 3/9 1/5
Total 9 5 100% 100%
Problem 1: Apply the naive Bayes classifier to a concept learning problem, classifying days
according to whether someone will play tennis {outlook=sunny, temperature=cool,
humidity=high. Wind=strong}
8/13/2024 91
Dr. Shivashankar, ISE, GAT
Day Outlook Temperature Humidity Wind Play_Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild high Strong No
Cont…
Solution:
{Outlook=sunny, temperature=cool, Humidity=high, Wind=strong}
P(Play Tennis=yes)=9/14=0.6428
P(Play Tennis=No)=5/14=0.3571
𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) ෑ
𝑖
𝑃(𝑎𝑖|𝑣𝑗)
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜}
𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) ∗P(Temperature = cool I 𝑣𝑗) ∗
P(Humidity =High | 𝑣𝑗) ∗ P(Wind = Strong | 𝑣𝑗)
𝑉𝑁𝐵(Yes)= P(Sunny|Yes) *P(Cool|Yes) *P(High|yes) *P(Strong|Yes)* P(yes)
= 2/9 * 3/9* 3/9*3/9* 0.6428 =0.0053
𝑉𝑁𝐵(No)= P(Sunny|No) *P(Cool|No) *P(High|No) *P(Strong|No)* P(No)
= 3/5 * 1/5* 4/5*3/5*0.3571=0.0206
𝑉𝑁𝐵(Yes)=
𝑉𝑁𝐵(Yes)
𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No)
=
0.0053
0.0053+0.0206
= 0.205
𝑉𝑁𝐵(No)=
𝑉𝑁𝐵(No)
𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No)
=
0.0206
0.0053+0.0206
= 0.795
Therefore, 𝑉𝑁𝐵(No)= 0.795 > 0.205, Play Tennis: No
8/13/2024 92
Dr. Shivashankar, ISE, GAT
Outlook Y N
Sunny 2/9 3/5
Overcast 4/9 0
Rain 3/9 2/5
Temperature Y N
Hot 2/9 2/5
Mild 4/9 2/5
Cold 3/9 1/5
Humidity y N
High 3/9 4/5
Normal 6/9 1/5
Wind Y N
Strong 3/9 3/5
Weak 6/9 2/5
Cont…
Problem 2: Estimate the conditional probabilities of each attributes {color, legs, height,
smelly} for the species classes {M,H} using the data set given in the table. Using these
probabilities estimate the probability values for the new instance {color=green, legs=2,
height=tall and smelly=No}.
8/13/2024 93
Dr. Shivashankar, ISE, GAT
No Color Legs Height Smelly Species
1 White 3 Short Yes M
2 Green 2 Tall No M
3 Green 3 Short Yes M
4 White 3 Short Yes M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
8 White 2 Short Yes H
Cont…
Solution : {color=green, legs=2, height=tall and smelly=No},
P(M)=4/8=0.5, P(H)=4/8=0.5
P(M/New instance)= P(M)*P(color=green/M) * P(legs=2/M) * P(Height=Tall/M) * P(Smelly=No/M)
= 0.5* 2/4 * 1/4 * 1/4 * 1/4 = 0.0039
P(H/New instance)= P(H)*P(color=green/H) * P(legs=2/H) * P(Height=Tall/H) * P(Smelly=No/H)
= 0.5* 1/4 * 4/4 * 2/4 * 3/4 = 0.048
Since P(H/New instance) > P(M/New instance)
Hence the new instance {color=green, legs=2, height=tall and smelly=No} belongs to H
8/13/2024 94
Dr. Shivashankar, ISE, GAT
Color M H
White 2/4 3/4
Green 2/4 1/4
Legs M H
2 1/4 4/4
3 3/4 0/4
Height M H
Short 3/4 2/4
Tall 1/4 2/4
Smelly M H
Yes 3/4 1/4
No 1/4 3/4
Cont…
1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|-), P(B|-) and P(C|-)
2. Use the estimate of Conditional probabilities given to predict the class label for a test
sample (A=0, B=1, C=0), use Naïve Bays Approach.
3. Estimate the conditional probabilities using the m-estimate approach with P=1/2 and m=4.
solution:
1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+)
P(A|-), P(B|-), P(C|-)
P(A=0|-)= 3/5=0.6
P(A=0|+)= 2/5=0.4
P(B=0|-)= 3/5=0.6
P(B=0|+)= 4/5=0.8
P(C=0|-)= 0/5=0.0
P(C=0|+)= 3/5=0.6
2. Classify the new instance (A=0, B=1, C=0)
P(𝐶𝑖|𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛)=
𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛|𝐶𝑖. 𝑃(𝐶𝑖)
𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛)
P(+|A=0, B=1, C=0)
=
𝑃 𝐴 = 0 + ∗𝑃 𝐵 = 1 + ∗𝑃 𝐶 = 0 + ∗𝑃(+)
𝑃(𝐴=0,𝐵=1,𝐶=0)
=
0.4∗0.2∗0.6∗0.5
𝐾
=
0.024
𝐾
8/13/2024 95
Dr. Shivashankar, ISE, GAT
Record A B C Class
1 0 0 0 +
2 0 0 1 -
3 0 1 1 -
4 0 1 1 -
5 0 0 0 +
6 1 0 0 +
7 1 0 1 -
8 1 0 1 -
9 1 1 1 +
10 1 0 1 +
Cont…
P(-|A=0, B=1, C=0)
=
𝑃 𝐴 = 0 − ∗𝑃 𝐵 = 1 − ∗𝑃 𝐶 = 0 − ∗𝑃(−)
𝑃(𝐴=0,𝐵=1,𝐶=0)
=
𝟎
𝑲
The class label should be +since
0.024
𝐾
>
0
𝐾
3. Estimate the conditional probability using the m-estimate approach
With P=1/4, m=4.
The conditional probability using m-estimate:
Prob(A|B)=
𝑛𝑐+𝑚𝑝
𝑛+𝑚
Where, 𝑛𝑐: No. of times A^B happened
n: No. of times B happened in the training data
P(A=0|+)=
2+2
5+4
=
4
9
P(A=0|-)=
3+2
5+4
=
5
9
P(B=1|+)=
1+2
5+4
=
3
9
P(B=1|-)=
2+2
5+4
=
4
9
P(C=0|+)=
3+2
5+4
=
5
9
P(C=0|-)=
0+2
5+4
=
2
9
8/13/2024 96
Dr. Shivashankar, ISE, GAT
P(A=1|-) =0.4
P(A=1|+) =0.6
P(B=1|-) =0.4
P(A=1|+) =0.2
P(C=1|-) =1
P(C=1|+) =0.4
P(A=0|-) =0.6
P(A=0|+) =0.4
P(B=0|-) =0.6
P(B=0|+) =0.8
P(C=0|-) =0
P(C=0|+) =0.6
Cont…
Problem 3: Classify the new estimate (A=0, B=1, C=0), using m-estimates approach with
P=1/2, m=4.
P(+|A=0, B=1, C=0) =
P(A=0|+) ∗ P(B=1|+)∗ P(C=0|+)∗ P(+)
P(A=0, B=1, C=0)
=
4
9
∗
3
9
∗
5
9
∗0.5
𝐾
=
0.0412
𝐾
P(-|A=0, B=1, C=0) =
P(A=0|−) ∗ P(B=1|−)∗ P(C=0|−)∗ P(−)
P(A=0, B=1, C=0)
=
5
9
∗
4
9
∗
2
9
∗0.5
𝐾
=
0.0274
𝐾
The class label should be belongs to +since
0.0412
𝐾
>
0.0274
𝐾
8/13/2024 97
Dr. Shivashankar, ISE, GAT

Module 3_Machine Learning Bayesian Learn

  • 1.
    MACHINE LEARNING (INTEGRATED) (21ISE62) Module3 Dr. Shivashankar Professor Department of Information Science & Engineering GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru 8/13/2024 1 Dr. Shivashankar, ISE, GAT GLOBAL ACADEMY OF TECHNOLOGY Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098 Department of Information Science & Engineering
  • 2.
    Course Outcomes After Completionof the course, student will be able to:  Illustrate Regression Techniques and Decision Tree Learning Algorithm.  Apply SVM, ANN and KNN algorithm to solve appropriate problems.  Apply Bayesian Techniques and derive effective learning rules.  Illustrate performance of AI and ML algorithms using evaluation techniques.  Understand reinforcement learning and its application in real world problems. Text Book: 1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, India Edition 2013. 2. EthemAlpaydın, Introduction to machine learning, MIT press, Second edition. 3. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson, First Impression, 2014. 8/13/2024 2 Dr. Shivashankar, ISE, GAT
  • 3.
    Module 3: BayesianLearning: Conditional probability INTRODUCTION • Conditional probability is the probability that depends on a previous result or event. • It help us understand how events are related to each other. • When the probability of one event happening doesn’t influence the probability of any other event, then events are called independent, otherwise dependent events. • It is defined as the probability of any event occurring when another event has already occurred. • In other words, it calculates the probability of one event happening given that a certain condition is satisfied. • It is represented as P (A | B) which means the probability of A when B has already happened. 8/13/2024 3 Dr. Shivashankar, ISE, GAT
  • 4.
    Cont… Conditional Probability Formula: •When the intersection of two events happen, then the formula for conditional probability for the occurrence of two events is given by; • P(A|B) = N(A∩B)/N(B) or • P(B|A) = N(A∩B)/N(A) • Where P(A|B) represents the probability of occurrence of A given B has occurred. • N(A ∩ B) is the number of elements common to both A and B. • N(B) is the number of elements in B, and it cannot be equal to zero. • Let N represent the total number of elements in the sample space. • N(A ∩ B)/N can be written as P(A ∩ B) and N(B)/N as P(B). 𝑃 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴) 𝑃(𝐵) • Therefore, P(A ∩ B) = P(B) P(A|B) if P(B) ≠ 0 • = P(A) P(B|A) if P(A) ≠ 0 • Similarly, the probability of occurrence of B when A has already occurred is given by, • P(B|A) = P(B ∩ A)/P(A) 8/13/2024 4 Dr. Shivashankar, ISE, GAT
  • 5.
    Cont… How to CalculateConditional Probability? To calculate the conditional probability, we can use the following method: Step 1: Identify the Events. Let’s call them Event A and Event B. Step 2: Determine the Probability of Event A i.e., P(A) Step 3: Determine the Probability of Event B i.e., P(B) Step 4: Determine the Probability of Event A and B i.e., P(A∩B). Step 5: Apply the Conditional Probability Formula and calculate the required probability. Conditional Probability of Independent Events For independent events, A and B, the conditional probability of A and B with respect to each other is given as follows: P(B|A) = P(B) P(A|B) = P(A) 8/13/2024 5 Dr. Shivashankar, ISE, GAT
  • 6.
    Cont… Problem 1: Twodies are thrown simultaneously, and the sum of the numbers obtained is found to be 7. What is the probability that the number 3 has appeared at least once? Solution: • Event A indicates the combination in which 3 has appeared at least once. • Event B indicates the combination of the numbers which sum up to 7. • A = {(3, 1), (3, 2), (3, 3)(3, 4)(3, 5)(3, 6)(1, 3)(2, 3)(4, 3)(5, 3)(6, 3)} • B = {(1, 6)(2, 5)(3, 4)(4, 3)(5, 2)(6, 1)} • P(A) = 11/36 • P(B) = 6/36 • A ∩ B = 2 • P(A ∩ B) = 2/36 • Applying the conditional probability formula we get, • P(A|B) = P(A∩B)/P(B) = (2/36)/(6/36) = ⅓ 8/13/2024 6 Dr. Shivashankar, ISE, GAT
  • 7.
    Cont… Problem 2: Ina group of 100 computer buyers, 40 bought CPU, 30 purchased monitor, and 20 purchased CPU and monitors. If a computer buyer chose at random and bought a CPU, what is the probability they also bought a Monitor? Solution: As per the first event, 40 out of 100 bought CPU, So, P(A) = 40% or 0.4 Now, according to the question, 20 buyers purchased both CPU and monitors. So, this is the intersection of the happening of two events. Hence, P(A∩B) = 20% or 0.2 Conditional probability is P(B|A) = P(A∩B)/P(B) P(B|A) = 0.2/0.4 = 2/4 = ½ = 0.5 The probability that a buyer bought a monitor, given that they purchased a CPU, is 50%. 8/13/2024 7 Dr. Shivashankar, ISE, GAT
  • 8.
    Cont… Question 7: Ina survey among a group of students, 70% play football, 60% play basketball, and 40% play both sports. If a student is chosen at random and it is known that the student plays basketball, what is the probability that the student also plays football? Solution: Let’s assume there are 100 students in the survey. Number of students who play football = n(A) = 70 Number of students who play basketball = n(B) = 60 Number of students who play both sports = n(A ∩ B) = 40 To find the probability that a student plays football given that they play basketball, we use the conditional probability formula: P(A|B) = n(A ∩ B) / n(B) Substituting the values, we get: P(A|B) = 40 / 60 = 2/3 Therefore, probability that a randomly chosen student who plays basketball also plays football is 2/3. 8/13/2024 8 Dr. Shivashankar, ISE, GAT
  • 9.
    BAYES THEOREM • Bayes’theorem describes the probability of occurrence of an event related to any condition. • Bayes’ Theorem is used to determine the conditional probability of an event. • Bayesian methods provide the basis for probabilistic learning methods that accommodate knowledge about the prior probabilities of alternative hypotheses. • To define Bayes theorem precisely: • P(h) to denote the initial probability that hypothesis h holds. • P(h) is often called the prior-probability of h and may reflect any background knowledge, chance that h is a correct hypothesis. • P(D) to denote the prior probability that training data D will be observed (i.e., the probability of D given no knowledge about which hypothesis holds). • P(D|h) to denote the probability of observing data D given some world in which hypothesis h holds. • P (h|D) is called the posterior-probability of h, because it reflects our confidence that h holds. • Notice the posterior probability P(h|D) reflects the influence of the training data D, in contrast to the prior probability P(h) , which is independent of D. 8/13/2024 9 Dr. Shivashankar, ISE, GAT
  • 10.
    BAYES THEOREM • IfA and B are two events, then the formula for the Bayes theorem is given by: • P(A|B) = P(B|A) X P(A) 𝑃(𝐵) • Where P(A|B) is the probability of condition when event A is occurring while event B has already occurred. P(A) – Probability of event A P(B) – Probability of event B P(A|B) – Probability of A given B P(B|A) – Probability of B given A From the definition of conditional probability, Bayes theorem can be derived for events as given below: P(A|B) = P(A ⋂ B)/ P(B), where P(B) ≠ 0 P(B|A) = P(B ⋂ A)/ P(A), where P(A) ≠ 0 • Since P(A∩ 𝐵)𝑎𝑛𝑑 𝑃(𝐵 ∩ 𝐴) are equal • P(A|B) X P(B) = P(B|A) X P(A) ∴ P(A|B) = P(B|A) X P(A) 𝑃(𝐵) this is the Bays theorem. 8/13/2024 10 Dr. Shivashankar, ISE, GAT Likelihood probability Posterior probability Marginal probability Prior probability
  • 11.
    Cont… Problem 1: Apatient takes a lab test for cancer diagnosis. There are two possible outcomes in this case: ⊕(positive) and ⊖ (negative). The test returns a correct positive results in only 98%. If the cases in which the diseases is actually present and a correct negative result in only 97% of the cases in which the disease in present. Furthermore, 0.008 of the entire population have this cancer. Compute the following values. 1). P(Cancer) 2). P(¬𝐶𝑎𝑛𝑐𝑒𝑟) 3). P(+ve Cancer) 4). P(-ve Cancer) 5). P(+| (¬𝐶𝑎𝑛𝑐𝑒𝑟) 4). P(-| (¬𝐶𝑎𝑛𝑐𝑒𝑟) Solution: • P(cancer)= 0.008 P(¬cancer)= 0.992 • P(⊕/cancer)=0.98 P(⊖/cancer)= 0.02 • P(⊕/¬cancer)=0.03 P(⊖/¬cancer)= 0.97 and • P(⊕/cancer) P(cancer)=0.98 X 0.008 = 0.0078 • P(⊕/-cancer) P(-cancer)=0.03 X 0.992=0.0298 • Thus, ℎ𝑀𝐴𝑃 = -cancer. • The exact posterior probabilities can also be determined by normalizing the above quantities so that they sum to 1 (e.g., P(cancer/ ⊕) = 0.0078 0.0078+0.0298 = 0.207 8/13/2024 11 Dr. Shivashankar, ISE, GAT
  • 12.
    Cont… Problem 3: Giventhat they passed the exam what is the probability it is a woman ? Answer: P(A): Probability of a woman passing the exam is 92/100 and it's equal to 0.92 P(B|A): probability of having a woman ; 100/200 and it's equal to 0.5 P(B): The probability of passing the exam ; 169/200 and it's equal to 0.845 P(A|B) = (P(0.5) x P(0.92 ) / P(0.845) P(A|B) = 0.54 92/169 = 0.54 too 8/13/2024 12 Dr. Shivashankar, ISE, GAT Did not pass the exam Passed the exam Total Women 8 92 100 Men 23 77 100 Total 31 169 200
  • 13.
    Cont… 4. Covid-19 hastaken over the world and the use of Covid19 tests is still relevant to block the spread of the virus and protect our families. If the Covid19 infection rate is 10% of the population, and thanks to the tests we have in Algeria, 95% of infected people are positive with 5% false positive. What would be the probability that I am really infected if I test positive? Solution : Parameters : • P(A): 10% infected • P(B|A): 95% Test positive while infected • 5% False positive while non infected • 90% not infected • We will start multiplying the probability of infection (10%) by the probability of testing positive given that be infected (95%) then we divided by the sum of the probability of infection (10%) by the probability of testing positive given that be infected (95%) with not infected (90%) multiplied by false positive (5%) P(A|B) = P(A) * P(B|A) / Σ P(A) * P(B|A) P(A|B) = 0.1 * 0.95 /(0.95 * 0.1) +(0.05*0.90) P(A|B) = 0.095 / 0.095 + 0.045 P(A|B) = 0.678 8/13/2024 13 Dr. Shivashankar, ISE, GAT
  • 14.
    Cont… 2. Let Adenote the event that a “patient has liver disease”, and B the event that a “patient is an alcoholic”. It is known from experience that 10% of the patients entering the clinic have liver disease and 5% of the patient are alcoholics. Also, among those patients diagnosed with liver disease, 7% are alcoholic. Given that a patient is alcoholic, what is the probability that he will have liver disease? Solution: A-”patient has liver disease”. B-”patient is an alcoholic”. P(A)=10%=0.1 P(B)=5%=0.05 P(B|A)=7%=0.07 P(A|B) = P(B|A) X P(A) 𝑃(𝐵) = 0.07𝑋0.10 0.05 = 0.14 8/13/2024 14 Dr. Shivashankar, ISE, GAT
  • 15.
    MAXIMUM LIKELIHOOD ANDLEAST-SQUARED ERROR HYPOTHESES • Many learning approaches such as neural network learning, linear regression, and polynomial curve fitting try to learn a continuous valued target function. • Under certain assumptions any learning algorithm that minimizes the squared error between output hypothesis predictions and the training data will output a MAXIMUM LIKELIHOOD HYPOTHESIS. • The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data. • In order to find the Maximum Likelihood Hypothesis in Bayesian learning for continuous valued target function, we start with Maximum Likelihood Hypothesis definition, but using lower case p to refer to the Probability Density Function ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 𝑝 𝐷|ℎ 8/13/2024 15 Dr. Shivashankar, ISE, GAT The argument that gives the maximum value from a target function
  • 16.
    MAXIMUM LIKELIHOOD ANDLEAST-SQUARED ERROR HYPOTHESES • ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 𝑝 𝐷|ℎ where p: probability density function • Assume that fixed set of training instances ( 𝑥1, 𝑥2 , 𝑥3 ,……. 𝑥𝑛 ) and data D corresponding sequence of target values D=(𝑑1, 𝑑2, … . 𝑑𝑛) • p(D|h) → product of p(𝑑𝑖|ℎ) ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 𝑝(𝑑𝑖|ℎ) Assume target values are normally distributed 𝑓(𝑥|𝜇)= 1 2𝜋𝜎2 𝑒 − 𝑥−𝜇 2 2𝜎2 ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 𝑝(𝑑𝑖|ℎ) ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 1 2𝜋𝜎2 𝑒 − 1 2𝜎2 𝑑𝑖−𝜇 2 ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 1 2𝜋𝜎2 𝑒 − 1 2𝜎2 𝑑𝑖−ℎ(𝑥𝑖) 2 8/13/2024 16 Dr. Shivashankar, ISE, GAT Mean Standard Deviation Target of 𝑖𝑡ℎ input output of hypothesis 𝑜𝑓 𝑖𝑡ℎ input Variance
  • 17.
    Conti.. Rather than maximizingabove calculated expression, we shall choose to maximize its (less complicated) logarithmic. This is justified because Ln(p) is a monotonic function of p. Therefore maximizing Ln p also maximizes p. ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 𝑙𝑛 1 2𝜋𝜎2 − 1 2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2 First term is constant, discard it ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 − 1 2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2 Maximizing the negative quantity is equivalent to minimizing the corresponding position quantity. ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 1 2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2 Finally we can again discard constants that are independent of h ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 𝑑𝑖 − ℎ(𝑥𝑖 )2 8/13/2024 17 Dr. Shivashankar, ISE, GAT Leas square error hypothesis Bayesian Learning for given continuous valued target
  • 18.
    NAIVE BAYES CLASSIFIER •Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem (helps to determine the likelihood that one event will occur with unclear information while another has already happened) and used for solving classification problems. • It is mainly used in text classification that includes a high-dimensional training dataset. • Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. • It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. • Naïve: It assumes that the occurrence of a certain feature is independent of the occurrence of other features. • Example: If the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other. 8/13/2024 18 Dr. Shivashankar, ISE, GAT
  • 19.
    Conti.. Naïve and Bayesalgorithm : It is a way to calculate the value of P(B|A) with the knowledge of P(A|B). Working of Naïve Bayes' Classifier:  Convert the given dataset into frequency tables (divide the number of rows and columns).  Generate Likelihood table by finding the probabilities of given features (𝜇, 𝜎, 𝑥𝑖).  Now, use Bayes theorem to calculate the posterior probability. Steps to implement: • Step 1: Data Pre-processing step • Step 2: Fitting Naive Bayes to the Training set • Step 3: Predicting the test result • Step 4: Test accuracy of the result • Step 5: Visualizing the test set result. 8/13/2024 19 Dr. Shivashankar, ISE, GAT
  • 20.
    NAIVE BAYES CLASSIFIER •One highly practical Bayesian learning method is the naive Bayes learner, often called the Naive Bayes classifier. • The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f (x) can take on any value from some finite set V. • A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values 𝑎1, 𝑎2, 𝑎3, … , 𝑎𝑛 . • The learner is asked to predict the target value, or classification, for this new instance. • The naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. • Naive Bayes classifier: 𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖𝑣 𝑣𝑖 ෑ 𝑖 𝑃(𝑎𝑖|𝑣𝑗) • where 𝑉𝑁𝐵 denotes the target value output by the Naive Bayes classifier. • Notice that in a naive Bayes classifier the number of distinct 𝑃(𝑎𝑖|𝑣𝑗) terms that must be estimated from the training data is just the number of distinct attribute values times the number of distinct target values-a much smaller number than if we were to estimate the P(𝑎1, 𝑎2, …, 𝑎𝑛 |𝑣𝑗) terms as first contemplated. 8/13/2024 20 Dr. Shivashankar, ISE, GAT
  • 21.
    NAÏVE BAYES CLASSIFIER FromBays theorem P(A|B) = P(B|A) X P(A) 𝑃(𝐵) this is the Bays theorem. Data set X={𝑥1, 𝑥2, … … … . . 𝑥𝑛) using these features compute output {y} features. Multiple features 𝑓1, 𝑓2, 𝑓3, 𝑦 𝑥1, 𝑥2, 𝑥3, 𝑦1 --- (1 record) 𝑥1, 𝑥2, 𝑥3, 𝑦2 --- (2 record) For this kind of data set how (Bayes theorem) equation changes To compute these features in y using Bayes theorem P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) = (P(𝑥1 𝑦 ∗P(𝑥2|𝑦) ∗ P(𝑥3|𝑦),……..∗ P(𝑥𝑛|𝑦)∗P(y)) 𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛) = 𝑃 𝑌 ∗ς1=1 𝑛 𝑃(𝑥𝑖|𝑦) 𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛) P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) 𝛼 𝑃 𝑌 ∗ ς1=1 𝑛 𝑃 𝑥𝑖 𝑦 Y= argmax 𝑦 [𝑃 𝑌 ∗ ς1=1 𝑛 𝑃 𝑥𝑖 𝑦 ] 8/13/2024 21 Dr. Shivashankar, ISE, GAT
  • 22.
    Problem 1: Calculateplay for TODAY Check for dataset TODAY (outlook=Sunny, temperature= Hot) Solution: Naive Base classifier is defined by 𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) ς𝑖 𝑃(𝑎𝑖|𝑣𝑗)= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) P(Temperature = hotI 𝑣𝑗) 𝑉𝑁𝐵(Yes)= P(Yes|Today) = 𝑃 𝑇𝑜𝑑𝑎𝑦 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠) 𝑃(𝑇𝑜𝑑𝑎𝑦) = 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ∗𝑃 𝐻𝑜𝑡 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠) 𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝 = 2/9*2/9*9/14 =0.031 𝑉𝑁𝐵(No)= P(No|Today) = 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 ∗𝑃 𝐻𝑜𝑡 𝑁𝑜 ∗𝑃(𝑁𝑜) 𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝 =3/5*2/5*5/14 = 0.08571 To calculate P(Yes) for Today condition and normalized to one 𝑉𝑁𝐵(Yes)= 𝑉𝑁𝐵(Yes) 𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No) = 0.031 0.031+0.08571 ≈0.27 𝑉𝑁𝐵(No)= 0.08571 0.031+0.08571 = 0.734 ∴ 𝑉𝑁𝐵(No) probability is high , So for TODAY (Sunny, Hot)—Play is No 8/13/2024 22 Dr. Shivashankar, ISE, GAT Outlook Yes No P(y) P(No) Sunny 2 3 2/9 3/5 Overcast 4 0 4/9 0/5 Rainy 3 2 3/9 3/5 Total 9 5 100% 100% Temperature Yes No P(y) P(No) Hot 2 2 2/9 2/5 Mild 4 2 4/9 2/5 Cool 3 1 3/9 1/5 Total 9 5 100% 100%
  • 23.
    Problem 1: Applythe naive Bayes classifier to a concept learning problem, classifying days according to whether someone will play tennis {outlook=sunny, temperature=cool, humidity=high. Wind=strong} 8/13/2024 23 Dr. Shivashankar, ISE, GAT Day Outlook Temperature Humidity Wind Play_Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild high Strong No
  • 24.
    Cont… Solution: {Outlook=sunny, temperature=cool, Humidity=high,Wind=strong} P(Play Tennis=yes)=9/14=0.6428 P(Play Tennis=No)=5/14=0.3571 𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) ෑ 𝑖 𝑃(𝑎𝑖|𝑣𝑗) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) ∗P(Temperature = cool I 𝑣𝑗) ∗ P(Humidity =High | 𝑣𝑗) ∗ P(Wind = Strong | 𝑣𝑗) 𝑉𝑁𝐵(Yes)= P(Sunny|Yes) *P(Cool|Yes) *P(High|yes) *P(Strong|Yes)* P(yes) = 2/9 * 3/9* 3/9*3/9* 0.6428 =0.0053 𝑉𝑁𝐵(No)= P(Sunny|No) *P(Cool|No) *P(High|No) *P(Strong|No)* P(No) = 3/5 * 1/5* 4/5*3/5*0.3571=0.0206 𝑉𝑁𝐵(Yes)= 𝑉𝑁𝐵(Yes) 𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No) = 0.0053 0.0053+0.0206 = 0.205 𝑉𝑁𝐵(No)= 𝑉𝑁𝐵(No) 𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No) = 0.0206 0.0053+0.0206 = 0.795 Therefore, 𝑉𝑁𝐵(No)= 0.795 > 0.205, Play Tennis: No 8/13/2024 24 Dr. Shivashankar, ISE, GAT Outlook Y N Sunny 2/9 3/5 Overcast 4/9 0 Rain 3/9 2/5 Temperature Y N Hot 2/9 2/5 Mild 4/9 2/5 Cold 3/9 1/5 Humidity y N High 3/9 4/5 Normal 6/9 1/5 Wind Y N Strong 3/9 3/5 Weak 6/9 2/5
  • 25.
    Cont… Problem 2: Estimatethe conditional probabilities of each attributes {color, legs, height, smelly} for the species classes {M,H} using the data set given in the table. Using these probabilities estimate the probability values for the new instance {color=green, legs=2, height=tall and smelly=No}. 8/13/2024 25 Dr. Shivashankar, ISE, GAT No Color Legs Height Smelly Species 1 White 3 Short Yes M 2 Green 2 Tall No M 3 Green 3 Short Yes M 4 White 3 Short Yes M 5 Green 2 Short No H 6 White 2 Tall No H 7 White 2 Tall No H 8 White 2 Short Yes H
  • 26.
    Cont… Solution : {color=green,legs=2, height=tall and smelly=No}, P(M)=4/8=0.5, P(H)=4/8=0.5 P(M/New instance)= P(M)*P(color=green/M) * P(legs=2/M) * P(Height=Tall/M) * P(Smelly=No/M) = 0.5* 2/4 * 1/4 * 1/4 * 1/4 = 0.0039 P(H/New instance)= P(H)*P(color=green/H) * P(legs=2/H) * P(Height=Tall/H) * P(Smelly=No/H) = 0.5* 1/4 * 4/4 * 2/4 * 3/4 = 0.048 Since P(H/New instance) > P(M/New instance) Hence the new instance {color=green, legs=2, height=tall and smelly=No} belongs to H 8/13/2024 26 Dr. Shivashankar, ISE, GAT Color M H White 2/4 3/4 Green 2/4 1/4 Legs M H 2 1/4 4/4 3 3/4 0/4 Height M H Short 3/4 2/4 Tall 1/4 2/4 Smelly M H Yes 3/4 1/4 No 1/4 3/4
  • 27.
    Cont… 1. Estimate theconditional probabilities for P(A|+), P(B|+), P(C|+), P(A|-), P(B|-) and P(C|-) 2. Use the estimate of Conditional probabilities given to predict the class label for a test sample (A=0, B=1, C=0), use Naïve Bays Approach. 3. Estimate the conditional probabilities using the m-estimate approach with P=1/2 and m=4. solution: 1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+) P(A|-), P(B|-), P(C|-) P(A=0|-)= 3/5=0.6 P(A=0|+)= 2/5=0.4 P(B=0|-)= 3/5=0.6 P(B=0|+)= 4/5=0.8 P(C=0|-)= 0/5=0.0 P(C=0|+)= 3/5=0.6 2. Classify the new instance (A=0, B=1, C=0) P(𝐶𝑖|𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛)= 𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛|𝐶𝑖. 𝑃(𝐶𝑖) 𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛) P(+|A=0, B=1, C=0) = 𝑃 𝐴 = 0 + ∗𝑃 𝐵 = 1 + ∗𝑃 𝐶 = 0 + ∗𝑃(+) 𝑃(𝐴=0,𝐵=1,𝐶=0) = 0.4∗0.2∗0.6∗0.5 𝐾 = 0.024 𝐾 8/13/2024 27 Dr. Shivashankar, ISE, GAT Record A B C Class 1 0 0 0 + 2 0 0 1 - 3 0 1 1 - 4 0 1 1 - 5 0 0 0 + 6 1 0 0 + 7 1 0 1 - 8 1 0 1 - 9 1 1 1 + 10 1 0 1 +
  • 28.
    Cont… P(-|A=0, B=1, C=0) = 𝑃𝐴 = 0 − ∗𝑃 𝐵 = 1 − ∗𝑃 𝐶 = 0 − ∗𝑃(−) 𝑃(𝐴=0,𝐵=1,𝐶=0) = 𝟎 𝑲 The class label should be +since 0.024 𝐾 > 0 𝐾 3. Estimate the conditional probability using the m-estimate approach With P=1/4, m=4. The conditional probability using m-estimate: Prob(A|B)= 𝑛𝑐+𝑚𝑝 𝑛+𝑚 Where, 𝑛𝑐: No. of times A^B happened n: No. of times B happened in the training data P(A=0|+)= 2+2 5+4 = 4 9 P(A=0|-)= 3+2 5+4 = 5 9 P(B=1|+)= 1+2 5+4 = 3 9 P(B=1|-)= 2+2 5+4 = 4 9 P(C=0|+)= 3+2 5+4 = 5 9 P(C=0|-)= 0+2 5+4 = 2 9 8/13/2024 28 Dr. Shivashankar, ISE, GAT P(A=1|-) =0.4 P(A=1|+) =0.6 P(B=1|-) =0.4 P(A=1|+) =0.2 P(C=1|-) =1 P(C=1|+) =0.4 P(A=0|-) =0.6 P(A=0|+) =0.4 P(B=0|-) =0.6 P(B=0|+) =0.8 P(C=0|-) =0 P(C=0|+) =0.6
  • 29.
    Cont… Problem 3: Classifythe new estimate (A=0, B=1, C=0), using m-estimates approach with P=1/2, m=4. P(+|A=0, B=1, C=0) = P(A=0|+) ∗ P(B=1|+)∗ P(C=0|+)∗ P(+) P(A=0, B=1, C=0) = 4 9 ∗ 3 9 ∗ 5 9 ∗0.5 𝐾 = 0.0412 𝐾 P(-|A=0, B=1, C=0) = P(A=0|−) ∗ P(B=1|−)∗ P(C=0|−)∗ P(−) P(A=0, B=1, C=0) = 5 9 ∗ 4 9 ∗ 2 9 ∗0.5 𝐾 = 0.0274 𝐾 The class label should be belongs to +since 0.0412 𝐾 > 0.0274 𝐾 8/13/2024 29 Dr. Shivashankar, ISE, GAT
  • 30.
    Artificial Neural Networks •Motivation behind neural network is human brain. Human brain is called as the best processor even though it works slower than other computers. • Human brain cells, called neurons, form a complex, highly interconnected network and send electrical signals to each other to help humans process information. • Similarly, an artificial neural network is made of artificial neurons that work together to solve a real world problems. • Artificial neurons are software modules, called nodes, and artificial neural networks are software programs or algorithms that, at their core, use computing systems to solve mathematical calculations. 8/13/2024 30 Dr. Shivashankar, ISE, GAT Fig 3.3: Artificial Neural Networks
  • 31.
    Conti.. Input Layer • Thisis the first layer in a typical neural network. • Input layer neurons receive the input information from the outside world enters the artificial neural network, process it through a mathematical function (activation function), and transmit output to the next layer’s neurons based on comparison with a preset threshold value. • We pre-process text, image, audio, video, and other types of data to derive their numeric representation. Hidden Layer • Hidden layers take their input from the input layer or other hidden layers and have a large number of hidden layers. It contains the summation and activation function. • Each hidden layer analyzes the output from the previous layer, processes it further, and passes it on to the next layer. Here also, we multiply the data by edge weights as it is transmitted to the next layer. Output Layer • The output layer gives the final result of all the data processing by the artificial neural network. It can have single or multiple nodes. • For instance, if we have a binary (yes/no) classification problem, the output layer will have one output node, which will give the result as 1 or 0. • However, if we have a multi-class classification problem, the output layer might consist of more than one output node. 8/13/2024 31 Dr. Shivashankar, ISE, GAT
  • 32.
    Conti.. • It isusually a computational network based on biological neural networks that construct the structure of the human brain. • Similar to a human brain has neurons interconnected to each other, artificial neural networks also have neurons that are linked to each other in various layers of the networks. • These neurons are known as nodes. • Artificial neural networks (ANNs) provide a general, practical method for learning real-valued, discrete-valued, and vector-valued functions from examples. • ANN learning is robust to errors in the training data and has been successfully applied to problems such as interpreting visual scenes, speech recognition, and learning robot control strategies. • The fastest neuron switching times are known to be on the order of 10−3 seconds--quite slow compared to computer switching speeds of 10−10 seconds. 8/13/2024 32 Dr. Shivashankar, ISE, GAT
  • 33.
    Biological Motivation • Theterm "Artificial Neural Network(ANN)" refers to a biologically inspired sub-field of artificial intelligence modeled after the brain. • ANN has been inspired by biological learning system biological learning system is made up of complex web of interconnected neurons. • Artificial interconnected neurons like biological neurons making up an ANN. • Each biological neuron is capable of taking a number of inputs and produce output. • One motivation for ANN is that to work for a particular task identification through many parallel processes. Consider human brain: • Number of neurons ~ 1011 neurons • Connections per neurons ~ 104−5 • Neurons switching time ~ 10−3 seconds (0.001) • Computer switching time ~ 10−10seconds • Scene recognition time ~ 10−1seconds (0.1) 8/13/2024 33 Dr. Shivashankar, ISE, GAT
  • 34.
    NEURAL NETWORK REPRESENTATIONS •In an artificial neural network, a neurone is a logistic unit.  Feed input via input wires.  Logistic unit does computation.  Sends output down output wires • That logistic computation is just like our previous logistic regression hypothesis calculation. • Input – 30* 32 grid – camera. • Output – Vehicle is steered. • Training – Observing steering commands of human driving the vehicle. • 960 inputs – 30 output units – Steering command recommended most. • ALVINN – acyclic graph. 8/13/2024 34 Dr. Shivashankar, ISE, GAT
  • 35.
    PERCEPTRONS • One typeof ANN system is based on a unit called a perceptron. • A perceptron takes a vector of real-valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the result is greater than some threshold and -1 otherwise. • More precisely, given inputs 𝑥1 through 𝑥2, the output o(𝑥1, … … 𝑥𝑛) computed by the perceptron is o(𝑥1, … … 𝑥𝑛) =ቊ 1 𝑖𝑓 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛𝑥𝑛 > 0 −1 𝑜𝑡ℎ𝑟𝑒𝑤𝑖𝑠𝑒 • where each 𝒘𝒊 is a real-valued constant, or weight, that determines the contribution of input 𝒙𝒊 to the perceptron output. • We will sometimes write the perceptron function as 𝑎( Ԧ 𝑥) =sgn 𝑤. Ԧ 𝑥 • Where, sgn (y)=ቊ 1 𝑖𝑓 𝑦 > 0 −1 𝑜𝑡ℎ𝑟𝑒𝑤𝑖𝑠𝑒 • Learning a perceptron involves choosing values for the weights 𝑤0, … . . 𝑤𝑛. Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors. • H= 𝑤| 𝑤 𝜖 𝜏 𝑛+1 8/13/2024 35 Dr. Shivashankar, ISE, GAT
  • 36.
    Representational Power ofPerceptron • We can view the perceptron as representing a hyperplane decision surface in the n- dimensional space of instances (i.e., points). • The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs and a -1 for instances lying on the other side. • The equation for this decision hyperplane is 𝑤. Ԧ 𝑥 = 0. • Of course, some sets of positive and negative examples cannot be separated by any hyperplane. • Those that can be separated are called linearly separable sets of examples. • A single perceptron can be used to represent many boolean functions. 8/13/2024 36 Dr. Shivashankar, ISE, GAT
  • 37.
    Cont… • AND andOR can be viewed as special cases of m-of-n functions: that is, functions where at least m of the n inputs to the perceptron must be true. • The OR function corresponds to m = 1 and the AND function to m = n. • Any m-of-n function is easily represented using a perceptron by setting all input weights to the same value (e.g., 0.5) and then setting the threshold t accordingly. • Perceptron can represent all of the primitive boolean functions AND, OR, NAND (¬ AND), and NOR (¬ OR). • The ability of perceptron to represent AND, OR, NAND, and NOR is important because every boolean function can be represented by some network of interconnected units based on these primitives. 8/13/2024 37 Dr. Shivashankar, ISE, GAT
  • 38.
    The Perceptron TrainingRule • learning problem is to determine a weight vector that causes the perceptron to produce the correct output for each of the given training examples. • One way to learn an acceptable weight vector is to begin with random weights, then iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example. • This process is repeated, iterating through the training examples as many times as needed until the perceptron classifies all training examples correctly. • At every step of feeding a training example, when the perceptron fails to produce the correct +1/-1, we revise every weight 𝑤𝑖 associated with every input 𝑥𝑖, according to the following rule: 𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖 Where ∆𝑤𝑖=ƞ 𝑡 − 𝑜 𝑥𝑖 t is the target output for the current training example, o is the output generated by the perceptron, and ƞ is a positive constant called the learning rate. The role of the learning rate is to moderate the degree to which weights are changed at each step. ∆ : This is the learning rate, or the step size. 8/13/2024 38 Dr. Shivashankar, ISE, GAT
  • 39.
    The Perceptron TrainingRule • In order to train the Perceptron f(X<W): 𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖 Where ∆𝑤𝑖=ƞ 𝑡 − 𝑜 𝑥𝑖 8/13/2024 39 Dr. Shivashankar, ISE, GAT  Initialize the weight, W, randomly.  For as many times as necessary: For each training examples x𝝐𝑿  Compute f(x,W)  If x is misclassified: Modify the weight, 𝑤𝑖 associated with every 𝑥𝑖 in x.
  • 40.
    Problem • Problem 6:Compute AND gate using single perceptron training rule. • Solution: A B Y=ቊ 1 𝑖𝑓 𝑤𝑥 + 𝑏 > 0 0 𝑖𝑓 𝑤𝑥 + 𝑏 ≤ 0 • Assume w1=1, w2=1 and bias= -1 • Perceptron training rule : y= w1x1+w2x2+b • If x1=0, x2=0, then 0+0-1= -1 0 1 0+1-1= 0 -1 1 0 1+0-1= 0 1 y=1 1 1 1+1-1= 1 1 8/13/2024 40 Dr. Shivashankar, ISE, GAT A B Y=A.B 0 0 0 0 1 0 1 0 0 1 1 1 Y=A.B AND b x1 x2
  • 41.
    Problems • Problem 7:Compute OR gate using single perceptron training rule. • Solution: Y=ቊ 1 𝑖𝑓 𝑤𝑥 + 𝑏 > 0 0 𝑖𝑓 𝑤𝑥 + 𝑏 ≤ 0 • Assume w1=1, w2=1 and bias= -1 • Perceptron training rule : y= w1x1+w2x2+b • If x1=0, x2=0, then 0+0-1= -1 0 1 0+1-1= 0 But output y= 0 and target =1, misclassification, let us change the w1=1, w2=2. Then, y= w1x1+w2x2+b and w1=1, w2=2, b=-1 For (0,0), y= 0+0 -1= -1 (0,1), y= 1x0 +2x1-1 = 1 (1,0), y= 1x1+2x0 -1=0, But output = 0 and target =1, misclassification, so let us change the w1=2 and w2=2 (0,0), y= 0+0 -1= -1 (0,1), y= 2x0 +2x1-1 = 1 (1,0), y= 2x1+0x2-1 = 1 (1,1), y= 2x1+2x1-1= 3 8/13/2024 41 Dr. Shivashankar, ISE, GAT A B Y=A+B 0 0 0 0 1 1 1 0 1 1 1 1 OR b x1 x2 2 2 Y=1 -1
  • 42.
    Problems • Problem 7:Compute NAND gate using single perceptron training rule. • Solution: • Assume w1=1, w2=1 and bias= -1 • If x1=0, x2=0, then 0+0-1= -1 • Change w1=1, w2=1 and bias= 1 if (0,0), y= 1√ (0,1), y= 2√ (1,0), y= 2√ (1,1), y= 3 X Change w1= -1, w2= -1 and bias= 2 if (0,0), y= 2√ (0,1), y= 1√ (1,0), y= 1√ (1,1), y= 0 √ 8/13/2024 42 Dr. Shivashankar, ISE, GAT A B Y=𝐴. 𝐵 0 0 1 0 1 1 1 0 1 1 1 0 NAND b x1 x2 -1 -1 Y=1 2
  • 43.
    Problem • Problem 6:Compute NOR gate using single perceptron training rule. • Solution: • Assume w1=-1, w2=-1 and bias= 1 • Perceptron training rule : y= w1x1+w2x2+b • If x1=0, x2=0, then 0+0+1= -1 0 1 0-1+1= 0 1 0 -1+0+1= 0 1 1 -1-1+1= 1 8/13/2024 43 Dr. Shivashankar, ISE, GAT A B Y=𝐴 + 𝐵 0 0 1 0 1 0 1 0 0 1 1 0 NOR b x1 x2 1 -1 -1 Y=1
  • 44.
    Problem • Problem 8:Compute NOT gate using single perceptron training rule. • Solution: • Y=O=wx+b, when w=1 and b=-1 • When x=0, y=1X1-1=0, misclassification, change b=1 if we change w value it doesn't reflect any changes. W=1, b=1, now if x=0, y=0+1=1. both output and target are mapping. If x=1, y=wx+b=1+1=2, misclassification of the output and target value. So change w=-1 and b=1 If x=0, y=1x0+1=1 x=-1, y= -1x1+1=0, both output and target values are mapping. 8/13/2024 44 Dr. Shivashankar, ISE, GAT NOT b x +1 1 y
  • 45.
    Problem Problem 1: Assume𝑤1 = 0.6 𝑎𝑛𝑑 𝑤2 = 0.6, 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 1 𝑎𝑛𝑑 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 ƞ=0.5. Compute OR gate using perceptron training rule. Solution : 1. A=0, B=0 and target=0 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2 =0.6*0+0.6*0=0 This is not greater than the threshold value of 1. So the output =0 2. A=0, B=1 and target =1 𝑤𝑖𝑥𝑖 = 0.6 ∗ 0 +0.6*1= 0.6 This is not greater than the threshold value of 1. So the output =0. 𝑤𝑖=𝑤𝑖+ƞ(t-o) 𝑥𝑖 𝑤1=0.6+0.5(1-0)0=0.6 𝑤2=0.6+0.5(1-0)1=1.1 Now 𝒘𝟏=0.6, 𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5 8/13/2024 45 Dr. Shivashankar, ISE, GAT A B Y=A+B (Target) 0 0 0 0 1 1 1 0 1 1 1 1
  • 46.
    Problem • Now 𝒘𝟏=0.6,𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5 1. A=0, B=0 and target=0 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2 =0.6*0+1.1*0=0 This is not greater than the threshold value of 1. So the output =0 2. A=0, B=1 and target =1 𝑤𝑖𝑥𝑖 = 0.6 ∗ 0 +1.1*1= 1.1 This is greater than the threshold value of 1. So the output =1. 3. A=1, B=0 and target =1 𝑤𝑖𝑥𝑖 = 0.6 ∗ 1 +1.1*0= 0.6 This is not greater than the threshold value of 1. So the output =0. 𝑤𝑖=𝑤𝑖+ƞ(t-0) 𝑥𝑖 𝑤1=0.6+0.5(1-0)1=1.1 𝑤2=1.1+0.5(1-0)0=1.1 8/13/2024 46 Dr. Shivashankar, ISE, GAT
  • 47.
    Problem • Now 𝒘𝟏=1.1,𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5 1. A=0, B=0 and target=0 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2 =1.1*0+1.1*0=0 This is not greater than the threshold value of 1. So the output =0 2. A=0, B=1 and target =1 𝑤𝑖𝑥𝑖 = 1.1 ∗ 0 +1.1*1= 1.1 This is greater than the threshold value of 1. So the output =1. 3. A=1, B=0 and target =1 𝑤𝑖𝑥𝑖 = 1.1 ∗ 1 +1.1*0= 1.1 This is greater than the threshold value of 1. So the output =1. 4. A=1. B=1 and target =1 𝑤𝑖=1.1*1+1.1*1=2.2 This is greater than the threshold value of 1. So the output =1. 8/13/2024 47 Dr. Shivashankar, ISE, GAT B A 1.1 1.1 ∈ 𝜃 = 1 𝑂𝑢𝑡𝑝𝑢𝑡
  • 48.
    Problem Problem 2: Assume𝑤1 = 1.2 𝑎𝑛𝑑 𝑤2 = 0.6, 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 1 𝑎𝑛𝑑 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 ƞ=0.5. Compute AND gate using perceptron training rule. Solution : 1. A=0, B=0 and target=0 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2 =1.2*0+0.6*0=0 This is not greater than the threshold value of 1. So the output =0 2. A=0, B=1 and target =0 𝑤𝑖𝑥𝑖 = 1.2 ∗ 0 +0.6*1= 0.6 This is not greater than the threshold value of 1, So the output =0. 3. A=1, B=0 and target =0 𝑤𝑖𝑥𝑖 = 1.2 ∗ 1 +0.6*0= 1.2 This is greater than the threshold value of 1, So the output =1. 𝑤𝑖=𝑤𝑖+ƞ(t-o) 𝑥𝑖 𝑤1=1.2+0.5(0-1)1=0.7 𝑤2=0.6+0.5(0-1)0=0.6 Now 𝒘𝟏=0.7, 𝒘𝟐=0.6, threshold = 1 and learning rate ƞ=0.5 8/13/2024 48 Dr. Shivashankar, ISE, GAT A B Y=A+B (Target) 0 0 0 0 1 0 1 0 0 1 1 1
  • 49.
    Problems For 𝒘𝟏=0.7, 𝒘𝟐=0.6,threshold = 1 and learning rate ƞ=0.5 1. A=0, B=0 and target=0 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2 =0.7*0+0.6*0=0 This is not greater than the threshold value of 1. So the output =0 2. A=0, B=1 and target =0 𝑤𝑖𝑥𝑖 = 0.7 ∗ 0 +0.6*1= 0.6 This is not greater than the threshold value of 1. So the output =0. 3. A=1, B=0 and target =0 𝑤𝑖𝑥𝑖 = 0.7 ∗ 1 +0.6*0= 0.7 This is not greater than the threshold value of 1. So the output =0. 4. A=1, B=1 and target =1 𝑤𝑖𝑥𝑖 = 0.7 ∗ 1 +0.6*1= 1.3 This is greater than the threshold value of 1. So the output =1. 8/13/2024 49 Dr. Shivashankar, ISE, GAT A B 0.7 0.6 ∈ 𝜃 = 1 Weighted sum Output
  • 50.
    Problem • Problem 3:consider X-OR gate, compute Perceptron training rule with threshold =1 and learning rate=1.5. • Solution: y=𝑥1 ҧ 𝑥2 + ҧ 𝑥1𝑥2 • Y=𝑍1+𝑍2 • Where, 𝑍1 = 𝑥1 ҧ 𝑥2(𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 1), • 𝑍2= ҧ 𝑥1𝑥2 (Function 2) • Y=𝑍1 OR 𝑍2(Function 3) • First function: 𝑍1 = 𝑥1 ҧ 𝑥2 • Assume the initial weights are 𝑊11=𝑊21=1 • Threshold =1 and Learning rate=1.5 8/13/2024 50 Dr. Shivashankar, ISE, GAT 𝑥1 𝑥2 y 0 0 0 0 1 1 1 0 1 1 1 0 𝑋1 𝑋2 𝑍1 Y 𝑍2 𝑥1 𝑤11 𝑥2 𝑤12 𝑤21 𝑤22 𝑦1 𝑦2 y 𝑥1 𝑥2 𝑍1 0 0 0 0 1 0 1 0 1 1 1 0
  • 51.
    Problem (0,0) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗𝑥𝑖 = 1 ∗ 0 + 1 ∗ 0 = 0 (output=0) (0,1) 𝑍1𝑖𝑛=1 ∗ 0 + 1 ∗ 1 = 1 (output=1) 𝑤𝑖,𝑗=𝑤𝑖,𝑗+ƞ(t-o)𝑥𝑖 𝑤11=1+1.5(0-1)0=1 𝑤21=1+1.5(0-1)1=-0.5 Now, 𝑤11=1, 𝑤21=-0.5, threshold=1 and learning rate=1.5 (0,0) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + (−0.5) ∗ 0 = 0 (output=0) (0,1) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + −0.5 ∗ 1 = −0.5 (output=0) (1,0) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 1 + (−0.5) ∗ 0 = 1 (output=1) (1,1) 𝑍1𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 1 + (−0.5) ∗ 1 = 0.5 (output=0) …………………………………………………………………………………………………………………………………… Second function: 𝑍2= ҧ 𝑥1𝑥2 • Assume the initial weights are 𝑊12=𝑊22=1 • Threshold =1 and Learning rate=1.5 • (0,0) 𝑍2𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = 1 ∗ 0 + 1 ∗ 0 = 0 (output=0) • (0,1) 𝑍2𝑖𝑛=1 ∗ 0 + 1 ∗ 1 = 1 (output=1) • (1,0) 𝑍2𝑖𝑛=1 ∗ 1 + 1 ∗ 0 = 1 (output=1) 8/13/2024 51 Dr. Shivashankar, ISE, GAT 𝑥1 𝑥2 𝑧2 0 0 0 0 1 1 1 0 0 0 0 0
  • 52.
    Problem 𝑤𝑖,𝑗=𝑤𝑖,𝑗+ƞ(t-o)𝑥𝑖 𝑤12=1+1.5(0-1)1= -0.5 𝑤22=1+1.5(0-1)0= 1 Now,𝑤12= -0.5, 𝑤22= 1, threshold=1, learning rate=1.5 • (0,0) 𝑍2𝑖𝑛=𝑤𝑖,𝑗 ∗ 𝑥𝑖 = −0.5 ∗ 0 + 1 ∗ 0 = 0 (output=0) • (0,1) 𝑍2𝑖𝑛=(-0.5) ∗ 0 + 1 ∗ 1 = 1 (output=1) • (1,0) 𝑍2𝑖𝑛= −0.5 ∗ 1 + 1 ∗ 0 = −0.5 (output=0) • (1,1) 𝑍2𝑖𝑛= −0.5 ∗ 1 + 1 ∗ 1 = 0.5 (output=0) • Y=𝑍1 OR 𝑍2 𝑦𝑖𝑛 = 𝑍1𝑣1 + 𝑍2𝑣2 • Assume the initial weights are XOR table • 𝑣1 = 𝑣2 = 1, threshold=1, learning rate=1.5 • (0,0) 𝑦𝑖𝑛=𝑣𝑖 ∗ 𝑧𝑖 = 1 ∗ 0 + 1 ∗ 0 = 0 (output=0) • (0,1) 𝑦𝑖𝑛=1 ∗ 0 + 1 ∗ 1 = 1 (output=1) • (1,0) 𝑦𝑖𝑛=1 ∗ 1 + 1 ∗ 0 =1 (output=1) • (0,0) 𝑦𝑖𝑛=1 ∗ 0 + 1 ∗ 0 = 0 (output=0) • ∴ 𝑤11 = 1, 𝑤12 = −0.5, 𝑤21 = −0.5, 𝑤22 = 1 • 𝑣1 = 𝑣2 = 1. 8/13/2024 52 Dr. Shivashankar, ISE, GAT 𝑥1 𝑥2 𝑍1 𝑍2 𝑦𝑖𝑛 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 1 0 0 0
  • 53.
    Problem • Problem 4:Consider NAND gate, compute Perceptron training rule with W1=1.2, W2=0.6 threshold =-1 and learning rate=1.5. • Solution: 8/13/2024 53 Dr. Shivashankar, ISE, GAT A B Y=𝐴. 𝐵 0 0 1 0 1 1 1 0 1 1 1 0
  • 54.
    Problem • Problem 5:Consider NOR gate, compute Perceptron training rule with W1=0.6, W1=1. threshold =-0.5 and learning rate=1.5. • Solution: 8/13/2024 54 Dr. Shivashankar, ISE, GAT A B Y=𝐴 + 𝐵 0 0 1 0 1 0 1 0 0 1 1 0
  • 55.
    Gradient Descent andthe Delta Rule • It is also important because gradient descent can serve as the basis for learning algorithms that must search through hypothesis spaces. • The delta training rule is best understood by considering the task of training an unthresholded perceptron; that is, a linear unit for which the output o is given by o=𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 +…………..+𝑤𝑛𝑥𝑛 O( Ԧ 𝑥)=𝑤. Ԧ 𝑥 Thus, a linear unit corresponds to the first stage of a perceptron, without the threshold.  Although there are many ways to define this error, one common measure that will turn out to be especially convenient is  𝐸 𝑤 = 1 2 σ𝑑𝜖𝐷 𝑡𝑑 − 𝑜𝑑 2  where D is the set of training examples, 𝑡𝑑 is the target output for training example d, and 𝑜𝑑is the output of the linear unit for training example d. Gradient Descent and the Delta Rule for each weight changed by ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗 𝛿𝑗=𝑂𝑗(1=𝑂𝑗) (𝑡𝑗 − 𝑂𝑗) if j is an output unit 𝛿𝑗=𝑂𝑗(1=𝑂𝑗)σ𝑘 𝛿𝑘𝑤𝑘𝑗 if j is a hidden unit Where ƞ is a constant called the learning rate 𝑡𝑗is the correct teacher output for unit j 𝛿𝑗is the error measure for unit j 8/13/2024 55 Dr. Shivashankar, ISE, GAT
  • 56.
    The Backpropagation Algorithm •Backpropagation is an effective algorithm used to train artificial neural networks, especially in feed-forward neural networks. • Its an iterative algorithm, that helps to minimize the cost function by determining which weights and biases should be adjusted to minimize the loss by moving down towards the gradient of the error. Let us Consider networks with multiple output units rather than single units as before, we begin by redefining E to sum the errors over all of the network output units 𝐸 𝑤 = 1 2 ෍ 𝑑𝜖𝐷 ෍ 𝑘𝜖𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑡𝑘𝑑 − 𝑜𝑘𝑑 2 Where, outputs is the set of output units in the network, and 𝑡𝑘𝑑 and 𝑜𝑘𝑑 are the target and output values associated with the kth output unit and training example d. 8/13/2024 56 Dr. Shivashankar, ISE, GAT
  • 57.
    Case 1: Computeand derive the increment (∆) for output unit weight in The Backpropagation Algorithm (𝒐𝒋) Derivation: 𝜕𝐸𝑑 𝑗𝑛𝑒𝑡𝑗 = 𝜕𝐸𝑑𝜕0𝑗 𝜕0𝑗 𝑗𝑛𝑒𝑡𝑗 𝜕𝑜𝑗 𝑗𝑛𝑒𝑡𝑗 = 𝜕 𝜕𝑜𝑗 1 2 σ𝑘𝜖𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑡𝑘 − 𝑜𝑘 2 = 𝜕 𝜕𝑜𝑗 1 2 𝑡𝑗 − 𝑜𝑗 2 = 1 2 * 2(𝑡𝑗 − 𝑜𝑗) 𝜕(𝑡𝑗−𝑜𝑗) 𝑗𝑜𝑗 = -(𝑡𝑗−𝑜𝑗)  𝜕𝐸𝑑 𝑗𝑛𝑒𝑡𝑗 = -(𝑡𝑗−𝑜𝑗) 𝑜𝑗(1 − 𝑜𝑗) = -𝑜𝑗 (1 − 𝑜𝑗) (𝑡𝑗−𝑜𝑗) And 𝛿𝑗 ≠ − 𝜕𝐸𝑑 𝑗𝑛𝑒𝑡𝑗 = 𝑜𝑗 (1 − 𝑜𝑗) (𝑡𝑗−𝑜𝑗) . ∆𝑤𝑗𝑖 = ƞ𝛿𝑗𝑥𝑗𝑖= ƞ𝑜𝑗 (1 − 𝑜𝑗) (𝑡𝑗−𝑜𝑗) 𝑥𝑗𝑖 . 𝒐𝒋 8/13/2024 57 Dr. Shivashankar, ISE, GAT Type equation here. ∈ + 𝑛𝑒𝑡𝑗 𝛿
  • 58.
    The Backpropagation Algorithm BACKPROPOGATION(training_examples, ƞ, 𝒏𝒊𝒏, 𝒏𝒐𝒖𝒕, 𝒏𝒉𝒐𝒅𝒅𝒆𝒏) Each training example is a pair of the form ( Ԧ 𝑥,Ԧ 𝑡), where Ԧ 𝑥 is the vector of network input values, and 𝑡 is the vector of target network output values. ƞ is the learning rate (e.g., .O5). 𝑛𝑖𝑛, is the number of network inputs, 𝑛ℎ𝑖𝑑𝑑𝑒𝑛 the number of units in the hidden layer, and 𝑛𝑜𝑢𝑡, the number of output units. The input from unit i into unit j is denoted 𝑥𝑖,𝑗, and the weight from unit i to unit j is denoted 𝑤𝑖,𝑗.  Create a feed-forward network with 𝑛𝑖𝑛 inputs, 𝑛ℎ𝑖𝑑𝑑𝑒𝑛 hidden units, and 𝑛𝑜𝑢𝑡 output units.  Until the termination condition is met, do  For each ( Ԧ 𝑥,Ԧ 𝑡) in training_examples, do Propagate the input forward through the network: 1, Input the instance Ԧ 𝑥 to the network and compute the output 𝑜𝑢 of every unit u in the network: 𝑎𝑗 = σ𝑗 𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑎𝑛𝑑 𝑦𝑗 = 𝐹 𝑎𝑗 = 1 1+𝑒 −𝑎𝑗 Propagate the errors backward through the network: 2. For each network output unit k, calculate its error term 𝛿𝑘. 𝛿𝑘 ← 𝑜𝑘 1 − 𝑜𝑘 𝑡𝑘 − 𝑜𝑘 3. For each hidden unit h, calculate its error term 𝛿ℎ. 𝛿𝑘 ← 𝑜ℎ 1 − 𝑜ℎ ෍ 𝑘𝜖𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑤𝑘ℎ𝛿𝑘 4. Update each network weight 𝑤𝑗𝑖 𝑤𝑗𝑖 ← 𝑤𝑗𝑖+∆𝑤𝑗𝑖 where , ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑥𝑗𝑖 8/13/2024 58 Dr. Shivashankar, ISE, GAT
  • 59.
    Problems Problem 1: Assumethat the neurons have a sigmoid activation function, perform a forward pass and backward pass on the network. Assume that the actual output of y is 0.5 and learning rate is 1. Perform another forward pass. Solution: Forward pass: compute output for 𝑦3, 𝑦4 and 𝑦5 𝑎𝑗 = ෍ 𝑗 𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 = 1 1 + 𝑒−𝑎𝑗 𝑎3 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.1*0.35+0.8*0.9 = 0.755 𝑦3 = 𝑓(𝑎3)= 1 1+𝑒−0.755=0.68 𝑎4 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.4*0.35+0.6*0.9 = 0.68 𝑦4 = 𝑓(𝑎4)= 1 1+𝑒−0.68=0.6637 𝑎5 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.3*0.68+0.9*0.66 7= 0.801 8/13/2024 59 Dr. Shivashankar, ISE, GAT 𝑥1 = 0.35 𝑥2 = 0.9 𝐻3 𝐻4 𝑂5 𝑤13 = 0.1 𝑤14 = 0.4 𝑤23 = 0.8 𝑤24 = 0.6 𝑤45 = 0.9 𝑤35 = 0.3 𝑦5 Output y 𝑦3 𝑦4
  • 60.
    Conti.. 𝑦5 = 𝑓(𝑎5)= 1 1+𝑒−0.801=0.69(Network output) ∴ 𝐸𝑟𝑟𝑜𝑟 = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡−𝑦5= 0.5-0.69= -0.19 ………………………………………………………………………………………………………………………………………… Each weight changed by ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗 𝛿𝑗=𝑂𝑗(1=𝑂𝑗) (𝑡𝑗 − 𝑂𝑗) if j is an output unit 𝛿𝑗=𝑂𝑗(1=𝑂𝑗)σ𝑘 𝛿𝑘𝑤𝑘𝑗 if j is a hidden unit Where ƞ is a constant called the learning rate 𝑡𝑗is the correct teacher output for unit j 𝛿𝑗is the error measure for unit j Backward pass : Compute 𝛿3, 𝛿4 𝑎𝑛𝑑 𝛿5 For output unit: 𝛿5= 𝑦5 1 − 𝑦5 ( 𝑦𝑡𝑎𝑟𝑔𝑒𝑡− 𝑦5)=0.69 * (1- 0.69) * (0.5-0.69)= -0.0406 For hidden unit: 𝛿3= 𝑦3 1 − 𝑦3 ( 𝑤35∗ 𝛿5)=0.68 * (1- 0.68) * (0.3 * -0.0406)= -0.00265 𝛿4= 𝑦4 1 − 𝑦4 ( 𝑤45∗ 𝛿5)=0.6637 * (1- 0.6637) * (0.9 * -0.0406)= -0.0082 8/13/2024 60 Dr. Shivashankar, ISE, GAT
  • 61.
    Conti.. Compute new weights: ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗 ∆𝑤45=ƞ𝛿5𝑦4=1 * -0.0406*0.6637= -0.0269 𝑤45(new)=∆𝑤45+𝑤45(old) = -0.0269 +0.9= 0.8731 ∆𝑤14=ƞ𝛿4𝑥1= 1 * -0.0082 * 0.35 = -0.00287 𝑤14(𝑛𝑒𝑤)= ∆𝑤14+𝑤14(𝑜𝑙𝑑)= -0.00287+0.4= 0.3971 Similarly, update all other weights 8/13/2024 61 Dr. Shivashankar, ISE, GAT i J 𝑤𝑖𝑗 𝛿𝑗 𝑥𝑖 ƞ Updated 𝑤𝑖𝑗 1 3 0.1 -0.00265 0.35 1 0.0991 2 3 0.8 -0.00265 0.9 1 0.7976 1 4 0.4 -0.0082 0.35 1 0.3971 2 4 0.6 -0.0082 0.9 1 0.5926 3 5 0.3 -0.0406 0.68 1 0.2724 4 5 0.9 -0.0406 0.6637 1 0.8731
  • 62.
    Conti.. Updated network 2nd timeForward pass: Forward pass: compute output for 𝒚𝟑, 𝒚𝟒 and 𝒚𝟓 𝑎𝑗 = ෍ 𝑗 𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 = 1 1 + 𝑒−𝑎𝑗 𝑎1 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.0991*0.35+0.7976*0.9= 0.7525 𝑦3 = 𝑓(𝑎1)= 1 1+𝑒−0.7525 = 0.6797 𝑎2 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.3971*0.35+0.5926*0.9= 0.6723 𝑦4 = 𝑓(𝑎2)= 1 1+𝑒−0.6723 = 0.6620 𝑎3 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.2724*0.6797+0.8731*0.6620 = 0.7631 𝑦5 = 𝑓(𝑎3)= 1 1+𝑒−0.7631 = 0.6820 (Network output) Error = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦5 = 0.5 − 0.6820= -0.182 8/13/2024 62 Dr. Shivashankar, ISE, GAT
  • 63.
    Problems Problem 1: Assumethat the neurons have a sigmoid activation function, perform a forward pass and backward pass on the network. Assume that the actual output of y is 0.5 and learning rate is 1. Perform another forward pass. Solution: Forward pass: compute output for 𝑦3, 𝑦4 and 𝑦5 𝑎𝑗 = ෍ 𝑗 𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 = 1 1 + 𝑒−𝑎𝑗 𝑎1 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.1*0.35+0.8*0.9 = 0.755 𝑦3 = 𝑓(𝑎1)= 1 1+𝑒−0.755=0.68 𝑎2 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.4*0.35+0.6*0.9 = 0.68 𝑦4 = 𝑓(𝑎2)= 1 1+𝑒−0.68=0.6637 𝑎3 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.3*0.68+0.9*0.66 7= 0.801 8/13/2024 63 Dr. Shivashankar, ISE, GAT 𝑥1 = 0.35 𝑥2 = 0.9 𝐻3 𝐻4 𝑂5 𝑤13 = 0.1 𝑤14 = 0.4 𝑤23 = 0.8 𝑤24 = 0.6 𝑤45 = 0.9 𝑤35 = 0.3 𝑦5 Output y 𝑦3 𝑦4
  • 64.
    Conti.. 𝑦5 = 𝑓(𝑎3)= 1 1+𝑒−0.801=0.69(Network output) ∴ 𝐸𝑟𝑟𝑜𝑟 = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡−𝑦5= 0.5-0.69= -0.19 ………………………………………………………………………………………………………………………………………… Each weight changed by ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗 𝛿𝑗=𝑂𝑗(1=𝑂𝑗) (𝑡𝑗 − 𝑂𝑗) if j is an output unit 𝛿𝑗=𝑂𝑗(1=𝑂𝑗)σ𝑘 𝛿𝑘𝑤𝑘𝑗 if j is a hidden unit Where ƞ is a constant called the learning rate 𝑡𝑗is the correct teacher output for unit j 𝛿𝑗is the error measure for unit j Backward pass : Compute 𝛿3, 𝛿4 𝑎𝑛𝑑 𝛿5 For output unit: 𝛿5= 𝑦5 1 − 𝑦5 ( 𝑦𝑡𝑎𝑟𝑔𝑒𝑡− 𝑦5)=0.69 * (1- 0.69) * (0.5-0.69)= -0.0406 For hidden unit: 𝛿3= 𝑦3 1 − 𝑦3 ( 𝑤35∗ 𝛿5)=0.68 * (1- 0.68) * (0.3 * -0.0406)= -0.00265 𝛿4= 𝑦4 1 − 𝑦4 ( 𝑤45∗ 𝛿5)=0.6637 * (1- 0.6637) * (0.9 * -0.0406)= -0.0082 8/13/2024 64 Dr. Shivashankar, ISE, GAT
  • 65.
    Conti.. Compute new weights: ∆𝑤𝑗𝑖=ƞ𝛿𝑗𝑂𝑗 ∆𝑤45=ƞ𝛿5𝑦4=1 * -0.0406*0.6637= -0.0269 𝑤45(new)=∆𝑤45+𝑤45(old) = -0.0269 +0.9= 0.8731 ∆𝑤14=ƞ𝛿4𝑥1= 1 * -0.0082 * 0.35 = -0.00287 𝑤14(𝑛𝑒𝑤)= ∆𝑤14+𝑤14(𝑜𝑙𝑑)= -0.00287+0.4= 0.3971 Similarly, update all other weights 8/13/2024 65 Dr. Shivashankar, ISE, GAT i J 𝑤𝑖𝑗 𝛿𝑗 𝑥𝑖 ƞ Updated 𝑤𝑖𝑗 1 3 0.1 -0.00265 0.35 1 0.0991 2 3 0.8 -0.00265 0.9 1 0.7976 1 4 0.4 -0.0082 0.35 1 0.3971 2 4 0.6 -0.0082 0.9 1 0.5926 3 5 0.3 -0.0406 0.68 1 0.2724 4 5 0.9 -0.0406 0.6637 1 0.8731
  • 66.
    Conti.. Updated network 2nd timeForward pass: Forward pass: compute output for 𝒚𝟑, 𝒚𝟒 and 𝒚𝟓 𝑎𝑗 = ෍ 𝑗 𝑤𝑖𝑗 ∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 = 1 1 + 𝑒−𝑎𝑗 𝑎1 = 𝑤13 ∗ 𝑥1 + 𝑤23 ∗ 𝑥2 = 0.0991*0.35+0.7976*0.9= 0.7525 𝑦3 = 𝑓(𝑎1)= 1 1+𝑒−0.7525 = 0.6797 𝑎2 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 = 0.3971*0.35+0.5926*0.9= 0.6723 𝑦4 = 𝑓(𝑎2)= 1 1+𝑒−0.6723 = 0.6620 𝑎3 = 𝑤35 ∗ 𝑦3 + 𝑤45 ∗ 𝑦4 = 0.2724*0.6797+0.8731*0.6620 = 0.7631 𝑦5 = 𝑓(𝑎3)= 1 1+𝑒−0.7631 = 0.6820 (Network output) Error = 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦5 = 0.5 − 0.6820= -0.182 8/13/2024 66 Dr. Shivashankar, ISE, GAT
  • 67.
    Conti.. Problem 2: Assumethat the neurons have a sigmoid activation function, perform a forward pass and a backward pass on the network. Assume that the actual output of y is 1 and learning rate is 0.9. Perform another forward pass. Solution: Forward pass: Compute output for 𝑦4, 𝑦5 and 𝑦6 8/13/2024 67 Dr. Shivashankar, ISE, GAT 𝑥1 = 1 𝑥3 = 1 𝑥2 = 0 𝐻4 𝐻5 𝑂6 𝑤1,5 = −0.3 𝑤1,4 =0.2 𝑤2,5 =0.1 𝑤2,4 =0.4 𝑤3,4 = −0.5 𝑤3,5 =0.2 𝑤4,6 =-0.3 𝑤5,6 =-0.2 𝐴𝑐𝑡𝑢𝑎𝑙𝑜𝑢𝑝𝑢𝑡 = 1 𝜃4 =-0.4 or Bias 𝜃5 =0.2 𝜃6 =0.1
  • 68.
    Conti.. 𝑎𝑗 = ෍ 𝑗 𝑤𝑖𝑗∗ 𝑥𝑖 𝑦𝑗 = 𝐹 𝑎𝑗 = 1 1 + 𝑒−𝑎𝑗 𝑎4 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 + 𝑤34 ∗ 𝑥3 +𝜃4(or bias)= (0.2*1)+(0.4*0)+(-0.5*1)+(-0.4) = -0.7 𝑜(𝐻4)=𝑦4 = 𝑓(𝑎4)= 1 1+𝑒0.7 = 0.332 𝑎5 = 𝑤15 ∗ 𝑥1 + 𝑤25 ∗ 𝑥2 + 𝑤35 ∗ 𝑥3 +𝜃5= (-0.3*1)+(0.1*0)+(0.2*1)+0.2= 0.1 𝑜(𝐻5)=𝑦5 = 𝑓(𝑎5)= 1 1+𝑒−0.1 = 0.525 𝑎6 = 𝑤46 ∗ 𝐻4 + 𝑤56 ∗ 𝐻5 + 𝜃6= (-0.3*0.332)+(-0.2*0.525)+0.1= -0.105 𝑜(𝑂6)=𝑦6 = 𝑓(𝑎6)= 1 1+𝑒0.105 = 0.474 Error= 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 - 𝑦6=1- 0.474 = 0.526 ................................................................................................................................................. Backward pass: For output unit: 𝛿6 =𝑦6(1 - 𝑦6)(𝑦𝑡𝑎𝑟𝑔𝑒𝑡 - 𝑦6)=0.474*(1-0.474)=0.1311 For hidden unit: 𝛿5 =𝑦5(1 - 𝑦5) 𝑤56*𝛿6 = 0.525*(1-0.525)*(-0.2*0.1311)= -0.0065 𝛿4 =𝑦4(1 - 𝑦4) 𝑤46*𝛿6 = 0.332*(1-0.332)*(-0.3*0.1331)= -0.0087 8/13/2024 68 Dr. Shivashankar, ISE, GAT
  • 69.
    Conti.. Compute new weights ∆𝑤𝑖𝑗=ƞ𝛿𝑗𝑜𝑖 ∆𝑤46 =ƞ𝛿6𝑦4=0.9*0.1311*0.332 = 0.03917 𝑤46(new)=∆𝑤46 + 𝑤46(old)=0.03917+(-0.3)= -0.261 ∆𝑤14 =ƞ𝛿4𝑥1=0.9* -0.0087*1= -0.0078 𝑤14(new)=∆𝑤14 + 𝑤14(old)= -0.0076+0.2= 0.192 8/13/2024 69 Dr. Shivashankar, ISE, GAT i j 𝑤𝑖𝑗 𝛿𝑖 𝑥𝑖 ƞ Updated 𝑤𝑖𝑗 4 6 -0.3 0.1311 0.332 0.9 -0.261 5 6 -0.2 0.1311 0.525 0.9 -0.138 1 4 0.2 -0.0087 1 0.9 0.192 1 5 -0.3 -0.0065 1 0.9 -0.306 2 4 0.4 -0.0087 0 0.9 0.4 2 5 0.1 -0.0065 0 0.9 0.1 3 4 -0.5 -0.0087 1 0.9 -0.508 3 5 0.2 -0.0065 1 0.9 0.194
  • 70.
    Conti.. Updated network: 2nd timeForward pass: Forward pass: compute output for 𝒚𝟒, 𝒚𝟓 and 𝒚𝟔 𝑎4 = 𝑤14 ∗ 𝑥1 + 𝑤24 ∗ 𝑥2 + 𝑤34 ∗ 𝑥3 +𝜃4= (0.192*1)+(0.4*0)+(-0.508*1)+(-0.408)= -0.724 𝑜(𝐻4)=𝑦4 = 𝑓(𝑎4)= 1 1+𝑒0.724= 0.327 𝑎5 = 𝑤15 ∗ 𝑥1 + 𝑤25 ∗ 𝑥2 + 𝑤35 ∗ 𝑥3 +𝜃5= (-0.306*1)+(0.1*0)+(0.194*1)+(0.194)= 0.082 𝑜(𝐻5)=𝑦5 = 𝑓(𝑎5)= 1 1+𝑒−0.082= 0.520 𝑎6 = 𝑤46 ∗ 𝐻4 + 𝑤56 ∗ 𝐻5 + 𝜃6= (-0.261* 0.327)+(-0.138*0.520)+0.218= 0.061 𝑜(𝑂6)=𝑦6 = 𝑓(𝑎6)= 1 1+𝑒−0.061.= 0.515 (Network Output) Error= 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 - 𝑦6= 1- 0.515= 0.485 8/13/2024 70 Dr. Shivashankar, ISE, GAT
  • 71.
    Bayesian Learning: Conditionalprobability • Conditional probability is the probability that depends on a previous result or event. • It help us understand how events are related to each other. • When the probability of one event happening doesn’t influence the probability of any other event, then events are called independent, otherwise dependent events. • It is defined as the probability of any event occurring when another event has already occurred. • In other words, it calculates the probability of one event happening given that a certain condition is satisfied. • It is represented as P (A | B) which means the probability of A when B has already happened. 8/13/2024 71 Dr. Shivashankar, ISE, GAT
  • 72.
    Cont… Conditional Probability Formula: •When the intersection of two events happen, then the formula for conditional probability for the occurrence of two events is given by; • P(A|B) = N(A∩B)/N(B) or • P(B|A) = N(A∩B)/N(A) • Where P(A|B) represents the probability of occurrence of A given B has occurred. • N(A ∩ B) is the number of elements common to both A and B. • N(B) is the number of elements in B, and it cannot be equal to zero. • Let N represent the total number of elements in the sample space. • N(A ∩ B)/N can be written as P(A ∩ B) and N(B)/N as P(B). 𝑃 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴) 𝑃(𝐵) • Therefore, P(A ∩ B) = P(B) P(A|B) if P(B) ≠ 0 • = P(A) P(B|A) if P(A) ≠ 0 • Similarly, the probability of occurrence of B when A has already occurred is given by, • P(B|A) = P(B ∩ A)/P(A) 8/13/2024 72 Dr. Shivashankar, ISE, GAT
  • 73.
    Cont… How to CalculateConditional Probability? To calculate the conditional probability, we can use the following method: Step 1: Identify the Events. Let’s call them Event A and Event B. Step 2: Determine the Probability of Event A i.e., P(A) Step 3: Determine the Probability of Event B i.e., P(B) Step 4: Determine the Probability of Event A and B i.e., P(A∩B). Step 5: Apply the Conditional Probability Formula and calculate the required probability. Conditional Probability of Independent Events For independent events, A and B, the conditional probability of A and B with respect to each other is given as follows: P(B|A) = P(B) P(A|B) = P(A) 8/13/2024 73 Dr. Shivashankar, ISE, GAT
  • 74.
    Cont… Problem 1: Twodies are thrown simultaneously, and the sum of the numbers obtained is found to be 7. What is the probability that the number 3 has appeared at least once? Solution: • Event A indicates the combination in which 3 has appeared at least once. • Event B indicates the combination of the numbers which sum up to 7. • A = {(3, 1), (3, 2), (3, 3)(3, 4)(3, 5)(3, 6)(1, 3)(2, 3)(4, 3)(5, 3)(6, 3)} • B = {(1, 6)(2, 5)(3, 4)(4, 3)(5, 2)(6, 1)} • P(A) = 11/36 • P(B) = 6/36 • A ∩ B = 2 • P(A ∩ B) = 2/36 • Applying the conditional probability formula we get, • P(A|B) = P(A∩B)/P(B) = (2/36)/(6/36) = ⅓ 8/13/2024 74 Dr. Shivashankar, ISE, GAT
  • 75.
    Cont… Problem 2: Ina group of 100 computer buyers, 40 bought CPU, 30 purchased monitor, and 20 purchased CPU and monitors. If a computer buyer chose at random and bought a CPU, what is the probability they also bought a Monitor? Solution: As per the first event, 40 out of 100 bought CPU, So, P(A) = 40% or 0.4 Now, according to the question, 20 buyers purchased both CPU and monitors. So, this is the intersection of the happening of two events. Hence, P(A∩B) = 20% or 0.2 Conditional probability is P(B|A) = P(A∩B)/P(B) P(B|A) = 0.2/0.4 = 2/4 = ½ = 0.5 The probability that a buyer bought a monitor, given that they purchased a CPU, is 50%. 8/13/2024 75 Dr. Shivashankar, ISE, GAT
  • 76.
    Cont… Question 7: Ina survey among a group of students, 70% play football, 60% play basketball, and 40% play both sports. If a student is chosen at random and it is known that the student plays basketball, what is the probability that the student also plays football? Solution: Let’s assume there are 100 students in the survey. Number of students who play football = n(A) = 70 Number of students who play basketball = n(B) = 60 Number of students who play both sports = n(A ∩ B) = 40 To find the probability that a student plays football given that they play basketball, we use the conditional probability formula: P(A|B) = n(A ∩ B) / n(B) Substituting the values, we get: P(A|B) = 40 / 60 = 2/3 Therefore, probability that a randomly chosen student who plays basketball also plays football is 2/3. 8/13/2024 76 Dr. Shivashankar, ISE, GAT
  • 77.
    BAYES THEOREM • Bayes’theorem describes the probability of occurrence of an event related to any condition. • Bayes’ Theorem is used to determine the conditional probability of an event. • Bayesian methods provide the basis for probabilistic learning methods that accommodate knowledge about the prior probabilities of alternative hypotheses. • To define Bayes theorem precisely: • P(h) to denote the initial probability that hypothesis h holds. • P(h) is often called the prior-probability of h and may reflect any background knowledge, chance that h is a correct hypothesis. • P(D) to denote the prior probability that training data D will be observed (i.e., the probability of D given no knowledge about which hypothesis holds). • P(D|h) to denote the probability of observing data D given some world in which hypothesis h holds. • P (h|D) is called the posterior-probability of h, because it reflects our confidence that h holds. • Notice the posterior probability P(h|D) reflects the influence of the training data D, in contrast to the prior probability P(h) , which is independent of D. 8/13/2024 77 Dr. Shivashankar, ISE, GAT
  • 78.
    BAYES THEOREM • IfA and B are two events, then the formula for the Bayes theorem is given by: • P(A|B) = P(B|A) X P(A) 𝑃(𝐵) • Where P(A|B) is the probability of condition when event A is occurring while event B has already occurred. P(A) – Probability of event A P(B) – Probability of event B P(A|B) – Probability of A given B P(B|A) – Probability of B given A From the definition of conditional probability, Bayes theorem can be derived for events as given below: P(A|B) = P(A ⋂ B)/ P(B), where P(B) ≠ 0 P(B|A) = P(B ⋂ A)/ P(A), where P(A) ≠ 0 • Since P(A∩ 𝐵)𝑎𝑛𝑑 𝑃(𝐵 ∩ 𝐴) are equal • P(A|B) X P(B) = P(B|A) X P(A) ∴ P(A|B) = P(B|A) X P(A) 𝑃(𝐵) this is the Bays theorem. 8/13/2024 78 Dr. Shivashankar, ISE, GAT Likelihood probability Posterior probability Marginal probability Prior probability
  • 79.
    Cont… Problem 1: Apatient takes a lab test for cancer diagnosis. There are two possible outcomes in this case: ⊕(positive) and ⊖ (negative). The test returns a correct positive results in only 98%. If the cases in which the diseases is actually present and a correct negative result in only 97% of the cases in which the disease in present. Furthermore, 0.008 of the entire population have this cancer. Compute the following values. 1). P(Cancer) 2). P(¬𝐶𝑎𝑛𝑐𝑒𝑟) 3). P(+ve Cancer) 4). P(-ve Cancer) 5). P(+| (¬𝐶𝑎𝑛𝑐𝑒𝑟) 4). P(-| (¬𝐶𝑎𝑛𝑐𝑒𝑟) Solution: • P(cancer)= 0.008 P(¬cancer)= 0.992 • P(⊕/cancer)=0.98 P(⊖/cancer)= 0.02 • P(⊕/¬cancer)=0.03 P(⊖/¬cancer)= 0.97 and • P(⊕/cancer) P(cancer)=0.98 X 0.008 = 0.0078 • P(⊕/-cancer) P(-cancer)=0.03 X 0.992=0.0298 • Thus, ℎ𝑀𝐴𝑃 = -cancer. • The exact posterior probabilities can also be determined by normalizing the above quantities so that they sum to 1 (e.g., P(cancer/ ⊕) = 0.0078 0.0078+0.0298 = 0.207 8/13/2024 79 Dr. Shivashankar, ISE, GAT
  • 80.
    Cont… Problem 3: Giventhat they passed the exam what is the probability it is a woman ? Answer: P(A): Probability of a woman passing the exam is 92/100 and it's equal to 0.92 P(B|A): probability of having a woman ; 100/200 and it's equal to 0.5 P(B): The probability of passing the exam ; 169/200 and it's equal to 0.845 P(A|B) = (P(0.5) x P(0.92 ) / P(0.845) P(A|B) = 0.54 92/169 = 0.54 too 8/13/2024 80 Dr. Shivashankar, ISE, GAT Did not pass the exam Passed the exam Total Women 8 92 100 Men 23 77 100 Total 31 169 200
  • 81.
    Cont… 4. Covid-19 hastaken over the world and the use of Covid19 tests is still relevant to block the spread of the virus and protect our families. If the Covid19 infection rate is 10% of the population, and thanks to the tests we have in Algeria, 95% of infected people are positive with 5% false positive. What would be the probability that I am really infected if I test positive? Solution : Parameters : • P(A): 10% infected • P(B|A): 95% Test positive while infected • 5% False positive while non infected • 90% not infected • We will start multiplying the probability of infection (10%) by the probability of testing positive given that be infected (95%) then we divided by the sum of the probability of infection (10%) by the probability of testing positive given that be infected (95%) with not infected (90%) multiplied by false positive (5%) P(A|B) = P(A) * P(B|A) / Σ P(A) * P(B|A) P(A|B) = 0.1 * 0.95 /(0.95 * 0.1) +(0.05*0.90) P(A|B) = 0.095 / 0.095 + 0.045 P(A|B) = 0.678 8/13/2024 81 Dr. Shivashankar, ISE, GAT
  • 82.
    Cont… 2. Let Adenote the event that a “patient has liver disease”, and B the event that a “patient is an alcoholic”. It is known from experience that 10% of the patients entering the clinic have liver disease and 5% of the patient are alcoholics. Also, among those patients diagnosed with liver disease, 7% are alcoholic. Given that a patient is alcoholic, what is the probability that he will have liver disease? Solution: A-”patient has liver disease”. B-”patient is an alcoholic”. P(A)=10%=0.1 P(B)=5%=0.05 P(B|A)=7%=0.07 P(A|B) = P(B|A) X P(A) 𝑃(𝐵) = 0.07𝑋0.10 0.05 = 0.14 8/13/2024 82 Dr. Shivashankar, ISE, GAT
  • 83.
    MAXIMUM LIKELIHOOD ANDLEAST-SQUARED ERROR HYPOTHESES • Many learning approaches such as neural network learning, linear regression, and polynomial curve fitting try to learn a continuous valued target function. • Under certain assumptions any learning algorithm that minimizes the squared error between output hypothesis predictions and the training data will output a MAXIMUM LIKELIHOOD HYPOTHESIS. • The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data. • In order to find the Maximum Likelihood Hypothesis in Bayesian learning for continuous valued target function, we start with Maximum Likelihood Hypothesis definition, but using lower case p to refer to the Probability Density Function ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 𝑝 𝐷|ℎ 8/13/2024 83 Dr. Shivashankar, ISE, GAT The argument that gives the maximum value from a target function
  • 84.
    MAXIMUM LIKELIHOOD ANDLEAST-SQUARED ERROR HYPOTHESES • ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 𝑝 𝐷|ℎ where p: probability density function • Assume that fixed set of training instances ( 𝑥1, 𝑥2 , 𝑥3 ,……. 𝑥𝑛 ) and data D corresponding sequence of target values D=(𝑑1, 𝑑2, … . 𝑑𝑛) • p(D|h) → product of p(𝑑𝑖|ℎ) ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 𝑝(𝑑𝑖|ℎ) Assume target values are normally distributed 𝑓(𝑥|𝜇)= 1 2𝜋𝜎2 𝑒 − 𝑥−𝜇 2 2𝜎2 ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 𝑝(𝑑𝑖|ℎ) ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 1 2𝜋𝜎2 𝑒 − 1 2𝜎2 𝑑𝑖−𝜇 2 ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෑ 𝑖=1 𝑛 1 2𝜋𝜎2 𝑒 − 1 2𝜎2 𝑑𝑖−ℎ(𝑥𝑖) 2 8/13/2024 84 Dr. Shivashankar, ISE, GAT Mean Standard Deviation Target of 𝑖𝑡ℎ input output of hypothesis 𝑜𝑓 𝑖𝑡ℎ input Variance
  • 85.
    Conti.. Rather than maximizingabove calculated expression, we shall choose to maximize its (less complicated) logarithmic. This is justified because Ln(p) is a monotonic function of p. Therefore maximizing Ln p also maximizes p. ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 𝑙𝑛 1 2𝜋𝜎2 − 1 2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2 First term is constant, discard it ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 − 1 2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2 Maximizing the negative quantity is equivalent to minimizing the corresponding position quantity. ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 1 2𝜎2 𝑑𝑖 − ℎ(𝑥𝑖 )2 Finally we can again discard constants that are independent of h ℎ𝑀𝐿 = argmax ℎ𝜖𝐻 ෍ 𝑖=1 𝑚 𝑑𝑖 − ℎ(𝑥𝑖 )2 8/13/2024 85 Dr. Shivashankar, ISE, GAT Leas square error hypothesis Bayesian Learning for given continuous valued target
  • 86.
    NAIVE BAYES CLASSIFIER •Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem (helps to determine the likelihood that one event will occur with unclear information while another has already happened) and used for solving classification problems. • It is mainly used in text classification that includes a high-dimensional training dataset. • Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. • It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. • Naïve: It assumes that the occurrence of a certain feature is independent of the occurrence of other features. • Example: If the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other. 8/13/2024 86 Dr. Shivashankar, ISE, GAT
  • 87.
    Conti.. Naïve and Bayesalgorithm : It is a way to calculate the value of P(B|A) with the knowledge of P(A|B). Working of Naïve Bayes' Classifier:  Step 1: Convert the given dataset into frequency tables (divide the number of rows and columns).  Step 2: Generate Likelihood table by finding the probabilities of given features (𝜇, 𝜎, 𝑥𝑖).  Step 3: Now, use Bayes theorem to calculate the posterior probability. Steps to implement: • Data Pre-processing step • Fitting Naive Bayes to the Training set • Predicting the test result • Test accuracy of the result • Visualizing the test set result. 8/13/2024 87 Dr. Shivashankar, ISE, GAT
  • 88.
    NAIVE BAYES CLASSIFIER •One highly practical Bayesian learning method is the naive Bayes learner, often called the Naive Bayes classifier. • The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f (x) can take on any value from some finite set V. • A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values 𝑎1, 𝑎2, 𝑎3, … , 𝑎𝑛 . • The learner is asked to predict the target value, or classification, for this new instance. • The naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. • Naive Bayes classifier: 𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖𝑣 𝑣𝑖 ෑ 𝑖 𝑃(𝑎𝑖|𝑣𝑗) • where 𝑉𝑁𝐵 denotes the target value output by the Naive Bayes classifier. • Notice that in a naive Bayes classifier the number of distinct 𝑃(𝑎𝑖|𝑣𝑗) terms that must be estimated from the training data is just the number of distinct attribute values times the number of distinct target values-a much smaller number than if we were to estimate the P(𝑎1, 𝑎2, …, 𝑎𝑛 |𝑣𝑗) terms as first contemplated. 8/13/2024 88 Dr. Shivashankar, ISE, GAT
  • 89.
    NAÏVE BAYES CLASSIFIER FromBays theorem P(A|B) = P(B|A) X P(A) 𝑃(𝐵) this is the Bays theorem. Data set X={𝑥1, 𝑥2, … … … . . 𝑥𝑛) using these features compute output {y} features. Multiple features 𝑓1, 𝑓2, 𝑓3, 𝑦 𝑥1, 𝑥2, 𝑥3, 𝑦1 --- (1 record) 𝑥1, 𝑥2, 𝑥3, 𝑦2 --- (2 record) For this kind of data set how (Bayes theorem) equation changes To compute these features in y using Bayes theorem P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) = (P(𝑥1 𝑦 ∗P(𝑥2|𝑦) ∗ P(𝑥3|𝑦),……..∗ P(𝑥𝑛|𝑦)∗P(y)) 𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛) = 𝑃 𝑌 ∗ς1=1 𝑛 𝑃(𝑥𝑖|𝑦) 𝑃 𝑥1 𝑃 𝑥2 𝑃 𝑥3 ,………….𝑃(𝑥𝑛) P(y|𝑥1, 𝑥2, 𝑥3,….. 𝑥𝑛) 𝛼 𝑃 𝑌 ∗ ς1=1 𝑛 𝑃 𝑥𝑖 𝑦 Y= argmax 𝑦 [𝑃 𝑌 ∗ ς1=1 𝑛 𝑃 𝑥𝑖 𝑦 ] 8/13/2024 89 Dr. Shivashankar, ISE, GAT
  • 90.
    Problem 1: Calculateplay for TODAY Check for dataset TODAY (outlook=Sunny, temperature= Hot) Solution: Naive Base classifier is defined by 𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) ς𝑖 𝑃(𝑎𝑖|𝑣𝑗)= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) P(Temperature = hotI 𝑣𝑗) 𝑉𝑁𝐵(Yes)= P(Yes|Today) = 𝑃 𝑇𝑜𝑑𝑎𝑦 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠) 𝑃(𝑇𝑜𝑑𝑎𝑦) = 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ∗𝑃 𝐻𝑜𝑡 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠) 𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝 = 2/9*2/9*9/14 =0.031 𝑉𝑁𝐵(No)= P(No|Today) = 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 ∗𝑃 𝐻𝑜𝑡 𝑁𝑜 ∗𝑃(𝑁𝑜) 𝑃 𝑇𝑜𝑑𝑎𝑦 →𝑒𝑣𝑒𝑟𝑦 𝑟𝑒𝑐𝑜𝑟𝑑 𝑠𝑎𝑚𝑒,𝑠𝑜 𝑠𝑘𝑖𝑝 =3/5*2/5*5/14 = 0.08571 To calculate P(Yes) for Today condition and normalized to one 𝑉𝑁𝐵(Yes)= 𝑉𝑁𝐵(Yes) 𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No) = 0.031 0.031+0.08571 ≈0.27 𝑉𝑁𝐵(No)= 0.08571 0.031+0.08571 = 0.734 ∴ 𝑉𝑁𝐵(No) probability is high , So for TODAY (Sunny, Hot)—Play is No 8/13/2024 90 Dr. Shivashankar, ISE, GAT Outlook Yes No P(y) P(No) Sunny 2 3 2/9 3/5 Overcast 4 0 4/9 0/5 Rainy 3 2 3/9 3/5 Total 9 5 100% 100% Temperature Yes No P(y) P(No) Hot 2 2 2/9 2/5 Mild 4 2 4/9 2/5 Cool 3 1 3/9 1/5 Total 9 5 100% 100%
  • 91.
    Problem 1: Applythe naive Bayes classifier to a concept learning problem, classifying days according to whether someone will play tennis {outlook=sunny, temperature=cool, humidity=high. Wind=strong} 8/13/2024 91 Dr. Shivashankar, ISE, GAT Day Outlook Temperature Humidity Wind Play_Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild high Strong No
  • 92.
    Cont… Solution: {Outlook=sunny, temperature=cool, Humidity=high,Wind=strong} P(Play Tennis=yes)=9/14=0.6428 P(Play Tennis=No)=5/14=0.3571 𝑉𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) ෑ 𝑖 𝑃(𝑎𝑖|𝑣𝑗) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣𝑗𝜖{𝑦𝑒𝑠,𝑛𝑜} 𝑃(𝑣𝑗) P(0utlook = sunny|𝑣𝑗) ∗P(Temperature = cool I 𝑣𝑗) ∗ P(Humidity =High | 𝑣𝑗) ∗ P(Wind = Strong | 𝑣𝑗) 𝑉𝑁𝐵(Yes)= P(Sunny|Yes) *P(Cool|Yes) *P(High|yes) *P(Strong|Yes)* P(yes) = 2/9 * 3/9* 3/9*3/9* 0.6428 =0.0053 𝑉𝑁𝐵(No)= P(Sunny|No) *P(Cool|No) *P(High|No) *P(Strong|No)* P(No) = 3/5 * 1/5* 4/5*3/5*0.3571=0.0206 𝑉𝑁𝐵(Yes)= 𝑉𝑁𝐵(Yes) 𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No) = 0.0053 0.0053+0.0206 = 0.205 𝑉𝑁𝐵(No)= 𝑉𝑁𝐵(No) 𝑉𝑁𝐵(Yes)+𝑉𝑁𝐵(No) = 0.0206 0.0053+0.0206 = 0.795 Therefore, 𝑉𝑁𝐵(No)= 0.795 > 0.205, Play Tennis: No 8/13/2024 92 Dr. Shivashankar, ISE, GAT Outlook Y N Sunny 2/9 3/5 Overcast 4/9 0 Rain 3/9 2/5 Temperature Y N Hot 2/9 2/5 Mild 4/9 2/5 Cold 3/9 1/5 Humidity y N High 3/9 4/5 Normal 6/9 1/5 Wind Y N Strong 3/9 3/5 Weak 6/9 2/5
  • 93.
    Cont… Problem 2: Estimatethe conditional probabilities of each attributes {color, legs, height, smelly} for the species classes {M,H} using the data set given in the table. Using these probabilities estimate the probability values for the new instance {color=green, legs=2, height=tall and smelly=No}. 8/13/2024 93 Dr. Shivashankar, ISE, GAT No Color Legs Height Smelly Species 1 White 3 Short Yes M 2 Green 2 Tall No M 3 Green 3 Short Yes M 4 White 3 Short Yes M 5 Green 2 Short No H 6 White 2 Tall No H 7 White 2 Tall No H 8 White 2 Short Yes H
  • 94.
    Cont… Solution : {color=green,legs=2, height=tall and smelly=No}, P(M)=4/8=0.5, P(H)=4/8=0.5 P(M/New instance)= P(M)*P(color=green/M) * P(legs=2/M) * P(Height=Tall/M) * P(Smelly=No/M) = 0.5* 2/4 * 1/4 * 1/4 * 1/4 = 0.0039 P(H/New instance)= P(H)*P(color=green/H) * P(legs=2/H) * P(Height=Tall/H) * P(Smelly=No/H) = 0.5* 1/4 * 4/4 * 2/4 * 3/4 = 0.048 Since P(H/New instance) > P(M/New instance) Hence the new instance {color=green, legs=2, height=tall and smelly=No} belongs to H 8/13/2024 94 Dr. Shivashankar, ISE, GAT Color M H White 2/4 3/4 Green 2/4 1/4 Legs M H 2 1/4 4/4 3 3/4 0/4 Height M H Short 3/4 2/4 Tall 1/4 2/4 Smelly M H Yes 3/4 1/4 No 1/4 3/4
  • 95.
    Cont… 1. Estimate theconditional probabilities for P(A|+), P(B|+), P(C|+), P(A|-), P(B|-) and P(C|-) 2. Use the estimate of Conditional probabilities given to predict the class label for a test sample (A=0, B=1, C=0), use Naïve Bays Approach. 3. Estimate the conditional probabilities using the m-estimate approach with P=1/2 and m=4. solution: 1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+) P(A|-), P(B|-), P(C|-) P(A=0|-)= 3/5=0.6 P(A=0|+)= 2/5=0.4 P(B=0|-)= 3/5=0.6 P(B=0|+)= 4/5=0.8 P(C=0|-)= 0/5=0.0 P(C=0|+)= 3/5=0.6 2. Classify the new instance (A=0, B=1, C=0) P(𝐶𝑖|𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛)= 𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛|𝐶𝑖. 𝑃(𝐶𝑖) 𝑃(𝑥1, 𝑥2, 𝑥3,.., 𝑥𝑛) P(+|A=0, B=1, C=0) = 𝑃 𝐴 = 0 + ∗𝑃 𝐵 = 1 + ∗𝑃 𝐶 = 0 + ∗𝑃(+) 𝑃(𝐴=0,𝐵=1,𝐶=0) = 0.4∗0.2∗0.6∗0.5 𝐾 = 0.024 𝐾 8/13/2024 95 Dr. Shivashankar, ISE, GAT Record A B C Class 1 0 0 0 + 2 0 0 1 - 3 0 1 1 - 4 0 1 1 - 5 0 0 0 + 6 1 0 0 + 7 1 0 1 - 8 1 0 1 - 9 1 1 1 + 10 1 0 1 +
  • 96.
    Cont… P(-|A=0, B=1, C=0) = 𝑃𝐴 = 0 − ∗𝑃 𝐵 = 1 − ∗𝑃 𝐶 = 0 − ∗𝑃(−) 𝑃(𝐴=0,𝐵=1,𝐶=0) = 𝟎 𝑲 The class label should be +since 0.024 𝐾 > 0 𝐾 3. Estimate the conditional probability using the m-estimate approach With P=1/4, m=4. The conditional probability using m-estimate: Prob(A|B)= 𝑛𝑐+𝑚𝑝 𝑛+𝑚 Where, 𝑛𝑐: No. of times A^B happened n: No. of times B happened in the training data P(A=0|+)= 2+2 5+4 = 4 9 P(A=0|-)= 3+2 5+4 = 5 9 P(B=1|+)= 1+2 5+4 = 3 9 P(B=1|-)= 2+2 5+4 = 4 9 P(C=0|+)= 3+2 5+4 = 5 9 P(C=0|-)= 0+2 5+4 = 2 9 8/13/2024 96 Dr. Shivashankar, ISE, GAT P(A=1|-) =0.4 P(A=1|+) =0.6 P(B=1|-) =0.4 P(A=1|+) =0.2 P(C=1|-) =1 P(C=1|+) =0.4 P(A=0|-) =0.6 P(A=0|+) =0.4 P(B=0|-) =0.6 P(B=0|+) =0.8 P(C=0|-) =0 P(C=0|+) =0.6
  • 97.
    Cont… Problem 3: Classifythe new estimate (A=0, B=1, C=0), using m-estimates approach with P=1/2, m=4. P(+|A=0, B=1, C=0) = P(A=0|+) ∗ P(B=1|+)∗ P(C=0|+)∗ P(+) P(A=0, B=1, C=0) = 4 9 ∗ 3 9 ∗ 5 9 ∗0.5 𝐾 = 0.0412 𝐾 P(-|A=0, B=1, C=0) = P(A=0|−) ∗ P(B=1|−)∗ P(C=0|−)∗ P(−) P(A=0, B=1, C=0) = 5 9 ∗ 4 9 ∗ 2 9 ∗0.5 𝐾 = 0.0274 𝐾 The class label should be belongs to +since 0.0412 𝐾 > 0.0274 𝐾 8/13/2024 97 Dr. Shivashankar, ISE, GAT