2. “Goal - Become a Data Scientist”
“A Dream becomes a Goal when action is taken towards its achievement” - Bo Bennett
“The Plan”
“A Goal without a Plan is just a wish”
3. ● Introduction to Probability
● Conditional Probability
● Independence Events
● Bayes’ Theorem
● Estimations - MLE, MAP
● Joint Probability
● Naive Bayes’
● Gaussian NB
Agenda
4. Probability - The chance
● How likely something is to happen
● Probability is quantified as a number between 0 and 1
● What is the probability of getting 6 out of a dice roll?
○ Not biased, equal chance
○ Probability of getting any number is 1/6
5. Conditional Probability
● Dependence of event A on B
● P(A and B) is joint probability
● Probability of A given event B has happened
2
4
5 1 2 3
5
1
4
2
3
6. Independent Events
● Happening of event A doesn’t depend on event B
● So, the joint probability is product of
individual probabilities
7. Bayes’ Theorem
● describes the probability of an event, based on prior knowledge
P(A | B) - Conditional Probability; Posterior
● P(B | A) - Conditional Probability; Likelihood
● P(A) and P(B) - Marginal probabilities; probability
8. Joint Probability Distribution
Gender Hours_Worked Wealth Probabilities
Female
<40.5
poor 0.253122
rich 0.0245895
>40.5
poor 0.0421768
rich 0.0116293
Male
<40.5
poor 0.331313
rich 0.0971295
>40.5
poor 0.134106
rich 0.105933
Total Probability 0.9999991
Gender
Hours
Worked
P(rich | G,HW) P(poor | G,HW)
F <40.5 0.09 0.91
F >40.5 0.21 0.79
M <40.5 0.23 0.77
M >40.5 0.38 0.62
To learn P(Y | X1, X2) we need
2^n estimates here
How P(Y | X1, X2)s are
calculated?
9. Maximum Likelihood Estimation
● Data: Observed set of D of “h” Heads and “t” Tails
P(D|𝜽) = P(h,t|𝜽) = 𝜽^h(1-𝜽^t)
● Optimization problem: Learning 𝜽
● Objective function:
○ MLE: Choose 𝜽 that maximizes the probability of observed data
𝜽c = arg max P(D|𝜽)
𝜽c = h/(h+t)
10. Maximise A Posteriori
● MLE is not a good estimate in case of less data
● Prior information about parameter is required for better estimate
● P(𝜽) is the prior information
● Prior is assumed to be Beta distribution
P(𝜽|D) ∝ P(D|𝜽)P(𝜽)
𝜽c = [h+𝛃1] / [(h+𝛃1)+(t+𝛃2)]
𝛃1 = Prior information about heads
𝛃2 = Prior information about tails
11. Naive Bayes’ The Hero
● Have less no. of estimators, how?
● Assumption of Conditional Independence
P( X1…..Xn | Y ) = 𝚷 P(Xi | Y)
Conditioned on Y and X1 to Xn are independent
P( X1…..Xn | Y ) = P(X1 | Y) P(X2 | Y) P(X3 | Y)...P(Xn | Y)
● If ‘Xi’ is a binary feature then 2n+1 parameters to estimate
12. Pros of Naive Bayes’
● In spite of over-simplified assumptions, naive Bayes classifiers have worked
quite well
● Document classification and Spam filtering
● Requires small amount of training data to estimate the necessary
parameters
● Extremely fast compared to more sophisticated methods