SlideShare a Scribd company logo
1 of 22
Download to read offline
Classification: Naive Bayes
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 27, 2015
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 1 / 16
Learning by example
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
Learning by example
• How did you solve this problem?
• Can you make this process explicit (e.g. write code to do so)?
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
Diagnoses a la Bayes1
• You’re testing for a rare disease:
• 1% of the population is infected
• You have a highly sensitive and specific test:
• 99% of sick patients test positive
• 99% of healthy patients test negative
• Given that a patient tests positive, what is probability the
patient is sick?
1
Wiggins, SciAm 2006
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16
Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
So given that a patient tests positive (198 ppl), there is a 50%
chance the patient is sick (99 ppl)!
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
The small error rate on the large healthy population produces
many false positives.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
Natural frequencies a la Gigerenzer2
2
http://bit.ly/ggbbc
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16
Inverting conditional probabilities
Bayes’ Theorem
Equate the far right- and left-hand sides of product rule
p (y|x) p (x) = p (x, y) = p (x|y) p (y)
and divide to get the probability of y given x from the probability
of x given y:
p (y|x) =
p (x|y) p (y)
p (x)
where p (x) = y∈ΩY
p (x|y) p (y) is the normalization constant.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16
Diagnoses a la Bayes
Given that a patient tests positive, what is probability the patient
is sick?
p (sick|+) =
99/100
p (+|sick)
1/100
p (sick)
p (+)
99/1002+99/1002=198/1002
=
99
198
=
1
2
where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy).
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16
(Super) Naive Bayes
We can use Bayes’ rule to build a one-word spam classifier:
p (spam|word) =
p (word|spam) p (spam)
p (word)
where we estimate these probabilities with ratios of counts:
ˆp(word|spam) =
# spam docs containing word
# spam docs
ˆp(word|ham) =
# ham docs containing word
# ham docs
ˆp(spam) =
# spam docs
# docs
ˆp(ham) =
# ham docs
# docs
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16
(Super) Naive Bayes
$ ./enron_naive_bayes.sh meeting
1500 spam examples
3672 ham examples
16 spam examples containing meeting
153 ham examples containing meeting
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(meeting|spam) = .0106
estimated P(meeting|ham) = .0416
P(spam|meeting) = .0923
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 9 / 16
(Super) Naive Bayes
$ ./enron_naive_bayes.sh money
1500 spam examples
3672 ham examples
194 spam examples containing money
50 ham examples containing money
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(money|spam) = .1293
estimated P(money|ham) = .0136
P(spam|money) = .7957
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 10 / 16
(Super) Naive Bayes
$ ./enron_naive_bayes.sh enron
1500 spam examples
3672 ham examples
0 spam examples containing enron
1478 ham examples containing enron
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(enron|spam) = 0
estimated P(enron|ham) = .4025
P(spam|enron) = 0
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 11 / 16
Naive Bayes
Represent each document by a binary vector x where xj = 1 if the
j-th word appears in the document (xj = 0 otherwise).
Modeling each word as an independent Bernoulli random variable,
the probability of observing a document x of class c is:
p (x|c) =
j
θ
xj
jc (1 − θjc)1−xj
where θjc denotes the probability that the j-th word occurs in a
document of class c.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16
Naive Bayes
Using this likelihood in Bayes’ rule and taking a logarithm, we have:
log p (c|x) = log
p (x|c) p (c)
p (x)
=
j
xj log
θjc
1 − θjc
+
j
log(1 − θjc) + log
θc
p (x)
where θc is the probability of observing a document of class c.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16
Naive Bayes
We can eliminate p (x) by calculating the log-odds:
log
p (1|x)
p (0|x)
=
j
xj log
θj1(1 − θj0)
θj0(1 − θj1)
wj
+
j
log
1 − θj1
1 − θj0
+ log
θ1
θ0
w0
which gives a linear classifier of the form w · x + w0
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 14 / 16
Naive Bayes
We train by counting words and documents within classes to
estimate θjc and θc:
ˆθjc =
njc
nc
ˆθc =
nc
n
and use these to calculate the weights ˆwj and bias ˆw0:
ˆwj = log
ˆθj1(1 − ˆθj0)
ˆθj0(1 − ˆθj1)
ˆw0 =
j
log
1 − ˆθj1
1 − ˆθj0
+ log
ˆθ1
ˆθ0
.
We we predict by simply adding the weights of the words that
appear in the document to the bias term.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 15 / 16
Naive Bayes
In practice, this works better than one might expect given its
simplicity3
3
http://www.jstor.org/pss/1403452
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16
Naive Bayes
Training is computationally cheap and scalable, and the model is
easy to update given new observations3
3
http://www.springerlink.com/content/wu3g458834583125/
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16
Naive Bayes
Performance varies with document representations and
corresponding likelihood models3
3
http://ceas.cc/2006/15.pdf
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16
Naive Bayes
It’s often important to smooth parameter estimates (e.g., by
adding pseudocounts) to avoid overfitting
ˆθjc =
njc + α
nc + α + β
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16

More Related Content

Viewers also liked

02. naive bayes classifier revision
02. naive bayes classifier   revision02. naive bayes classifier   revision
02. naive bayes classifier revisionJeonghun Yoon
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love BucharestStefan Adam
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierYiqun Hu
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learningbutest
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11darwinrlo
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes ClassifiersDongseo University
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Zeynep ausace 2011 tahrir presentation
Zeynep ausace 2011 tahrir presentationZeynep ausace 2011 tahrir presentation
Zeynep ausace 2011 tahrir presentationZeynep Tufekci
 
Mobile Analytics
Mobile AnalyticsMobile Analytics
Mobile AnalyticsSamuel Chow
 
Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Terametric
 

Viewers also liked (20)

02. naive bayes classifier revision
02. naive bayes classifier   revision02. naive bayes classifier   revision
02. naive bayes classifier revision
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Zeynep ausace 2011 tahrir presentation
Zeynep ausace 2011 tahrir presentationZeynep ausace 2011 tahrir presentation
Zeynep ausace 2011 tahrir presentation
 
Mobile Analytics
Mobile AnalyticsMobile Analytics
Mobile Analytics
 
Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910
 

Similar to Modeling Social Data, Lecture 6: Classification with Naive Bayes

Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03jakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
2.statistical DEcision makig.pptx
2.statistical DEcision makig.pptx2.statistical DEcision makig.pptx
2.statistical DEcision makig.pptxImpanaR2
 
Introduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.pptIntroduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.pptLong Dang
 
blood_red_demo.pptx
blood_red_demo.pptxblood_red_demo.pptx
blood_red_demo.pptxBruceHuang41
 
Causal Inference for Everyone
Causal Inference for EveryoneCausal Inference for Everyone
Causal Inference for EveryoneAlokSwain16
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
 
Complements and Conditional Probability, and Bayes' Theorem
 Complements and Conditional Probability, and Bayes' Theorem Complements and Conditional Probability, and Bayes' Theorem
Complements and Conditional Probability, and Bayes' TheoremLong Beach City College
 
Notes_5.2.pptx
Notes_5.2.pptxNotes_5.2.pptx
Notes_5.2.pptxjaywarven1
 
Notes_5.2 (1).pptx
Notes_5.2 (1).pptxNotes_5.2 (1).pptx
Notes_5.2 (1).pptxjaywarven1
 
Notes_5.2 (1).pptx
Notes_5.2 (1).pptxNotes_5.2 (1).pptx
Notes_5.2 (1).pptxjaywarven1
 
Probability&Bayes theorem
Probability&Bayes theoremProbability&Bayes theorem
Probability&Bayes theoremimran iqbal
 
Probably, Definitely, Maybe
Probably, Definitely, MaybeProbably, Definitely, Maybe
Probably, Definitely, MaybeJames McGivern
 

Similar to Modeling Social Data, Lecture 6: Classification with Naive Bayes (20)

Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Lecture3
Lecture3Lecture3
Lecture3
 
bayesjaw.ppt
bayesjaw.pptbayesjaw.ppt
bayesjaw.ppt
 
2.statistical DEcision makig.pptx
2.statistical DEcision makig.pptx2.statistical DEcision makig.pptx
2.statistical DEcision makig.pptx
 
Probabilty1.pptx
Probabilty1.pptxProbabilty1.pptx
Probabilty1.pptx
 
Introduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.pptIntroduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.ppt
 
Basic concepts of probability
Basic concepts of probabilityBasic concepts of probability
Basic concepts of probability
 
blood_red_demo.pptx
blood_red_demo.pptxblood_red_demo.pptx
blood_red_demo.pptx
 
Causal Inference for Everyone
Causal Inference for EveryoneCausal Inference for Everyone
Causal Inference for Everyone
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
Complements and Conditional Probability, and Bayes' Theorem
 Complements and Conditional Probability, and Bayes' Theorem Complements and Conditional Probability, and Bayes' Theorem
Complements and Conditional Probability, and Bayes' Theorem
 
Notes_5.2.pptx
Notes_5.2.pptxNotes_5.2.pptx
Notes_5.2.pptx
 
Notes_5.2 (1).pptx
Notes_5.2 (1).pptxNotes_5.2 (1).pptx
Notes_5.2 (1).pptx
 
Notes_5.2 (1).pptx
Notes_5.2 (1).pptxNotes_5.2 (1).pptx
Notes_5.2 (1).pptx
 
Warmup_New.pdf
Warmup_New.pdfWarmup_New.pdf
Warmup_New.pdf
 
Probability&Bayes theorem
Probability&Bayes theoremProbability&Bayes theorem
Probability&Bayes theorem
 
Probably, Definitely, Maybe
Probably, Definitely, MaybeProbably, Definitely, Maybe
Probably, Definitely, Maybe
 
BS 723_Class 6(5).pptx
BS 723_Class 6(5).pptxBS 723_Class 6(5).pptx
BS 723_Class 6(5).pptx
 

More from jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part Ijakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 

Recently uploaded

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 

Recently uploaded (20)

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 

Modeling Social Data, Lecture 6: Classification with Naive Bayes

  • 1. Classification: Naive Bayes APAM E4990 Modeling Social Data Jake Hofman Columbia University February 27, 2015 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 1 / 16
  • 2. Learning by example Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  • 3. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  • 4. Diagnoses a la Bayes1 • You’re testing for a rare disease: • 1% of the population is infected • You have a highly sensitive and specific test: • 99% of sick patients test positive • 99% of healthy patients test negative • Given that a patient tests positive, what is probability the patient is sick? 1 Wiggins, SciAm 2006 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16
  • 5. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  • 6. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl So given that a patient tests positive (198 ppl), there is a 50% chance the patient is sick (99 ppl)! Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  • 7. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl The small error rate on the large healthy population produces many false positives. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  • 8. Natural frequencies a la Gigerenzer2 2 http://bit.ly/ggbbc Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16
  • 9. Inverting conditional probabilities Bayes’ Theorem Equate the far right- and left-hand sides of product rule p (y|x) p (x) = p (x, y) = p (x|y) p (y) and divide to get the probability of y given x from the probability of x given y: p (y|x) = p (x|y) p (y) p (x) where p (x) = y∈ΩY p (x|y) p (y) is the normalization constant. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16
  • 10. Diagnoses a la Bayes Given that a patient tests positive, what is probability the patient is sick? p (sick|+) = 99/100 p (+|sick) 1/100 p (sick) p (+) 99/1002+99/1002=198/1002 = 99 198 = 1 2 where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy). Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16
  • 11. (Super) Naive Bayes We can use Bayes’ rule to build a one-word spam classifier: p (spam|word) = p (word|spam) p (spam) p (word) where we estimate these probabilities with ratios of counts: ˆp(word|spam) = # spam docs containing word # spam docs ˆp(word|ham) = # ham docs containing word # ham docs ˆp(spam) = # spam docs # docs ˆp(ham) = # ham docs # docs Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16
  • 12. (Super) Naive Bayes $ ./enron_naive_bayes.sh meeting 1500 spam examples 3672 ham examples 16 spam examples containing meeting 153 ham examples containing meeting estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(meeting|spam) = .0106 estimated P(meeting|ham) = .0416 P(spam|meeting) = .0923 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 9 / 16
  • 13. (Super) Naive Bayes $ ./enron_naive_bayes.sh money 1500 spam examples 3672 ham examples 194 spam examples containing money 50 ham examples containing money estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(money|spam) = .1293 estimated P(money|ham) = .0136 P(spam|money) = .7957 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 10 / 16
  • 14. (Super) Naive Bayes $ ./enron_naive_bayes.sh enron 1500 spam examples 3672 ham examples 0 spam examples containing enron 1478 ham examples containing enron estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(enron|spam) = 0 estimated P(enron|ham) = .4025 P(spam|enron) = 0 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 11 / 16
  • 15. Naive Bayes Represent each document by a binary vector x where xj = 1 if the j-th word appears in the document (xj = 0 otherwise). Modeling each word as an independent Bernoulli random variable, the probability of observing a document x of class c is: p (x|c) = j θ xj jc (1 − θjc)1−xj where θjc denotes the probability that the j-th word occurs in a document of class c. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16
  • 16. Naive Bayes Using this likelihood in Bayes’ rule and taking a logarithm, we have: log p (c|x) = log p (x|c) p (c) p (x) = j xj log θjc 1 − θjc + j log(1 − θjc) + log θc p (x) where θc is the probability of observing a document of class c. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16
  • 17. Naive Bayes We can eliminate p (x) by calculating the log-odds: log p (1|x) p (0|x) = j xj log θj1(1 − θj0) θj0(1 − θj1) wj + j log 1 − θj1 1 − θj0 + log θ1 θ0 w0 which gives a linear classifier of the form w · x + w0 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 14 / 16
  • 18. Naive Bayes We train by counting words and documents within classes to estimate θjc and θc: ˆθjc = njc nc ˆθc = nc n and use these to calculate the weights ˆwj and bias ˆw0: ˆwj = log ˆθj1(1 − ˆθj0) ˆθj0(1 − ˆθj1) ˆw0 = j log 1 − ˆθj1 1 − ˆθj0 + log ˆθ1 ˆθ0 . We we predict by simply adding the weights of the words that appear in the document to the bias term. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 15 / 16
  • 19. Naive Bayes In practice, this works better than one might expect given its simplicity3 3 http://www.jstor.org/pss/1403452 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16
  • 20. Naive Bayes Training is computationally cheap and scalable, and the model is easy to update given new observations3 3 http://www.springerlink.com/content/wu3g458834583125/ Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16
  • 21. Naive Bayes Performance varies with document representations and corresponding likelihood models3 3 http://ceas.cc/2006/15.pdf Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16
  • 22. Naive Bayes It’s often important to smooth parameter estimates (e.g., by adding pseudocounts) to avoid overfitting ˆθjc = njc + α nc + α + β Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16