SlideShare a Scribd company logo
Naïve Bayes
Chapter 4, DDS
Introduction
• We discussed the Bayes Rule last class: Here is a
its derivation from first principles of probabilities:
– P(A|B) = P(A&B)/P(B)
P(B|A) = P(A&B)/P(A)P(B|A) P(A) =P(A&B)
P(A|B) =
P(B|A)P(A)
P(B)
• Now lets look a very common application of
Bayes, for supervised learning in classification,
spam filtering
Classification
• Training set  design a model
• Test set  validate the model
• Classify data set using the model
• Goal of classification: to label the items in the
set to one of the given/known classes
• For spam filtering it is binary class: spam or nit
spam(ham)
Why not use methods in ch.3?
• Linear regression is about continuous
variables, not binary class
• K-nn can accommodate multi-features: curse
of dimensionality: 1 distinct word 1
feature 10000 words 10000 features!
• What are we going to use? Naïve Bayes
Lets Review
• A rare disease where 1%
• We have highly sensitive and specific test that is
– 99% positive for sick patients
– 99% negative for non-sick
• If a patients test positive, what is probability that
he/she is sick?
• Approach: patient is sick : sick, tests positive +
• P(sick/+) = P(+/sick) P(sick)/P(+)=
0.99*0.01/(0.99*0.01+0.99*0.01) =
0.099/2*(0.099) = ½ = 0.5
Spam Filter for individual words
Classifying mail into spam and not spam: binary
classification
Lets say if we get a mail with --- you have won a
“lottery” right away you know it is a spam.
We will assume that is if a word qualifies to be a
spam then the email is a spam…
P(spam|word) =
P(word|spam)P(spam)
P(word)
Further discussion
• Lets call good emails “ham”
• P(ham) = 1- P(spam)
• P(word) = P(word|spam)P(spam) + P(word|ham)P(ham)
Sample data
• Enron data: https://www.cs.cmu.edu/~enron
• Enron employee emails
• A small subset chosen for EDA
• 1500 spam, 3672 ham
• Test word is “meeting”…that is, your goal is label a
email with word “meeting” as spam or ham (not spam)
• Run an simple shell script and find out that 16
“meeting”s in spam, 153 “meetings” in ham
• Right away what is your intuition? Now prove it using
Bayes
Calculations
• P(spam) = 1500/(1500+3672) = 0.29
• P(ham) = 0.71
• P(meeting|spam) = 16/1500= 0.0106
• P(meeting|ham) = 15/3672 = 0.0416
• P(meeting) = P(meeting|spam)P(spam) +
P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261
• P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting)
= 0.0106*0.29/0.03261 = 0.094  9.4%
Simulation using bash shell script
• On to demo
• This code is available in pages 105-106 … good
luck with the typos… figure it out
A spam that combines words: Naïve
Bayes
• Lets transform one word algorithm to a model
that considers all words…
• Form an bit vector for words with each email: X
with xj is 1 if the word is present, 0 if the word is
absent in the email
• Let c denote it is spam
• Then 𝑃 𝑥 𝑐 = 𝑗(∅ 𝑗𝑐)xj (1 - ∅ 𝑗𝑐) (1-xj)
• Lets understand this with an example..and also
turn product into summation..by using log..
Multi-word (contd.)
• …
• log(p(x|c)) = 𝑗 𝑋𝑗 𝑊𝑗 + 𝑤0
• The x weights vary with email… can we
compute using MR?
• Once you know P(x|c), we can estimate P(c|x)
using Bayes Rule (P(c), and P(x) can be
computed as before); we can also use MR for
P(x) computation for various words (KEY)
Wrangling
• Rest of the chapter deals with wrangling of
data
• Very important… what we are doing now with
project 1 and project 2
• Connect to an API and extract data
• The DDS chapter 4 shows an example with
NYT data and classifies the articles.
Summary
• Learn Naïve Bayes Rule
• Application to spam filtering in emails
• Work the example/understand the example
discussed in class: disease one, a spam filter..
• Possible question problem statement 
classification model using Naïve Bayes

More Related Content

What's hot

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 

What's hot (15)

Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
haskell_fp1
haskell_fp1haskell_fp1
haskell_fp1
 
PROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In PrologPROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In Prolog
 
String Handling
String HandlingString Handling
String Handling
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prolog
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
 
Prolog
PrologProlog
Prolog
 
Python001 training course_mumbai
Python001 training course_mumbaiPython001 training course_mumbai
Python001 training course_mumbai
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Introduction to Prolog
Introduction to PrologIntroduction to Prolog
Introduction to Prolog
 
Some tips for taking the High School AP Java college board exam
Some tips for taking the High School  AP Java college board examSome tips for taking the High School  AP Java college board exam
Some tips for taking the High School AP Java college board exam
 

Viewers also liked (20)

Hash mac algorithms
Hash mac algorithmsHash mac algorithms
Hash mac algorithms
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Text classification
Text classificationText classification
Text classification
 
Decision tree
Decision treeDecision tree
Decision tree
 
Prolog resume
Prolog resumeProlog resume
Prolog resume
 
Crypto passport authentication
Crypto passport authenticationCrypto passport authentication
Crypto passport authentication
 
Introduction to security_and_crypto
Introduction to security_and_cryptoIntroduction to security_and_crypto
Introduction to security_and_crypto
 
Xml schema
Xml schemaXml schema
Xml schema
 
Database concepts
Database conceptsDatabase concepts
Database concepts
 
Datamining with nb
Datamining with nbDatamining with nb
Datamining with nb
 
Building a-database
Building a-databaseBuilding a-database
Building a-database
 
Sql database object
Sql database objectSql database object
Sql database object
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in crypto
 
Basic dns-mod
Basic dns-modBasic dns-mod
Basic dns-mod
 
Xml stylus studio
Xml stylus studioXml stylus studio
Xml stylus studio
 
Database constraints
Database constraintsDatabase constraints
Database constraints
 
Crypto theory to practice
Crypto theory to practiceCrypto theory to practice
Crypto theory to practice
 
Hash crypto
Hash cryptoHash crypto
Hash crypto
 
Overview prolog
Overview prologOverview prolog
Overview prolog
 
Introduction toprolog
Introduction toprologIntroduction toprolog
Introduction toprolog
 

Similar to Naïve bayes

Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
darwinrlo
 

Similar to Naïve bayes (20)

tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Understanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsUnderstanding the Machine Learning Algorithms
Understanding the Machine Learning Algorithms
 
IR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfIR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdf
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptx
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Functions, List and String methods
Functions, List and String methodsFunctions, List and String methods
Functions, List and String methods
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Classifying text with Bayes Models
Classifying text with Bayes ModelsClassifying text with Bayes Models
Classifying text with Bayes Models
 

More from Tony Nguyen

Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
Tony Nguyen
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
Tony Nguyen
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
Tony Nguyen
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
Tony Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Tony Nguyen
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
Tony Nguyen
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
Tony Nguyen
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
Tony Nguyen
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
Tony Nguyen
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
Tony Nguyen
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
Tony Nguyen
 
Object oriented programming-with_java
Object oriented programming-with_javaObject oriented programming-with_java
Object oriented programming-with_java
Tony Nguyen
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
Tony Nguyen
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
Tony Nguyen
 

More from Tony Nguyen (20)

Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Abstract class
Abstract classAbstract class
Abstract class
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object model
Object modelObject model
Object model
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Inheritance
InheritanceInheritance
Inheritance
 
Object oriented programming-with_java
Object oriented programming-with_javaObject oriented programming-with_java
Object oriented programming-with_java
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Api crash
Api crashApi crash
Api crash
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 

Naïve bayes

  • 2. Introduction • We discussed the Bayes Rule last class: Here is a its derivation from first principles of probabilities: – P(A|B) = P(A&B)/P(B) P(B|A) = P(A&B)/P(A)P(B|A) P(A) =P(A&B) P(A|B) = P(B|A)P(A) P(B) • Now lets look a very common application of Bayes, for supervised learning in classification, spam filtering
  • 3. Classification • Training set  design a model • Test set  validate the model • Classify data set using the model • Goal of classification: to label the items in the set to one of the given/known classes • For spam filtering it is binary class: spam or nit spam(ham)
  • 4. Why not use methods in ch.3? • Linear regression is about continuous variables, not binary class • K-nn can accommodate multi-features: curse of dimensionality: 1 distinct word 1 feature 10000 words 10000 features! • What are we going to use? Naïve Bayes
  • 5. Lets Review • A rare disease where 1% • We have highly sensitive and specific test that is – 99% positive for sick patients – 99% negative for non-sick • If a patients test positive, what is probability that he/she is sick? • Approach: patient is sick : sick, tests positive + • P(sick/+) = P(+/sick) P(sick)/P(+)= 0.99*0.01/(0.99*0.01+0.99*0.01) = 0.099/2*(0.099) = ½ = 0.5
  • 6. Spam Filter for individual words Classifying mail into spam and not spam: binary classification Lets say if we get a mail with --- you have won a “lottery” right away you know it is a spam. We will assume that is if a word qualifies to be a spam then the email is a spam… P(spam|word) = P(word|spam)P(spam) P(word)
  • 7. Further discussion • Lets call good emails “ham” • P(ham) = 1- P(spam) • P(word) = P(word|spam)P(spam) + P(word|ham)P(ham)
  • 8. Sample data • Enron data: https://www.cs.cmu.edu/~enron • Enron employee emails • A small subset chosen for EDA • 1500 spam, 3672 ham • Test word is “meeting”…that is, your goal is label a email with word “meeting” as spam or ham (not spam) • Run an simple shell script and find out that 16 “meeting”s in spam, 153 “meetings” in ham • Right away what is your intuition? Now prove it using Bayes
  • 9. Calculations • P(spam) = 1500/(1500+3672) = 0.29 • P(ham) = 0.71 • P(meeting|spam) = 16/1500= 0.0106 • P(meeting|ham) = 15/3672 = 0.0416 • P(meeting) = P(meeting|spam)P(spam) + P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261 • P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting) = 0.0106*0.29/0.03261 = 0.094  9.4%
  • 10. Simulation using bash shell script • On to demo • This code is available in pages 105-106 … good luck with the typos… figure it out
  • 11. A spam that combines words: Naïve Bayes • Lets transform one word algorithm to a model that considers all words… • Form an bit vector for words with each email: X with xj is 1 if the word is present, 0 if the word is absent in the email • Let c denote it is spam • Then 𝑃 𝑥 𝑐 = 𝑗(∅ 𝑗𝑐)xj (1 - ∅ 𝑗𝑐) (1-xj) • Lets understand this with an example..and also turn product into summation..by using log..
  • 12. Multi-word (contd.) • … • log(p(x|c)) = 𝑗 𝑋𝑗 𝑊𝑗 + 𝑤0 • The x weights vary with email… can we compute using MR? • Once you know P(x|c), we can estimate P(c|x) using Bayes Rule (P(c), and P(x) can be computed as before); we can also use MR for P(x) computation for various words (KEY)
  • 13. Wrangling • Rest of the chapter deals with wrangling of data • Very important… what we are doing now with project 1 and project 2 • Connect to an API and extract data • The DDS chapter 4 shows an example with NYT data and classifies the articles.
  • 14. Summary • Learn Naïve Bayes Rule • Application to spam filtering in emails • Work the example/understand the example discussed in class: disease one, a spam filter.. • Possible question problem statement  classification model using Naïve Bayes