SlideShare a Scribd company logo
1 of 14
Naïve Bayes
Chapter 4, DDS
Introduction
• We discussed the Bayes Rule last class: Here is a
its derivation from first principles of probabilities:
– P(A|B) = P(A&B)/P(B)
P(B|A) = P(A&B)/P(A)P(B|A) P(A) =P(A&B)
P(A|B) =
P(B|A)P(A)
P(B)
• Now lets look a very common application of
Bayes, for supervised learning in classification,
spam filtering
Classification
• Training set  design a model
• Test set  validate the model
• Classify data set using the model
• Goal of classification: to label the items in the
set to one of the given/known classes
• For spam filtering it is binary class: spam or nit
spam(ham)
Why not use methods in ch.3?
• Linear regression is about continuous
variables, not binary class
• K-nn can accommodate multi-features: curse
of dimensionality: 1 distinct word 1
feature 10000 words 10000 features!
• What are we going to use? Naïve Bayes
Lets Review
• A rare disease where 1%
• We have highly sensitive and specific test that is
– 99% positive for sick patients
– 99% negative for non-sick
• If a patients test positive, what is probability that
he/she is sick?
• Approach: patient is sick : sick, tests positive +
• P(sick/+) = P(+/sick) P(sick)/P(+)=
0.99*0.01/(0.99*0.01+0.99*0.01) =
0.099/2*(0.099) = ½ = 0.5
Spam Filter for individual words
Classifying mail into spam and not spam: binary
classification
Lets say if we get a mail with --- you have won a
“lottery” right away you know it is a spam.
We will assume that is if a word qualifies to be a
spam then the email is a spam…
P(spam|word) =
P(word|spam)P(spam)
P(word)
Further discussion
• Lets call good emails “ham”
• P(ham) = 1- P(spam)
• P(word) = P(word|spam)P(spam) + P(word|ham)P(ham)
Sample data
• Enron data: https://www.cs.cmu.edu/~enron
• Enron employee emails
• A small subset chosen for EDA
• 1500 spam, 3672 ham
• Test word is “meeting”…that is, your goal is label a
email with word “meeting” as spam or ham (not spam)
• Run an simple shell script and find out that 16
“meeting”s in spam, 153 “meetings” in ham
• Right away what is your intuition? Now prove it using
Bayes
Calculations
• P(spam) = 1500/(1500+3672) = 0.29
• P(ham) = 0.71
• P(meeting|spam) = 16/1500= 0.0106
• P(meeting|ham) = 15/3672 = 0.0416
• P(meeting) = P(meeting|spam)P(spam) +
P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261
• P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting)
= 0.0106*0.29/0.03261 = 0.094  9.4%
Simulation using bash shell script
• On to demo
• This code is available in pages 105-106 … good
luck with the typos… figure it out
A spam that combines words: Naïve
Bayes
• Lets transform one word algorithm to a model
that considers all words…
• Form an bit vector for words with each email: X
with xj is 1 if the word is present, 0 if the word is
absent in the email
• Let c denote it is spam
• Then 𝑃 𝑥 𝑐 = 𝑗(∅ 𝑗𝑐)xj (1 - ∅ 𝑗𝑐) (1-xj)
• Lets understand this with an example..and also
turn product into summation..by using log..
Multi-word (contd.)
• …
• log(p(x|c)) = 𝑗 𝑋𝑗 𝑊𝑗 + 𝑤0
• The x weights vary with email… can we
compute using MR?
• Once you know P(x|c), we can estimate P(c|x)
using Bayes Rule (P(c), and P(x) can be
computed as before); we can also use MR for
P(x) computation for various words (KEY)
Wrangling
• Rest of the chapter deals with wrangling of
data
• Very important… what we are doing now with
project 1 and project 2
• Connect to an API and extract data
• The DDS chapter 4 shows an example with
NYT data and classifies the articles.
Summary
• Learn Naïve Bayes Rule
• Application to spam filtering in emails
• Work the example/understand the example
discussed in class: disease one, a spam filter..
• Possible question problem statement 
classification model using Naïve Bayes

More Related Content

What's hot

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 

What's hot (15)

Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
haskell_fp1
haskell_fp1haskell_fp1
haskell_fp1
 
PROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In PrologPROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In Prolog
 
String Handling
String HandlingString Handling
String Handling
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prolog
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
 
Prolog
PrologProlog
Prolog
 
Python001 training course_mumbai
Python001 training course_mumbaiPython001 training course_mumbai
Python001 training course_mumbai
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Introduction to Prolog
Introduction to PrologIntroduction to Prolog
Introduction to Prolog
 
Some tips for taking the High School AP Java college board exam
Some tips for taking the High School  AP Java college board examSome tips for taking the High School  AP Java college board exam
Some tips for taking the High School AP Java college board exam
 

Viewers also liked

Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
Young Alista
 

Viewers also liked (20)

Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Oberoi Priviera Brochure - Zricks.com
Oberoi Priviera Brochure - Zricks.comOberoi Priviera Brochure - Zricks.com
Oberoi Priviera Brochure - Zricks.com
 
Rustomjee Elements Brochure - Zricks.com
Rustomjee Elements Brochure - Zricks.comRustomjee Elements Brochure - Zricks.com
Rustomjee Elements Brochure - Zricks.com
 
Tvs Emerald Green Hills Brochure - Zricks.com
Tvs Emerald Green Hills Brochure - Zricks.comTvs Emerald Green Hills Brochure - Zricks.com
Tvs Emerald Green Hills Brochure - Zricks.com
 
Adroit Imperia Brochure - Zricks.com
Adroit Imperia Brochure - Zricks.comAdroit Imperia Brochure - Zricks.com
Adroit Imperia Brochure - Zricks.com
 
Goel Ganga Prive Brochure - Zricks.com
Goel Ganga Prive Brochure - Zricks.comGoel Ganga Prive Brochure - Zricks.com
Goel Ganga Prive Brochure - Zricks.com
 
Legacy Tierra Brochure - Zricks.com
Legacy Tierra Brochure - Zricks.comLegacy Tierra Brochure - Zricks.com
Legacy Tierra Brochure - Zricks.com
 
Prestige Sunrise Park Brochure - Zricks.com
Prestige Sunrise Park Brochure - Zricks.comPrestige Sunrise Park Brochure - Zricks.com
Prestige Sunrise Park Brochure - Zricks.com
 
Romell Aether Brochure - Zricks.com
Romell Aether Brochure - Zricks.comRomell Aether Brochure - Zricks.com
Romell Aether Brochure - Zricks.com
 
Sobha Avenue Brochure - Zricks.com
Sobha Avenue Brochure - Zricks.comSobha Avenue Brochure - Zricks.com
Sobha Avenue Brochure - Zricks.com
 
Paranjape Athashri Valley Brochure - Zricks.com
Paranjape Athashri Valley Brochure - Zricks.comParanjape Athashri Valley Brochure - Zricks.com
Paranjape Athashri Valley Brochure - Zricks.com
 
Spenta Towers Brochure - Zricks.com
Spenta Towers Brochure - Zricks.comSpenta Towers Brochure - Zricks.com
Spenta Towers Brochure - Zricks.com
 
DS MAX Suncrest Brochure - Zricks.com
DS MAX Suncrest Brochure - Zricks.comDS MAX Suncrest Brochure - Zricks.com
DS MAX Suncrest Brochure - Zricks.com
 
DS MAX Silver Bell Brochure - Zricks.com
DS MAX Silver Bell Brochure - Zricks.comDS MAX Silver Bell Brochure - Zricks.com
DS MAX Silver Bell Brochure - Zricks.com
 
Aparna Westside Brochure - Zricks.com
Aparna Westside Brochure - Zricks.comAparna Westside Brochure - Zricks.com
Aparna Westside Brochure - Zricks.com
 
Paranjape Xion Brochure - Zricks.com
Paranjape Xion Brochure - Zricks.comParanjape Xion Brochure - Zricks.com
Paranjape Xion Brochure - Zricks.com
 
Arge Urban Bloom Brochure - Zricks.com
Arge Urban Bloom Brochure - Zricks.comArge Urban Bloom Brochure - Zricks.com
Arge Urban Bloom Brochure - Zricks.com
 
Oberoi Prisma Brochure - Zricks.com
Oberoi Prisma Brochure - Zricks.comOberoi Prisma Brochure - Zricks.com
Oberoi Prisma Brochure - Zricks.com
 
Aparna Sarovar Grande Brochure - Zricks.com
Aparna Sarovar Grande Brochure - Zricks.comAparna Sarovar Grande Brochure - Zricks.com
Aparna Sarovar Grande Brochure - Zricks.com
 
Manar Sirri Brochure - Zricks.com
Manar Sirri Brochure - Zricks.comManar Sirri Brochure - Zricks.com
Manar Sirri Brochure - Zricks.com
 

Similar to Naïve bayes

Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
darwinrlo
 

Similar to Naïve bayes (20)

tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Understanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsUnderstanding the Machine Learning Algorithms
Understanding the Machine Learning Algorithms
 
IR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfIR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdf
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptx
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Functions, List and String methods
Functions, List and String methodsFunctions, List and String methods
Functions, List and String methods
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Classifying text with Bayes Models
Classifying text with Bayes ModelsClassifying text with Bayes Models
Classifying text with Bayes Models
 

More from Young Alista

Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.ppt
Young Alista
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
Young Alista
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
Young Alista
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
Young Alista
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
Young Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Young Alista
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
Young Alista
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
Young Alista
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
Young Alista
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
Young Alista
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
Young Alista
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
Young Alista
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
Young Alista
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
Young Alista
 

More from Young Alista (20)

Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.ppt
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Cache recap
Cache recapCache recap
Cache recap
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Object model
Object modelObject model
Object model
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Abstract class
Abstract classAbstract class
Abstract class
 
Inheritance
InheritanceInheritance
Inheritance
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Naïve bayes

  • 2. Introduction • We discussed the Bayes Rule last class: Here is a its derivation from first principles of probabilities: – P(A|B) = P(A&B)/P(B) P(B|A) = P(A&B)/P(A)P(B|A) P(A) =P(A&B) P(A|B) = P(B|A)P(A) P(B) • Now lets look a very common application of Bayes, for supervised learning in classification, spam filtering
  • 3. Classification • Training set  design a model • Test set  validate the model • Classify data set using the model • Goal of classification: to label the items in the set to one of the given/known classes • For spam filtering it is binary class: spam or nit spam(ham)
  • 4. Why not use methods in ch.3? • Linear regression is about continuous variables, not binary class • K-nn can accommodate multi-features: curse of dimensionality: 1 distinct word 1 feature 10000 words 10000 features! • What are we going to use? Naïve Bayes
  • 5. Lets Review • A rare disease where 1% • We have highly sensitive and specific test that is – 99% positive for sick patients – 99% negative for non-sick • If a patients test positive, what is probability that he/she is sick? • Approach: patient is sick : sick, tests positive + • P(sick/+) = P(+/sick) P(sick)/P(+)= 0.99*0.01/(0.99*0.01+0.99*0.01) = 0.099/2*(0.099) = ½ = 0.5
  • 6. Spam Filter for individual words Classifying mail into spam and not spam: binary classification Lets say if we get a mail with --- you have won a “lottery” right away you know it is a spam. We will assume that is if a word qualifies to be a spam then the email is a spam… P(spam|word) = P(word|spam)P(spam) P(word)
  • 7. Further discussion • Lets call good emails “ham” • P(ham) = 1- P(spam) • P(word) = P(word|spam)P(spam) + P(word|ham)P(ham)
  • 8. Sample data • Enron data: https://www.cs.cmu.edu/~enron • Enron employee emails • A small subset chosen for EDA • 1500 spam, 3672 ham • Test word is “meeting”…that is, your goal is label a email with word “meeting” as spam or ham (not spam) • Run an simple shell script and find out that 16 “meeting”s in spam, 153 “meetings” in ham • Right away what is your intuition? Now prove it using Bayes
  • 9. Calculations • P(spam) = 1500/(1500+3672) = 0.29 • P(ham) = 0.71 • P(meeting|spam) = 16/1500= 0.0106 • P(meeting|ham) = 15/3672 = 0.0416 • P(meeting) = P(meeting|spam)P(spam) + P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261 • P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting) = 0.0106*0.29/0.03261 = 0.094  9.4%
  • 10. Simulation using bash shell script • On to demo • This code is available in pages 105-106 … good luck with the typos… figure it out
  • 11. A spam that combines words: Naïve Bayes • Lets transform one word algorithm to a model that considers all words… • Form an bit vector for words with each email: X with xj is 1 if the word is present, 0 if the word is absent in the email • Let c denote it is spam • Then 𝑃 𝑥 𝑐 = 𝑗(∅ 𝑗𝑐)xj (1 - ∅ 𝑗𝑐) (1-xj) • Lets understand this with an example..and also turn product into summation..by using log..
  • 12. Multi-word (contd.) • … • log(p(x|c)) = 𝑗 𝑋𝑗 𝑊𝑗 + 𝑤0 • The x weights vary with email… can we compute using MR? • Once you know P(x|c), we can estimate P(c|x) using Bayes Rule (P(c), and P(x) can be computed as before); we can also use MR for P(x) computation for various words (KEY)
  • 13. Wrangling • Rest of the chapter deals with wrangling of data • Very important… what we are doing now with project 1 and project 2 • Connect to an API and extract data • The DDS chapter 4 shows an example with NYT data and classifies the articles.
  • 14. Summary • Learn Naïve Bayes Rule • Application to spam filtering in emails • Work the example/understand the example discussed in class: disease one, a spam filter.. • Possible question problem statement  classification model using Naïve Bayes