SlideShare a Scribd company logo
1 of 14
Naïve Bayes
Chapter 4, DDS
Introduction
• We discussed the Bayes Rule last class: Here is a
its derivation from first principles of probabilities:
– P(A|B) = P(A&B)/P(B)
P(B|A) = P(A&B)/P(A)P(B|A) P(A) =P(A&B)
P(A|B) =
P(B|A)P(A)
P(B)
• Now lets look a very common application of
Bayes, for supervised learning in classification,
spam filtering
Classification
• Training set  design a model
• Test set  validate the model
• Classify data set using the model
• Goal of classification: to label the items in the
set to one of the given/known classes
• For spam filtering it is binary class: spam or nit
spam(ham)
Why not use methods in ch.3?
• Linear regression is about continuous
variables, not binary class
• K-nn can accommodate multi-features: curse
of dimensionality: 1 distinct word 1
feature 10000 words 10000 features!
• What are we going to use? Naïve Bayes
Lets Review
• A rare disease where 1%
• We have highly sensitive and specific test that is
– 99% positive for sick patients
– 99% negative for non-sick
• If a patients test positive, what is probability that
he/she is sick?
• Approach: patient is sick : sick, tests positive +
• P(sick/+) = P(+/sick) P(sick)/P(+)=
0.99*0.01/(0.99*0.01+0.99*0.01) =
0.099/2*(0.099) = ½ = 0.5
Spam Filter for individual words
Classifying mail into spam and not spam: binary
classification
Lets say if we get a mail with --- you have won a
“lottery” right away you know it is a spam.
We will assume that is if a word qualifies to be a
spam then the email is a spam…
P(spam|word) =
P(word|spam)P(spam)
P(word)
Further discussion
• Lets call good emails “ham”
• P(ham) = 1- P(spam)
• P(word) = P(word|spam)P(spam) + P(word|ham)P(ham)
Sample data
• Enron data: https://www.cs.cmu.edu/~enron
• Enron employee emails
• A small subset chosen for EDA
• 1500 spam, 3672 ham
• Test word is “meeting”…that is, your goal is label a
email with word “meeting” as spam or ham (not spam)
• Run an simple shell script and find out that 16
“meeting”s in spam, 153 “meetings” in ham
• Right away what is your intuition? Now prove it using
Bayes
Calculations
• P(spam) = 1500/(1500+3672) = 0.29
• P(ham) = 0.71
• P(meeting|spam) = 16/1500= 0.0106
• P(meeting|ham) = 15/3672 = 0.0416
• P(meeting) = P(meeting|spam)P(spam) +
P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261
• P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting)
= 0.0106*0.29/0.03261 = 0.094  9.4%
Simulation using bash shell script
• On to demo
• This code is available in pages 105-106 … good
luck with the typos… figure it out
A spam that combines words: Naïve
Bayes
• Lets transform one word algorithm to a model
that considers all words…
• Form an bit vector for words with each email: X
with xj is 1 if the word is present, 0 if the word is
absent in the email
• Let c denote it is spam
• Then 𝑃 𝑥 𝑐 = 𝑗(∅ 𝑗𝑐)xj (1 - ∅ 𝑗𝑐) (1-xj)
• Lets understand this with an example..and also
turn product into summation..by using log..
Multi-word (contd.)
• …
• log(p(x|c)) = 𝑗 𝑋𝑗 𝑊𝑗 + 𝑤0
• The x weights vary with email… can we
compute using MR?
• Once you know P(x|c), we can estimate P(c|x)
using Bayes Rule (P(c), and P(x) can be
computed as before); we can also use MR for
P(x) computation for various words (KEY)
Wrangling
• Rest of the chapter deals with wrangling of
data
• Very important… what we are doing now with
project 1 and project 2
• Connect to an API and extract data
• The DDS chapter 4 shows an example with
NYT data and classifies the articles.
Summary
• Learn Naïve Bayes Rule
• Application to spam filtering in emails
• Work the example/understand the example
discussed in class: disease one, a spam filter..
• Possible question problem statement 
classification model using Naïve Bayes

More Related Content

What's hot

Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanfordSakthivel C R
 
PROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In PrologPROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In PrologDataminingTools Inc
 
The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189Mahmoud Samir Fayed
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prologHarry Potter
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prologbaran19901990
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentationVedaGayathri1
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Danial Virk
 
Python001 training course_mumbai
Python001 training course_mumbaiPython001 training course_mumbai
Python001 training course_mumbaivibrantuser
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
Some tips for taking the High School AP Java college board exam
Some tips for taking the High School  AP Java college board examSome tips for taking the High School  AP Java college board exam
Some tips for taking the High School AP Java college board examMichael Scaman
 

What's hot (15)

Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
haskell_fp1
haskell_fp1haskell_fp1
haskell_fp1
 
PROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In PrologPROLOG: Recursion And Lists In Prolog
PROLOG: Recursion And Lists In Prolog
 
String Handling
String HandlingString Handling
String Handling
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189The Ring programming language version 1.6 book - Part 182 of 189
The Ring programming language version 1.6 book - Part 182 of 189
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prolog
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
 
Prolog
PrologProlog
Prolog
 
Python001 training course_mumbai
Python001 training course_mumbaiPython001 training course_mumbai
Python001 training course_mumbai
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Introduction to Prolog
Introduction to PrologIntroduction to Prolog
Introduction to Prolog
 
Some tips for taking the High School AP Java college board exam
Some tips for taking the High School  AP Java college board examSome tips for taking the High School  AP Java college board exam
Some tips for taking the High School AP Java college board exam
 

Viewers also liked

Chap 4 Review.pdf
Chap 4 Review.pdfChap 4 Review.pdf
Chap 4 Review.pdfbwlomas
 
Function operations and compositions
Function operations and compositionsFunction operations and compositions
Function operations and compositionsbwlomas
 
Abstract class
Abstract classAbstract class
Abstract classJames Wong
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonJames Wong
 
3-1 to 3-3 Review Day.pdf
3-1 to 3-3 Review Day.pdf3-1 to 3-3 Review Day.pdf
3-1 to 3-3 Review Day.pdfbwlomas
 
Res epre61 13
Res epre61 13Res epre61 13
Res epre61 13EPRE
 
Object Oriented Analysis & Design
Object Oriented Analysis & DesignObject Oriented Analysis & Design
Object Oriented Analysis & Designkpthakershy
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonYoung Alista
 
Hash mac algorithms
Hash mac algorithmsHash mac algorithms
Hash mac algorithmsYoung Alista
 
Test driven development
Test driven developmentTest driven development
Test driven developmentYoung Alista
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingYoung Alista
 
Crypto passport authentication
Crypto passport authenticationCrypto passport authentication
Crypto passport authenticationYoung Alista
 
Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.pptYoung Alista
 

Viewers also liked (19)

Chap 4 Review.pdf
Chap 4 Review.pdfChap 4 Review.pdf
Chap 4 Review.pdf
 
Lab 1
Lab 1Lab 1
Lab 1
 
Function operations and compositions
Function operations and compositionsFunction operations and compositions
Function operations and compositions
 
Abstract class
Abstract classAbstract class
Abstract class
 
Polymorphism
PolymorphismPolymorphism
Polymorphism
 
Decision tree
Decision treeDecision tree
Decision tree
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
3-1 to 3-3 Review Day.pdf
3-1 to 3-3 Review Day.pdf3-1 to 3-3 Review Day.pdf
3-1 to 3-3 Review Day.pdf
 
Res epre61 13
Res epre61 13Res epre61 13
Res epre61 13
 
Object Oriented Analysis & Design
Object Oriented Analysis & DesignObject Oriented Analysis & Design
Object Oriented Analysis & Design
 
Ooad overview
Ooad overviewOoad overview
Ooad overview
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Hash mac algorithms
Hash mac algorithmsHash mac algorithms
Hash mac algorithms
 
Test driven development
Test driven developmentTest driven development
Test driven development
 
Basic dns-mod
Basic dns-modBasic dns-mod
Basic dns-mod
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Decision tree
Decision treeDecision tree
Decision tree
 
Crypto passport authentication
Crypto passport authenticationCrypto passport authentication
Crypto passport authentication
 
Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.ppt
 

Similar to Naïve bayes

NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basicsDorothy Bishop
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningLibya Thomas
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11darwinrlo
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelJunya Tanaka
 
Understanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsUnderstanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsRupak Roy
 
IR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfIR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfhimarusti
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptxMahimMajee
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
Functions, List and String methods
Functions, List and String methodsFunctions, List and String methods
Functions, List and String methodsPranavSB
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Pythonfreshdatabos
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Kira
 
Classifying text with Bayes Models
Classifying text with Bayes ModelsClassifying text with Bayes Models
Classifying text with Bayes ModelsValentin Mihov
 

Similar to Naïve bayes (20)

tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Understanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsUnderstanding the Machine Learning Algorithms
Understanding the Machine Learning Algorithms
 
IR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfIR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdf
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptx
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Functions, List and String methods
Functions, List and String methodsFunctions, List and String methods
Functions, List and String methods
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Classifying text with Bayes Models
Classifying text with Bayes ModelsClassifying text with Bayes Models
Classifying text with Bayes Models
 

More from James Wong

Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtosJames Wong
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningJames Wong
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksJames Wong
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsJames Wong
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceJames Wong
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesJames Wong
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileJames Wong
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheJames Wong
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisJames Wong
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaJames Wong
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsJames Wong
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonJames Wong
 

More from James Wong (20)

Data race
Data raceData race
Data race
 
Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtos
 
Recursion
RecursionRecursion
Recursion
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Object model
Object modelObject model
Object model
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Inheritance
InheritanceInheritance
Inheritance
 
Api crash
Api crashApi crash
Api crash
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Naïve bayes

  • 2. Introduction • We discussed the Bayes Rule last class: Here is a its derivation from first principles of probabilities: – P(A|B) = P(A&B)/P(B) P(B|A) = P(A&B)/P(A)P(B|A) P(A) =P(A&B) P(A|B) = P(B|A)P(A) P(B) • Now lets look a very common application of Bayes, for supervised learning in classification, spam filtering
  • 3. Classification • Training set  design a model • Test set  validate the model • Classify data set using the model • Goal of classification: to label the items in the set to one of the given/known classes • For spam filtering it is binary class: spam or nit spam(ham)
  • 4. Why not use methods in ch.3? • Linear regression is about continuous variables, not binary class • K-nn can accommodate multi-features: curse of dimensionality: 1 distinct word 1 feature 10000 words 10000 features! • What are we going to use? Naïve Bayes
  • 5. Lets Review • A rare disease where 1% • We have highly sensitive and specific test that is – 99% positive for sick patients – 99% negative for non-sick • If a patients test positive, what is probability that he/she is sick? • Approach: patient is sick : sick, tests positive + • P(sick/+) = P(+/sick) P(sick)/P(+)= 0.99*0.01/(0.99*0.01+0.99*0.01) = 0.099/2*(0.099) = ½ = 0.5
  • 6. Spam Filter for individual words Classifying mail into spam and not spam: binary classification Lets say if we get a mail with --- you have won a “lottery” right away you know it is a spam. We will assume that is if a word qualifies to be a spam then the email is a spam… P(spam|word) = P(word|spam)P(spam) P(word)
  • 7. Further discussion • Lets call good emails “ham” • P(ham) = 1- P(spam) • P(word) = P(word|spam)P(spam) + P(word|ham)P(ham)
  • 8. Sample data • Enron data: https://www.cs.cmu.edu/~enron • Enron employee emails • A small subset chosen for EDA • 1500 spam, 3672 ham • Test word is “meeting”…that is, your goal is label a email with word “meeting” as spam or ham (not spam) • Run an simple shell script and find out that 16 “meeting”s in spam, 153 “meetings” in ham • Right away what is your intuition? Now prove it using Bayes
  • 9. Calculations • P(spam) = 1500/(1500+3672) = 0.29 • P(ham) = 0.71 • P(meeting|spam) = 16/1500= 0.0106 • P(meeting|ham) = 15/3672 = 0.0416 • P(meeting) = P(meeting|spam)P(spam) + P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261 • P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting) = 0.0106*0.29/0.03261 = 0.094  9.4%
  • 10. Simulation using bash shell script • On to demo • This code is available in pages 105-106 … good luck with the typos… figure it out
  • 11. A spam that combines words: Naïve Bayes • Lets transform one word algorithm to a model that considers all words… • Form an bit vector for words with each email: X with xj is 1 if the word is present, 0 if the word is absent in the email • Let c denote it is spam • Then 𝑃 𝑥 𝑐 = 𝑗(∅ 𝑗𝑐)xj (1 - ∅ 𝑗𝑐) (1-xj) • Lets understand this with an example..and also turn product into summation..by using log..
  • 12. Multi-word (contd.) • … • log(p(x|c)) = 𝑗 𝑋𝑗 𝑊𝑗 + 𝑤0 • The x weights vary with email… can we compute using MR? • Once you know P(x|c), we can estimate P(c|x) using Bayes Rule (P(c), and P(x) can be computed as before); we can also use MR for P(x) computation for various words (KEY)
  • 13. Wrangling • Rest of the chapter deals with wrangling of data • Very important… what we are doing now with project 1 and project 2 • Connect to an API and extract data • The DDS chapter 4 shows an example with NYT data and classifies the articles.
  • 14. Summary • Learn Naïve Bayes Rule • Application to spam filtering in emails • Work the example/understand the example discussed in class: disease one, a spam filter.. • Possible question problem statement  classification model using Naïve Bayes