SlideShare a Scribd company logo
1 of 25
Download to read offline
P1WU
UNIT – III: CLASSIFICATION
Topic 3: NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle, i.e. every pair of features
being classified is independent of each other.
• Naive Bayes classifiers have been heavily used
for text classification and text analysis machine learning
problems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Text Analysis is a major application field for machine learning
algorithms.
• However the raw data,
• a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size
rather than the raw text documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle,
• i.e. every pair of features being classified is independent of each other.
• The dataset is divided into two parts, namely,
feature matrix and the response/target vector.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• The Feature matrix (X) contains all the vectors(rows) of the
dataset in which each vector consists of the value of
dependent features. The number of features is d i.e. X =
(x1,x2,x2, xd).
• The Response/target vector (y) contains the value of
class/group variable for each row of feature matrix.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
Bayes’ Theorem finds the probability of an event
occurring given the probability of another event that
has already occurred.
Bayes’ theorem is stated mathematically as follows:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• where:
• A and B are called events.
• P(A | B) is the probability of event A, given the event B is true (has occured)
• Event B is also termed as evidence.
P(A) is the priori of A (the prior independent probability, i.e. probability of event
before evidence is seen).
• P(B | A) is the probability of B given event A, i.e. probability of event B after evidence
A is seen.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• Summary
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• Text Analysis is a major application field for machine learning
algorithms.
However the raw data, a sequence of symbols (i.e. strings) cannot be fed
directly to the algorithms themselves as most of them expect
numerical feature vectors with a fixed size rather than the raw text
documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• In order to address this, scikit-learn provides utilities for the most
common ways to extract numerical features from text content,
namely:
• tokenizing strings and giving an integer id for each possible token, for
instance by using w ite-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
• In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate
sample.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• We will consider the following training set.
• The data samples are described by attributes age, income, student,
and credit.
• The class label attribute, buy, tells whether the person buys a
computer, has two distinct values, yes (class C1) and no (class C2).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
RID Age Income student Credit Ci: buy
1 Youth High no Fair C2: no
2 Youth High no Excellent C2: no
3 middle-aged High no Fair C1: yes
4 Senior medium no Fair C1: yes
5 Senior Low yes Fair C1: yes
6 Senior Low yes Excellent C2: no
7 middle-aged Low yes Excellent C1: yes
8 Youth medium no Fair C2: no
9 Youth Low yes Fair C1: yes
10 Senior medium yes Fair C1: yes
11 Youth medium yes Excellent C1: yes
12 middle-aged medium no Excellent C1: yes
13 middle-aged High yes Fair C1: yes
14 Senior medium no Excellent C2: no
Example 1 : Using the Naive Bayesian Classifier
• The sample we wish to classify is
• X = (age = youth, income = medium, student = yes, credit = fair)
• We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori
probability of each class, can be estimated based on the training
samples:
• P(buy =yes ) = 9 /14
• P(buy =no ) = 5 /14
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• To compute P (X|Ci), for i = 1, 2, we compute the following conditional
probabilities:
• P(age=youth | buy =yes ) = 2/9
• P(income =medium | buy =yes ) = 4/9
• P(student =yes | buy =yes ) = 6/9
• P(credit =fair | buy =yes ) = 6/9
• P(age=youth | buy =no ) = 3/5
• P(income =medium | buy =no ) = 2/5
• P(student =yes | buy =no ) = 1/5
• P(credit =fair | buy =no ) = 2/5
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Using the above probabilities, we obtain
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Similarly
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
To find the class that maximizes P (X|Ci)P (Ci), we compute
Thus the naive Bayesian classifier predicts buy = yes for sample X
Example 2: Predicting a class label using naïve Bayesian classification
• Predicting a class label using naïve Bayesian classification.
• The training data set is given below:
• The data tuples are described by the attributes Owns Home?, Married,
Gender and Employed.
• The class label attribute Risk Class has three distinct values.
• Let C1 corresponds to the class A, and C2 corresponds to the class B
and C3 corresponds to the class C.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• The tuple is to classify is,
• X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Owns Home Married Gender Employed Risk Class
Yes Yes Male Yes B
No No Female Yes A
Yes Yes Female Yes C
Yes No Male No B
No Yes Female Yes C
No No Female Yes A
No No Male No B
Yes No Female Yes A
No Yes Female Yes C
Yes Yes Female Yes C
Example 2: Predicting a class label using naïve Bayesian classification
• Solution
• There are 10 samples and three classes.
• Risk class A = 3 Risk class B = 3 Risk class C = 4
•
• The prior probabilities are obtained by dividing these frequencies by
the total number in the training data,
• P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each:
• P(Owns Home = Yes/A) = 1/3 =0.33
• P(Married = No/A) = 3/3 =1
• P(Gender = Female/A) = 3/3 = 1
• P(Employed = Yes/A) = 3/3 = 1
•
• P(Owns Home = Yes/B) = 2/3 =0.67
• P(Married = No/B) = 2/3 =0.67
• P(Gender = Female/B) = 0/3 = 0
• P(Employed = Yes/B) = 1/3 = 0.33
•
• P(Owns Home = Yes/C) = 2/4 =0.5
• P(Married = No/C) = 0/4 =0
• P(Gender = Female/C) = 4/4 = 1
• P(Employed = Yes/C) = 4/4 = 1
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• Using the above probabilities, we obtain
• P(X/A)= P(Owns Home = Yes/A) X
• P(Married = No/A) x
• P(Gender = Female/A) X
• P(Employed = Yes/A)
= 0.33 x 1 x 1 x 1 = 0.33
• Similarly, P(X/B)= 0 , P(X/C) =0
•
• To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute,
• P(X/A) P(A) = 0.33 X 0.3 = 0.099
• P(X/B) P(B) =0 X 0.3 = 0
• P(X/C) P(C) = 0 X 0.4 = 0.0
• Therefore x is assigned to class A
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Advantages and Disadvantages
• Advantages:
a) Have the minimum error rate in comparison to all other classifiers.
b) Easy to implement
c) Good results obtained in most of the cases.
d) They provide theoretical justification for other classifiers that do not
explicitly use
• Disadvantages:
a) Lack of available probability data.
b) Inaccuracies in the assumption.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES

More Related Content

What's hot

Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)
Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)
Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)Dr. SELVAGANESAN S
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Professional Ethics Unit2
Professional Ethics Unit2Professional Ethics Unit2
Professional Ethics Unit2LovelitJose
 
Unit III GE8076 Professional Ethics in Engineering by Dr.Selvaganesan
Unit III GE8076 Professional Ethics in Engineering by Dr.SelvaganesanUnit III GE8076 Professional Ethics in Engineering by Dr.Selvaganesan
Unit III GE8076 Professional Ethics in Engineering by Dr.SelvaganesanDr. SELVAGANESAN S
 
Cross validation
Cross validationCross validation
Cross validationRidhaAfrawe
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and typesPadma Metta
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Unit I Human Values (GE8076 Professional Ethics in Engineering)
Unit I Human Values (GE8076 Professional Ethics in Engineering)Unit I Human Values (GE8076 Professional Ethics in Engineering)
Unit I Human Values (GE8076 Professional Ethics in Engineering)Dr. SELVAGANESAN S
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filteringNeha Kulkarni
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Xin-She Yang
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and ClusteringEng Teong Cheah
 

What's hot (20)

Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)
Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)
Unit II Engineering Ethics (GE8076 Professional Ethics in Engineering)
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
GE8076_PEE_UNIT - I T8 T9 CARING SHARING.pptx
GE8076_PEE_UNIT - I T8 T9 CARING SHARING.pptxGE8076_PEE_UNIT - I T8 T9 CARING SHARING.pptx
GE8076_PEE_UNIT - I T8 T9 CARING SHARING.pptx
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Professional Ethics Unit2
Professional Ethics Unit2Professional Ethics Unit2
Professional Ethics Unit2
 
Unit III GE8076 Professional Ethics in Engineering by Dr.Selvaganesan
Unit III GE8076 Professional Ethics in Engineering by Dr.SelvaganesanUnit III GE8076 Professional Ethics in Engineering by Dr.Selvaganesan
Unit III GE8076 Professional Ethics in Engineering by Dr.Selvaganesan
 
Cross validation
Cross validationCross validation
Cross validation
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and types
 
GE8076_PEE_UNIT - I T11 COURAGE.pptx
GE8076_PEE_UNIT - I T11 COURAGE.pptxGE8076_PEE_UNIT - I T11 COURAGE.pptx
GE8076_PEE_UNIT - I T11 COURAGE.pptx
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
GE8076_PEE_UNIT - I T18 SPIRITUALITY.pptx
GE8076_PEE_UNIT - I T18  SPIRITUALITY.pptxGE8076_PEE_UNIT - I T18  SPIRITUALITY.pptx
GE8076_PEE_UNIT - I T18 SPIRITUALITY.pptx
 
Unit I Human Values (GE8076 Professional Ethics in Engineering)
Unit I Human Values (GE8076 Professional Ethics in Engineering)Unit I Human Values (GE8076 Professional Ethics in Engineering)
Unit I Human Values (GE8076 Professional Ethics in Engineering)
 
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
 
Senses of engineering ethics1
Senses of engineering ethics1Senses of engineering ethics1
Senses of engineering ethics1
 

Similar to CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf

IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and LimeIRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and LimeIRJET Journal
 
IRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine LearningIRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine LearningIRJET Journal
 
Adversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam BhattacharyaAdversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam BhattacharyaRupam Bhattacharya
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3NIMMYRAJU
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMikhail Klassen
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 

Similar to CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf (20)

CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
CS8080_IRT_UNIT - III T10  ACCURACY AND ERROR.pdfCS8080_IRT_UNIT - III T10  ACCURACY AND ERROR.pdf
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
 
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
CS8080  IRT UNIT - III  SLIDES IN PDF.pdfCS8080  IRT UNIT - III  SLIDES IN PDF.pdf
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
 
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdfCS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
 
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
CS8080_IRT_UNIT - III T5  DECISION TREES.pdfCS8080_IRT_UNIT - III T5  DECISION TREES.pdf
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
 
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdfCS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
 
CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
CS8080_IRT_UNIT - III T8  FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdfCS8080_IRT_UNIT - III T8  FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
 
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
CS8080_IRT_UNIT - III T13 INVERTED  INDEXES.pdfCS8080_IRT_UNIT - III T13 INVERTED  INDEXES.pdf
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
 
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdfCS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
 
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdfCS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
 
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and LimeIRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
 
IRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine LearningIRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine Learning
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Adversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam BhattacharyaAdversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam Bhattacharya
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
 
Lecture 1.pptx
Lecture 1.pptxLecture 1.pptx
Lecture 1.pptx
 
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdfCS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace Training
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdfCS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
 
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdfCS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
 

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (13)

JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptxJAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
 
INTRO TO PROGRAMMING.ppt
INTRO TO PROGRAMMING.pptINTRO TO PROGRAMMING.ppt
INTRO TO PROGRAMMING.ppt
 
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptxCS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
 
CS3391 OOP UT-I T1 OVERVIEW OF OOP
CS3391 OOP UT-I T1 OVERVIEW OF OOPCS3391 OOP UT-I T1 OVERVIEW OF OOP
CS3391 OOP UT-I T1 OVERVIEW OF OOP
 
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMINGCS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
 
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptxCS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
 
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – V NOTES FINAL.pdfCS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
CS3391 -OOP -UNIT – IV NOTES FINAL.pdfCS3391 -OOP -UNIT – IV NOTES FINAL.pdf
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
CS3391 -OOP -UNIT – III  NOTES FINAL.pdfCS3391 -OOP -UNIT – III  NOTES FINAL.pdf
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
CS3391 -OOP -UNIT – II  NOTES FINAL.pdfCS3391 -OOP -UNIT – II  NOTES FINAL.pdf
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
 
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
CS3391 -OOP -UNIT – I  NOTES FINAL.pdfCS3391 -OOP -UNIT – I  NOTES FINAL.pdf
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
 
CS3251-_PIC
CS3251-_PICCS3251-_PIC
CS3251-_PIC
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 

Recently uploaded

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 

Recently uploaded (20)

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 

CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf

  • 1. P1WU UNIT – III: CLASSIFICATION Topic 3: NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 2. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 3. NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 4. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Naive Bayes classifiers are a collection of classification algorithms based on Bayes Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. • Naive Bayes classifiers have been heavily used for text classification and text analysis machine learning problems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 5. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Text Analysis is a major application field for machine learning algorithms. • However the raw data, • a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 6. The Naive Bayes algorithm • Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, • i.e. every pair of features being classified is independent of each other. • The dataset is divided into two parts, namely, feature matrix and the response/target vector. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 7. The Naive Bayes algorithm • The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd). • The Response/target vector (y) contains the value of class/group variable for each row of feature matrix. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 8. The Bayes’ Theorem Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as follows: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 9. The Bayes’ Theorem • where: • A and B are called events. • P(A | B) is the probability of event A, given the event B is true (has occured) • Event B is also termed as evidence. P(A) is the priori of A (the prior independent probability, i.e. probability of event before evidence is seen). • P(B | A) is the probability of B given event A, i.e. probability of event B after evidence A is seen. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 10. The Bayes’ Theorem • Summary AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 11. Dealing with text data • Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 12. Dealing with text data • In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely: • tokenizing strings and giving an integer id for each possible token, for instance by using w ite-spaces and punctuation as token separators. • counting the occurrences of tokens in each document. • In this scheme, features and samples are defined as follows: • each individual token occurrence frequency is treated as a feature. • the vector of all the token frequencies for a given document is considered a multivariate sample. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 13. Example 1 : Using the Naive Bayesian Classifier • We will consider the following training set. • The data samples are described by attributes age, income, student, and credit. • The class label attribute, buy, tells whether the person buys a computer, has two distinct values, yes (class C1) and no (class C2). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 14. Example 1 : Using the Naive Bayesian Classifier AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES RID Age Income student Credit Ci: buy 1 Youth High no Fair C2: no 2 Youth High no Excellent C2: no 3 middle-aged High no Fair C1: yes 4 Senior medium no Fair C1: yes 5 Senior Low yes Fair C1: yes 6 Senior Low yes Excellent C2: no 7 middle-aged Low yes Excellent C1: yes 8 Youth medium no Fair C2: no 9 Youth Low yes Fair C1: yes 10 Senior medium yes Fair C1: yes 11 Youth medium yes Excellent C1: yes 12 middle-aged medium no Excellent C1: yes 13 middle-aged High yes Fair C1: yes 14 Senior medium no Excellent C2: no
  • 15. Example 1 : Using the Naive Bayesian Classifier • The sample we wish to classify is • X = (age = youth, income = medium, student = yes, credit = fair) • We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori probability of each class, can be estimated based on the training samples: • P(buy =yes ) = 9 /14 • P(buy =no ) = 5 /14 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 16. Example 1 : Using the Naive Bayesian Classifier • To compute P (X|Ci), for i = 1, 2, we compute the following conditional probabilities: • P(age=youth | buy =yes ) = 2/9 • P(income =medium | buy =yes ) = 4/9 • P(student =yes | buy =yes ) = 6/9 • P(credit =fair | buy =yes ) = 6/9 • P(age=youth | buy =no ) = 3/5 • P(income =medium | buy =no ) = 2/5 • P(student =yes | buy =no ) = 1/5 • P(credit =fair | buy =no ) = 2/5 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 17. Example 1 : Using the Naive Bayesian Classifier • Using the above probabilities, we obtain AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 18. Example 1 : Using the Naive Bayesian Classifier • Similarly AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES To find the class that maximizes P (X|Ci)P (Ci), we compute Thus the naive Bayesian classifier predicts buy = yes for sample X
  • 19. Example 2: Predicting a class label using naïve Bayesian classification • Predicting a class label using naïve Bayesian classification. • The training data set is given below: • The data tuples are described by the attributes Owns Home?, Married, Gender and Employed. • The class label attribute Risk Class has three distinct values. • Let C1 corresponds to the class A, and C2 corresponds to the class B and C3 corresponds to the class C. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 20. Example 1 : Using the Naive Bayesian Classifier • The tuple is to classify is, • X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Owns Home Married Gender Employed Risk Class Yes Yes Male Yes B No No Female Yes A Yes Yes Female Yes C Yes No Male No B No Yes Female Yes C No No Female Yes A No No Male No B Yes No Female Yes A No Yes Female Yes C Yes Yes Female Yes C
  • 21. Example 2: Predicting a class label using naïve Bayesian classification • Solution • There are 10 samples and three classes. • Risk class A = 3 Risk class B = 3 Risk class C = 4 • • The prior probabilities are obtained by dividing these frequencies by the total number in the training data, • P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 22. Example 2: Predicting a class label using naïve Bayesian classification • To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each: • P(Owns Home = Yes/A) = 1/3 =0.33 • P(Married = No/A) = 3/3 =1 • P(Gender = Female/A) = 3/3 = 1 • P(Employed = Yes/A) = 3/3 = 1 • • P(Owns Home = Yes/B) = 2/3 =0.67 • P(Married = No/B) = 2/3 =0.67 • P(Gender = Female/B) = 0/3 = 0 • P(Employed = Yes/B) = 1/3 = 0.33 • • P(Owns Home = Yes/C) = 2/4 =0.5 • P(Married = No/C) = 0/4 =0 • P(Gender = Female/C) = 4/4 = 1 • P(Employed = Yes/C) = 4/4 = 1 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 23. Example 2: Predicting a class label using naïve Bayesian classification • Using the above probabilities, we obtain • P(X/A)= P(Owns Home = Yes/A) X • P(Married = No/A) x • P(Gender = Female/A) X • P(Employed = Yes/A) = 0.33 x 1 x 1 x 1 = 0.33 • Similarly, P(X/B)= 0 , P(X/C) =0 • • To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute, • P(X/A) P(A) = 0.33 X 0.3 = 0.099 • P(X/B) P(B) =0 X 0.3 = 0 • P(X/C) P(C) = 0 X 0.4 = 0.0 • Therefore x is assigned to class A AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 24. Advantages and Disadvantages • Advantages: a) Have the minimum error rate in comparison to all other classifiers. b) Easy to implement c) Good results obtained in most of the cases. d) They provide theoretical justification for other classifiers that do not explicitly use • Disadvantages: a) Lack of available probability data. b) Inaccuracies in the assumption. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 25. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES