Text classification

Text Classification and Naïve Bayes
An example of text classification
Definition of a machine learning problem
A refresher on probability
The Naive Bayes classifier
1

Different ways for classification
Human labor (people assign categories to every incoming
article)
Hand-crafted rules for automatic classification
 If article contains: stock, Dow, share, Nasdaq, etc.  Business
 If article contains: set, breakpoint, player, Federer, etc.  Tennis
Machine learning algorithms
3

What is Machine Learning?
4
Definition: A computer program is said to learn from
experience E when its performance P at a task T
improves with experience E.
Tom Mitchell, Machine Learning, 1997
Examples:
- Learning to recognize spoken words
- Learning to drive a vehicle
- Learning to play backgammon

Components of a ML System (1)
Experience (a set of examples that combines together
input and output for a task)
 Text categorization: document + category
 Speech recognition: spoken text + written text
Experience is referred to as Training Data. When training
data is available, we talk of Supervised Learning.
Performance metrics
 Error or accuracy in the Test Data
 Test Data are not present in the Training Data
 When there are few training data, methods like ‘leave-one-out’ or
‘ten-fold cross validation’ are used to measure error.
5

Components of a ML System (2)
Type of knowledge to be learned (known as the target
function, that will map between input and output)
Representation of the target function
 Decision trees
 Neural networks
 Linear functions
The learning algorithm
 C4.5 (learns decision trees)
 Gradient descent (learns a neural network)
 Linear programming (learns linear functions)
6
Task

Defining Text Classification
7
XdX∈d
},,,{ 21 Jccc =C
D cd,
C×∈Xcd,
C→X:γ
γ=Γ D)(
the document in the multi-dimensional space
a set of classes (categories, or labels)
the training set of labeled documents
Target function:
Learning algorithm:
=cd, “Beijing joins the World Trade Organization”, China
cd =)(γ =)(dγ China

Naïve Bayes Learning
8
∏≤≤∈∈
==
dnk
k
CcCc
MAP ctPcPdcPc
1
)|(ˆ)(ˆmaxarg)|(ˆmaxarg
cd =)(γ
Learning Algorithm: Naïve Bayes
Target Function:
)|()(maxarg)|(maxarg cdPcPdcPc
CcCc
MAP
∈∈
==
)(cP
)|( cdP
The generative process:
)|( dcP
a priori probability, of choosing a category
the cond. prob. of generating d, given the fixed c
a posteriori probability that c generated d

Visualizing probability
A is a random variable that denotes an uncertain event
 Example: A = “I’ll get an A+ in the final exam”
P(A) is “the fraction of possible worlds where A is true”
10
Worlds in
which A
is true
Slide: Andrew W. Moore
Worlds in which A is false
Event space of all possible
worlds. Its area is 1.
P(A) = Area of the blue
circle.

Axioms and Theorems of Probability
Axioms:
 0 <= P(A) <= 1
 P(True) = 1
 P(False) = 0
 P(A or B) = P(A) + P(B) – P(A and B)
Theorems:
 P(not A) = P(~A) = 1 – P(A)
 P(A) = P(A ^ B) + P(A ^ ~B)
11

Conditional Probability
P(A|B) = the probability of A being true, given that we
know that B is true
12
F
H
H = “I have a headache”
F = “Coming down with flu”
P(H) = 1/10
P(F) = 1/40
P(H/F) = 1/2
Slide: Andrew W. Moore
Headaches are rare and flu
even rarer, but if you got that flu,
there is a 50-50 chance you’ll
have a headache.

Back to the Naïve Bayes Classifier
14

Estimating parameters for the
target function
We are looking for the estimates and
16
)(ˆ cP )|(ˆ cdP
P(c) is the fraction of possible worlds where c is true.
N
N
cP c
=)(ˆ N – number of all documents
Nc – number of documents in class c
d is a vector in the space X
)|,,,()|( 2 ctttPcdP dni =
where each dimension is a term:
)()|()( BPBAPBAP =∧By using the chain rule: we have:
(P
),,...,(),,...,|()|,,,( 2212 cttPctttPctttP ddd nnni =
...=

Naïve assumptions of independence
1. All attribute values are independent of each other given
the class. (conditional independence assumption)
2. The conditional probabilities for a term are the same
independent of position in the document.
We assume the document is a “bag-of-words”.
17
∏≤≤
==
d
d
nk
kni ctPctttPcdP
1
2 )|()|,,,()|( 
∏≤≤∈∈
==
dnk
k
CcCc
MAP ctPcPdcPc
1
)|(ˆ)(ˆmaxarg)|(ˆmaxarg
Finally, we get the target function of Slide 8:

Again about estimation
18
For each term, t, we need to estimate P(t|c)
∑ ∈
=
Vt ct
ct
T
T
ctP
' '
)|(ˆ
Because an estimate will be 0 if a term does not appear with a class
in the training data, we need smoothing:
||)(
1
)1(
1
)|(ˆ
' '' ' VT
T
T
T
ctP
Vt ct
ct
Vt ct
ct
∑∑ ∈∈
+
+
=
+
+
=Laplace
Smoothing
|V| is the number of terms in the vocabulary
Tct is the count of term t in all documents of class c

An Example of classification with
Naïve Bayes
19

Example 13.1 (Part 1)
20
Training
set
docID c = China?
1 Chinese Beijing Chinese Yes
2 Chinese Chinese Shangai Yes
3 Chinese Macao Yes
4 Tokyo Japan Chinese No
Test set 5 Chinese Chinese Chinese Tokyo Japan ?
Two classes: “China”, “not China”
N = 4 4/3)(ˆ =cP 4/1)(ˆ =cP
V = {Beijing, Chinese, Japan, Macao, Tokyo}

Example 13.1 (Part 1)
21
Training
set
docID c = China?
1 Chinese Beijing Chinese Yes
2 Chinese Chinese Shangai Yes
3 Chinese Macao Yes
4 Tokyo Japan Chinese No
Test set 5 Chinese Chinese Chinese Tokyo Japan ?
7/3)68/()15()|Chinese(ˆ =++=cP
14/1)68/()10()|Japan(ˆ)|Tokyo(ˆ =++== cPcP
9/2)63/()11()|Chinese(ˆ =++=cP
9/2)63/()11()|Japan(ˆ)|Tokyo(ˆ =++== cPcP
Estimation Classification
∏≤≤
∝
dnk
k ctPcPdcP
1
)|()()|(
0001.09/29/2)9/2(4/1)|(
0003.014/114/1)7/3(4/3)|(
3
5
3
5
≈⋅⋅⋅∝
≈⋅⋅⋅∝
dcP
dcP

Summary: Miscellanious
Naïve Bayes is linear in the time is takes to scan the data
When we have many terms, the product of probabilities
with cause a floating point underflow, therefore:
For a large training set, the vocabulary is large. It is better
to select only a subset of terms. For that is used “feature
selection” (Section 13.5).
22
∑≤≤∈
+=
dnk
k
Cc
MAP ctPcPc
1
)|(log)(ˆ[logmaxarg

Text classification

More Related Content

What's hot

Viewers also liked

Similar to Text classification

More from Tony Nguyen

Recently uploaded

Text classification

Editor's Notes