1. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Probabilistic Methods for Structured
Document Classiﬁcation at INEX´07
´
Luis M. de Campos, Juan M. Fernandez-Luna, Juan F.
Huete, Alfonso E. Romero
´
Departamento de Ciencias de la Computacion e Inteligencia Artiﬁcial
Universidad de Granada
{lci,jmﬂuna,jhg,aeromero}@decsai.ugr.es
INEX 2007, Dagstuhl (Germany) December 18, 2007
2. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Abstract
We present here the result of our participation in the
Document Mining track at INEX´07. We submitted several
runs for this track (only classiﬁcation). This is the ﬁrst year
we apply for, with relative good results.
3. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Our aim
Finding good XML to ﬂat-document transformations...
...in order to apply probabilistic ﬂat-text and,
to improve ﬂat classiﬁers result.
4. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Our aim
Finding good XML to ﬂat-document transformations...
...in order to apply probabilistic ﬂat-text and,
to improve ﬂat classiﬁers result.
5. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Our aim
Finding good XML to ﬂat-document transformations...
...in order to apply probabilistic ﬂat-text and,
to improve ﬂat classiﬁers result.
6. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
The models
Multinomial Naive Bayes
The models: Multinomial Naive Bayes
Extensively used in text classiﬁcation (see paper by
MacCallum).
Posterior probability of a class is computed using the
bayes rule:
p(ci )p(dj |ci )
p(ci |dj ) = ∝ p(ci )p(dj |ci ) (1)
p(dj )
Probabilities p(dj |ci ) are supossed to follow a multinomial
distribution:
p(tk |ci )njk
p(dj |ci ) ∝ (2)
tk ∈dj
The estimation of the needed values is carried out in the
N
following way: p(tk |ci ) = Nik+M and p(ci ) = Ni,doc .
+1
Ni doc
7. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
The models
OR Gate Bayesian Network Classiﬁer
The OR Gate Bayesian Network Classiﬁer
Tries to model rules of the following kind:
IF (ti ∈ d) ∨ (tj ∈ d) ∨ ... THEN classify by c.
The canonical model used is a noisy OR gate, wich is
related with the notion of causality.
The appearance of some terms is the cause for the
assignation of a certain class.
The general expression of the probability distributions is
this:
p(ci |pa(Ci )) = 1 − (1 − w(Tk , Ci ))
Tk ∈R(pa(Ci ))
p(c i |pa(Ci )) = 1 − p(ci |pa(Ci ))
8. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
The models
OR Gate Bayesian Network Classiﬁer
The OR Gate Bayesian Network Classiﬁer (2)
We instantiate all terms of a document: if
tk ∈ dj , p(tk |dj ) = 1, 0 otherwise.
We replicate each term node k with its frequency in the
document njk .
The computation of the posterior probability is very easy
(only depending on the terms appearing on the document):
(1 − w(Tk , Ci ))njk .
p(ci |dj ) = 1 − (3)
Tk ∈Pa(Ci )∩dj
Two estimation schemes for the weights:
Nik
Maximum likelihood: w(Tk , Ci ) = Nk .
(Ni −Nih )N
= Nik ×
Better approximation: w(Tk , Ci ) h=k (N−Nh )Ni
Nk
See paper for details!
9. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Document representation
Example XML ﬁle
Document representation: example XML ﬁle
<book>
<title>El ingenioso hidalgo Don Quijote
de la Mancha</title>
<author>Miguel de Cervantes Saavedra</author>
<contents>
<chapter>Uno</chapter>
<text>En un lugar de La Mancha de cuyo
nombre no quiero
acordarme...</text>
</contents>
</book>
10. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Document representation
Method: “only text”
Document representation
1 Only text (removing all tags).
Quijote
El ingenioso hidalgo Don Quijote de la Mancha Miguel de
Cervantes Saavedra Uno En un lugar de La Mancha de cuyo
nombre no quiero acordarme...
11. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Document representation
Method: “text replication”
Document representation
5 Text replication (term frequencies are replicated several
times, depending on the tag containing the terms).
Replication values
title 1 author 0 chapter 0 text 2
Quijote
El ingenioso hidalgo Don Quijote de la Mancha En En un un
lugar lugar de de La La Mancha Mancha de de cuyo cuyo
nombre nombre no no quiero quiero acordarme acordarme...
12. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Results
Results
Five runs were submitted to the track:
(1) Naive Bayes, only text, no term selection. Microaverage:
0.77630. Macroaverage: 0.58536.
(2) Naive Bayes, replication (id=2), no term selection.
Microaverage: 0.78107 (+0.6%). Macroaverage: 0.6373
(+8.9%).
(3) Or gate, maximum likelihood, replication (id=8), selection
by MI. Microaverage: 0.75097. Macroaverage: 0.61973.
(4) Or gate, maximum likelihood, replication (id=5), selection
by MI. Microaverage: 0.75354. Macroaverage: 0.61298.
(5) Or gate, better approximation, only text, ≥ 2.
Microaverage: 0.78998. Macroaverage: 0.76054.
See paper for text replication values (id=2,5,8).
13. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Conclusion
Conclusions
Also reached by our previous experiments!!
Naive Bayes works bad with few populated categories (low
macroaverage).
Tagging and adding seems not to work well without feature
selection.
Text replication improves macroaverage (good for Naive
Bayes! ).
OR gates by ML needs of a per-class feature selection
method (mutual information).
The better approximation for the OR gate is our best
classiﬁer in previous experiments over ﬂat text.
14. Probabilistic Methods for Structured Document Classiﬁcation at INEX´07
Conclusion
Thank you very much!
Questions or comments?
Be the first to comment