1. PhD DissertationDocument Classiﬁcation Models TC Notation Representationbased on Bayesian networks Particularities EvaluationAlfonso E. Romero Bayesian networks Deﬁnition Storage problems Canonical modelsApril 27, 2010 OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Structured Structured documents Advisors: Luis M. de Campos, and Transformations Experiments Juan M. Fernández-Luna Link-based categorization Multiclass model Department of Computer Science and A.I. Experiments Multilabel model University of Granada Experiments Remarks PhD Dissertation.1
2. A brief overview: the problem I We shall solve problems in document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Deﬁnition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.2
3. A brief overview: the problem II ... in particular, automatic document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Deﬁnition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.3
4. A brief overview: the problem III ... concretely, supervised document categorization. TC Notation Representation Particularities The Alhambra of Learning procedure Evaluation Granada, most visited National Bayesian networks monument in 2009 Deﬁnition Storage problems Canonical models Prime minister OR Gate Zapatero to give a Algorithm (classiﬁer) Models speech on the National Experiments Parliament Thesaurus Deﬁnitions Unsupervised model AVE train to run at Supervised model 350 km/h in 2010 Experiments from Madrid to National Structured Barcelona Structured documents Transformations Experiments Labeled corpus Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.4
5. A brief overview: the methods I Which learning method shall we use? TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Deﬁnition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Deﬁnitions Unsupervised model Decision trees Supervised model Experiments Structured Evolutive algorithms Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.5
6. A brief overview: the methods II The answer: TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Deﬁnition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Deﬁnitions Unsupervised model Decision trees Supervised model Experiments Evolutive algorithms Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.6
7. A brief overview: the methods III TC But, why? Notation Representation Particularities Evaluation Bayesian networks • Strong theoretical foundation (probability theory). Deﬁnition Storage problems Canonical models OR Gate • Models for (uncertain) knowledge representation. Models Experiments Thesaurus Deﬁnitions • Great success in related tasks (IR). Unsupervised model Supervised model Experiments Structured • Our background at the group UTAI. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.7
8. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Deﬁnition Storage problems 3 An OR Gate-Based Text Classiﬁer Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.8
9. Outline ⇒ Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Deﬁnition Storage problems 3 An OR Gate-Based Text Classiﬁer Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.9
10. Supervised Text Categorization I Provided TC Notation 1 Set of labeled documents DTr (training). Representation Particularities 2 C, set of categories/labels. Evaluation Bayesian networks The goal is to build a model f (classiﬁer) capable of Deﬁnition Storage problems predicting categories (of C) of documents in D. Canonical models OR Gate Models Different kinds of labeling: Experiments Thesaurus • f : D → {c, c} (binary). Deﬁnitions Unsupervised model • f : D → {c1 , c2 , . . . , cn } (multiclass). Supervised model Experiments • f : D × C → {0, 1} (multilabel). Structured Structured documents Transformations A multilabel problem reduces to |C| binary problems Experiments Link-based categorization C = {c, c}. We often change the codomain from {0, 1} Multiclass model Experiments (hard classiﬁcation) to [0, 1] (soft classiﬁcation). Multilabel model Experiments Remarks PhD Dissertation.10
11. Supervised Text Categorization IIDocument representation: As in Information Retrieval TC Stopwords removal + stemming + Vector Notation representation (Frequency, binary or tf-idf). Representation Particularities Evaluation From (preprocessed) document to vector Bayesian networks Deﬁnition Storage problems Canonical models term ⇔ dimension OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Example (beginning of John Miltons "Lost Paradise"): Supervised model Experiments Of Mans First Disobedience, and the Fruit Of Mans First Disobedience, and the Fruit Structured Of that Forbidden Tree, whose mortal tast Of that Forbidden Tree, whose mortal tast Structured documents Brought Death into the World, and all our woe, Brought Death into the World, and all our woe, With loss of Eden, till one greater Man... With loss of Eden, till one greater Man... Transformations Experiments Link-based categorization Multiclass model 2 1 1 1 1 1 1 1 1 1 1 1 1 1 Experiments man obedience fruit forbid tree mortal tast bring death world woe loss eden great Multilabel model Experiments Remarks PhD Dissertation.11
12. Supervised Text Categorization IIIWhat are the particularities of the problem? It differs from a “classic” Machine Learning problem in: TC Notation Representation Particularities Evaluation • High dimensionality (easily > 10000). Bayesian networks Deﬁnition Storage problems Canonical models • Very unbalanced datasets. OR Gate Models Experiments • |C| 0. Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments • Sometimes, there is a hierarchy in the set C. Structured Structured documents Transformations Experiments • Sometimes, explicit relationships among Link-based categorization Multiclass model documents are given. Experiments Multilabel model Experiments Remarks PhD Dissertation.12
13. Supervised Text Categorization IVEvaluation How to measure the correctness of documents TC assigned to set of categories? Notation Representation • Binary/multiclass: Particularities Evaluation TP Bayesian networks • Hard categorization Precision: TP+FP and Recall: Deﬁnition Storage problems TP 2PR TP+FN , F1 : P+R . Canonical models OR Gate • Soft categorization Precision/Recall BEP. Models Experiments Thesaurus Deﬁnitions • Multilabel: micro and macro averages. Unsupervised model Supervised model Experiments Structured • Also, average precision on the 11 std. recall points Structured documents Transformations (category ranking). Experiments Link-based categorization Multiclass model Experiments Multilabel model • Standard corpora: Reuters, Ohsumed, 20 NG. . . Experiments Remarks PhD Dissertation.13
14. Outline 1 Text Categorization TC Notation Representation ⇒ Bayesian networks Particularities Evaluation Bayesian networks Deﬁnition Storage problems 3 An OR Gate-Based Text Classiﬁer Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.14
15. Bayesian networks IDeﬁnition and characteristics A set of random variables X1 , . . . , XN in a DAG, verifying P(X1 , . . . , Xn ) = n P(Xi |Pa(Xi )) ⇒ the graph i=1 TC Notation represents independences. Representation Particularities Evaluation Causal interpretation. Bayesian networks Deﬁnition Storage problems Canonical models Learning and inference methods available. OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.15
16. Bayesian networks IIIEstimation and storage problems TC The problem Notation Representation Particularities • One value for each conﬁguration of the parents. Evaluation Bayesian networks • General case: exponential number of parameters Deﬁnition Storage problems on the number of parents. Canonical models OR Gate Models Experiments The solution Thesaurus Deﬁnitions 1 Few parents per node (not realistic in text). Unsupervised model Supervised model Experiments 2 Write the probability of a node as a deterministic Structured function of the conﬁguration (canonical model). Set Structured documents Transformations of parameters with linear size on the number of Experiments Link-based categorization parents. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.16
17. Bayesian networks: canonical modelsComponents and examples of canonical models • X = {Xi }: parents (causes), Y : child (effect). TC Notation Representation • Xi in {xi , xi }, Yi in {yi , yi } (occurrence or not). Particularities Evaluation Bayesian networks 1 Noisy-OR gate model: Deﬁnition p(y |X) = 1 − Xi ∈R(x) (1 − wOR (Xi , Y )). Storage problems Canonical models 2 Additive model: p(y |X) = Xi ∈R(x) wadd (Xi , Y ). OR Gate Models Experiments Thesaurus ... Deﬁnitions X1 X2 XL Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Y Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.17
18. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Deﬁnition ⇒ An OR Gate-Based Text Classiﬁer Storage problems Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.18
19. An OR Gate-Based Text ClassiﬁerWhy using OR gates? • The OR gate is a simple model (and fast for TC Notation inference). Representation Particularities • It has gained great success in knowledge Evaluation Bayesian networks representation. Deﬁnition Storage problems • Discriminative classiﬁer (models directly p(ci |dj )) Canonical models (NB generative). OR Gate Models • Seems natural the term are causes and category Experiments Thesaurus the effect. Deﬁnitions Unsupervised model Supervised model Experiments C C Structured Structured documents Transformations Experiments Link-based categorization T1 T2 ... Tn T1 T2 ... Tn Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.19
20. An OR Gate-Based Text ClassiﬁerThe model Equations for the OR gate classiﬁer TC Notation Probability distributions: Representation Particularities Evaluation pi (ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) , Bayesian networks Deﬁnition Tk ∈R(pa(Ci )) Storage problems Canonical models pi (c i |pa(Ci )) = 1 − pi (ci |pa(Ci )). OR Gate Models Experiments Inference: Thesaurus Deﬁnitions Unsupervised model pi (ci |dj ) = 1 − 1 − w(Tk , Ci ) p(tk |dj ) Supervised model Experiments Tk ∈Pa(Ci ) Structured Structured documents = 1− (1 − w(Tk , Ci )) . Transformations Experiments Tk ∈Pa(Ci )∩dj Link-based categorization Multiclass model Experiments Multilabel model Experiments Model characterized by the w(Tk , Ci ) formula. Remarks PhD Dissertation.20
21. An OR Gate-Based Text ClassiﬁerWeight estimation formulae Weight w(Tk , Ci ) means TC Notation Representation ˆ pi (ci |tk , t h ∀Th ∈ Pa(Ci ), Th = Tk ). Particularities Evaluation Bayesian networks ˆ Nik +1 1 ML: w(Tk , Ci ) as p(ci |tk ), w(Tk , Ci ) = N•k +2 (using Deﬁnition Storage problems Laplace). Canonical models OR Gate Models Experiments 2 TI: assuming term independence, given the Thesaurus (Ni• −N category, w(Tk , Ci ) = ntNik•k h=k (N−N•hih )N . Deﬁnitions Unsupervised model iN )Ni• Supervised model Experiments Structured Notation: N number of words in the training documents. Structured documents Transformations N•k times that term tk appears in training documents Experiments Link-based categorization (N•k = ci Nik ), Ni• is the total number of words in Multiclass model Experiments documents of class ci (Ni• = tk Nik ), M vocabulary size. Multilabel model Experiments Remarks PhD Dissertation.21
22. An OR Gate-Based Text ClassiﬁerPruning independent terms TC Notation Representation Particularities • The model can be improved pruning terms which Evaluation are independent with the category. Bayesian networks Deﬁnition Storage problems Canonical models • We run an independence test for each pair OR Gate Models term/category, at a certain conﬁdence level. Only Experiments terms which pass it are kept. Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments • The size of the parent set is highly reduced, but Structured classiﬁcation is often improved. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.22
23. An OR Gate-Based Text ClassiﬁerExperiments I: Experimental setting TC Notation Representation Particularities • We compare a Multinomial NB, a Rocchio, OR TI, Evaluation OR ML, and both OR with pruning at levels Bayesian networks Deﬁnition {0.9, 0.99, 0.999}. Storage problems Canonical models OR Gate Models • We made experiments on Ohsumed-23, Reuters Experiments and 20 NG. Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments • Soft categorization, evaluated with macro/micro Structured BEP, 11 avg std prec, and macro/micro F1 @{1, 3, 5}. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.23
24. An OR Gate-Based Text ClassiﬁerExperiments II: Experimental setting TC Notation Representation Results (# of wins on 9 measures): Particularities Evaluation Bayesian networks Deﬁnition • Reuters: OR-TI-0.999 (7), OR-TI (1), OR-ML-0.999 Storage problems Canonical models (1). OR Gate Models Experiments Thesaurus • Ohsumed: OR-TI-0.999 (8), OR-ML-0.99 (1). Deﬁnitions Unsupervised model Supervised model Experiments • 20 NG: OR-TI (2), OR-ML (2), OR-TI-0.999 (1), NB Structured (4). Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.24
25. An OR Gate-Based Text Classiﬁer Conclusions TC Notation Representation • A new text categorization model, based on noisy OR Particularities Evaluation gates. Bayesian networks Deﬁnition • Simple and computationally affordable. Storage problems Canonical models • Results improves Naïve Bayes noticeably. OR Gate Models Experiments Thesaurus Future work Deﬁnitions Unsupervised model Supervised model • Use more advanced noisy OR models (leaky). Experiments Structured • Use another canonical models. Structured documents Transformations • Explore another alternatives for term pruning. Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.25
26. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Deﬁnition Storage problems 3 An OR Gate-Based Text Classiﬁer Canonical models OR Gate Models Experiments ⇒ Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.26
27. Automatic Indexing From a Thesaurus Using BayesianNetworks: Thesauri IDeﬁnitions TC A thesaurus Notation Representation A list of terms representing concepts, grouped those with Particularities Evaluation the same meaning, with hierarchical relationships among Bayesian networks them. Deﬁnition Storage problems Canonical models ND:psychiatric OR Gate ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary Models clinic care centre Experiments Thesaurus Deﬁnitions D:psychiatric D:medical D:medical ND:medical Unsupervised model institution institution centre service Supervised model Experiments Structured ND:health ND:health D:health Structured documents protection service Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model D:health Experiments policy Remarks PhD Dissertation.27
28. Automatic Indexing From a Thesaurus Using BayesianNetworks: Thesauri IIIndexing with a thesauri TC • Problem: associating descriptors (keywords) to Notation Representation documents (scientiﬁc, medical, legal,. . . ). Particularities Evaluation • Manually: expensive and time consuming work (due Bayesian networks to thousand of descriptors! [EUROVOC > 6000]). Deﬁnition Storage problems Besides: Canonical models OR Gate • How many descriptors should we assign? Models • Which descriptor should assign in the hierarchy? Experiments Thesaurus Deﬁnitions We propose an automatic solution Unsupervised model Supervised model Experiments 1 TC problem (categories ≡ descriptors). Structured Structured documents 2 Makes use of the meta-information of the thesaurus. Transformations Experiments Link-based categorization 3 Unsupervised and supervised case. Multiclass model Experiments 4 Based on BNs. Multilabel model Experiments Remarks PhD Dissertation.28
29. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IFrom the thesaurus... TC Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Deﬁnition Storage problems D:psychiatric D:medical D:medical ND:medical centre Canonical models institution institution service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Deﬁnitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.29
30. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case II... to the Bayesian network. TC policy protection service health psychiatric outpatient hospital clinic institution medical care centre dispensary Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Deﬁnition Storage problems D:psychiatric D:medical D:medical ND:medical Canonical models institution institution centre service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Deﬁnitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations • Binary variables (relevant/not relevant), for each Experiments Link-based categorization term, descriptor and non descriptor. Multiclass model Experiments • Problem: information of different nature mixed in Multilabel model Experiments descriptor nodes! Remarks PhD Dissertation.30
31. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IIIIntroducing concept and virtual nodes TC protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Notation Representation Particularities ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Evaluation Bayesian networks E:health E:health E:medical E:medical E:psychiatric Deﬁnition policy service centre institution institution Storage problems Canonical models C:medical C:medical C:psychiatric centre institution institution OR Gate Models Experiments H:health service C:health Thesaurus service Deﬁnitions Unsupervised model H:health policy Supervised model Experiments C:health policy Structured Structured documents Transformations Experiments • New concept nodes, C. Link-based categorization Multiclass model • New Hierarchy nodes, HC . Experiments Multilabel model Experiments • New Equivalence nodes, EC . Remarks PhD Dissertation.31
32. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IVProbability distributions We already have the structure, but we have to specify TC Notation the distributions on each node. Representation Particularities Many parents ⇒ canonical models. Evaluation Bayesian networks • D and ND nodes (SUM): Deﬁnition Storage problems p(d + |pa(D)) = T ∈R(pa(D)) w(T , D). Canonical models OR Gate • HC nodes (SUM): Models + Experiments p(hc |pa(HC )) = C ∈R(pa(HC )) w(C , HC ). Thesaurus • EC nodes (OR): Deﬁnitions Unsupervised model + p(ec |pa(EC )) = 1 − D∈R(pa(EC )) (1 − w(D, C)). Supervised model Experiments • C nodes (OR): Structured Structured documents Transformations + + 8 > 1 − (1 − w(EC , C))(1 − w(HC , C)) > if ec = ec , hc = hc Experiments > < w(E , C) + − Link-based categorization + C if ec = ec , hc = hc p(c |{ec , hc }) = − + Multiclass model > w(HC , C) > if ec = ec , hc = hc Experiments > − − 0 if ec = ec , hc = hc : Multilabel model Experiments Remarks PhD Dissertation.32
33. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case VInference TC Classiﬁcation is seen as inference. Notation Representation Particularities • Given a document d, we set the term variables Evaluation T ∈ d to t + , t − otherwise. Bayesian networks Deﬁnition Storage problems Canonical models • Exact propagation is carried out: OR Gate Models • First, to descriptor nodes. Experiments Thesaurus Deﬁnitions • Then, to EC nodes. Unsupervised model Supervised model Experiments • Following a topological order, probabilities Structured + p(hc |pa(HC )) are computed after their parents Structured documents Transformations + p(c |pa(C)) are set. Experiments Link-based categorization Multiclass model Experiments • Final p(c + |pa(C)), ∀C values are returned. Multilabel model Experiments Remarks PhD Dissertation.33
34. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IChanges from the unsupervised case TC Notation Representation Particularities Evaluation • Concept also receives information from labeled Bayesian networks Deﬁnition documents in the training set. Storage problems Canonical models OR Gate Models • We add a training node TC as new parent of the Experiments concept one. The node is an OR gate ML classiﬁer. Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments • Distributions of C nodes are modiﬁed Structured consequently. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.34
35. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IIGraphically: unsupervised TC Notation Representation Particularities Evaluation protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Bayesian networks Deﬁnition ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Storage problems Canonical models E:health policy E:health service E:medical centre E:medical institution E:psychiatric OR Gate institution Models Experiments C:medical C:medical C:psychiatric centre institution institution Thesaurus Deﬁnitions H:health Unsupervised model service Supervised model C:health service Experiments H:health Structured policy Structured documents C:health Transformations policy Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.35
36. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IIGraphically: supervised TC Notation Representation Particularities Evaluation protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Bayesian networks Deﬁnition ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Storage problems Canonical models T:health policy E:health policy T:health service E:health service T:medical centre E:medical centre T:medical institution E:medical institution T:psychiatric E:psychiatric OR Gate institution institution Models Experiments C:medical C:medical C:psychiatric centre institution institution Thesaurus Deﬁnitions H:health Unsupervised model service Supervised model C:health service Experiments H:health Structured policy Structured documents C:health Transformations policy Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.36
37. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments IDescription of the collection TC Notation • Database from the Andalusian Parliament at Spain, Representation Particularities containing 7933 parliamentary resolutions. Evaluation Bayesian networks • Classiﬁed on the Eurovoc thesaurus (5080 Deﬁnition Storage problems categories). Canonical models OR Gate • From 1 to 14 descriptors assigned (average 3.8). Models Experiments • Each initiative 1 to 3 lines of text. Thesaurus Deﬁnitions Unsupervised model Experimentation Supervised model Experiments Structured 1 Unsupervised: our model Vs. VSM and HVSM. Structured documents Transformations 2 Supervised: our model Vs. SVM, Rocchio and NB. Experiments Link-based categorization Micro-macro BEP, F1@5, AV Prec. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.37
40. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments IVSupervised experiments: Micro Recall for incremental number ofcategories TC Notation Representation Naive Bayes Particularities Standalone OR-Gate Evaluation 0.9 Rocchio SBN (conf 1) SBN (conf 2) Bayesian networks SVM Deﬁnition Storage problems Canonical models 0.8 OR Gate Models Experiments 0.7 Thesaurus Deﬁnitions Unsupervised model Supervised model 0.6 Experiments Structured Structured documents Transformations 0.5 Experiments Link-based categorization Multiclass model Experiments Multilabel model 0.4 Experiments 5 10 15 20 Remarks PhD Dissertation.40
41. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments VSupervised experiments: Micro F1 at ﬁve computed for incrementalpercentage of training data TC Notation 0.7 Representation Naive Bayes Particularities Standalone OR-Gate Evaluation Rocchio Supervised BN (conf. 1) Supervised BN (conf. 2) Bayesian networks SVM Deﬁnition Storage problems 0.6 Canonical models OR Gate Models Experiments 0.5 Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments 0.4 Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments 0.3 Multilabel model Experiments 10 20 30 40 50 60 70 80 90 100 Remarks PhD Dissertation.41
42. Automatic Indexing From a Thesaurus Using BayesianNetworks Conclusions TC Notation Representation • A BN-based model for document classiﬁcation Particularities Evaluation indexed from a thesaurus. Bayesian networks • Very good results in unsupervised (300% VSM Deﬁnition Storage problems results). Canonical models OR Gate • Outstanding results in supervised (above SVM). Models Experiments Thesaurus Deﬁnitions Future work Unsupervised model Supervised model Experiments • Consider associative relationships. Structured Structured documents • Take context into account. Transformations Experiments Link-based categorization • Test on other thesauri (MeSH, Agrovoc), build testing Multiclass model Experiments corpora. Multilabel model Experiments Remarks PhD Dissertation.42
43. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Deﬁnition Storage problems 3 An OR Gate-Based Text Classiﬁer Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments ⇒ Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.43
44. Structured Document Categorization Using BayesianNetworksIntroduction TC Notation What is structure in document collections? Representation Particularities Evaluation • Internal structure (inside each document): XML Bayesian networks (“structured”) documents . Deﬁnition Storage problems Canonical models • External structure (outside documents, graph in the OR Gate collection): linked-based collections. Models Experiments Outline of this part: Thesaurus Deﬁnitions 1 Contributions in Structured TC (XML). Unsupervised model Supervised model Experiments 2 A model for link-based document categorization Structured (multiclass). Structured documents Transformations 3 A model for link-based document categorization Experiments Link-based categorization (multilabel). Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.44
45. Structured Document Categorization Using BayesianNetworks IStructured Documents TC Notation Representation Particularities Evaluation Bayesian networks Deﬁnition Scientific paper Storage problems Canonical models Title OR Gate Models Bibliography Experiments Section Thesaurus Abstract Section Section Subsection Deﬁnitions Unsupervised model Section Title Subsection Section Title Supervised model Experiments Section Title Subsection Subsection Subsection Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.45
46. Structured Document Categorization Using BayesianNetworks IITransformations I TC Notation Representation Particularities Evaluation Structured TC: same as “plain” TC, corpora of structured Bayesian networks documents. Deﬁnition Storage problems Canonical models OR Gate Models Experiments Thesaurus Deﬁnitions Our approach Unsupervised model Supervised model Convert with transformations structured documents to Experiments Structured plain documents, and test plain classiﬁers on them. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.46
47. Structured Document Categorization Using BayesianNetworks IIITransformations II TC Notation <book> Representation Particularities <title>El ingenioso hidalgo Don Quijote Evaluation Bayesian networks de la Mancha</title> Deﬁnition Storage problems <author>Miguel de Cervantes Saavedra Canonical models </author><contents> OR Gate Models <chapter>Uno</chapter> Experiments <text>En un lugar de La Mancha de Thesaurus Deﬁnitions cuyo nombre no quiero acordarme... Unsupervised model Supervised model </text> </contents> Experiments </book> Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Figure: “Quijote”, XML Fragment used for examples, with Experiments header removed. Multilabel model Experiments Remarks PhD Dissertation.47
48. Structured Document Categorization Using BayesianNetworks IIITransformations III TC Notation Representation Particularities Evaluation Bayesian networks Deﬁnition El ingenioso hidalgo Don Quijote de la Storage problems Canonical models Mancha Miguel de Cervantes Saavedra Uno OR Gate En un lugar de La Mancha de cuyo nombre Models Experiments no quiero acordarme... Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Figure: “Quijote”, with “only text” approach. Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.48
49. Structured Document Categorization Using BayesianNetworks IIITransformations IV TC Notation Representation Particularities Evaluation title_El title_ingenioso title_hidalgo Bayesian networks title_Don title_Quijote title_de Deﬁnition Storage problems title_la title_Mancha author_Miguel Canonical models author_de author_Cervantes OR Gate Models author_Saavedra chapter_Uno text_En Experiments text_un text_lugar text_de text_La Thesaurus Deﬁnitions text_Mancha text_de text_cuyo text_nombre Unsupervised model Supervised model text_no text_quiero text_acordarme... Experiments Structured Structured documents Transformations Experiments Figure: “Quijote”, with “tagging_1”. Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.49
50. Structured Document Categorization Using BayesianNetworks IIITransformations V TC Our contribution Notation Representation Particularities Evaluation title: 1, author: 0, chapter: 0, text: 2 Bayesian networks Deﬁnition Storage problems Canonical models El ingenioso hidalgo Don Quijote de OR Gate Models la Mancha En En un un lugar lugar de Experiments de La La Mancha Mancha de de cuyo cuyo Thesaurus Deﬁnitions nombre nombre no no quiero quiero Unsupervised model Supervised model acordarme acordarme... Experiments Structured Structured documents Transformations Experiments Figure: “Quijote”, with “replication” method, using values Link-based categorization proposed before. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.50
51. Structured Document Categorization Using BayesianNetworks IIIExperimentation TC • Experiments on the INEX 2007 XML dataset. Notation Representation Particularities • 96611 documents, 21 categories, 50% training/test Evaluation split. Bayesian networks Deﬁnition • Replication improves macro measures on Naïve Storage problems Canonical models Bayes a lot. OR Gate Models • Other transformations are not useful here. Experiments Thesaurus Deﬁnitions Unsupervised model Method Reduction Selection? µBEP MBEP µF1 MF1 Supervised model Naïve Bayes Only text no 0.76160 0.58608 0.78139 0.64324 Experiments Naïve Bayes Only text ≥ 2 docs. 0.72269 0.67379 0.77576 0.69309 Naïve Bayes Only text ≥ 3 docs. 0.69753 0.67467 0.76191 0.68856 Structured Naïve Bayes Repl. (id=2) None 0.76005 0.64491 0.78233 0.66635 Structured documents Naïve Bayes Repl. (id=2) ≥ 2 docs. 0.71270 0.68386 0.61321 0.73780 Transformations Experiments Naïve Bayes Repl. (id=2) ≥ 3 docs. 0.70916 0.68793 0.73270 0.65697 Link-based categorization Naïve Bayes Repl. (id=3) None 0.75809 0.67327 0.77622 0.67101 Multiclass model Naïve Bayes Repl. (id=4) None 0.75921 0.69176 0.76968 0.67013 Experiments Naïve Bayes Repl. (id=5) None 0.75976 0.70045 0.76216 0.66412 Multilabel model Naïve Bayes Repl. (id=8) None 0.74406 0.69865 0.72728 0.61602 Experiments Naïve Bayes Repl. (id=11) None 0.72722 0.67965 0.71422 0.60451 Remarks PhD Dissertation.51
52. Structured Document Categorization Using BayesianNetworks III TC Notation Representation Particularities Conclusions Evaluation Bayesian networks Deﬁnition • Several XML transformation (one original). Storage problems Canonical models • Good results with “replication” + NB. OR Gate Models Experiments Thesaurus Future work Deﬁnitions Unsupervised model Supervised model • More extensive experimentation. Experiments Structured • New transformations. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.52
53. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization A set of documents with a graph structure among them. TC Notation The goal is to label a document using both its content Representation Particularities and the graph structure (labels of the neighbors?). Evaluation Bayesian networks Deﬁnition Storage problems Canonical models OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model ? Experiments Multilabel model Experiments Remarks PhD Dissertation.53
54. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization TC Typically, scatterplots like this: Notation Representation Particularities Evaluation Bayesian networks 0.12 Deﬁnition Storage problems 0.1 Canonical models 0.08 OR Gate Models Experiments 0.06 Thesaurus 0.04 Deﬁnitions Unsupervised model 0.02 Supervised model Experiments 0 Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Encyclopedia regularity (a document of category Ci Experiments Multilabel model tends to links documents on the same category). Experiments Remarks PhD Dissertation.54
55. Structured Document Categorization Using BayesianNetworks IVlink-based categorization: multiclass I TC Document d0 , linked to documents d1 , . . . , dm . Notation Representation Particularities Evaluation Random variables C0 , C1 , . . . , Cm , in {c0 , c1 , . . . , cn }. Bayesian networks Deﬁnition Storage problems Variables ei , evidence of the classiﬁcation (content) of Canonical models OR Gate document di . Models Experiments Thesaurus Given the true class of the document to classify Deﬁnitions Unsupervised model (independences): Supervised model Experiments 1 the categories of the linked documents are Structured independent among each other, and Structured documents Transformations Experiments 2 the evidence about this category due to the Link-based categorization Multiclass model document content is independent of the original Experiments Multilabel model category of the document we want to classify. Experiments Remarks PhD Dissertation.55
56. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass II TC Notation e1 ... ... em Representation Particularities Evaluation Bayesian networks Deﬁnition Storage problems ... ... Canonical models C1 Cm OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments C0 Structured Structured documents Transformations Experiments Link-based categorization Multiclass model e0 Experiments Multilabel model Experiments Remarks PhD Dissertation.56
57. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass III TC Notation With some computation: Representation Particularities 0 1 Evaluation m ” p(C = c |e ) i j i C “ Bayesian networks Y X p(C0 = c0 |e) ∝ p(C0 = c0 |e0 ) p Ci = cj |C0 = c0 B @ A i=1 cj ={c0 ,...,cn } p(Ci = cj ) Deﬁnition Storage problems Canonical models Where: OR Gate Models • p(C0 = c0 |e) ﬁnal evidence that the document Experiments Thesaurus belongs to C0 . Deﬁnitions Unsupervised model • p(Ci = cj |ei ) obtained with a “local” (content) Supervised model Experiments classiﬁer (NB). Structured • p(Ci = ci ) (prior) and p(Ci = ci |C0 = c0 ) (probability Structured documents Transformations a document of Ci links another of C0 ), obtained Experiments Link-based categorization from training data. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.57
58. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass IV TC Notation Representation Experiments: INEX 2008 corpus: Particularities Evaluation Bayesian networks • A classical Naïve Bayes algorithm on the ﬂat text Deﬁnition Storage problems documents obtained 0.67674 of recall. Canonical models • Our proposal using the previous Naïve Bayes as the OR Gate Models base classiﬁer obtained 0.6787 of recall (using Experiments Thesaurus outlinks). Deﬁnitions Unsupervised model • Our model (inlinks): 0.67894 of recall. Supervised model Experiments • Our model (neighbours): 0.68273 of recall. Structured Structured documents The model works better in a “ideal environment” (knowing Transformations Experiments the labels of all neighbors). Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.58
59. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass V TC Notation Representation Particularities Conclusions Evaluation Bayesian networks • A new model for classiﬁcation of multiclass linked Deﬁnition Storage problems documents, based on BNs. Canonical models OR Gate • Good performance in an ideal environment. Models Experiments Thesaurus Deﬁnitions Future work Unsupervised model Supervised model Experiments • Use a base classiﬁer (probabilistic) with a better Structured performance (Logistic? SVM with probabilistic Structured documents Transformations outputs?). Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.59
60. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel I TC • Previous model was not ﬂexible. Structure of BN Notation Representation imposed. Particularities Evaluation • We learn the interactions among categories from Bayesian networks Deﬁnition data, no ﬁxed structure, but any which is learnt Storage problems Canonical models from the set of categories. OR Gate Models • Variables: categories Ci (one for category), Experiments categories of incoming links Ej (one for category) Thesaurus Deﬁnitions and terms Tk (many). Unsupervised model Supervised model • We will search for p(ci |ej , dj ). Experiments Structured • Main assumption: Structured documents Transformations Experiments p(dj , ej |ci ) = p(dj |ci ) p(ej |ci ). Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.60
61. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel II TC Notation Representation With a few computations: Particularities Evaluation Bayesian networks p(ci |dj ) p(ci |ej ) / p(ci ) Deﬁnition p(ci |dj , ej ) = Storage problems p(ci |dj )p(ci |ej )/p(ci ) + p(c i |dj )p(c i |ej )/p(c i ) Canonical models OR Gate Models Experiments • p(ci |dj ): output of a probabilistic classiﬁer. Any Thesaurus probabilistic classiﬁer. Deﬁnitions Unsupervised model Supervised model Experiments • p(ci |ej ): probability of being of Ci considering the set Structured Structured documents of the categories of the incoming (known) links. This Transformations Experiments is modeled by the BN. Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.61
62. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel III TC Experimentation INEX 2009 corpus: 54572 documents, Notation Representation test/train split of a 20/80%. 39 categories. Particularities Evaluation Bayesian networks Measures Accuracy (ACC), Area under Roc curve Deﬁnition Storage problems (ROC), F1 measure (PRF) and Avg prec on 11 std Canonical models OR Gate (MAP). Models Experiments • Learning Bayesian Network, using WEKA package. Thesaurus • Hillclimbing algorithm (easy and fast) + BDeu Deﬁnitions Unsupervised model metric (3 parents max. per node). Supervised model Experiments • Propagation, using Elvira Structured Structured documents • Compute p(ci ) (once), and p(ci |ej ) (for each Transformations Experiments document j). Exact propagation is slow for so many Link-based categorization categories! ⇒ Importance Sampling algorithm Multiclass model Experiments (approximate). Multilabel model Experiments Remarks PhD Dissertation.62
63. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel IV TC Notation Representation Particularities Evaluation Bayesian networks Results Deﬁnition Storage problems Canonical models MACC µACC MROC µROC MPRF µPRF MAP OR Gate N. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097 Models N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235 Experiments OR gate 0.92932 0.92612 0.92526 0.92163 0.45966 0.50407 0.72955 Thesaurus OR gate + BN 0.96607 0.95588 0.92810 0.92739 0.51729 0.55116 0.72508 Deﬁnitions Unsupervised model Supervised model Our method clearly improves both baselines. Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.63
64. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multilabel V TC Notation Conclusions Representation Particularities Evaluation • A new model for classiﬁcation of multilabel linked Bayesian networks Deﬁnition documents, based on BNs. Storage problems Canonical models • Very ﬂexible. OR Gate Models • Any learning procedure is usable. Experiments Thesaurus • Very promising results Deﬁnitions Unsupervised model Supervised model Experiments Future work Structured Structured documents Transformations • Use different baselines. Experiments Link-based categorization • More extensive experimentation. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.64
65. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Deﬁnition Storage problems 3 An OR Gate-Based Text Classiﬁer Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Deﬁnitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments ⇒ Final Remarks Multilabel model Experiments Remarks PhD Dissertation.65
66. Final RemarksSummaryMost relevant contributions TC Notation Representation • A new text classiﬁer (OR gate), better than NB. Particularities Evaluation Bayesian networks Deﬁnition Storage problems • Deﬁnition of a new problem (thesaurus indexing). Canonical models Two models, outstanding results. OR Gate Models Experiments Thesaurus • Some minor contributions in XML classiﬁcation Deﬁnitions Unsupervised model (“text replication”). Supervised model Experiments Structured Structured documents • Two models of link-based document Transformations Experiments categorization. Promising results in the Multilabel Link-based categorization Multiclass model one. Experiments Multilabel model Experiments Remarks PhD Dissertation.66
67. Final Remarks IIList of Publications supporting this work: TC Notation Representation Particularities Evaluation In the thesis: Bayesian networks Deﬁnition See pages 170-173 Storage problems Canonical models OR Gate Models or... Experiments Thesaurus Deﬁnitions In the www: Unsupervised model Supervised model Experiments Visit http://decsai.ugr.es/~aeromero Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.67
68. Final Remarks IIISoftware: • DauroLab. A toolbox for Machine Learning. Written TC Notation in Java (by me!). Representation Particularities • Free (libre) software (GPL v3). Evaluation Bayesian networks • http://sf.net/projects/daurolab. Deﬁnition Storage problems Canonical models OR Gate Models Experiments Thesaurus Deﬁnitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.68
Be the first to comment