Document classification models based on Bayesian networks

  • 1,455 views
Uploaded on

Slides of my PhD thesis, presented in April 2010.

Slides of my PhD thesis, presented in April 2010.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,455
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
43
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. PhD DissertationDocument Classification Models TC Notation Representationbased on Bayesian networks Particularities EvaluationAlfonso E. Romero Bayesian networks Definition Storage problems Canonical modelsApril 27, 2010 OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Advisors: Luis M. de Campos, and Transformations Experiments Juan M. Fernández-Luna Link-based categorization Multiclass model Department of Computer Science and A.I. Experiments Multilabel model University of Granada Experiments Remarks PhD Dissertation.1
  • 2. A brief overview: the problem I We shall solve problems in document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Definition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.2
  • 3. A brief overview: the problem II ... in particular, automatic document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Definition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.3
  • 4. A brief overview: the problem III ... concretely, supervised document categorization. TC Notation Representation Particularities The Alhambra of Learning procedure Evaluation Granada, most visited National Bayesian networks monument in 2009 Definition Storage problems Canonical models Prime minister OR Gate Zapatero to give a Algorithm (classifier) Models speech on the National Experiments Parliament Thesaurus Definitions Unsupervised model AVE train to run at Supervised model 350 km/h in 2010 Experiments from Madrid to National Structured Barcelona Structured documents Transformations Experiments Labeled corpus Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.4
  • 5. A brief overview: the methods I Which learning method shall we use? TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Definition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Definitions Unsupervised model Decision trees Supervised model Experiments Structured Evolutive algorithms Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.5
  • 6. A brief overview: the methods II The answer: TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Definition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Definitions Unsupervised model Decision trees Supervised model Experiments Evolutive algorithms Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.6
  • 7. A brief overview: the methods III TC But, why? Notation Representation Particularities Evaluation Bayesian networks • Strong theoretical foundation (probability theory). Definition Storage problems Canonical models OR Gate • Models for (uncertain) knowledge representation. Models Experiments Thesaurus Definitions • Great success in related tasks (IR). Unsupervised model Supervised model Experiments Structured • Our background at the group UTAI. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.7
  • 8. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.8
  • 9. Outline ⇒ Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.9
  • 10. Supervised Text Categorization I Provided TC Notation 1 Set of labeled documents DTr (training). Representation Particularities 2 C, set of categories/labels. Evaluation Bayesian networks The goal is to build a model f (classifier) capable of Definition Storage problems predicting categories (of C) of documents in D. Canonical models OR Gate Models Different kinds of labeling: Experiments Thesaurus • f : D → {c, c} (binary). Definitions Unsupervised model • f : D → {c1 , c2 , . . . , cn } (multiclass). Supervised model Experiments • f : D × C → {0, 1} (multilabel). Structured Structured documents Transformations A multilabel problem reduces to |C| binary problems Experiments Link-based categorization C = {c, c}. We often change the codomain from {0, 1} Multiclass model Experiments (hard classification) to [0, 1] (soft classification). Multilabel model Experiments Remarks PhD Dissertation.10
  • 11. Supervised Text Categorization IIDocument representation: As in Information Retrieval TC Stopwords removal + stemming + Vector Notation representation (Frequency, binary or tf-idf). Representation Particularities Evaluation From (preprocessed) document to vector Bayesian networks Definition Storage problems Canonical models term ⇔ dimension OR Gate Models Experiments Thesaurus Definitions Unsupervised model Example (beginning of John Miltons "Lost Paradise"): Supervised model Experiments Of Mans First Disobedience, and the Fruit Of Mans First Disobedience, and the Fruit Structured Of that Forbidden Tree, whose mortal tast Of that Forbidden Tree, whose mortal tast Structured documents Brought Death into the World, and all our woe, Brought Death into the World, and all our woe, With loss of Eden, till one greater Man... With loss of Eden, till one greater Man... Transformations Experiments Link-based categorization Multiclass model 2 1 1 1 1 1 1 1 1 1 1 1 1 1 Experiments man obedience fruit forbid tree mortal tast bring death world woe loss eden great Multilabel model Experiments Remarks PhD Dissertation.11
  • 12. Supervised Text Categorization IIIWhat are the particularities of the problem? It differs from a “classic” Machine Learning problem in: TC Notation Representation Particularities Evaluation • High dimensionality (easily > 10000). Bayesian networks Definition Storage problems Canonical models • Very unbalanced datasets. OR Gate Models Experiments • |C| 0. Thesaurus Definitions Unsupervised model Supervised model Experiments • Sometimes, there is a hierarchy in the set C. Structured Structured documents Transformations Experiments • Sometimes, explicit relationships among Link-based categorization Multiclass model documents are given. Experiments Multilabel model Experiments Remarks PhD Dissertation.12
  • 13. Supervised Text Categorization IVEvaluation How to measure the correctness of documents TC assigned to set of categories? Notation Representation • Binary/multiclass: Particularities Evaluation TP Bayesian networks • Hard categorization Precision: TP+FP and Recall: Definition Storage problems TP 2PR TP+FN , F1 : P+R . Canonical models OR Gate • Soft categorization Precision/Recall BEP. Models Experiments Thesaurus Definitions • Multilabel: micro and macro averages. Unsupervised model Supervised model Experiments Structured • Also, average precision on the 11 std. recall points Structured documents Transformations (category ranking). Experiments Link-based categorization Multiclass model Experiments Multilabel model • Standard corpora: Reuters, Ohsumed, 20 NG. . . Experiments Remarks PhD Dissertation.13
  • 14. Outline 1 Text Categorization TC Notation Representation ⇒ Bayesian networks Particularities Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.14
  • 15. Bayesian networks IDefinition and characteristics A set of random variables X1 , . . . , XN in a DAG, verifying P(X1 , . . . , Xn ) = n P(Xi |Pa(Xi )) ⇒ the graph i=1 TC Notation represents independences. Representation Particularities Evaluation Causal interpretation. Bayesian networks Definition Storage problems Canonical models Learning and inference methods available. OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.15
  • 16. Bayesian networks IIIEstimation and storage problems TC The problem Notation Representation Particularities • One value for each configuration of the parents. Evaluation Bayesian networks • General case: exponential number of parameters Definition Storage problems on the number of parents. Canonical models OR Gate Models Experiments The solution Thesaurus Definitions 1 Few parents per node (not realistic in text). Unsupervised model Supervised model Experiments 2 Write the probability of a node as a deterministic Structured function of the configuration (canonical model). Set Structured documents Transformations of parameters with linear size on the number of Experiments Link-based categorization parents. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.16
  • 17. Bayesian networks: canonical modelsComponents and examples of canonical models • X = {Xi }: parents (causes), Y : child (effect). TC Notation Representation • Xi in {xi , xi }, Yi in {yi , yi } (occurrence or not). Particularities Evaluation Bayesian networks 1 Noisy-OR gate model: Definition p(y |X) = 1 − Xi ∈R(x) (1 − wOR (Xi , Y )). Storage problems Canonical models 2 Additive model: p(y |X) = Xi ∈R(x) wadd (Xi , Y ). OR Gate Models Experiments Thesaurus ... Definitions X1 X2 XL Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Y Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.17
  • 18. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition ⇒ An OR Gate-Based Text Classifier Storage problems Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.18
  • 19. An OR Gate-Based Text ClassifierWhy using OR gates? • The OR gate is a simple model (and fast for TC Notation inference). Representation Particularities • It has gained great success in knowledge Evaluation Bayesian networks representation. Definition Storage problems • Discriminative classifier (models directly p(ci |dj )) Canonical models (NB generative). OR Gate Models • Seems natural the term are causes and category Experiments Thesaurus the effect. Definitions Unsupervised model Supervised model Experiments C C Structured Structured documents Transformations Experiments Link-based categorization T1 T2 ... Tn T1 T2 ... Tn Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.19
  • 20. An OR Gate-Based Text ClassifierThe model Equations for the OR gate classifier TC Notation Probability distributions: Representation Particularities Evaluation pi (ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) , Bayesian networks Definition Tk ∈R(pa(Ci )) Storage problems Canonical models pi (c i |pa(Ci )) = 1 − pi (ci |pa(Ci )). OR Gate Models Experiments Inference: Thesaurus Definitions Unsupervised model pi (ci |dj ) = 1 − 1 − w(Tk , Ci ) p(tk |dj ) Supervised model Experiments Tk ∈Pa(Ci ) Structured Structured documents = 1− (1 − w(Tk , Ci )) . Transformations Experiments Tk ∈Pa(Ci )∩dj Link-based categorization Multiclass model Experiments Multilabel model Experiments Model characterized by the w(Tk , Ci ) formula. Remarks PhD Dissertation.20
  • 21. An OR Gate-Based Text ClassifierWeight estimation formulae Weight w(Tk , Ci ) means TC Notation Representation ˆ pi (ci |tk , t h ∀Th ∈ Pa(Ci ), Th = Tk ). Particularities Evaluation Bayesian networks ˆ Nik +1 1 ML: w(Tk , Ci ) as p(ci |tk ), w(Tk , Ci ) = N•k +2 (using Definition Storage problems Laplace). Canonical models OR Gate Models Experiments 2 TI: assuming term independence, given the Thesaurus (Ni• −N category, w(Tk , Ci ) = ntNik•k h=k (N−N•hih )N . Definitions Unsupervised model iN )Ni• Supervised model Experiments Structured Notation: N number of words in the training documents. Structured documents Transformations N•k times that term tk appears in training documents Experiments Link-based categorization (N•k = ci Nik ), Ni• is the total number of words in Multiclass model Experiments documents of class ci (Ni• = tk Nik ), M vocabulary size. Multilabel model Experiments Remarks PhD Dissertation.21
  • 22. An OR Gate-Based Text ClassifierPruning independent terms TC Notation Representation Particularities • The model can be improved pruning terms which Evaluation are independent with the category. Bayesian networks Definition Storage problems Canonical models • We run an independence test for each pair OR Gate Models term/category, at a certain confidence level. Only Experiments terms which pass it are kept. Thesaurus Definitions Unsupervised model Supervised model Experiments • The size of the parent set is highly reduced, but Structured classification is often improved. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.22
  • 23. An OR Gate-Based Text ClassifierExperiments I: Experimental setting TC Notation Representation Particularities • We compare a Multinomial NB, a Rocchio, OR TI, Evaluation OR ML, and both OR with pruning at levels Bayesian networks Definition {0.9, 0.99, 0.999}. Storage problems Canonical models OR Gate Models • We made experiments on Ohsumed-23, Reuters Experiments and 20 NG. Thesaurus Definitions Unsupervised model Supervised model Experiments • Soft categorization, evaluated with macro/micro Structured BEP, 11 avg std prec, and macro/micro F1 @{1, 3, 5}. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.23
  • 24. An OR Gate-Based Text ClassifierExperiments II: Experimental setting TC Notation Representation Results (# of wins on 9 measures): Particularities Evaluation Bayesian networks Definition • Reuters: OR-TI-0.999 (7), OR-TI (1), OR-ML-0.999 Storage problems Canonical models (1). OR Gate Models Experiments Thesaurus • Ohsumed: OR-TI-0.999 (8), OR-ML-0.99 (1). Definitions Unsupervised model Supervised model Experiments • 20 NG: OR-TI (2), OR-ML (2), OR-TI-0.999 (1), NB Structured (4). Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.24
  • 25. An OR Gate-Based Text Classifier Conclusions TC Notation Representation • A new text categorization model, based on noisy OR Particularities Evaluation gates. Bayesian networks Definition • Simple and computationally affordable. Storage problems Canonical models • Results improves Naïve Bayes noticeably. OR Gate Models Experiments Thesaurus Future work Definitions Unsupervised model Supervised model • Use more advanced noisy OR models (leaky). Experiments Structured • Use another canonical models. Structured documents Transformations • Explore another alternatives for term pruning. Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.25
  • 26. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments ⇒ Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.26
  • 27. Automatic Indexing From a Thesaurus Using BayesianNetworks: Thesauri IDefinitions TC A thesaurus Notation Representation A list of terms representing concepts, grouped those with Particularities Evaluation the same meaning, with hierarchical relationships among Bayesian networks them. Definition Storage problems Canonical models ND:psychiatric OR Gate ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary Models clinic care centre Experiments Thesaurus Definitions D:psychiatric D:medical D:medical ND:medical Unsupervised model institution institution centre service Supervised model Experiments Structured ND:health ND:health D:health Structured documents protection service Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model D:health Experiments policy Remarks PhD Dissertation.27
  • 28. Automatic Indexing From a Thesaurus Using BayesianNetworks: Thesauri IIIndexing with a thesauri TC • Problem: associating descriptors (keywords) to Notation Representation documents (scientific, medical, legal,. . . ). Particularities Evaluation • Manually: expensive and time consuming work (due Bayesian networks to thousand of descriptors! [EUROVOC > 6000]). Definition Storage problems Besides: Canonical models OR Gate • How many descriptors should we assign? Models • Which descriptor should assign in the hierarchy? Experiments Thesaurus Definitions We propose an automatic solution Unsupervised model Supervised model Experiments 1 TC problem (categories ≡ descriptors). Structured Structured documents 2 Makes use of the meta-information of the thesaurus. Transformations Experiments Link-based categorization 3 Unsupervised and supervised case. Multiclass model Experiments 4 Based on BNs. Multilabel model Experiments Remarks PhD Dissertation.28
  • 29. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IFrom the thesaurus... TC Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Definition Storage problems D:psychiatric D:medical D:medical ND:medical centre Canonical models institution institution service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Definitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.29
  • 30. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case II... to the Bayesian network. TC policy protection service health psychiatric outpatient hospital clinic institution medical care centre dispensary Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Definition Storage problems D:psychiatric D:medical D:medical ND:medical Canonical models institution institution centre service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Definitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations • Binary variables (relevant/not relevant), for each Experiments Link-based categorization term, descriptor and non descriptor. Multiclass model Experiments • Problem: information of different nature mixed in Multilabel model Experiments descriptor nodes! Remarks PhD Dissertation.30
  • 31. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IIIIntroducing concept and virtual nodes TC protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Notation Representation Particularities ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Evaluation Bayesian networks E:health E:health E:medical E:medical E:psychiatric Definition policy service centre institution institution Storage problems Canonical models C:medical C:medical C:psychiatric centre institution institution OR Gate Models Experiments H:health service C:health Thesaurus service Definitions Unsupervised model H:health policy Supervised model Experiments C:health policy Structured Structured documents Transformations Experiments • New concept nodes, C. Link-based categorization Multiclass model • New Hierarchy nodes, HC . Experiments Multilabel model Experiments • New Equivalence nodes, EC . Remarks PhD Dissertation.31
  • 32. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IVProbability distributions We already have the structure, but we have to specify TC Notation the distributions on each node. Representation Particularities Many parents ⇒ canonical models. Evaluation Bayesian networks • D and ND nodes (SUM): Definition Storage problems p(d + |pa(D)) = T ∈R(pa(D)) w(T , D). Canonical models OR Gate • HC nodes (SUM): Models + Experiments p(hc |pa(HC )) = C ∈R(pa(HC )) w(C , HC ). Thesaurus • EC nodes (OR): Definitions Unsupervised model + p(ec |pa(EC )) = 1 − D∈R(pa(EC )) (1 − w(D, C)). Supervised model Experiments • C nodes (OR): Structured Structured documents Transformations + + 8 > 1 − (1 − w(EC , C))(1 − w(HC , C)) > if ec = ec , hc = hc Experiments > < w(E , C) + − Link-based categorization + C if ec = ec , hc = hc p(c |{ec , hc }) = − + Multiclass model > w(HC , C) > if ec = ec , hc = hc Experiments > − − 0 if ec = ec , hc = hc : Multilabel model Experiments Remarks PhD Dissertation.32
  • 33. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case VInference TC Classification is seen as inference. Notation Representation Particularities • Given a document d, we set the term variables Evaluation T ∈ d to t + , t − otherwise. Bayesian networks Definition Storage problems Canonical models • Exact propagation is carried out: OR Gate Models • First, to descriptor nodes. Experiments Thesaurus Definitions • Then, to EC nodes. Unsupervised model Supervised model Experiments • Following a topological order, probabilities Structured + p(hc |pa(HC )) are computed after their parents Structured documents Transformations + p(c |pa(C)) are set. Experiments Link-based categorization Multiclass model Experiments • Final p(c + |pa(C)), ∀C values are returned. Multilabel model Experiments Remarks PhD Dissertation.33
  • 34. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IChanges from the unsupervised case TC Notation Representation Particularities Evaluation • Concept also receives information from labeled Bayesian networks Definition documents in the training set. Storage problems Canonical models OR Gate Models • We add a training node TC as new parent of the Experiments concept one. The node is an OR gate ML classifier. Thesaurus Definitions Unsupervised model Supervised model Experiments • Distributions of C nodes are modified Structured consequently. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.34
  • 35. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IIGraphically: unsupervised TC Notation Representation Particularities Evaluation protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Bayesian networks Definition ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Storage problems Canonical models E:health policy E:health service E:medical centre E:medical institution E:psychiatric OR Gate institution Models Experiments C:medical C:medical C:psychiatric centre institution institution Thesaurus Definitions H:health Unsupervised model service Supervised model C:health service Experiments H:health Structured policy Structured documents C:health Transformations policy Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.35
  • 36. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IIGraphically: supervised TC Notation Representation Particularities Evaluation protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Bayesian networks Definition ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Storage problems Canonical models T:health policy E:health policy T:health service E:health service T:medical centre E:medical centre T:medical institution E:medical institution T:psychiatric E:psychiatric OR Gate institution institution Models Experiments C:medical C:medical C:psychiatric centre institution institution Thesaurus Definitions H:health Unsupervised model service Supervised model C:health service Experiments H:health Structured policy Structured documents C:health Transformations policy Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.36
  • 37. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments IDescription of the collection TC Notation • Database from the Andalusian Parliament at Spain, Representation Particularities containing 7933 parliamentary resolutions. Evaluation Bayesian networks • Classified on the Eurovoc thesaurus (5080 Definition Storage problems categories). Canonical models OR Gate • From 1 to 14 descriptors assigned (average 3.8). Models Experiments • Each initiative 1 to 3 lines of text. Thesaurus Definitions Unsupervised model Experimentation Supervised model Experiments Structured 1 Unsupervised: our model Vs. VSM and HVSM. Structured documents Transformations 2 Supervised: our model Vs. SVM, Rocchio and NB. Experiments Link-based categorization Micro-macro BEP, F1@5, AV Prec. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.37
  • 38. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments IIUnsupervised experiments TC Notation Representation Particularities Evaluation Bayesian networks Definition Storage problems Models Micro BEP Macr BEP Av. prec. Micro F1@5 Macro F1@5 Canonical models BN, 0.8, 0 0.26244 0.20394 0.29967 0.30811 0.17661 BN, 0.9, 0 0.28241 0.20234 0.30700 0.31419 0.18419 OR Gate BN, 0.8, 0.8 0.26068 0.21208 0.30500 0.30845 0.17521 Models BN, 0.9, 0.9 0.26881 0.20903 0.31321 0.31473 0.18433 Experiments BN, 0.9, 1.0 0.26636 0.20880 0.31261 0.31381 0.18265 Thesaurus BN, 1.0, 1.0 0.25584 0.20768 0.27870 0.30963 0.18865 Definitions VSM 0.15127 0.18772 0.18061 0.20839 0.17016 Unsupervised model HVSM 0.13326 0.17579 0.17151 0.20052 0.14587 Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.38
  • 39. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments IIISupervised experiments TC Notation Representation Particularities Models Micro BEP Macr BEP Av. prec. Micro F1@5 Macro F1@5 Evaluation Naïve Bayes 0.42924 0.17787 0.61840 0.50050 0.20322 Bayesian networks Rocchio 0.34158 0.35796 0.43516 0.40527 0.33980 Definition OR gate 0.40338 0.44855 0.56236 0.41367 0.24629 Storage problems SVM 0.63972 0.47890 0.69695 0.57268 0.40841 Canonical models SBN 0.0, 0.9 0.54825 0.43361 0.66834 0.54066 0.33414 SBN 0.0, 0.8 0.55191 0.43388 0.67149 0.54294 0.33781 OR Gate SBN 0.0, 0.5 0.55617 0.43269 0.67571 0.54578 0.34088 Models SBN 0.0, 0.1 0.55735 0.43282 0.67761 0.54652 0.34228 Experiments SBN 0.9, 0.0 0.55294 0.47207 0.65998 0.56940 0.36761 Thesaurus SBN 0.8, 0.0 0.57936 0.47820 0.68185 0.58163 0.38589 Definitions SBN 0.5, 0.0 0.58372 0.48497 0.70176 0.57875 0.38009 Unsupervised model SBN 0.1, 0.0 0.56229 0.46171 0.68715 0.55390 0.35123 Supervised model SBN 0.8, 0.1 0.57887 0.47809 0.68187 0.58144 0.38610 Experiments SBN 0.5, 0.1 0.58343 0.48487 0.70197 0.57887 0.38146 SBN 0.5, 0.5 0.58285 0.48716 0.70096 0.57859 0.37868 Structured SBN 0.8, 0.8 0.56801 0.47946 0.67358 0.57508 0.37300 Structured documents SBN 0.9, 0.9 0.53963 0.47200 0.64957 0.56278 0.35742 Transformations SBN 1.0, 1.0 0.49084 0.45875 0.59042 0.53235 0.32173 Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.39
  • 40. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments IVSupervised experiments: Micro Recall for incremental number ofcategories TC Notation Representation Naive Bayes Particularities Standalone OR-Gate Evaluation 0.9 Rocchio SBN (conf 1) SBN (conf 2) Bayesian networks SVM Definition Storage problems Canonical models 0.8 OR Gate Models Experiments 0.7 Thesaurus Definitions Unsupervised model Supervised model 0.6 Experiments Structured Structured documents Transformations 0.5 Experiments Link-based categorization Multiclass model Experiments Multilabel model 0.4 Experiments 5 10 15 20 Remarks PhD Dissertation.40
  • 41. Automatic Indexing From a Thesaurus Using BayesianNetworks: Experiments VSupervised experiments: Micro F1 at five computed for incrementalpercentage of training data TC Notation 0.7 Representation Naive Bayes Particularities Standalone OR-Gate Evaluation Rocchio Supervised BN (conf. 1) Supervised BN (conf. 2) Bayesian networks SVM Definition Storage problems 0.6 Canonical models OR Gate Models Experiments 0.5 Thesaurus Definitions Unsupervised model Supervised model Experiments 0.4 Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments 0.3 Multilabel model Experiments 10 20 30 40 50 60 70 80 90 100 Remarks PhD Dissertation.41
  • 42. Automatic Indexing From a Thesaurus Using BayesianNetworks Conclusions TC Notation Representation • A BN-based model for document classification Particularities Evaluation indexed from a thesaurus. Bayesian networks • Very good results in unsupervised (300% VSM Definition Storage problems results). Canonical models OR Gate • Outstanding results in supervised (above SVM). Models Experiments Thesaurus Definitions Future work Unsupervised model Supervised model Experiments • Consider associative relationships. Structured Structured documents • Take context into account. Transformations Experiments Link-based categorization • Test on other thesauri (MeSH, Agrovoc), build testing Multiclass model Experiments corpora. Multilabel model Experiments Remarks PhD Dissertation.42
  • 43. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments ⇒ Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.43
  • 44. Structured Document Categorization Using BayesianNetworksIntroduction TC Notation What is structure in document collections? Representation Particularities Evaluation • Internal structure (inside each document): XML Bayesian networks (“structured”) documents . Definition Storage problems Canonical models • External structure (outside documents, graph in the OR Gate collection): linked-based collections. Models Experiments Outline of this part: Thesaurus Definitions 1 Contributions in Structured TC (XML). Unsupervised model Supervised model Experiments 2 A model for link-based document categorization Structured (multiclass). Structured documents Transformations 3 A model for link-based document categorization Experiments Link-based categorization (multilabel). Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.44
  • 45. Structured Document Categorization Using BayesianNetworks IStructured Documents TC Notation Representation Particularities Evaluation Bayesian networks Definition Scientific paper Storage problems Canonical models Title OR Gate Models Bibliography Experiments Section Thesaurus Abstract Section Section Subsection Definitions Unsupervised model Section Title Subsection Section Title Supervised model Experiments Section Title Subsection Subsection Subsection Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.45
  • 46. Structured Document Categorization Using BayesianNetworks IITransformations I TC Notation Representation Particularities Evaluation Structured TC: same as “plain” TC, corpora of structured Bayesian networks documents. Definition Storage problems Canonical models OR Gate Models Experiments Thesaurus Definitions Our approach Unsupervised model Supervised model Convert with transformations structured documents to Experiments Structured plain documents, and test plain classifiers on them. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.46
  • 47. Structured Document Categorization Using BayesianNetworks IIITransformations II TC Notation <book> Representation Particularities <title>El ingenioso hidalgo Don Quijote Evaluation Bayesian networks de la Mancha</title> Definition Storage problems <author>Miguel de Cervantes Saavedra Canonical models </author><contents> OR Gate Models <chapter>Uno</chapter> Experiments <text>En un lugar de La Mancha de Thesaurus Definitions cuyo nombre no quiero acordarme... Unsupervised model Supervised model </text> </contents> Experiments </book> Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Figure: “Quijote”, XML Fragment used for examples, with Experiments header removed. Multilabel model Experiments Remarks PhD Dissertation.47
  • 48. Structured Document Categorization Using BayesianNetworks IIITransformations III TC Notation Representation Particularities Evaluation Bayesian networks Definition El ingenioso hidalgo Don Quijote de la Storage problems Canonical models Mancha Miguel de Cervantes Saavedra Uno OR Gate En un lugar de La Mancha de cuyo nombre Models Experiments no quiero acordarme... Thesaurus Definitions Unsupervised model Supervised model Experiments Figure: “Quijote”, with “only text” approach. Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.48
  • 49. Structured Document Categorization Using BayesianNetworks IIITransformations IV TC Notation Representation Particularities Evaluation title_El title_ingenioso title_hidalgo Bayesian networks title_Don title_Quijote title_de Definition Storage problems title_la title_Mancha author_Miguel Canonical models author_de author_Cervantes OR Gate Models author_Saavedra chapter_Uno text_En Experiments text_un text_lugar text_de text_La Thesaurus Definitions text_Mancha text_de text_cuyo text_nombre Unsupervised model Supervised model text_no text_quiero text_acordarme... Experiments Structured Structured documents Transformations Experiments Figure: “Quijote”, with “tagging_1”. Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.49
  • 50. Structured Document Categorization Using BayesianNetworks IIITransformations V TC Our contribution Notation Representation Particularities Evaluation title: 1, author: 0, chapter: 0, text: 2 Bayesian networks Definition Storage problems Canonical models El ingenioso hidalgo Don Quijote de OR Gate Models la Mancha En En un un lugar lugar de Experiments de La La Mancha Mancha de de cuyo cuyo Thesaurus Definitions nombre nombre no no quiero quiero Unsupervised model Supervised model acordarme acordarme... Experiments Structured Structured documents Transformations Experiments Figure: “Quijote”, with “replication” method, using values Link-based categorization proposed before. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.50
  • 51. Structured Document Categorization Using BayesianNetworks IIIExperimentation TC • Experiments on the INEX 2007 XML dataset. Notation Representation Particularities • 96611 documents, 21 categories, 50% training/test Evaluation split. Bayesian networks Definition • Replication improves macro measures on Naïve Storage problems Canonical models Bayes a lot. OR Gate Models • Other transformations are not useful here. Experiments Thesaurus Definitions Unsupervised model Method Reduction Selection? µBEP MBEP µF1 MF1 Supervised model Naïve Bayes Only text no 0.76160 0.58608 0.78139 0.64324 Experiments Naïve Bayes Only text ≥ 2 docs. 0.72269 0.67379 0.77576 0.69309 Naïve Bayes Only text ≥ 3 docs. 0.69753 0.67467 0.76191 0.68856 Structured Naïve Bayes Repl. (id=2) None 0.76005 0.64491 0.78233 0.66635 Structured documents Naïve Bayes Repl. (id=2) ≥ 2 docs. 0.71270 0.68386 0.61321 0.73780 Transformations Experiments Naïve Bayes Repl. (id=2) ≥ 3 docs. 0.70916 0.68793 0.73270 0.65697 Link-based categorization Naïve Bayes Repl. (id=3) None 0.75809 0.67327 0.77622 0.67101 Multiclass model Naïve Bayes Repl. (id=4) None 0.75921 0.69176 0.76968 0.67013 Experiments Naïve Bayes Repl. (id=5) None 0.75976 0.70045 0.76216 0.66412 Multilabel model Naïve Bayes Repl. (id=8) None 0.74406 0.69865 0.72728 0.61602 Experiments Naïve Bayes Repl. (id=11) None 0.72722 0.67965 0.71422 0.60451 Remarks PhD Dissertation.51
  • 52. Structured Document Categorization Using BayesianNetworks III TC Notation Representation Particularities Conclusions Evaluation Bayesian networks Definition • Several XML transformation (one original). Storage problems Canonical models • Good results with “replication” + NB. OR Gate Models Experiments Thesaurus Future work Definitions Unsupervised model Supervised model • More extensive experimentation. Experiments Structured • New transformations. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.52
  • 53. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization A set of documents with a graph structure among them. TC Notation The goal is to label a document using both its content Representation Particularities and the graph structure (labels of the neighbors?). Evaluation Bayesian networks Definition Storage problems Canonical models OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model ? Experiments Multilabel model Experiments Remarks PhD Dissertation.53
  • 54. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization TC Typically, scatterplots like this: Notation Representation Particularities Evaluation Bayesian networks 0.12 Definition Storage problems 0.1 Canonical models 0.08 OR Gate Models Experiments 0.06 Thesaurus 0.04 Definitions Unsupervised model 0.02 Supervised model Experiments 0 Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Encyclopedia regularity (a document of category Ci Experiments Multilabel model tends to links documents on the same category). Experiments Remarks PhD Dissertation.54
  • 55. Structured Document Categorization Using BayesianNetworks IVlink-based categorization: multiclass I TC Document d0 , linked to documents d1 , . . . , dm . Notation Representation Particularities Evaluation Random variables C0 , C1 , . . . , Cm , in {c0 , c1 , . . . , cn }. Bayesian networks Definition Storage problems Variables ei , evidence of the classification (content) of Canonical models OR Gate document di . Models Experiments Thesaurus Given the true class of the document to classify Definitions Unsupervised model (independences): Supervised model Experiments 1 the categories of the linked documents are Structured independent among each other, and Structured documents Transformations Experiments 2 the evidence about this category due to the Link-based categorization Multiclass model document content is independent of the original Experiments Multilabel model category of the document we want to classify. Experiments Remarks PhD Dissertation.55
  • 56. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass II TC Notation e1 ... ... em Representation Particularities Evaluation Bayesian networks Definition Storage problems ... ... Canonical models C1 Cm OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments C0 Structured Structured documents Transformations Experiments Link-based categorization Multiclass model e0 Experiments Multilabel model Experiments Remarks PhD Dissertation.56
  • 57. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass III TC Notation With some computation: Representation Particularities 0 1 Evaluation m ” p(C = c |e ) i j i C “ Bayesian networks Y X p(C0 = c0 |e) ∝ p(C0 = c0 |e0 ) p Ci = cj |C0 = c0 B @ A i=1 cj ={c0 ,...,cn } p(Ci = cj ) Definition Storage problems Canonical models Where: OR Gate Models • p(C0 = c0 |e) final evidence that the document Experiments Thesaurus belongs to C0 . Definitions Unsupervised model • p(Ci = cj |ei ) obtained with a “local” (content) Supervised model Experiments classifier (NB). Structured • p(Ci = ci ) (prior) and p(Ci = ci |C0 = c0 ) (probability Structured documents Transformations a document of Ci links another of C0 ), obtained Experiments Link-based categorization from training data. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.57
  • 58. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass IV TC Notation Representation Experiments: INEX 2008 corpus: Particularities Evaluation Bayesian networks • A classical Naïve Bayes algorithm on the flat text Definition Storage problems documents obtained 0.67674 of recall. Canonical models • Our proposal using the previous Naïve Bayes as the OR Gate Models base classifier obtained 0.6787 of recall (using Experiments Thesaurus outlinks). Definitions Unsupervised model • Our model (inlinks): 0.67894 of recall. Supervised model Experiments • Our model (neighbours): 0.68273 of recall. Structured Structured documents The model works better in a “ideal environment” (knowing Transformations Experiments the labels of all neighbors). Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.58
  • 59. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multiclass V TC Notation Representation Particularities Conclusions Evaluation Bayesian networks • A new model for classification of multiclass linked Definition Storage problems documents, based on BNs. Canonical models OR Gate • Good performance in an ideal environment. Models Experiments Thesaurus Definitions Future work Unsupervised model Supervised model Experiments • Use a base classifier (probabilistic) with a better Structured performance (Logistic? SVM with probabilistic Structured documents Transformations outputs?). Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.59
  • 60. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel I TC • Previous model was not flexible. Structure of BN Notation Representation imposed. Particularities Evaluation • We learn the interactions among categories from Bayesian networks Definition data, no fixed structure, but any which is learnt Storage problems Canonical models from the set of categories. OR Gate Models • Variables: categories Ci (one for category), Experiments categories of incoming links Ej (one for category) Thesaurus Definitions and terms Tk (many). Unsupervised model Supervised model • We will search for p(ci |ej , dj ). Experiments Structured • Main assumption: Structured documents Transformations Experiments p(dj , ej |ci ) = p(dj |ci ) p(ej |ci ). Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.60
  • 61. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel II TC Notation Representation With a few computations: Particularities Evaluation Bayesian networks p(ci |dj ) p(ci |ej ) / p(ci ) Definition p(ci |dj , ej ) = Storage problems p(ci |dj )p(ci |ej )/p(ci ) + p(c i |dj )p(c i |ej )/p(c i ) Canonical models OR Gate Models Experiments • p(ci |dj ): output of a probabilistic classifier. Any Thesaurus probabilistic classifier. Definitions Unsupervised model Supervised model Experiments • p(ci |ej ): probability of being of Ci considering the set Structured Structured documents of the categories of the incoming (known) links. This Transformations Experiments is modeled by the BN. Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.61
  • 62. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel III TC Experimentation INEX 2009 corpus: 54572 documents, Notation Representation test/train split of a 20/80%. 39 categories. Particularities Evaluation Bayesian networks Measures Accuracy (ACC), Area under Roc curve Definition Storage problems (ROC), F1 measure (PRF) and Avg prec on 11 std Canonical models OR Gate (MAP). Models Experiments • Learning Bayesian Network, using WEKA package. Thesaurus • Hillclimbing algorithm (easy and fast) + BDeu Definitions Unsupervised model metric (3 parents max. per node). Supervised model Experiments • Propagation, using Elvira Structured Structured documents • Compute p(ci ) (once), and p(ci |ej ) (for each Transformations Experiments document j). Exact propagation is slow for so many Link-based categorization categories! ⇒ Importance Sampling algorithm Multiclass model Experiments (approximate). Multilabel model Experiments Remarks PhD Dissertation.62
  • 63. Structured Document Categorization Using BayesianNetworks VLinked-document categorization: multilabel IV TC Notation Representation Particularities Evaluation Bayesian networks Results Definition Storage problems Canonical models MACC µACC MROC µROC MPRF µPRF MAP OR Gate N. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097 Models N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235 Experiments OR gate 0.92932 0.92612 0.92526 0.92163 0.45966 0.50407 0.72955 Thesaurus OR gate + BN 0.96607 0.95588 0.92810 0.92739 0.51729 0.55116 0.72508 Definitions Unsupervised model Supervised model Our method clearly improves both baselines. Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.63
  • 64. Structured Document Categorization Using BayesianNetworks IVLinked-document categorization: multilabel V TC Notation Conclusions Representation Particularities Evaluation • A new model for classification of multilabel linked Bayesian networks Definition documents, based on BNs. Storage problems Canonical models • Very flexible. OR Gate Models • Any learning procedure is usable. Experiments Thesaurus • Very promising results Definitions Unsupervised model Supervised model Experiments Future work Structured Structured documents Transformations • Use different baselines. Experiments Link-based categorization • More extensive experimentation. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.64
  • 65. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments ⇒ Final Remarks Multilabel model Experiments Remarks PhD Dissertation.65
  • 66. Final RemarksSummaryMost relevant contributions TC Notation Representation • A new text classifier (OR gate), better than NB. Particularities Evaluation Bayesian networks Definition Storage problems • Definition of a new problem (thesaurus indexing). Canonical models Two models, outstanding results. OR Gate Models Experiments Thesaurus • Some minor contributions in XML classification Definitions Unsupervised model (“text replication”). Supervised model Experiments Structured Structured documents • Two models of link-based document Transformations Experiments categorization. Promising results in the Multilabel Link-based categorization Multiclass model one. Experiments Multilabel model Experiments Remarks PhD Dissertation.66
  • 67. Final Remarks IIList of Publications supporting this work: TC Notation Representation Particularities Evaluation In the thesis: Bayesian networks Definition See pages 170-173 Storage problems Canonical models OR Gate Models or... Experiments Thesaurus Definitions In the www: Unsupervised model Supervised model Experiments Visit http://decsai.ugr.es/~aeromero Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.67
  • 68. Final Remarks IIISoftware: • DauroLab. A toolbox for Machine Learning. Written TC Notation in Java (by me!). Representation Particularities • Free (libre) software (GPL v3). Evaluation Bayesian networks • http://sf.net/projects/daurolab. Definition Storage problems Canonical models OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.68
  • 69. Thank you for your attention