Document classification models based on Bayesian networks

2,252 views
2,097 views

Published on

Slides of my PhD thesis, presented in April 2010.

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,252
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
57
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Document classification models based on Bayesian networks

  1. 1. PhD DissertationDocument Classification Models TC Notation Representationbased on Bayesian networks Particularities EvaluationAlfonso E. Romero Bayesian networks Definition Storage problems Canonical modelsApril 27, 2010 OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Advisors: Luis M. de Campos, and Transformations Experiments Juan M. Fernández-Luna Link-based categorization Multiclass model Department of Computer Science and A.I. Experiments Multilabel model University of Granada Experiments Remarks PhD Dissertation.1
  2. 2. A brief overview: the problem I We shall solve problems in document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Definition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.2
  3. 3. A brief overview: the problem II ... in particular, automatic document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Definition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.3
  4. 4. A brief overview: the problem III ... concretely, supervised document categorization. TC Notation Representation Particularities The Alhambra of Learning procedure Evaluation Granada, most visited National Bayesian networks monument in 2009 Definition Storage problems Canonical models Prime minister OR Gate Zapatero to give a Algorithm (classifier) Models speech on the National Experiments Parliament Thesaurus Definitions Unsupervised model AVE train to run at Supervised model 350 km/h in 2010 Experiments from Madrid to National Structured Barcelona Structured documents Transformations Experiments Labeled corpus Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.4
  5. 5. A brief overview: the methods I Which learning method shall we use? TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Definition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Definitions Unsupervised model Decision trees Supervised model Experiments Structured Evolutive algorithms Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.5
  6. 6. A brief overview: the methods II The answer: TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Definition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Definitions Unsupervised model Decision trees Supervised model Experiments Evolutive algorithms Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.6
  7. 7. A brief overview: the methods III TC But, why? Notation Representation Particularities Evaluation Bayesian networks • Strong theoretical foundation (probability theory). Definition Storage problems Canonical models OR Gate • Models for (uncertain) knowledge representation. Models Experiments Thesaurus Definitions • Great success in related tasks (IR). Unsupervised model Supervised model Experiments Structured • Our background at the group UTAI. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.7
  8. 8. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.8
  9. 9. Outline ⇒ Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.9
  10. 10. Supervised Text Categorization I Provided TC Notation 1 Set of labeled documents DTr (training). Representation Particularities 2 C, set of categories/labels. Evaluation Bayesian networks The goal is to build a model f (classifier) capable of Definition Storage problems predicting categories (of C) of documents in D. Canonical models OR Gate Models Different kinds of labeling: Experiments Thesaurus • f : D → {c, c} (binary). Definitions Unsupervised model • f : D → {c1 , c2 , . . . , cn } (multiclass). Supervised model Experiments • f : D × C → {0, 1} (multilabel). Structured Structured documents Transformations A multilabel problem reduces to |C| binary problems Experiments Link-based categorization C = {c, c}. We often change the codomain from {0, 1} Multiclass model Experiments (hard classification) to [0, 1] (soft classification). Multilabel model Experiments Remarks PhD Dissertation.10
  11. 11. Supervised Text Categorization IIDocument representation: As in Information Retrieval TC Stopwords removal + stemming + Vector Notation representation (Frequency, binary or tf-idf). Representation Particularities Evaluation From (preprocessed) document to vector Bayesian networks Definition Storage problems Canonical models term ⇔ dimension OR Gate Models Experiments Thesaurus Definitions Unsupervised model Example (beginning of John Miltons "Lost Paradise"): Supervised model Experiments Of Mans First Disobedience, and the Fruit Of Mans First Disobedience, and the Fruit Structured Of that Forbidden Tree, whose mortal tast Of that Forbidden Tree, whose mortal tast Structured documents Brought Death into the World, and all our woe, Brought Death into the World, and all our woe, With loss of Eden, till one greater Man... With loss of Eden, till one greater Man... Transformations Experiments Link-based categorization Multiclass model 2 1 1 1 1 1 1 1 1 1 1 1 1 1 Experiments man obedience fruit forbid tree mortal tast bring death world woe loss eden great Multilabel model Experiments Remarks PhD Dissertation.11
  12. 12. Supervised Text Categorization IIIWhat are the particularities of the problem? It differs from a “classic” Machine Learning problem in: TC Notation Representation Particularities Evaluation • High dimensionality (easily > 10000). Bayesian networks Definition Storage problems Canonical models • Very unbalanced datasets. OR Gate Models Experiments • |C| 0. Thesaurus Definitions Unsupervised model Supervised model Experiments • Sometimes, there is a hierarchy in the set C. Structured Structured documents Transformations Experiments • Sometimes, explicit relationships among Link-based categorization Multiclass model documents are given. Experiments Multilabel model Experiments Remarks PhD Dissertation.12
  13. 13. Supervised Text Categorization IVEvaluation How to measure the correctness of documents TC assigned to set of categories? Notation Representation • Binary/multiclass: Particularities Evaluation TP Bayesian networks • Hard categorization Precision: TP+FP and Recall: Definition Storage problems TP 2PR TP+FN , F1 : P+R . Canonical models OR Gate • Soft categorization Precision/Recall BEP. Models Experiments Thesaurus Definitions • Multilabel: micro and macro averages. Unsupervised model Supervised model Experiments Structured • Also, average precision on the 11 std. recall points Structured documents Transformations (category ranking). Experiments Link-based categorization Multiclass model Experiments Multilabel model • Standard corpora: Reuters, Ohsumed, 20 NG. . . Experiments Remarks PhD Dissertation.13
  14. 14. Outline 1 Text Categorization TC Notation Representation ⇒ Bayesian networks Particularities Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.14
  15. 15. Bayesian networks IDefinition and characteristics A set of random variables X1 , . . . , XN in a DAG, verifying P(X1 , . . . , Xn ) = n P(Xi |Pa(Xi )) ⇒ the graph i=1 TC Notation represents independences. Representation Particularities Evaluation Causal interpretation. Bayesian networks Definition Storage problems Canonical models Learning and inference methods available. OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.15
  16. 16. Bayesian networks IIIEstimation and storage problems TC The problem Notation Representation Particularities • One value for each configuration of the parents. Evaluation Bayesian networks • General case: exponential number of parameters Definition Storage problems on the number of parents. Canonical models OR Gate Models Experiments The solution Thesaurus Definitions 1 Few parents per node (not realistic in text). Unsupervised model Supervised model Experiments 2 Write the probability of a node as a deterministic Structured function of the configuration (canonical model). Set Structured documents Transformations of parameters with linear size on the number of Experiments Link-based categorization parents. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.16
  17. 17. Bayesian networks: canonical modelsComponents and examples of canonical models • X = {Xi }: parents (causes), Y : child (effect). TC Notation Representation • Xi in {xi , xi }, Yi in {yi , yi } (occurrence or not). Particularities Evaluation Bayesian networks 1 Noisy-OR gate model: Definition p(y |X) = 1 − Xi ∈R(x) (1 − wOR (Xi , Y )). Storage problems Canonical models 2 Additive model: p(y |X) = Xi ∈R(x) wadd (Xi , Y ). OR Gate Models Experiments Thesaurus ... Definitions X1 X2 XL Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Y Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.17
  18. 18. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition ⇒ An OR Gate-Based Text Classifier Storage problems Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.18
  19. 19. An OR Gate-Based Text ClassifierWhy using OR gates? • The OR gate is a simple model (and fast for TC Notation inference). Representation Particularities • It has gained great success in knowledge Evaluation Bayesian networks representation. Definition Storage problems • Discriminative classifier (models directly p(ci |dj )) Canonical models (NB generative). OR Gate Models • Seems natural the term are causes and category Experiments Thesaurus the effect. Definitions Unsupervised model Supervised model Experiments C C Structured Structured documents Transformations Experiments Link-based categorization T1 T2 ... Tn T1 T2 ... Tn Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.19
  20. 20. An OR Gate-Based Text ClassifierThe model Equations for the OR gate classifier TC Notation Probability distributions: Representation Particularities Evaluation pi (ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) , Bayesian networks Definition Tk ∈R(pa(Ci )) Storage problems Canonical models pi (c i |pa(Ci )) = 1 − pi (ci |pa(Ci )). OR Gate Models Experiments Inference: Thesaurus Definitions Unsupervised model pi (ci |dj ) = 1 − 1 − w(Tk , Ci ) p(tk |dj ) Supervised model Experiments Tk ∈Pa(Ci ) Structured Structured documents = 1− (1 − w(Tk , Ci )) . Transformations Experiments Tk ∈Pa(Ci )∩dj Link-based categorization Multiclass model Experiments Multilabel model Experiments Model characterized by the w(Tk , Ci ) formula. Remarks PhD Dissertation.20
  21. 21. An OR Gate-Based Text ClassifierWeight estimation formulae Weight w(Tk , Ci ) means TC Notation Representation ˆ pi (ci |tk , t h ∀Th ∈ Pa(Ci ), Th = Tk ). Particularities Evaluation Bayesian networks ˆ Nik +1 1 ML: w(Tk , Ci ) as p(ci |tk ), w(Tk , Ci ) = N•k +2 (using Definition Storage problems Laplace). Canonical models OR Gate Models Experiments 2 TI: assuming term independence, given the Thesaurus (Ni• −N category, w(Tk , Ci ) = ntNik•k h=k (N−N•hih )N . Definitions Unsupervised model iN )Ni• Supervised model Experiments Structured Notation: N number of words in the training documents. Structured documents Transformations N•k times that term tk appears in training documents Experiments Link-based categorization (N•k = ci Nik ), Ni• is the total number of words in Multiclass model Experiments documents of class ci (Ni• = tk Nik ), M vocabulary size. Multilabel model Experiments Remarks PhD Dissertation.21
  22. 22. An OR Gate-Based Text ClassifierPruning independent terms TC Notation Representation Particularities • The model can be improved pruning terms which Evaluation are independent with the category. Bayesian networks Definition Storage problems Canonical models • We run an independence test for each pair OR Gate Models term/category, at a certain confidence level. Only Experiments terms which pass it are kept. Thesaurus Definitions Unsupervised model Supervised model Experiments • The size of the parent set is highly reduced, but Structured classification is often improved. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.22
  23. 23. An OR Gate-Based Text ClassifierExperiments I: Experimental setting TC Notation Representation Particularities • We compare a Multinomial NB, a Rocchio, OR TI, Evaluation OR ML, and both OR with pruning at levels Bayesian networks Definition {0.9, 0.99, 0.999}. Storage problems Canonical models OR Gate Models • We made experiments on Ohsumed-23, Reuters Experiments and 20 NG. Thesaurus Definitions Unsupervised model Supervised model Experiments • Soft categorization, evaluated with macro/micro Structured BEP, 11 avg std prec, and macro/micro F1 @{1, 3, 5}. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.23
  24. 24. An OR Gate-Based Text ClassifierExperiments II: Experimental setting TC Notation Representation Results (# of wins on 9 measures): Particularities Evaluation Bayesian networks Definition • Reuters: OR-TI-0.999 (7), OR-TI (1), OR-ML-0.999 Storage problems Canonical models (1). OR Gate Models Experiments Thesaurus • Ohsumed: OR-TI-0.999 (8), OR-ML-0.99 (1). Definitions Unsupervised model Supervised model Experiments • 20 NG: OR-TI (2), OR-ML (2), OR-TI-0.999 (1), NB Structured (4). Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.24
  25. 25. An OR Gate-Based Text Classifier Conclusions TC Notation Representation • A new text categorization model, based on noisy OR Particularities Evaluation gates. Bayesian networks Definition • Simple and computationally affordable. Storage problems Canonical models • Results improves Naïve Bayes noticeably. OR Gate Models Experiments Thesaurus Future work Definitions Unsupervised model Supervised model • Use more advanced noisy OR models (leaky). Experiments Structured • Use another canonical models. Structured documents Transformations • Explore another alternatives for term pruning. Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.25
  26. 26. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments ⇒ Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.26
  27. 27. Automatic Indexing From a Thesaurus Using BayesianNetworks: Thesauri IDefinitions TC A thesaurus Notation Representation A list of terms representing concepts, grouped those with Particularities Evaluation the same meaning, with hierarchical relationships among Bayesian networks them. Definition Storage problems Canonical models ND:psychiatric OR Gate ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary Models clinic care centre Experiments Thesaurus Definitions D:psychiatric D:medical D:medical ND:medical Unsupervised model institution institution centre service Supervised model Experiments Structured ND:health ND:health D:health Structured documents protection service Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model D:health Experiments policy Remarks PhD Dissertation.27
  28. 28. Automatic Indexing From a Thesaurus Using BayesianNetworks: Thesauri IIIndexing with a thesauri TC • Problem: associating descriptors (keywords) to Notation Representation documents (scientific, medical, legal,. . . ). Particularities Evaluation • Manually: expensive and time consuming work (due Bayesian networks to thousand of descriptors! [EUROVOC > 6000]). Definition Storage problems Besides: Canonical models OR Gate • How many descriptors should we assign? Models • Which descriptor should assign in the hierarchy? Experiments Thesaurus Definitions We propose an automatic solution Unsupervised model Supervised model Experiments 1 TC problem (categories ≡ descriptors). Structured Structured documents 2 Makes use of the meta-information of the thesaurus. Transformations Experiments Link-based categorization 3 Unsupervised and supervised case. Multiclass model Experiments 4 Based on BNs. Multilabel model Experiments Remarks PhD Dissertation.28
  29. 29. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IFrom the thesaurus... TC Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Definition Storage problems D:psychiatric D:medical D:medical ND:medical centre Canonical models institution institution service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Definitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.29
  30. 30. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case II... to the Bayesian network. TC policy protection service health psychiatric outpatient hospital clinic institution medical care centre dispensary Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Definition Storage problems D:psychiatric D:medical D:medical ND:medical Canonical models institution institution centre service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Definitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations • Binary variables (relevant/not relevant), for each Experiments Link-based categorization term, descriptor and non descriptor. Multiclass model Experiments • Problem: information of different nature mixed in Multilabel model Experiments descriptor nodes! Remarks PhD Dissertation.30
  31. 31. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IIIIntroducing concept and virtual nodes TC protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Notation Representation Particularities ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Evaluation Bayesian networks E:health E:health E:medical E:medical E:psychiatric Definition policy service centre institution institution Storage problems Canonical models C:medical C:medical C:psychiatric centre institution institution OR Gate Models Experiments H:health service C:health Thesaurus service Definitions Unsupervised model H:health policy Supervised model Experiments C:health policy Structured Structured documents Transformations Experiments • New concept nodes, C. Link-based categorization Multiclass model • New Hierarchy nodes, HC . Experiments Multilabel model Experiments • New Equivalence nodes, EC . Remarks PhD Dissertation.31
  32. 32. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case IVProbability distributions We already have the structure, but we have to specify TC Notation the distributions on each node. Representation Particularities Many parents ⇒ canonical models. Evaluation Bayesian networks • D and ND nodes (SUM): Definition Storage problems p(d + |pa(D)) = T ∈R(pa(D)) w(T , D). Canonical models OR Gate • HC nodes (SUM): Models + Experiments p(hc |pa(HC )) = C ∈R(pa(HC )) w(C , HC ). Thesaurus • EC nodes (OR): Definitions Unsupervised model + p(ec |pa(EC )) = 1 − D∈R(pa(EC )) (1 − w(D, C)). Supervised model Experiments • C nodes (OR): Structured Structured documents Transformations + + 8 > 1 − (1 − w(EC , C))(1 − w(HC , C)) > if ec = ec , hc = hc Experiments > < w(E , C) + − Link-based categorization + C if ec = ec , hc = hc p(c |{ec , hc }) = − + Multiclass model > w(HC , C) > if ec = ec , hc = hc Experiments > − − 0 if ec = ec , hc = hc : Multilabel model Experiments Remarks PhD Dissertation.32
  33. 33. Automatic Indexing From a Thesaurus Using BayesianNetworks: Unsupervised case VInference TC Classification is seen as inference. Notation Representation Particularities • Given a document d, we set the term variables Evaluation T ∈ d to t + , t − otherwise. Bayesian networks Definition Storage problems Canonical models • Exact propagation is carried out: OR Gate Models • First, to descriptor nodes. Experiments Thesaurus Definitions • Then, to EC nodes. Unsupervised model Supervised model Experiments • Following a topological order, probabilities Structured + p(hc |pa(HC )) are computed after their parents Structured documents Transformations + p(c |pa(C)) are set. Experiments Link-based categorization Multiclass model Experiments • Final p(c + |pa(C)), ∀C values are returned. Multilabel model Experiments Remarks PhD Dissertation.33
  34. 34. Automatic Indexing From a Thesaurus Using BayesianNetworks: Supervised model IChanges from the unsupervised case TC Notation Representation Particularities Evaluation • Concept also receives information from labeled Bayesian networks Definition documents in the training set. Storage problems Canonical models OR Gate Models • We add a training node TC as new parent of the Experiments concept one. The node is an OR gate ML classifier. Thesaurus Definitions Unsupervised model Supervised model Experiments • Distributions of C nodes are modified Structured consequently. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.34

×