Your SlideShare is downloading. ×
0
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Link-based document classification using Bayesian Networks

1,348

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,348
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction Our solution The Bayesian network model Results Conclusions and future works Link-based text classification using Bayesian networks Luis M. de Campos Juan M. Fernández-Luna Juan F. Huete Andrés R. Masegosa Alfonso E. Romero {lci,jmfluna,jhg,andrew,aeromero}@decsai.ugr.es Departamento de Ciencias de la Computación e Inteligencia Artificial E.T.S.I. Informática y de Telecomunicación, CITIC-UGR, Universidad de Granada 18071 – Granada, Spain INEX 2009 Workshop, Brisbane
  • 2. Introduction Our solution The Bayesian network model Results Conclusions and future works Our participation Universidad de Granada at INEX 2009 The third year we participate on XML mining (classification). As previous ocasions, we are interested in Bayesian networks. We’ve provided a new solution to this problem. Sorry, no AdHoc this year .
  • 3. Introduction Our solution The Bayesian network model Results Conclusions and future works Our participation The problem itself A text (XML) categorization problem. Training/test corpus. Multilabel (more than 1 category per doc). Links among files (training, test) given in a matrix. Vectors of indexed terms (normalized tf-idf) provided.
  • 4. Introduction Our solution The Bayesian network model Results Conclusions and future works Our participation The problem itself A text (XML) categorization problem. Training/test corpus. Same as previous years Multilabel (more than 1 category per doc). New this year! Links among files (training, test) given in a matrix. Same as 2008 Vectors of indexed terms (normalized tf-idf) provided. The eternal question, what about XML?
  • 5. Introduction Our solution The Bayesian network model Results Conclusions and future works Our solution (2008) Encyclopedia regularity (a document of category Ci tends to links documents on the same category). Graphically verified on the training set. In 2008 we combined a flat-text classifier (Naïve Bayes) with a Bayesian network of fixed structure which modelled interaction among categories, using learnt probabilities P(ci |cj ). Results were discrete (the worst model among 3, and improvements over our baseline were not significant).
  • 6. Introduction Our solution The Bayesian network model Results Conclusions and future works Our starting point (2009) We detected the same regularity on categories (no matrix plot this year). Possible (hidden) hierarchy (for example Portal:Religion, Portal:Christianity and Portal:Catholicism). This year we learn the interactions among categories from data, no fixed structure, but any which is on the set of categories.
  • 7. Introduction Our solution The Bayesian network model Results Conclusions and future works Modeling link structure Modeling link structure I We assume there is a global probability distribution among all these variables, and we will model it with a Bayesian network. Variables: categories Ci (39), categories of incoming links Ej (39) and terms Tk (many). Main Assumption: the probability distributions of a document and the categories of files that link it are independent given the category. Or simbolically: p(dj , ej |ci ) = p(dj |ci ) p(ej |ci ).
  • 8. Introduction Our solution The Bayesian network model Results Conclusions and future works Modeling link structure We then search for the conditional probability p(ci |dj , ej ): p(dj , ej |ci ) p(ci ) p(dj |ci ) p(ej |ci ) p(ci ) p(ci |dj , ej ) = = p(dj , ej ) p(dj , ej ) p(ci |dj ) p(dj ) p(ej |ci ) p(ci ) = p(ci ) p(dj , ej ) p(ci |dj ) p(dj ) p(ci |ej ) p(ej ) = p(ci ) p(dj , ej ) p(dj ) p(ej ) p(ci |dj ) p(ci |ej ) = . p(dj , ej ) p(ci )
  • 9. Introduction Our solution The Bayesian network model Results Conclusions and future works Modeling link structure We then search for the conditional probability p(ci |dj , ej ): p(dj , ej |ci ) p(ci ) p(dj |ci ) p(ej |ci ) p(ci ) p(ci |dj , ej ) = = p(dj , ej ) p(dj , ej ) p(ci |dj ) p(dj ) p(ej |ci ) p(ci ) = p(ci ) p(dj , ej ) p(ci |dj ) p(dj ) p(ci |ej ) p(ej ) = p(ci ) p(dj , ej ) p(dj ) p(ej ) p(ci |dj ) p(ci |ej ) = . p(dj , ej ) p(ci ) p(ci |dj ) p(ci |ej ) p(ci |dj , ej ) ∝ p(ci )
  • 10. Introduction Our solution The Bayesian network model Results Conclusions and future works Modeling link structure We then search for the conditional probability p(ci |dj , ej ): p(dj , ej |ci ) p(ci ) p(dj |ci ) p(ej |ci ) p(ci ) p(ci |dj , ej ) = = p(dj , ej ) p(dj , ej ) p(ci |dj ) p(dj ) p(ej |ci ) p(ci ) = p(ci ) p(dj , ej ) p(ci |dj ) p(dj ) p(ci |ej ) p(ej ) = p(ci ) p(dj , ej ) p(dj ) p(ej ) p(ci |dj ) p(ci |ej ) = . p(dj , ej ) p(ci ) p(ci |dj ) p(ci |ej ) p(ci |dj , ej ) ∝ p(ci ) p(ci |dj ) p(ci |ej ) / p(ci ) p(ci |dj , ej ) = p(ci |dj )p(ci |ej )/p(ci ) + p(c i |dj )p(c i |ej )/p(c i )
  • 11. Introduction Our solution The Bayesian network model Results Conclusions and future works Modeling link structure Modeling link structure III p(ci |dj ): output of a probabilistic classifier. Any probabilistic classifier. p(ci |ej ): probability of being of Ci considering the set of the categories of the incoming (known) links. This is modeled by the Bayesian network. The problem reduces to the following: [see next slide]
  • 12. Introduction Our solution The Bayesian network model Results Conclusions and future works Modeling link structure Modeling link structure IV We have a vector of 39+39 binary variables for each document: 39 for each category (1 if the doc. is of that category, 0 if not), and 39 more (1 if the document is linked by documents of this category, 0 if not). With a learning algorithm, we learn a Bayesian network from that data. For each document to classify, for each category Ci we compute its content probability p(ci |dj ) (with base classifier), and the probability of being of Ci knowing the categories of certain neighbours p(ci |ej ) (with the learnt Bayesian network). We combine them using the blue equation.
  • 13. Introduction Our solution The Bayesian network model Results Conclusions and future works Learning link structure Learning Bayesian Network, using WEKA package.
  • 14. Introduction Our solution The Bayesian network model Results Conclusions and future works Learning link structure Learning Bayesian Network, using WEKA package. Hillclimbing algorithm (easy and fast). BDeu metric. Three parents max. per node.
  • 15. Introduction Our solution The Bayesian network model Results Conclusions and future works Learning link structure Learning Bayesian Network, using WEKA package. Hillclimbing algorithm (easy and fast). BDeu metric. Three parents max. per node. Propagation, using Elvira (WEKA does not have propagation algorithms).
  • 16. Introduction Our solution The Bayesian network model Results Conclusions and future works Learning link structure Learning Bayesian Network, using WEKA package. Hillclimbing algorithm (easy and fast). BDeu metric. Three parents max. per node. Propagation, using Elvira (WEKA does not have propagation algorithms). Compute p(ci ) (once), and p(ci |ej ) (for each document j). Exact propagation was slow ! Importance Sampling algorithm (approximate).
  • 17. Introduction Our solution The Bayesian network model Results Conclusions and future works Base classifiers Base classifiers We have used Multinomial Naïve Bayes (binary) and Bayesian OR Gate (a model presented by our group in INEX 2007). They are extensive described on the paper (read it if you want to learn deeply about these two classifiers). Any other probabilistic classifiers can be used to firstly obtain p(ci |dj ) (any suggestions or preferences?).
  • 18. Introduction Our solution The Bayesian network model Results Conclusions and future works Results MACC µACC MROC µROC MPRF µPRF MAP N. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097 N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235 OR gate 0.75420 0.67806 0.92526 0.92163 0.25310 0.26268 0.72955 OR gate + BN 0.84768 0.81891 0.92810 0.92739 0.31611 0.36036 0.72508 Initial results Problem in the OR gate! (Evaluation assumes dj ∈ Ci ⇔ p(ci |dj ) > 0.5). This is not, in general, true for the OR gate, need some scaling procedure (like SCut strategy).
  • 19. Introduction Our solution The Bayesian network model Results Conclusions and future works Results MACC µACC MROC µROC MPRF µPRF MAP N. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097 N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235 OR gate 0.75420 0.67806 0.92526 0.92163 0.25310 0.26268 0.72955 OR gate + BN 0.84768 0.81891 0.92810 0.92739 0.31611 0.36036 0.72508 Initial results Problem in the OR gate! (Evaluation assumes dj ∈ Ci ⇔ p(ci |dj ) > 0.5). This is not, in general, true for the OR gate, need some scaling procedure (like SCut strategy). MACC µACC MROC µROC MPRF µPRF MAP OR gate 0.92932 0.92612 0.92526 0.92163 0.45966 0.50407 0.72955 OR gate + BN 0.96607 0.95588 0.92810 0.92739 0.51729 0.55116 0.72508 Scaled results (see paper for details).
  • 20. Introduction Our solution The Bayesian network model Results Conclusions and future works Conclusions The model is new, parametrizable (learning algorithm, parameters of algorithm, base classifier,...) and valuable by itself (always improves a baseline).
  • 21. Introduction Our solution The Bayesian network model Results Conclusions and future works Conclusions The model is new, parametrizable (learning algorithm, parameters of algorithm, base classifier,...) and valuable by itself (always improves a baseline). Using the Bayesian network over the OR gate provides a 10% of improvement in some measures .
  • 22. Introduction Our solution The Bayesian network model Results Conclusions and future works Conclusions The model is new, parametrizable (learning algorithm, parameters of algorithm, base classifier,...) and valuable by itself (always improves a baseline). Using the Bayesian network over the OR gate provides a 10% of improvement in some measures . Good results on ROC (ranked third).
  • 23. Introduction Our solution The Bayesian network model Results Conclusions and future works Conclusions The model is new, parametrizable (learning algorithm, parameters of algorithm, base classifier,...) and valuable by itself (always improves a baseline). Using the Bayesian network over the OR gate provides a 10% of improvement in some measures . Good results on ROC (ranked third). Other base classifier? SVM with probabilistic outputs, Logistic Regression...
  • 24. Introduction Our solution The Bayesian network model Results Conclusions and future works Conclusions The model is new, parametrizable (learning algorithm, parameters of algorithm, base classifier,...) and valuable by itself (always improves a baseline). Using the Bayesian network over the OR gate provides a 10% of improvement in some measures . Good results on ROC (ranked third). Other base classifier? SVM with probabilistic outputs, Logistic Regression... More experiments for the final version of the paper!
  • 25. Introduction Our solution The Bayesian network model Results Conclusions and future works Thank you for your attention! Questions, comments, criticism? <SPAM>Expecting to defend my PhD by April 2010, searching for a PostDoc (in Europe) for 2010 on ML/IR related stuff. Any offers? < /SPAM>

×