SlideShare a Scribd company logo
1 of 14
Probabilistic Methods for Structured Document Classification at INEX´07




                     Probabilistic Methods for Structured
                     Document Classification at INEX´07

                                               ´
               Luis M. de Campos, Juan M. Fernandez-Luna, Juan F.
                            Huete, Alfonso E. Romero

                                                           ´
                   Departamento de Ciencias de la Computacion e Inteligencia Artificial
                                       Universidad de Granada
                                   {lci,jmfluna,jhg,aeromero}@decsai.ugr.es


                INEX 2007, Dagstuhl (Germany) December 18, 2007
Probabilistic Methods for Structured Document Classification at INEX´07




                                                       Abstract

               We present here the result of our participation in the
               Document Mining track at INEX´07. We submitted several
               runs for this track (only classification). This is the first year
               we apply for, with relative good results.
Probabilistic Methods for Structured Document Classification at INEX´07




      Our aim




               Finding good XML to flat-document transformations...
               ...in order to apply probabilistic flat-text and,
               to improve flat classifiers result.
Probabilistic Methods for Structured Document Classification at INEX´07




      Our aim




               Finding good XML to flat-document transformations...
               ...in order to apply probabilistic flat-text and,
               to improve flat classifiers result.
Probabilistic Methods for Structured Document Classification at INEX´07




      Our aim




               Finding good XML to flat-document transformations...
               ...in order to apply probabilistic flat-text and,
               to improve flat classifiers result.
Probabilistic Methods for Structured Document Classification at INEX´07
  The models
    Multinomial Naive Bayes


      The models: Multinomial Naive Bayes

               Extensively used in text classification (see paper by
               MacCallum).
               Posterior probability of a class is computed using the
               bayes rule:

                                                    p(ci )p(dj |ci )
                                 p(ci |dj ) =                        ∝ p(ci )p(dj |ci )   (1)
                                                        p(dj )

               Probabilities p(dj |ci ) are supossed to follow a multinomial
               distribution:
                                                  p(tk |ci )njk
                                   p(dj |ci ) ∝                            (2)
                                                                tk ∈dj

               The estimation of the needed values is carried out in the
                                                              N
               following way: p(tk |ci ) = Nik+M and p(ci ) = Ni,doc .
                                              +1
                                           Ni                   doc
Probabilistic Methods for Structured Document Classification at INEX´07
  The models
    OR Gate Bayesian Network Classifier


      The OR Gate Bayesian Network Classifier


               Tries to model rules of the following kind:
               IF (ti ∈ d) ∨ (tj ∈ d) ∨ ... THEN classify by c.
               The canonical model used is a noisy OR gate, wich is
               related with the notion of causality.
               The appearance of some terms is the cause for the
               assignation of a certain class.
               The general expression of the probability distributions is
               this:
                       p(ci |pa(Ci )) = 1 −      (1 − w(Tk , Ci ))
                                                             Tk ∈R(pa(Ci ))

                                       p(c i |pa(Ci )) = 1 − p(ci |pa(Ci ))
Probabilistic Methods for Structured Document Classification at INEX´07
  The models
    OR Gate Bayesian Network Classifier


      The OR Gate Bayesian Network Classifier (2)

               We instantiate all terms of a document: if
               tk ∈ dj , p(tk |dj ) = 1, 0 otherwise.
               We replicate each term node k with its frequency in the
               document njk .
               The computation of the posterior probability is very easy
               (only depending on the terms appearing on the document):

                                                                         (1 − w(Tk , Ci ))njk .
                               p(ci |dj ) = 1 −                                                           (3)
                                                     Tk ∈Pa(Ci )∩dj

               Two estimation schemes for the weights:
                                                                             Nik
                       Maximum likelihood: w(Tk , Ci ) =                     Nk .
                                                                                            (Ni −Nih )N
                                                                            = Nik   ×
                       Better approximation: w(Tk , Ci )                                h=k (N−Nh )Ni
                                                                              Nk
               See paper for details!
Probabilistic Methods for Structured Document Classification at INEX´07
  Document representation
    Example XML file


      Document representation: example XML file


       <book>
        <title>El ingenioso hidalgo Don Quijote
         de la Mancha</title>
        <author>Miguel de Cervantes Saavedra</author>
        <contents>
          <chapter>Uno</chapter>
           <text>En un lugar de La Mancha de cuyo
            nombre no quiero
           acordarme...</text>
        </contents>
       </book>
Probabilistic Methods for Structured Document Classification at INEX´07
  Document representation
    Method: “only text”


      Document representation




           1 Only text (removing all tags).

       Quijote
       El ingenioso hidalgo Don Quijote de la Mancha Miguel de
       Cervantes Saavedra Uno En un lugar de La Mancha de cuyo
       nombre no quiero acordarme...
Probabilistic Methods for Structured Document Classification at INEX´07
  Document representation
    Method: “text replication”


      Document representation


           5 Text replication (term frequencies are replicated several
             times, depending on the tag containing the terms).

       Replication values
       title 1 author 0 chapter 0 text 2

       Quijote
       El ingenioso hidalgo Don Quijote de la Mancha En En un un
       lugar lugar de de La La Mancha Mancha de de cuyo cuyo
       nombre nombre no no quiero quiero acordarme acordarme...
Probabilistic Methods for Structured Document Classification at INEX´07
  Results



      Results

               Five runs were submitted to the track:
        (1) Naive Bayes, only text, no term selection. Microaverage:
            0.77630. Macroaverage: 0.58536.
        (2) Naive Bayes, replication (id=2), no term selection.
            Microaverage: 0.78107 (+0.6%). Macroaverage: 0.6373
            (+8.9%).
        (3) Or gate, maximum likelihood, replication (id=8), selection
            by MI. Microaverage: 0.75097. Macroaverage: 0.61973.
        (4) Or gate, maximum likelihood, replication (id=5), selection
            by MI. Microaverage: 0.75354. Macroaverage: 0.61298.
        (5) Or gate, better approximation, only text, ≥ 2.
            Microaverage: 0.78998. Macroaverage: 0.76054.
               See paper for text replication values (id=2,5,8).
Probabilistic Methods for Structured Document Classification at INEX´07
  Conclusion



      Conclusions

       Also reached by our previous experiments!!
               Naive Bayes works bad with few populated categories (low
               macroaverage).
               Tagging and adding seems not to work well without feature
               selection.
               Text replication improves macroaverage (good for Naive
               Bayes! ).
               OR gates by ML needs of a per-class feature selection
               method (mutual information).
               The better approximation for the OR gate is our best
               classifier in previous experiments over flat text.
Probabilistic Methods for Structured Document Classification at INEX´07
  Conclusion



      Thank you very much!




                     Questions or comments?

More Related Content

What's hot

Talk icml
Talk icmlTalk icml
Talk icml
Bo Li
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 
Lecture11
Lecture11Lecture11
Lecture11
Bo Li
 

What's hot (15)

Nominal Schema DL 2011
Nominal Schema DL 2011Nominal Schema DL 2011
Nominal Schema DL 2011
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
 
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
 
Talk icml
Talk icmlTalk icml
Talk icml
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
Causal Bayesian Networks
Causal Bayesian NetworksCausal Bayesian Networks
Causal Bayesian Networks
 
Lecture11
Lecture11Lecture11
Lecture11
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learning
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machines
 

Viewers also liked (9)

Aprendizaje Supervisado con DauroLab
Aprendizaje Supervisado con DauroLabAprendizaje Supervisado con DauroLab
Aprendizaje Supervisado con DauroLab
 
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian NetworksLink-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
 
Presentation alfonso romero
Presentation alfonso romeroPresentation alfonso romero
Presentation alfonso romero
 
Recuperación de Información y el modelo de Espacio Vectorial
Recuperación de Información y el modelo de Espacio VectorialRecuperación de Información y el modelo de Espacio Vectorial
Recuperación de Información y el modelo de Espacio Vectorial
 
Document classification models based on Bayesian networks
Document classification models based on Bayesian networksDocument classification models based on Bayesian networks
Document classification models based on Bayesian networks
 
Metodos Predictivos
Metodos PredictivosMetodos Predictivos
Metodos Predictivos
 
Resolviendo problemas con algoritmos geneticos
Resolviendo problemas con algoritmos geneticosResolviendo problemas con algoritmos geneticos
Resolviendo problemas con algoritmos geneticos
 
REDES NEURONALES Algoritmos de Aprendizaje
REDES NEURONALES Algoritmos  de AprendizajeREDES NEURONALES Algoritmos  de Aprendizaje
REDES NEURONALES Algoritmos de Aprendizaje
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 

Similar to Inex07

A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features ppt
irisshicat
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
butest
 

Similar to Inex07 (20)

lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
Bayesian Core: Chapter 8
Bayesian Core: Chapter 8Bayesian Core: Chapter 8
Bayesian Core: Chapter 8
 
Naive Bayes for the Superbowl
Naive Bayes for the SuperbowlNaive Bayes for the Superbowl
Naive Bayes for the Superbowl
 
ML.pptx
ML.pptxML.pptx
ML.pptx
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features ppt
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
kcde
kcdekcde
kcde
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Inex07

  • 1. Probabilistic Methods for Structured Document Classification at INEX´07 Probabilistic Methods for Structured Document Classification at INEX´07 ´ Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Alfonso E. Romero ´ Departamento de Ciencias de la Computacion e Inteligencia Artificial Universidad de Granada {lci,jmfluna,jhg,aeromero}@decsai.ugr.es INEX 2007, Dagstuhl (Germany) December 18, 2007
  • 2. Probabilistic Methods for Structured Document Classification at INEX´07 Abstract We present here the result of our participation in the Document Mining track at INEX´07. We submitted several runs for this track (only classification). This is the first year we apply for, with relative good results.
  • 3. Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.
  • 4. Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.
  • 5. Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.
  • 6. Probabilistic Methods for Structured Document Classification at INEX´07 The models Multinomial Naive Bayes The models: Multinomial Naive Bayes Extensively used in text classification (see paper by MacCallum). Posterior probability of a class is computed using the bayes rule: p(ci )p(dj |ci ) p(ci |dj ) = ∝ p(ci )p(dj |ci ) (1) p(dj ) Probabilities p(dj |ci ) are supossed to follow a multinomial distribution: p(tk |ci )njk p(dj |ci ) ∝ (2) tk ∈dj The estimation of the needed values is carried out in the N following way: p(tk |ci ) = Nik+M and p(ci ) = Ni,doc . +1 Ni doc
  • 7. Probabilistic Methods for Structured Document Classification at INEX´07 The models OR Gate Bayesian Network Classifier The OR Gate Bayesian Network Classifier Tries to model rules of the following kind: IF (ti ∈ d) ∨ (tj ∈ d) ∨ ... THEN classify by c. The canonical model used is a noisy OR gate, wich is related with the notion of causality. The appearance of some terms is the cause for the assignation of a certain class. The general expression of the probability distributions is this: p(ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) Tk ∈R(pa(Ci )) p(c i |pa(Ci )) = 1 − p(ci |pa(Ci ))
  • 8. Probabilistic Methods for Structured Document Classification at INEX´07 The models OR Gate Bayesian Network Classifier The OR Gate Bayesian Network Classifier (2) We instantiate all terms of a document: if tk ∈ dj , p(tk |dj ) = 1, 0 otherwise. We replicate each term node k with its frequency in the document njk . The computation of the posterior probability is very easy (only depending on the terms appearing on the document): (1 − w(Tk , Ci ))njk . p(ci |dj ) = 1 − (3) Tk ∈Pa(Ci )∩dj Two estimation schemes for the weights: Nik Maximum likelihood: w(Tk , Ci ) = Nk . (Ni −Nih )N = Nik × Better approximation: w(Tk , Ci ) h=k (N−Nh )Ni Nk See paper for details!
  • 9. Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Example XML file Document representation: example XML file <book> <title>El ingenioso hidalgo Don Quijote de la Mancha</title> <author>Miguel de Cervantes Saavedra</author> <contents> <chapter>Uno</chapter> <text>En un lugar de La Mancha de cuyo nombre no quiero acordarme...</text> </contents> </book>
  • 10. Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Method: “only text” Document representation 1 Only text (removing all tags). Quijote El ingenioso hidalgo Don Quijote de la Mancha Miguel de Cervantes Saavedra Uno En un lugar de La Mancha de cuyo nombre no quiero acordarme...
  • 11. Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Method: “text replication” Document representation 5 Text replication (term frequencies are replicated several times, depending on the tag containing the terms). Replication values title 1 author 0 chapter 0 text 2 Quijote El ingenioso hidalgo Don Quijote de la Mancha En En un un lugar lugar de de La La Mancha Mancha de de cuyo cuyo nombre nombre no no quiero quiero acordarme acordarme...
  • 12. Probabilistic Methods for Structured Document Classification at INEX´07 Results Results Five runs were submitted to the track: (1) Naive Bayes, only text, no term selection. Microaverage: 0.77630. Macroaverage: 0.58536. (2) Naive Bayes, replication (id=2), no term selection. Microaverage: 0.78107 (+0.6%). Macroaverage: 0.6373 (+8.9%). (3) Or gate, maximum likelihood, replication (id=8), selection by MI. Microaverage: 0.75097. Macroaverage: 0.61973. (4) Or gate, maximum likelihood, replication (id=5), selection by MI. Microaverage: 0.75354. Macroaverage: 0.61298. (5) Or gate, better approximation, only text, ≥ 2. Microaverage: 0.78998. Macroaverage: 0.76054. See paper for text replication values (id=2,5,8).
  • 13. Probabilistic Methods for Structured Document Classification at INEX´07 Conclusion Conclusions Also reached by our previous experiments!! Naive Bayes works bad with few populated categories (low macroaverage). Tagging and adding seems not to work well without feature selection. Text replication improves macroaverage (good for Naive Bayes! ). OR gates by ML needs of a per-class feature selection method (mutual information). The better approximation for the OR gate is our best classifier in previous experiments over flat text.
  • 14. Probabilistic Methods for Structured Document Classification at INEX´07 Conclusion Thank you very much! Questions or comments?