Automated Information Retrieval
   and Text Categorization:
    The RIKS Demonstrator

           Acknowledge final event
...
The RIKS Demonstrator
• United Nations University – Comparative Regional
  Integration Studies (UNU-CRIS)
• Issues address...
The RIKS Demonstrator
Issues addressed in the demonstrator:
  How to automate retrieval and processing
                   ...
Demo




                  Acknowledge 25-11-2008




K.U.Leuven: Content extraction
 from multilingual Web pages

• = Ext...
Acknowledge 25-11-2008




                 [Arias et al. submitted]
Acknowledge 25-11-2008




                          ...
[Arias et al. submitted]
                                      [5] =[Gottron 2008]
                     Acknowledge 25-11-...
K.U.Leuven: Text categorization




          Acknowledge 25-11-2008




           RIKS
 K.U.Leuven: search engine




  ...
Acknowledge 25-11-2008




Demo




   Acknowledge 25-11-2008




                            8
Weten dat je niet weet wat je zou moeten weten

          1. Information Forensics ‐ Smart Indexing
                      ...
Weten dat je niet weet wat je zou moeten weten

     1. Information Forensics – Smart Indexing
                           ...
Weten dat je niet weet wat je zou moeten weten

  2. Categorisation

Categorisation

                                     ...
RIKS
i.Know: news categorization




       Acknowledge 25-11-2008




       Acknowledge 25-11-2008




                 ...
Acknowledge 25-11-2008




Demo




   Acknowledge 25-11-2008




                            13
Thank you




Acknowledge 25-11-2008




                         14
Upcoming SlideShare
Loading in …5
×

Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

447 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
447
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

  1. 1. Automated Information Retrieval and Text Categorization: The RIKS Demonstrator Acknowledge final event November 25, 2008 Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR) Saskia Debergh (i.Know) Philippe De Lombaerde, Birger Fühne (UNU-CRIS) Overview • UNU CRIS: The RIKS Demonstrator UNU-CRIS: • K.U.Leuven: – Content extraction from multilingual Web pages – Text categorization: machine learning approach – Search engine and indexing infrastructure – Interfacing the Acknowledge platform • i.Know: – Information forensics Acknowledge 25-11-2008 1
  2. 2. The RIKS Demonstrator • United Nations University – Comparative Regional Integration Studies (UNU-CRIS) • Issues addressed in research and capacity building: – (i) emergence of regional (= supra-national) governance level – (ii) linkages with other governance levels (national, global/UN) – (iii) building of regional institutions – (iv) growing regional interdependence, etc. • RIKS = Regional Integration Knowledge System (UNU-CRIS and GARNET NoE) Acknowledge 25-11-2008 Acknowledge 25-11-2008 2
  3. 3. The RIKS Demonstrator Issues addressed in the demonstrator: How to automate retrieval and processing p g (cleaning, search, categorization, presentation) of particular types of relevant information in an e-learning environment?: – ‘News’: short texts, various formats, dynamic collection, short life cycle, role of news in e- learning application – ‘Documentation’: heterogeneous texts: scientific articles, theses, essays, ... , rather static collection – Treaty texts: long and complex texts, static collection, issue of accessibility Acknowledge 25-11-2008 RIKS example output Acknowledge 25-11-2008 3
  4. 4. Demo Acknowledge 25-11-2008 K.U.Leuven: Content extraction from multilingual Web pages • = Extracting main content from Web page and removing extraneous data (navigation menu’s, advertisements, etc.) • Requirements of the tool: – Accurate – Generic – Multilingual – Fast Acknowledge 25-11-2008 4
  5. 5. Acknowledge 25-11-2008 [Arias et al. submitted] Acknowledge 25-11-2008 5
  6. 6. [Arias et al. submitted] [5] =[Gottron 2008] Acknowledge 25-11-2008 K.U.Leuven:Text categorization • Heterogeneous documentation and Google News classified into 27 categories (e.g., trade, poverty, ...) (e g trade poverty ) • Supervised classifier: Multinomial Naïve Bayes, Support Vector Machine, ... • Features: – different features: unigrams, bigrams, feature item sets, ... • Additional feature Selection: – Chi Square, Information Gain, Linear Classifier Weights, Orthogonal Centroid Feature Selection • Different test set ups 6
  7. 7. K.U.Leuven: Text categorization Acknowledge 25-11-2008 RIKS K.U.Leuven: search engine Acknowledge 25-11-2008 7
  8. 8. Acknowledge 25-11-2008 Demo Acknowledge 25-11-2008 8
  9. 9. Weten dat je niet weet wat je zou moeten weten 1. Information Forensics ‐ Smart Indexing more than just an index distinguishes between concepts and relations distinguishes between concepts and relations starts from unstructured text (bottom‐up instead of top‐down) recognises word groups as meaningful units Top‐down: Bottom‐up: knowledge knowledge keywords concepts and relations text text Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. Weten dat je niet weet wat je zou moeten weten 1. Information Forensics – Smart Indexing De Fortis Bank werd overgenomen door BNP Paribas. Traditional indexing (keywords): De Fortis Bank werd overgenomen door BNP Paribas. Keyword Index Fortis 0.23 stopwords calculation Bank 0.38 werd 0.08 stemming correlation overgenomen 0.21 door 0.12 BNP 0.34 De Fortis Bank werd overgenomen door BNP Paribas Paribas 0.27 Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. 9
  10. 10. Weten dat je niet weet wat je zou moeten weten 1. Information Forensics – Smart Indexing De Fortis Bank werd overgenomen door BNP Paribas. Smart Indexing (concepts and relations): De Fortis Bank werd overgenomen door BNP Paribas. Smart Index relation  concept  Concept Fortis Bank detection detection Relation werd overgenomen door werd overgenomen door Concept BNP Paribas De Fortis Bank werd overgenomen door BNP Paribas Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. Weten dat je niet weet wat je zou moeten weten 2. Categorisation based on Smart Indexing Preconditions: Pre defined taxonomy/ontology Pre‐defined taxonomy/ontology Top‐down processing Advantages of Smart Indexing: Smart Indexing Results can be used to fill and enrich the taxonomy, thus ensuring  the entries are relevant precise complete Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. 10
  11. 11. Weten dat je niet weet wat je zou moeten weten 2. Categorisation Categorisation EU EFTA Smart Indexing (concepts and relations): The Agreement will be applied with the European  and with the EFTA states. Union Input: The Agreement will be applied with the European Union and with the EFTA states. Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. RIKS i.Know: news categorization Acknowledge 25-11-2008 11
  12. 12. RIKS i.Know: news categorization Acknowledge 25-11-2008 Acknowledge 25-11-2008 12
  13. 13. Acknowledge 25-11-2008 Demo Acknowledge 25-11-2008 13
  14. 14. Thank you Acknowledge 25-11-2008 14

×