Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning Better Context Characterizations: An Intelligent Information Retrieval Approach

This paper proposes an incremental method that can be used by an intelligent system to learn better descriptions of a thematic context. The method starts with a small number of terms selected from a simple description of the topic under analysis and uses this description as the initial search context. Using these terms, a set of queries are built and submitted to a search engine. New documents and terms are used to refine the learned vocabulary. Evaluations performed on a large number of topics indicate that the learned vocabulary is much more effective than the original one at the time of constructing queries to retrieve relevant material.

  • Login to see the comments

  • Be the first to like this

Learning Better Context Characterizations: An Intelligent Information Retrieval Approach

  1. 1. Learning Better Context Characterizations: an Intelligent Information Retrieval Approach Carlos M Lorenzetti Ana G Maguitman [email_address] [email_address] Universidad Nacional del Sur Av. L.N. Alem 1253 Bahía Blanca - Argentina Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento Laboratorio de Investigación y Desarrollo en Inteligencia Artificial CONICET AGENCIA
  2. 2. Information Retrieval limitations
  3. 3. Information Retrieval limitations Java as an island
  4. 4. Information Retrieval limitations Java as programming language
  5. 5. Problems: ambiguity Java?
  6. 6. Problems: ambiguity Java? Animals Computers Consumables Entertainment Geography Flora Ships
  7. 7. Proposed solutions <ul><li>Context-Specific term identification. </li></ul><ul><li>Find relevant information sources. </li></ul><ul><li>Incrementally make queries. </li></ul>
  8. 8. Context Characterization Words list T1 p1 T2 p2 T3 p3 T4 p4 Tn pn Context Articles Newspapers Others
  9. 9. Context Characterization <ul><li>Terms relevance </li></ul><ul><li>it uses the most simple way </li></ul><ul><li>Web? </li></ul>Counts documents’ term ocurrence Penalizes very common terms
  10. 10. Different Role of Terms <ul><li>Descriptors </li></ul><ul><ul><li>Terms that appear often in documents </li></ul></ul><ul><ul><li>related to the given topic </li></ul></ul><ul><ul><li>What is this topic about? </li></ul></ul><ul><li>Discriminators </li></ul><ul><ul><li>Terms that appear only in documents </li></ul></ul><ul><ul><li>related to the given topic </li></ul></ul><ul><ul><li>What terms are useful to seek similar information? </li></ul></ul>
  11. 11. <ul><li>Descriptors and Discriminators Computation: </li></ul><ul><li>an example </li></ul>
  12. 12. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine
  13. 13. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good descriptors
  14. 14. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good discriminators
  15. 15. Documents Descriptors and Discriminators Number of occurrences of term j in document i Topic: Java Virtual Machine Initial Context H <ul><li>espressotec.com </li></ul><ul><li>netbeans.org </li></ul><ul><li>sun.com </li></ul><ul><li>wikitravel.org </li></ul>(1) (2) (3) (4) 0 3 3 0 0 1 2 0 1 0 0 4 2 0 0 4 3 0 0 3 0 2 2 0 1 1 2 0 0 1 1 0 0 2 3 6 2 5 5 2 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java
  16. 16. Documents Descriptors Topic: Java Virtual Machine Initial Context Descriptive power of a term in a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,539 0,180 0,180 0,359 0,718
  17. 17. Documents Discriminators Topic: Java Virtual Machine Initial Context Discriminating power of a term in a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,577 0,500 0,577 0,500 0,447
  18. 18. Documents comparison criteria Documents similarity K 1 K 3 K 2 d 2 d 1  Cosine similarity
  19. 19. Topics Descriptors Topic: Java Virtual Machine Initial Context Term descriptive power in a topic of a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,014 0,032 0,040 0,040 0,055 0,064 0,089 0,124 0,158 0,385
  20. 20. Topics Discriminators Topic: Java Virtual Machine Initial Context Term discriminating power in a topic of a document 0 province 0 island 0 coffee 4 java 1 language 2 machine 3 programming 1 virtual 0 jdk 0 jvm 0,385 0,385 0,385 0,493 0,517 0,524 0,566 0,566 0,848 0,848
  21. 21. Proposed Algorithm Context w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w m-1 w m w m-2 w 9 . . . Roulette query 01 query 02 query 03 query n result 03 result 01 result 02 result n w 0,5 w 0,25 . . . w 0,1 1 2 m DESCRIPTORS DESCRIPTORS w 0,4 w 0,37 . . . w 0,01 1 2 m DISCRIMINATORS DISCRIMINATORS 1 2 4 3 Terms
  22. 22. <ul><li>Evaluation </li></ul>
  23. 23. Evaluation <ul><li>Local Index </li></ul><ul><li>DMOZ – ODP Project </li></ul><ul><ul><li>~ 500 Topics </li></ul></ul><ul><ul><li>> 350,000 pages </li></ul></ul><ul><ul><li>English language </li></ul></ul><ul><ul><li>3 rd level of the hierarchy </li></ul></ul><ul><ul><li>100 URLs at least </li></ul></ul><ul><li>Initial Context </li></ul><ul><li>Similarity </li></ul><ul><li>Precision </li></ul><ul><li>Recall </li></ul>1 st level 2 nd level 3 rd level Top Home Science Arts Cooking Family Childcare
  24. 24. Evaluation – Similarity Top/Computers/Open_Source/Software 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 iteration novelty-driven similarity [0.5866; 0.6073] 0.5970 best [0.0618; 0.0704] 0.0661 1 st 95% CI Mean  N Maximum Average Minimum
  25. 25. Evaluation – Similarity Context update 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 iteration novelty-driven similarity [0.5866; 0.6073] 0.5970 best [0.0618; 0.0704] 0.0661 1 st 95% CI Mean  N Maximum Average Minimum
  26. 26. Evaluation – Similarity Query formulation and retrieval process 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 iteration novelty-driven similarity [0.5866; 0.6073] 0.5970 best [0.0618; 0.0704] 0.0661 1 st 95% CI Mean  N Maximum Average Minimum
  27. 27. Evaluation – Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 first iteration precision best iteration precision Improvement observed (89.18%) No-improvement observed
  28. 28. Evaluation – Recall 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 first iteration recall 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 best iteration recall Improvement observed (89.38%) No-improvement observed
  29. 29. Conclusions <ul><li>We presented an intelligent IR approach for learning context-specific terms. </li></ul><ul><ul><li>Take advantage of the user context . </li></ul></ul><ul><li>We have shown evaluations and the effectiveness of incremental methods. </li></ul><ul><li>Future Work </li></ul><ul><ul><li>Investigate different parameters. </li></ul></ul><ul><ul><li>Develop methods to learn and adjust parameters. </li></ul></ul><ul><ul><li>Run additional tests to comparing with other methods. </li></ul></ul>
  30. 30. Thank you! CONICET AGENCIA Laboratorio de Investigación y Desarrollo en Inteligencia Artificial lidia.cs.uns.edu.ar Universidad Nacional del Sur Bahía Blanca www.uns.edu.ar

×