Your SlideShare is downloading. ×
Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based Approach
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based Approach

2,519
views

Published on

Context-based Web search has become an important research area and many strategies have been proposed to reflect contextual information in search queries. Despite the success of some of these …

Context-based Web search has become an important research area and many strategies have been proposed to reflect contextual information in search queries. Despite the success of some of these proposals they still have serious limitations due to their inability to bridge the terminology gap existing between the user context description and the relevant documents' vocabulary. This paper presents a quantitative technique to learn vocabularies useful for describing the theme of a context under analysis. The enriched vocabulary allows the formulation of search queries to identify resources with higher precision than those identified using the initial vocabulary. Rigorous experimentation leads us to conclude that the proposed technique is superior to a baseline and other well-known query reformulation techniques.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,519
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Context-based search is the process of seeking information related to a user’s thematic context. Meaningful automatic context-based search can only be achieved if the semantics of the terms in the context under analysis is reflected in the search queries. For example, if a user is searching using their own words ...
  • He or she could find a lot of topics related to this search.
  • An information request is usually initiated or generated within a task. For example, if the user is editing or reading a document on a specific topic, he may be willing to explore new material related to that topic. Topical queries can be formed using small sets of terms from the user’s context.
  • And in this way we could disambiguate a query that belongs to more than one topic.
  • Query tuning is usually achieved by replacing or extending the terms of a query, or by adjusting the weights of a query vector. Relevance feedback is a query refinement mechanism used to tune queries based on the relevance assessments of the query’s results. A driving hypothesis for relevance feedback methods is that it may be difficult to formulate a good query when the collection of documents isn’t known in advance, but it’s easy to judge particular documents, and so it makes sense to engage in an iterative query refinement process. A typical relevance feedback scenario will involve the following steps: A query is formulated the system returns an initial set of results a relevance assessment on the returned results is issued (by relevance feedback) The system computes a better representation of the information needs based on this feedback. The system returns a revised set of results. Depending on the level of automation of step 3 we can distinguish three forms of feedback: Supervised Feedback requires explicit feedback, which is typically obtained from users who indicate the relevance of each of the retrieved documents. Unsupervised Feedback applies blind relevance feedback, and typically assumes that the top k documents returned by a search process are relevant. And in Semi-supervised Feedback, the relevance of a document is inferred by the system. A common approach is to monitor the user behavior (e.g., which documents are selected for viewing or time spent viewing a document). Provided that the information seeking process is performed within a thematic context, and another automatic way to infer the relevance of a document is by computing the similarity of the document to the user’s current context. We’ll use it in this work.
  • Much work has addressed the problem of computing the informativeness of a term across a corpus and a good deal of research has focused on computing the descriptive and discriminating power of a term in a document with respect to a corpus. All this work, however, has been done based on a predefined collection of documents and independently from a thematic context. In a previous work we proposed to study the descriptive and discriminating power of a term based on its distribution across the topic of pages returned by a search engine. To distinguish between topic descriptors and discriminators we argue that good topic descriptors can be found by looking for terms that occur often in documents related to the given topic. On the other hand, good topic discriminators can be found by looking for terms that occur only in documents related to the given topic. Both topic descriptors and discriminators are important as query terms.
  • Because topic descriptors occur often in relevant pages, using them as query terms may improve recall.
  • Similarly, good topic discriminators occur primarily in relevant pages, and therefore using them as query terms may improve precision.
  • Now we’ll see a simple example to assess the potential of this kind of terms.
  • For example, if we choose the topic: Java Virtual Machine, we could take the following words in our context :
  • So, intuitively, and in sense with the definition, we could say that Good Descriptors would be words suchs as Java, Machine or Virtual, And …
  • Good discriminators would be: JVM and JDK.
  • More precisely we’ll see a practical example. As we see in this slide, we build a matrix of documents against terms. Our initial context is the first column of that matrix and the next columns are the pages that we could obtain through a search engine making queries with the initial context’s terms. By definition, each cell of the matrix represents the ocurrences of a term in a document. In this example we have four pages, two of them about the Island and Coffee and the rest about Java as a programming language.
  • We define the Descriptive power of term in a document with this expression and we can see the values of the terms. We see that the values of the terms that don’t belong to the initial context are zero.
  • Also we define the Discriminating power of a term in a document with this other expression and see the results. As our objective is to learn the user needs , instead of extracting the descriptors and discriminator of documents (like the user context) we need to find user context topic descriptors and discriminators. This term identification needs an Incremental Method that identifies which documents are similar to the user context. So, we need …
  • A document comparison criteria and we choose Cosine Similarity. It uses the most simple way to compare documents and it’s the most common method in IR. We don’t explain this method here, but with this criteria we define the topic notion. A topic will be a group of documents with a high cosine similarity.
  • Using the previous definitions, we define the Term descriptive power in a topic of a document using this equation. We see again the weights reached by every term and we note that Java and Machine are good topic descriptors as we mentioned before.
  • Also we define the notion of Term discriminating power in a topic of a document, and we note one of the most important things. Terms like JVM and JDK, which don’t belong to the initial context are excellent topic discriminators, as we thought before. Incremental search methods are useful for collecting information from diverse information sources. The incremental identification of context-specific terms can guide the search process through huge repositories of potentially useful material, helping to filter irrelevant content.
  • Our proposal is to approximate the terms' descriptive and discriminating power for the thematic context under analysis with the purpose of generating good queries . Our approach adapts the typical relevance feedback mechanism to account for a thematic context as follows: First, we extract terms from the user context. With these terms we make queries and the system returns an initial set of results. Simultaneously, with the obtained results and the context the descriptors and discriminators lists are built. These steps are repeated until no improvements are observed. Then, the context characterization is updated with new learned material and the process starts again. The system monitors the effectiveness achieved at each iteration and we use novelty-driven similarity as an estimation of the retrieval effectiveness (we’ll explain it later). If after a number of trials the retrieval effectiveness has not crossed a given threshold (that is, no significant improvements are observed after certain number of trials), the system forces to explore new potentially useful regions of the vocabulary landscape and it can be regarded as a vocabulary leap, which can be thought of as a significant transformation (typically an improvement) of the context characterization. ----------------------------------------------------------------------------------- Step 1: A query is formulated based on C . Step 2: The system returns an initial set of results. Step 3: Repeat for at least v iterations or until no improvements are registered Step 3.1: A relevance assessment on the returned results is issued based on C . Step 3.2: After a certain number of trials and depending on the relevance assessments, the system computes a better representation of the thematic context (phase change). Step 3.3: The system formulates new queries and returns a revised set of results. In order to learn better characterizations of the thematic context, the system undergoes a series of phases. At the end of each phase, the context characterization is updated with new learned material. Each phase evolves through a sequence of trials, where each trial consists in the formulation of a set of queries, the analysis of the retrieved results, the adjustment of the terms' weights, and the discovery of new potentially useful terms. To form queries during phase i we implemented a roulette selection mechanisms where the probability of choosing a particular term t to form a query is proportional to (weight of the term at phase i). Roulette selection is a technique typically used by Genetic Algorithms to choose potentially useful solutions for recombination, where the fitness level is used to associate a probability of selection. This approach resulted in a non-deterministic exploration of term space that favored the fittest terms. The system monitors the effectiveness achieved at each iteration. In our approach we use novelty-driven similarity as an estimation of the retrieval effectiveness (we’ll explaine it later). If after a number of trials the retrieval effectiveness has not crossed a given threshold (i.e., no significant improvements are observed after certain number of trials), the system forces a phase change to explore new potentially useful regions of the vocabulary landscape. A phase change can be regarded as a vocabulary leap, which can be thought of as a significant transformation (typically an improvement) of the context characterization.
  • Now, we’ll see an evaluation of the proposed solution
  • We compare the proposed method against two other methods. The first is a baseline that submits queries directly from the thematic context and doesn’t apply any refinement mechanism. The second method used for comparison is the Bo1-DFR method, which is based on the well known Rocchio method. To perform our tests we used 448 topics from the Open Directory Project (ODP) and a number of constraints were imposed on this selection with the purpose of ensuring the quality of our test set. The topics were selected from the third level of the ODP taxonomy, the minimum size for each selected topic was 100 pages and the language was restricted to English. For each topic we collected all of its URLs as well as those in its subtopics. The total number of collected pages was more than 350,000. In our tests we used the ODP description of each selected topic to create an initial context description. In order to compare the implemented methods we used three measures of query performance: Novelty-driven similarity is based on Cosine Similarity but disregards the terms that form the query, overcoming the bias introduced by those terms and favoring the exploration of new material. Precision measures the fraction of retrieved documents which are known to be relevant. The relevant set for each analyzed topic was set as the collection of its URLs as well as those in its subtopics. And Semantic Precision is a measure that considers the inter-topics similarity because other topics in the ontology could be semantically similar (and therefore partially relevant) to the topic of the given context. To compute it we used a semantic similarity measure for generalized ontologies proposed in a previous work.
  • As an example of the algorithm behaviour we show here its evolution in a representative topic. We see in this chart the novelty-driven similarity evolution and its behaviour at the different execution steps.
  • Now, in this chart we see the novelty-driven similarity again but in this case in a comparative chart. Each of the 448 topics is represented by a point. It’s interesting to note that for all the tested cases the incremental method was superior to other two methods.
  • Here, we see the comparison for the Precision metric. In this case the incremental method was strictly superior for 66.96% of the evaluated topics.
  • Finally, we see the Semantic Precision metric. Here the incremental method was superior in 65.18% of the topics.
  • The vocabulary problem is a main challenge in human-system communication. In this work we propose a solution to the semantic sensitivity problem, that is the limitation that arises when documents with similar context but different term vocabulary won't be associated, resulting in a false negative match. Our method operates by incrementally learning better vocabularies from a large external corpus such as the Web. We have shown that by implementing an incremental context refinement method we can perform better than a baseline method, which submits queries directly from the initial context, and to the Bo1-DFR method, which doesn't refine queries based on context. This points to the usefulness of simultaneously taking advantage of the terms in the current thematic context and an external corpus to learn better vocabularies and to automatically tune queries.
  • Transcript

    • 1. Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based Approach Carlos M Lorenzetti Ana G Maguitman [email_address] [email_address] Universidad Nacional del Sur Av. L.N. Alem 1253 Bahía Blanca - Argentina Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento Laboratorio de Investigación y Desarrollo en Inteligencia Artificial CONICET AGENCIA
    • 2. Introduction
    • 3. Context–Based Search Java?
    • 4. Context–Based Search Java? Animals Computers Consumables Entertainment Geography Flora Ships
    • 5. Context–Based Search Context Articles Newspapers Others
    • 6. Context–Based Search Java? Context Articles Newspapers Others Geography
    • 7. Query tuning
      • Step 1 : Query
      • Step 2 : Initial set of results
      • Step 3 : Relevance assessment
        • Supervised feedback
        • Unsupervised feedback
        • Semi-supervised feedback
      • Step 4 : Better representation
      • Step 5 : Revised set of results
    • 8. Different Role of Terms
      • Descriptors
        • Terms that appear often in documents
        • related to the given topic
        • What is this topic about?
      • Discriminators
        • Terms that appear only in documents
        • related to the given topic
        • What terms are useful to seek similar information?
    • 9. Different Role of Terms
      • Descriptors
        • Terms that appear often in documents
        • related to the given topic
        • What is this topic about?
      • Discriminators
        • Terms that appear only in documents
        • related to the given topic
        • What terms are useful to seek similar information?
      Recall
    • 10. Different Role of Terms
      • Descriptors
        • Terms that appear often in documents
        • related to the given topic
        • What is this topic about?
      • Discriminators
        • Terms that appear only in documents
        • related to the given topic
        • What terms are useful to seek similar information?
      Precision Recall
    • 11.
      • Descriptors and Discriminators Computation:
      • an example
    • 12. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine
    • 13. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good descriptors
    • 14. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good discriminators
    • 15. Documents Descriptors and Discriminators Number of occurrences of term j in document i Topic: Java Virtual Machine Initial Context H
      • espressotec.com
      • netbeans.org
      • sun.com
      • wikitravel.org
      (1) (2) (3) (4) 0 3 3 0 0 1 2 0 1 0 0 4 2 0 0 4 3 0 0 3 0 2 2 0 1 1 2 0 0 1 1 0 0 2 3 6 2 5 5 2 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java
    • 16. Documents Descriptors Topic: Java Virtual Machine Initial Context Descriptive power of a term in a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,539 0,180 0,180 0,359 0,718
    • 17. Documents Discriminators Topic: Java Virtual Machine Initial Context Discriminating power of a term in a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,577 0,500 0,577 0,500 0,447
    • 18. Documents comparison criteria Documents similarity K 1 K 3 K 2 d 2 d 1  Cosine similarity
    • 19. Topics Descriptors Topic: Java Virtual Machine Initial Context Term descriptive power in a topic of a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,014 0,032 0,040 0,040 0,055 0,064 0,089 0,124 0,158 0,385
    • 20. Topics Discriminators Topic: Java Virtual Machine Initial Context Term discriminating power in a topic of a document 0 province 0 island 0 coffee 4 java 1 language 2 machine 3 programming 1 virtual 0 jdk 0 jvm 0,385 0,385 0,385 0,493 0,517 0,524 0,566 0,566 0,848 0,848
    • 21. Proposed Algorithm Context w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w m-1 w m w m-2 w 9 . . . Roulette query 01 query 02 query 03 query n result 03 result 01 result 02 result n w 0,5 w 0,25 . . . w 0,1 1 2 m DESCRIPTORS DESCRIPTORS w 0,4 w 0,37 . . . w 0,01 1 2 m DISCRIMINATORS DISCRIMINATORS 1 2 4 3 Terms
    • 22.
      • Evaluation
    • 23. Evaluation
      • Local Index
      • DMOZ – ODP Project
        • ~ 500 Topics
        • 3 rd level of the hierarchy
        • 100 URLs at least
        • English language
        • > 350,000 pages
      • Initial Context
      • vs. Bo1 & Baseline
        • Novelty-Driven Similarity
        • Precision
        • Semantic Precision
      1 st level 2 nd level 3 rd level Top Home Science Arts Cooking Family Childcare
    • 24. Evaluation –  N Similarity Context update Top/Computers/Open_Source/Software Query formulation and retrieval process 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 iteration novelty-driven similarity Maximum Average Minimum [0.5866; 0.6073] 0.5970 best [0.0618; 0.0704] 0.0661 1 st 95% CI Mean  N
    • 25. Evaluation –  N Similarity [0.0822; 0.0924] 0.087 Baseline [0.5866; 0.6073] 0.597 Incremental [0.0710; 0.0803] 0.075 Bo1-DFR 95% CI Mean  N
    • 26. Evaluation – Precision   Incremental (66.96%) Bo1-DFR (24.33%) Baseline (8.7%)  [0.2461; 0.2863] 0.266 Baseline [0.3325; 0.3764] 0.354 Incremental [0.2859; 0.3298] 0.307 Bo1-DFR 95% CI Mean Precision
    • 27. Evaluation – Semantic Precision   Incremental (65.18%) Bo1-DFR (27.90%) Baseline (6.92%)  [0.5383; 0.5679] 0.553 Baseline [0.6068; 0.6372] 0.622 Incremental [0.5750; 0.6066] 0.590 Bo1-DFR 95% CI Mean Precision S
    • 28. Conclusions
      • We presented an intelligent IR approach for learning context-specific terms.
        • Take advantage of the user context .
      • We have shown evaluations and the effectiveness of incremental methods.
      • Future Work
        • Investigate different parameters.
        • Develop methods to learn and adjust parameters.
        • Run additional tests using other IR metrics.
    • 29. Thank you! CONICET AGENCIA Laboratorio de Investigación y Desarrollo en Inteligencia Artificial lidia.cs.uns.edu.ar Universidad Nacional del Sur Bahía Blanca www.uns.edu.ar

    ×