Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval


Published on

Slides for the paper "Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval" by Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux presented at SIGIR2012

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • lot of search engines queries are about entities (more than a half) there is the task...
  • tell that literals are strings attached to some node
  • just the only scoring function
  • tell what same as is
  • I dati sono un grafo , l ’ indice invertito ci dà un entry point e poi camminiam
  • TREC like collection/testset depth 10 pooling tutti lo conoscono qui!
  • Say that simple index is “ or ” , UL, LA, ULA is “ and ” Say disappointment with first result with BM25: we tried to do just II but didn ’ t work, and then we decided to go for graph… NO GOOGLE
  • Compare JUST s_1 with s_2 (lower recall but higher precision)
  • s2_3 doesn ’ t follow wikilinks. Indicies and database were resident in the machine. We didn ’ t focus on efficiency
  • Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

    1. 1. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux eXascale Infolab - University of Fribourg - Switzerland {firstname.lastname} SIGIR2012 - Monday, August 13th 2012
    2. 2. 2 Motivation • Lot of search engines queries are about entities. • Increasingly large amount of entity data online. • Often represented as huge graphs • e.g. the LOD cloud, Google Knowledge Graph, Facebook social graph. • Globally unique Entity identifiers (e.g., URIs) . • Hard to discover and/or memorize.
    3. 3. 3 Ad-hoc Object Retrieval (informal definition) • “Given the description of an entity, give me back its identifier” • Description can be keywords (e.g., “Harry Potter”). • More than one identifier per entity (e.g., dbpedia + freebase). • How to evaluate returned results?
    4. 4. Ad-hoc Object Retrieval (formal definition by Pound et al.) • Input: unstructured query q and data graph G. • Output: ranked list of resource identifiers (URIs) from G. • Evaluation: results (URIs) scored by a judge with access to all the information contained in or linked to the resource. • Standard collections exist. + 1. 1. 1. ws/ 1. 1. 1. 1. ws/ 1. k k
    5. 5. 5 Overview of Our Solution Inverted indices on the LOD Cloud... ...and RDF store containing the data. Simple NLP techniques, Autocompletion, Pseudo-relevance feedback BM25, BM25F
    6. 6. 6 Pseudo-Relevance Feedback NLP techniques Query auto- completion A Simple Example SIGIRSIGIR Graph traversals Final ranking function 2.…/sigir 3.…/IRAQ … 1.…/SIGIR Which properties should we follow? How to rank new results? II + ranking function(s) 2.…/IRAQ 3. … … 1.…/SIGIR How to build the II?
    7. 7. 7 Outline 1. Inverted Indices 2. Graph Based Entity Search 1. Object Properties vs Datatype Properties 2. Properties to Follow 3. Experimental Results 1. Experimental Setting 2. IR Techniques: Experimental Results 3. Evaluation of the Hybrid Approaches 4. Overhead of the Graph Traversal
    8. 8. 8 1. Inverted Indices (IIs) • Simple inverted index: • index all literals attached to each node in the input graph. • “movie” http://…types/film→ • Structured inverted index with three fields: • URI - tokenized URIs identifying entities. • Label - manually selected datatype properties to textual descriptions of the entity (e.g., label, title, name, full- name, …). • Attributes - all other literals. BM25(F), query auto-completion, query extension, relevance 8
    9. 9. 9 New URIs ... 2. Graph-Based Entity Search IR results ... ... N p1 p2 p_m p1 p2 p_m sim(e, q) > τ? ... Assign Scores 0.284 1.428 0.556 Merged Re- Ranked Results ... Take top-N docs. Follow links/properties and get new URIs. Filter new results by text similarity wrt the user query. Scoring functions: count sim > τ, avg sim > τ, Sum sim, Avg sim, Sum BM25 - ε
    10. 10. 10 2. 1. Object Properties vs Datatype Properties • Object Properties: • connect different entities • explore all the graph • Datatype properties: • give additional info about entities • explore just the neighborhood of a node
    11. 11. 11 2.2. properties to follow • RDF graph queried with SPARQL queries. • Scope 1 queries vs Scope 2 queries. • Set of predicates to follow selected using: • Common sense (e.g., sameAs) • Statistics from the data
    12. 12. 12 properties to follow: Two Examples Entry point given by the II
    13. 13. 13 3. Experimental results
    14. 14. 14 3.1 Experimental Setting • SemSearch 2010 and 2011 testsets: • Billion Triple Challenge 2009 (BTC2009) • 1.3 billions RDF triples crawled from the LOD cloud. • 92 and 50 queries, respectively. • Evaluation of systems with depth-10 pooling by means of crowdsourcing. • Measures taken into consideration: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), early Precision (P10)
    15. 15. 15 Completing Relevance by Crowdsourcing Judgements • We obtained relevance judgments for unjudged entities in the top-10 results of our runs by using Amazon MTurk. • To be fair we used the same design and settings that were used for the AOR task of SemSearch.
    16. 16. 16 3.2. IR Techniques: Experimental ResultsOur Baseline.
    17. 17. 18 3.3. Evaluation of Hybrid Approaches N = 3, = 0,τ score = sumBM25 - ε
    18. 18. 19 3.4. Overhead of the Graph traversal • Time in milliseconds needed for each part of the hybrid approaches. • Measures taken on a single machine with cold cache. Surprisingly small overhead (17% for best results).
    19. 19. 20 Conclusions • AOR = “Given the description of an entity, give me back its identifier” • Disappointing results using simple IR techniques for AOR task. • Hybrid system for AOR: • combining classic IR techniques + structured database storing graph data. • Our evaluation shows that the new approach leads to significantly better results (up to +25% MAP over BM25 baseline). • For the best working configuration found, the overhead caused from the graph traversal part is limited (17% more than running the chosen baseline).
    20. 20. 21 Thank you for your attention • You can find the new relevance judgments at • More info at • In the following days you’ll find our paper, this presentation, and the new crowdsourced relevance judgements at