• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Combining Inverted Indices and Structured Search for  Ad-hoc Object Retrieval
 

Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

on

  • 559 views

Slides for the paper "Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval" by Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux presented at SIGIR2012

Slides for the paper "Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval" by Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux presented at SIGIR2012

Statistics

Views

Total Views
559
Views on SlideShare
559
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • lot of search engines queries are about entities (more than a half) there is the task...
  • tell that literals are strings attached to some node
  • just the only scoring function
  • tell what same as is
  • I dati sono un grafo , l ’ indice invertito ci dà un entry point e poi camminiam
  • TREC like collection/testset depth 10 pooling tutti lo conoscono qui!
  • Say that simple index is “ or ” , UL, LA, ULA is “ and ” Say disappointment with first result with BM25: we tried to do just II but didn ’ t work, and then we decided to go for graph… NO GOOGLE
  • Compare JUST s_1 with s_2 (lower recall but higher precision)
  • s2_3 doesn ’ t follow wikilinks. Indicies and database were resident in the machine. We didn ’ t focus on efficiency

Combining Inverted Indices and Structured Search for  Ad-hoc Object Retrieval Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Presentation Transcript

  • Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux eXascale Infolab - University of Fribourg - Switzerland {firstname.lastname}@unifr.ch SIGIR2012 - Monday, August 13th 2012
  • 2 Motivation • Lot of search engines queries are about entities. • Increasingly large amount of entity data online. • Often represented as huge graphs • e.g. the LOD cloud, Google Knowledge Graph, Facebook social graph. • Globally unique Entity identifiers (e.g., URIs) . • Hard to discover and/or memorize.
  • 3 Ad-hoc Object Retrieval (informal definition) • “Given the description of an entity, give me back its identifier” • Description can be keywords (e.g., “Harry Potter”). • More than one identifier per entity (e.g., dbpedia + freebase). • How to evaluate returned results?
  • Ad-hoc Object Retrieval (formal definition by Pound et al.) • Input: unstructured query q and data graph G. • Output: ranked list of resource identifiers (URIs) from G. • Evaluation: results (URIs) scored by a judge with access to all the information contained in or linked to the resource. • Standard collections exist. + 1. http://ex.plode.us/tag/harry+potter 1. http://www.vox.com/explore/interests/harry%20potter 1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo ws/ 1. http://harrypotter.wizards.pro/ 1. http://ex.plode.us/tag/harry+potter 1. http://www.vox.com/explore/interests/harry%20potter 1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo ws/ 1. http://harrypotter.wizards.pro/ http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u k http://harrypotter.wizards.pro/ http://ebiquity.umbc.edu/person/html/Harry/Chen/ http://dbpedia.org/resource/Ceramist http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u k http://harrypotter.wizards.pro/ http://ebiquity.umbc.edu/person/html/Harry/Chen/ http://dbpedia.org/resource/Ceramist
  • 5 Overview of Our Solution Inverted indices on the LOD Cloud... ...and RDF store containing the data. Simple NLP techniques, Autocompletion, Pseudo-relevance feedback BM25, BM25F
  • 6 Pseudo-Relevance Feedback NLP techniques Query auto- completion A Simple Example SIGIRSIGIR Graph traversals Final ranking function 2. http://freebase.com/…/sigir 3. http://dbpedia.org/…/IRAQ … 1. http://dbpedia.org/…/SIGIR Which properties should we follow? How to rank new results? II + ranking function(s) 2. http://dbpedia.org/…/IRAQ 3. … … 1. http://dbpedia.org/…/SIGIR How to build the II?
  • 7 Outline 1. Inverted Indices 2. Graph Based Entity Search 1. Object Properties vs Datatype Properties 2. Properties to Follow 3. Experimental Results 1. Experimental Setting 2. IR Techniques: Experimental Results 3. Evaluation of the Hybrid Approaches 4. Overhead of the Graph Traversal
  • 8 1. Inverted Indices (IIs) • Simple inverted index: • index all literals attached to each node in the input graph. • “movie” http://…types/film→ • Structured inverted index with three fields: • URI - tokenized URIs identifying entities. • Label - manually selected datatype properties to textual descriptions of the entity (e.g., label, title, name, full- name, …). • Attributes - all other literals. BM25(F), query auto-completion, query extension, relevance 8
  • 9 New URIs ... 2. Graph-Based Entity Search IR results ... ... N p1 p2 p_m p1 p2 p_m sim(e, q) > τ? ... Assign Scores 0.284 1.428 0.556 Merged Re- Ranked Results ... Take top-N docs. Follow links/properties and get new URIs. Filter new results by text similarity wrt the user query. Scoring functions: count sim > τ, avg sim > τ, Sum sim, Avg sim, Sum BM25 - ε
  • 10 2. 1. Object Properties vs Datatype Properties • Object Properties: • connect different entities • explore all the graph • Datatype properties: • give additional info about entities • explore just the neighborhood of a node
  • 11 2.2. properties to follow • RDF graph queried with SPARQL queries. • Scope 1 queries vs Scope 2 queries. • Set of predicates to follow selected using: • Common sense (e.g., sameAs) • Statistics from the data
  • 12 properties to follow: Two Examples Entry point given by the II
  • 13 3. Experimental results
  • 14 3.1 Experimental Setting • SemSearch 2010 and 2011 testsets: • Billion Triple Challenge 2009 (BTC2009) • 1.3 billions RDF triples crawled from the LOD cloud. • 92 and 50 queries, respectively. • Evaluation of systems with depth-10 pooling by means of crowdsourcing. • Measures taken into consideration: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), early Precision (P10)
  • 15 Completing Relevance by Crowdsourcing Judgements • We obtained relevance judgments for unjudged entities in the top-10 results of our runs by using Amazon MTurk. • To be fair we used the same design and settings that were used for the AOR task of SemSearch.
  • 16 3.2. IR Techniques: Experimental ResultsOur Baseline.
  • 18 3.3. Evaluation of Hybrid Approaches N = 3, = 0,τ score = sumBM25 - ε
  • 19 3.4. Overhead of the Graph traversal • Time in milliseconds needed for each part of the hybrid approaches. • Measures taken on a single machine with cold cache. Surprisingly small overhead (17% for best results).
  • 20 Conclusions • AOR = “Given the description of an entity, give me back its identifier” • Disappointing results using simple IR techniques for AOR task. • Hybrid system for AOR: • combining classic IR techniques + structured database storing graph data. • Our evaluation shows that the new approach leads to significantly better results (up to +25% MAP over BM25 baseline). • For the best working configuration found, the overhead caused from the graph traversal part is limited (17% more than running the chosen baseline).
  • 21 Thank you for your attention • You can find the new relevance judgments at http://diuf.unifr.ch/xi/HybridAOR. • More info at www.exascale.info. • In the following days you’ll find our paper, this presentation, and the new crowdsourced relevance judgements at www.exascale.info/AOR.