Collection Ranking & Selectionfor Federated Entity SearchKrisztian Balog, Robert Neumayer, and Kjetil NørvågNorwegian Univ...
Motivation- Entities are ubiquitous  Many information needs revolve around entities  (people, products, organisations, pla...
Top searches of 2011*       Web search                  Mobile search       1. iPhone                   1. iPhone 5       ...
Top searches of 2011*       Web search                  Mobile search       1. iPhone                   1. iPhone 5       ...
The Web of Data           # Linking Open Data datasets  300  200  100    0    2007   2008       2009        2010    2011
LOD in 2007                                      Fresh-       ECS                                                  South- ...
LOD in 2011                                                                                                               ...
Ad-hoc entity searchAt the 2010/11 Semantic Search Challenge - Task   Given a keyword query, targeting a particular entity...
In this talk- Address the ad-hoc entity retrieval task in a  distributed setting  - The Web of Data is inherently distribu...
Federated searchA typical broker-based architecture1   Collection ranking   Central broker                             Sum...
Federated searchA typical broker-based architecture1   Collection ranking     Central broker2   Collection selection      ...
Federated searchA typical broker-based architecture1   Collection ranking     Central broker2   Collection selection3   Re...
Next: baseline modelsfor collection ranking and selection1   Collection ranking     Central broker2   Collection selection...
Collection rankingCollection-centric (CC) - Lexicon-based method   Treat and score each collection as if it was a single, ...
Collection rankingEntity-centric (EC) - Document-surrogate method   Model and query individual documents (entities) and   ...
Collection selectionTop-K selection - Choosing a fixed cutoff (K) ahead of time   K is usually set between 5 and 20        ...
Our method: AENN“All that an Entity Needs is a Name” - Exploit that entities are searched by their name - The central brok...
AENN collection ranking- Key observation  Different methods—collection-centric (CC) vs. entity-  centric (EC)—work best fo...
AENN collection rankingExample              CC         EC         AENN          A   0.35   A   0.65   A    0.5          D ...
AENN collection selection- Key observation  CC has higher recall, while EC has better precision- Idea  Use the collection ...
AENN collection selectionPrecision-oriented selection (AENN(p)) - Only select collections returned by EC              CC  ...
AENN collection selectionRecall-oriented selection (AENN(r)) - Include collections from CC until all from EC are   covered...
AENN collection selectionBalanced selection (AENN(b)) - Include collections from AENN until all from EC   are covered     ...
AENN collection selectionComparison of approaches              AENN(p)     AENN(r)     AENN(b)              A   0.5     A ...
Experimental setupBased on the 2010/11 Semantic Search Challenge - Distributed environment   Top 100 largest second-level ...
Test collections                                          BTC                                 BTC             DBpedia     ...
ResultsCollection ranking (BTC)                   CC         EC   AENN           0.75           0.50     MAP           0.2...
Results      Different collection selection strategies (BTCDBpedia)              Precision                             Rec...
Results      Collection selection (DBpedia)              Precision                             Recall                     ...
Summary- Addressed the task of ad-hoc entity retrieval in a  distributed setting- Introduced AENN, a novel collection rank...
Questions?Resources are available at http://bit.ly/OzfYK2                                               Contact           ...
Upcoming SlideShare
Loading in …5
×

Collection Ranking and Selection for Federated Entity Search

578 views
516 views

Published on

Presented at the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012).

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
578
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • This is how the Linking Open Data cloud looked like sometime in 2007\n
  • And this is how it looks like more-or-less now.\n
  • Main lessons learned: fielded variants of standard document retrieval methods work well on this task\nAll approaches, however, assumed that a centralised index encompassing all entities is available\n
  • \n
  • As I said earlier, existing approaches build document-based representations of entities. Therefore, federated search techniques for document retrieval can easily be applied.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Acronym comes from the very insightful observation that ...\n\nPrevious methods rely on sampling\n\n\n
  • Note that the central broker contains only the names of entities\n
  • Ok, let’s familiarise ourselves with these acronyms, because we’ll be using them a lot in the following slides.\n
  • \n
  • Recall that EC operates on a local ranking of entities; we expect this to be the final list of results\nConsequently, only collection that contribute to this list will be selected.\n
  • We start at the top of the CC ranking and keep including collections until we have all from EC covered. \n
  • We start at the top of the CC ranking and keep including collections until we have all from EC covered. \n
  • \n
  • No standard test collection exists for this task.\n
  • \n
  • \n
  • \n
  • CC-N: lower bound\nEC-C: upper bound (essentially, a centralised index of all entities with full content)\n
  • \n
  • \n
  • Collection Ranking and Selection for Federated Entity Search

    1. 1. Collection Ranking & Selectionfor Federated Entity SearchKrisztian Balog, Robert Neumayer, and Kjetil NørvågNorwegian University of Science and Technology19th International Symposium on String Processing and Information Retrieval (SPIRE 2012)Cartagena de Indias, Colombia, October 2012
    2. 2. Motivation- Entities are ubiquitous Many information needs revolve around entities (people, products, organisations, places, etc.)- Growing amount of structured data Again, organised around entities- Entities are often searched by their name If we know (heard of) it, we just ask for it by the name
    3. 3. Top searches of 2011* Web search Mobile search 1. iPhone 1. iPhone 5 2. Casey Anthony 2. Powerball 3. Kim Kardashian 3. MLB 4. Katy Perry 4. Scrabble cheat 5. Jennifer Lopez 5. Casey Anthony 6. Lindsay Lohan 6. Hurricane Irene 2011 7. American Idol 7. Kim Kardashian 8. Jennifer Aniston 8. Translator 9. Japan earthquake 9. Amy Winehouse 10. Osama bin Laden 10. May 21, 2011 Rapture* http://yearinreview.yahoo.com/
    4. 4. Top searches of 2011* Web search Mobile search 1. iPhone 1. iPhone 5 2. Casey Anthony 2. Powerball “users have learned that search engine relevance 3. Kim Kardashian 3. MLB decreases with longer queries and have grown 4. Katy Perry 4. Scrabble cheat accustomed to reducing their query Anthony 5. Jennifer Lopez 5. Casey (at least initially) to the name of an entity” 6. Lindsay Lohan 6. Hurricane Irene 2011 7. American Idol 7. Kim Kardashian 8. Jennifer Aniston [Blanco et al., 2011] 8. Translator 9. Japan earthquake 9. Amy Winehouse 10. Osama bin Laden 10. May 21, 2011 Rapture* http://yearinreview.yahoo.com/
    5. 5. The Web of Data # Linking Open Data datasets 300 200 100 0 2007 2008 2009 2010 2011
    6. 6. LOD in 2007 Fresh- ECS South- SIOC meat ampton BBC Later + NEW! Sem- NEW! TOTP Web- SW Central Onto- Conference world Corpus Music- Magna- FOAF brainz tune Open- Guides Revyu RDF Book Jamendo Geo- Mashup names DBpedia DBLP Berlin US World NEW! Census Fact- Data book NEW! lingvoj Euro- flickr stat wrappr Wiki- Open DBLP company Cyc Hannover Gov- Track Project W3C Guten- WordNet berg
    7. 7. LOD in 2011 Linked LOV User Slideshare tags2con Audio Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex Folk (DBTune) Reading St. GTAA Magna- Lists Andrews Klapp- tune stuhl- Resource NTU DB club Lists Resource Tropes Lotico Semantic yovisto John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL (DBTune) (Data Brainz Reading subjects FBD (zitgist) Lists Open EUTC Incubator) Linked Hellenic Library Open t4gm Produc- Crunch- PD Surge RDF info tions Discogs base Library Radio Ontos Source Code Crime ohloh Plymouth (Talis) (Data News LEM Ecosystem Reading RAMEAU Reports business Incubator) Crime data.gov. Portal Linked Data Lists SH UK Music Jamendo (En- uk Brainz (DBtune) LinkedL Ox AKTing) FanHubz gnoss ntnusc (DBTune) SSW CCN Points Thesau- Last.FM Poké- Thesaur Popula- artists pédia Didactal us rus W LIBRIS tion (En- (DBTune) Last.FM ia theses. LCSH Rådata reegle research patents MARC AKTing) (rdfize) my fr nå! data.gov. data.go Codes Ren. NHS uk v.uk Good- Experi- Classical List Energy (En- win flickr ment (DB Pokedex Norwe- Genera- AKTing) Mortality BBC Family wrappr Sudoc PSH Tune) gian (En- tors Program MeSH AKTing) semantic mes BBC IdRef GND CO2 educatio OpenEI web.org SW Energy Sudoc ndlna Emission n.data.g Music Dog VIAF EEA (En- Chronic- Linked (En- ov.uk Portu- Food UB AKTing) ling Event MDB AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Ord- Recht- Wildlife Deutsche Open Revyu DDC Openly spraak. Finder Bio- lobid Election nance legislation Local nl RDF graphie Resources NSZL Swedish Data Survey Tele- data Ulm EU New Book Project data.gov.uk graphis bnf.fr Catalog Open Insti- York URI Open Mashup Cultural tutions Times Greek P20 UK Post- Burner Calais Heritage codes DBpedia ECS Wiki statistics lobid GovWILD data.gov. Taxon iServe South- Organi- LOIUS BNBBrazilian uk Concept ECS ampton sations Geo World OS BibBase STW GESIS Poli- ESD South- ECS Names Fact- (RKB ticians stan- reference ampton data.gov.uk book Freebase Explorer) Budapest dards data.gov. NASA EPrints uk intervals Project OAI Lichfield transport (Data DBpedia data Guten- Pisa Spen- data.gov. Incu- dcs RESEX Scholaro- ISTAT ding bator) Fishes berg DBLP DBLP uk Geo meter Immi- Scotland of Texas (FU (L3S) Pupils & Uberblic DBLP gration Species Berlin) IRIT Exams Euro- dbpedia data- (RKB London TCM ACM stat lite open- Explorer) NVD Gazette (FUB) Gene IBM Traffic Geo ac-uk Scotland TWC LOGD Eurostat Daily DIT Linked UN/ Data UMBEL Med ERA Data LOCODE DEPLOY Gov.ie CORDIS YAGO New- lingvoj Disea- (RKB some SIDER RAE2001 castle LOCAH CORDIS Explorer) Linked Eurécom Eurostat Drug CiteSeer Roma (FUB) Sensor Data GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked PDB UniProt VIVO EURES EDGAR dotAC US SEC Indiana ePrints IEEE (Ontology totl.net (rdfabout) Central) WordNet RISKS (VUA) Taxono UniProt US Census EUNIS Twarql HGNC Semantic Cornetto (Bio2RDF) (rdfabout) my VIVO FTS XBRL PRO- ProDom STITCH Cornell LAAS SITE KISTI NSF Scotland Geo- GeoWord LODE graphy Net WordNet WordNet JISC (W3C) (RKB Climbing Linked Affy- KEGG SMC Explorer) SISVU Pub VIVO UF Piedmont GeoData metrix Drug ECCO- Finnish Journals PubMed Gene SGD Chem Munici- Accomo- El AGROV Ontology TCP Media dations Alpine bible palities Viajero OC Ski ontology Tourism KEGG Ocean Austria Enzyme PBAC Geographic Metoffice GEMET ChEMBL Italian Drilling OMIM KEGG Weather Open public Codices AEMET Linked MGI Pathway schools Forecasts Data Open InterPro GeneID Publications EARTh Thesau- KEGG Turismo rus Colors Reaction de Zaragoza Product Smart KEGG User-generated content Weather DB Link Medi Glycan Janus Stations Product Care KEGG AMP UniParc UniRef UniSTS Government Types Italian Homolo Com- Yahoo! Airports Museums pound Ontology Google Gene Geo Art Planet National wrapper Chem2 Cross-domain Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Life sciences Corpo- Amster- Reactome dam medu- Open rates Numbers Museum cator As of September 2011
    8. 8. Ad-hoc entity searchAt the 2010/11 Semantic Search Challenge - Task Given a keyword query, targeting a particular entity, provide a ranked list of relevant entities (i.e., URIs) - Queries Sampled from web search engine logs (142 in total) - Data collection Billion Triple Challenge 2009 (BTC) dataset - Relevance judgments On a 3-point scale, collected using crowdsourcing
    9. 9. In this talk- Address the ad-hoc entity retrieval task in a distributed setting - The Web of Data is inherently distributed - Some data sources may not be crawleable at all- Specifically, our focus is on the collection ranking and collection selection steps
    10. 10. Federated searchA typical broker-based architecture1 Collection ranking Central broker Summary A Summary B Summary C Q 1 A C B
    11. 11. Federated searchA typical broker-based architecture1 Collection ranking Central broker2 Collection selection Summary A Summary B Summary C Collection A Q Q 1 A Collection B C 2 B Q Collection C
    12. 12. Federated searchA typical broker-based architecture1 Collection ranking Central broker2 Collection selection3 Result merging Summary A Summary B Summary C Collection A Q Q 1 A Collection B C 2 B Q 3 Collection C
    13. 13. Next: baseline modelsfor collection ranking and selection1 Collection ranking Central broker2 Collection selection3 Result merging Summary A Summary B Summary C Collection A Q Q 1 A Collection B C 2 B Q 3 Collection C
    14. 14. Collection rankingCollection-centric (CC) - Lexicon-based method Treat and score each collection as if it was a single, large document Collection A Q Collection C Collection B ... ... Y P (c|q) / P (c) · P (t|✓c ) t2q
    15. 15. Collection rankingEntity-centric (EC) - Document-surrogate method Model and query individual documents (entities) and aggregate their relevance scores Collection A Collection C Q Collection B ... ... X P (c|q) / P (e|q) e2c,r(e,q)<
    16. 16. Collection selectionTop-K selection - Choosing a fixed cutoff (K) ahead of time K is usually set between 5 and 20 A D C E B F
    17. 17. Our method: AENN“All that an Entity Needs is a Name” - Exploit that entities are searched by their name - The central broker maintains a complete dictionary of entity names (and corresponding identifiers) - Utilise this information in the collection selection step to dynamically adjust the #collections selected
    18. 18. AENN collection ranking- Key observation Different methods—collection-centric (CC) vs. entity- centric (EC)—work best for different queries- Idea Combination should give better results than any of the two methods alone AENN(c, q) = (1 ) · CC(c, q) + · EC(c, q)
    19. 19. AENN collection rankingExample CC EC AENN A 0.35 A 0.65 A 0.5 D 0.3 B 0.3 B 0.2 C 0.2 C 0.05 D 0.15 E 0.15 C 0.125 B 0.1 E 0.075 F 0.05 F 0.025
    20. 20. AENN collection selection- Key observation CC has higher recall, while EC has better precision- Idea Use the collection rankings generated by EC and/or CC to dynamically adjust the set of collections selected - Precision-oriented selection - Recall-oriented selection - Balanced selection
    21. 21. AENN collection selectionPrecision-oriented selection (AENN(p)) - Only select collections returned by EC CC EC AENN AENN(p) A 0.35 A 0.65 A 0.5 A 0.5 D 0.3 B 0.3 B 0.2 B 0.2 C 0.2 C 0.05 D 0.15 C 0.125 E 0.15 C 0.125 B 0.1 E 0.075 F 0.05 F 0.025
    22. 22. AENN collection selectionRecall-oriented selection (AENN(r)) - Include collections from CC until all from EC are covered. This defines the cutoff point for AENN CC EC AENN AENN(r) A 0.35 A 0.65 A 0.5 A 0.5 D 0.3 B 0.3 B 0.2 B 0.2 C 0.2 C 0.05 D 0.15 D 0.15 E 0.15 C 0.125 C 0.125 B 0.1 E 0.075 E 0.075 F 0.05 F 0.025
    23. 23. AENN collection selectionBalanced selection (AENN(b)) - Include collections from AENN until all from EC are covered CC EC AENN AENN(b) A 0.35 A 0.65 A 0.5 A 0.5 D 0.3 B 0.3 B 0.2 B 0.2 C 0.2 C 0.05 D 0.15 D 0.15 E 0.15 C 0.125 C 0.125 B 0.1 E 0.075 F 0.05 F 0.025
    24. 24. AENN collection selectionComparison of approaches AENN(p) AENN(r) AENN(b) A 0.5 A 0.5 A 0.5 B 0.2 B 0.2 B 0.2 C 0.125 D 0.15 D 0.15 C 0.125 C 0.125 E 0.075
    25. 25. Experimental setupBased on the 2010/11 Semantic Search Challenge - Distributed environment Top 100 largest second-level domains from BTC - Three sets with different handling of DBpedia - Relevance Considered the #relevant entities from each collection - Metrics - Collection ranking: Standard IR metrics (MAP, MRR, nDCG) - Collection selection: Analogues of precision and recall, plus the avg. #coll. selected
    26. 26. Test collections BTC BTC DBpedia DBpedia#Entities 68.8M 60.5M 8.8M#Collections 100 99 100#Queries 136 116 130Avg. #rel. entities/query 14.9 4.8 10.1Avg. #rel. entities/collection 3.4 2.8 9.4
    27. 27. ResultsCollection ranking (BTC) CC EC AENN 0.75 0.50 MAP 0.25 0 Name-only Full content
    28. 28. Results Different collection selection strategies (BTCDBpedia) Precision Recall Avg. #coll. selected0.9 1.0 500.6 0.9 330.3 0.7 17 0 0.6 0 1 3 10 20 50 100 1 3 10 20 50 100 1 3 10 20 50 100 K K K AENN(p) AENN(r) AENN(b)
    29. 29. Results Collection selection (DBpedia) Precision Recall Avg. #coll. selected0.5 1.0 1000.3 0.7 670.2 0.3 33 0 0 0 1 3 10 20 50 100 1 3 10 20 50 100 1 3 10 20 50 100 K K K CC-N EC-C AENN(b)
    30. 30. Summary- Addressed the task of ad-hoc entity retrieval in a distributed setting- Introduced AENN, a novel collection ranking and selection method based on a lean name-based entity representation- Showed experimentally that AENN can outperform standard baselines that consider all entity content- Further, AENN can be geared towards high precision, high recall, or a balanced setting
    31. 31. Questions?Resources are available at http://bit.ly/OzfYK2 Contact @krisztianbalog krisztianbalog.com

    ×