Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Collection Ranking and Selection for Federated Entity Search

820 views

Published on

Presented at the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012).

Published in: Technology
  • Be the first to comment

Collection Ranking and Selection for Federated Entity Search

  1. 1. Collection Ranking & Selectionfor Federated Entity SearchKrisztian Balog, Robert Neumayer, and Kjetil NørvågNorwegian University of Science and Technology19th International Symposium on String Processing and Information Retrieval (SPIRE 2012)Cartagena de Indias, Colombia, October 2012
  2. 2. Motivation- Entities are ubiquitous Many information needs revolve around entities (people, products, organisations, places, etc.)- Growing amount of structured data Again, organised around entities- Entities are often searched by their name If we know (heard of) it, we just ask for it by the name
  3. 3. Top searches of 2011* Web search Mobile search 1. iPhone 1. iPhone 5 2. Casey Anthony 2. Powerball 3. Kim Kardashian 3. MLB 4. Katy Perry 4. Scrabble cheat 5. Jennifer Lopez 5. Casey Anthony 6. Lindsay Lohan 6. Hurricane Irene 2011 7. American Idol 7. Kim Kardashian 8. Jennifer Aniston 8. Translator 9. Japan earthquake 9. Amy Winehouse 10. Osama bin Laden 10. May 21, 2011 Rapture* http://yearinreview.yahoo.com/
  4. 4. Top searches of 2011* Web search Mobile search 1. iPhone 1. iPhone 5 2. Casey Anthony 2. Powerball “users have learned that search engine relevance 3. Kim Kardashian 3. MLB decreases with longer queries and have grown 4. Katy Perry 4. Scrabble cheat accustomed to reducing their query Anthony 5. Jennifer Lopez 5. Casey (at least initially) to the name of an entity” 6. Lindsay Lohan 6. Hurricane Irene 2011 7. American Idol 7. Kim Kardashian 8. Jennifer Aniston [Blanco et al., 2011] 8. Translator 9. Japan earthquake 9. Amy Winehouse 10. Osama bin Laden 10. May 21, 2011 Rapture* http://yearinreview.yahoo.com/
  5. 5. The Web of Data # Linking Open Data datasets 300 200 100 0 2007 2008 2009 2010 2011
  6. 6. LOD in 2007 Fresh- ECS South- SIOC meat ampton BBC Later + NEW! Sem- NEW! TOTP Web- SW Central Onto- Conference world Corpus Music- Magna- FOAF brainz tune Open- Guides Revyu RDF Book Jamendo Geo- Mashup names DBpedia DBLP Berlin US World NEW! Census Fact- Data book NEW! lingvoj Euro- flickr stat wrappr Wiki- Open DBLP company Cyc Hannover Gov- Track Project W3C Guten- WordNet berg
  7. 7. LOD in 2011 Linked LOV User Slideshare tags2con Audio Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex Folk (DBTune) Reading St. GTAA Magna- Lists Andrews Klapp- tune stuhl- Resource NTU DB club Lists Resource Tropes Lotico Semantic yovisto John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL (DBTune) (Data Brainz Reading subjects FBD (zitgist) Lists Open EUTC Incubator) Linked Hellenic Library Open t4gm Produc- Crunch- PD Surge RDF info tions Discogs base Library Radio Ontos Source Code Crime ohloh Plymouth (Talis) (Data News LEM Ecosystem Reading RAMEAU Reports business Incubator) Crime data.gov. Portal Linked Data Lists SH UK Music Jamendo (En- uk Brainz (DBtune) LinkedL Ox AKTing) FanHubz gnoss ntnusc (DBTune) SSW CCN Points Thesau- Last.FM Poké- Thesaur Popula- artists pédia Didactal us rus W LIBRIS tion (En- (DBTune) Last.FM ia theses. LCSH Rådata reegle research patents MARC AKTing) (rdfize) my fr nå! data.gov. data.go Codes Ren. NHS uk v.uk Good- Experi- Classical List Energy (En- win flickr ment (DB Pokedex Norwe- Genera- AKTing) Mortality BBC Family wrappr Sudoc PSH Tune) gian (En- tors Program MeSH AKTing) semantic mes BBC IdRef GND CO2 educatio OpenEI web.org SW Energy Sudoc ndlna Emission n.data.g Music Dog VIAF EEA (En- Chronic- Linked (En- ov.uk Portu- Food UB AKTing) ling Event MDB AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Ord- Recht- Wildlife Deutsche Open Revyu DDC Openly spraak. Finder Bio- lobid Election nance legislation Local nl RDF graphie Resources NSZL Swedish Data Survey Tele- data Ulm EU New Book Project data.gov.uk graphis bnf.fr Catalog Open Insti- York URI Open Mashup Cultural tutions Times Greek P20 UK Post- Burner Calais Heritage codes DBpedia ECS Wiki statistics lobid GovWILD data.gov. Taxon iServe South- Organi- LOIUS BNBBrazilian uk Concept ECS ampton sations Geo World OS BibBase STW GESIS Poli- ESD South- ECS Names Fact- (RKB ticians stan- reference ampton data.gov.uk book Freebase Explorer) Budapest dards data.gov. NASA EPrints uk intervals Project OAI Lichfield transport (Data DBpedia data Guten- Pisa Spen- data.gov. Incu- dcs RESEX Scholaro- ISTAT ding bator) Fishes berg DBLP DBLP uk Geo meter Immi- Scotland of Texas (FU (L3S) Pupils & Uberblic DBLP gration Species Berlin) IRIT Exams Euro- dbpedia data- (RKB London TCM ACM stat lite open- Explorer) NVD Gazette (FUB) Gene IBM Traffic Geo ac-uk Scotland TWC LOGD Eurostat Daily DIT Linked UN/ Data UMBEL Med ERA Data LOCODE DEPLOY Gov.ie CORDIS YAGO New- lingvoj Disea- (RKB some SIDER RAE2001 castle LOCAH CORDIS Explorer) Linked Eurécom Eurostat Drug CiteSeer Roma (FUB) Sensor Data GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked PDB UniProt VIVO EURES EDGAR dotAC US SEC Indiana ePrints IEEE (Ontology totl.net (rdfabout) Central) WordNet RISKS (VUA) Taxono UniProt US Census EUNIS Twarql HGNC Semantic Cornetto (Bio2RDF) (rdfabout) my VIVO FTS XBRL PRO- ProDom STITCH Cornell LAAS SITE KISTI NSF Scotland Geo- GeoWord LODE graphy Net WordNet WordNet JISC (W3C) (RKB Climbing Linked Affy- KEGG SMC Explorer) SISVU Pub VIVO UF Piedmont GeoData metrix Drug ECCO- Finnish Journals PubMed Gene SGD Chem Munici- Accomo- El AGROV Ontology TCP Media dations Alpine bible palities Viajero OC Ski ontology Tourism KEGG Ocean Austria Enzyme PBAC Geographic Metoffice GEMET ChEMBL Italian Drilling OMIM KEGG Weather Open public Codices AEMET Linked MGI Pathway schools Forecasts Data Open InterPro GeneID Publications EARTh Thesau- KEGG Turismo rus Colors Reaction de Zaragoza Product Smart KEGG User-generated content Weather DB Link Medi Glycan Janus Stations Product Care KEGG AMP UniParc UniRef UniSTS Government Types Italian Homolo Com- Yahoo! Airports Museums pound Ontology Google Gene Geo Art Planet National wrapper Chem2 Cross-domain Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Life sciences Corpo- Amster- Reactome dam medu- Open rates Numbers Museum cator As of September 2011
  8. 8. Ad-hoc entity searchAt the 2010/11 Semantic Search Challenge - Task Given a keyword query, targeting a particular entity, provide a ranked list of relevant entities (i.e., URIs) - Queries Sampled from web search engine logs (142 in total) - Data collection Billion Triple Challenge 2009 (BTC) dataset - Relevance judgments On a 3-point scale, collected using crowdsourcing
  9. 9. In this talk- Address the ad-hoc entity retrieval task in a distributed setting - The Web of Data is inherently distributed - Some data sources may not be crawleable at all- Specifically, our focus is on the collection ranking and collection selection steps
  10. 10. Federated searchA typical broker-based architecture1 Collection ranking Central broker Summary A Summary B Summary C Q 1 A C B
  11. 11. Federated searchA typical broker-based architecture1 Collection ranking Central broker2 Collection selection Summary A Summary B Summary C Collection A Q Q 1 A Collection B C 2 B Q Collection C
  12. 12. Federated searchA typical broker-based architecture1 Collection ranking Central broker2 Collection selection3 Result merging Summary A Summary B Summary C Collection A Q Q 1 A Collection B C 2 B Q 3 Collection C
  13. 13. Next: baseline modelsfor collection ranking and selection1 Collection ranking Central broker2 Collection selection3 Result merging Summary A Summary B Summary C Collection A Q Q 1 A Collection B C 2 B Q 3 Collection C
  14. 14. Collection rankingCollection-centric (CC) - Lexicon-based method Treat and score each collection as if it was a single, large document Collection A Q Collection C Collection B ... ... Y P (c|q) / P (c) · P (t|✓c ) t2q
  15. 15. Collection rankingEntity-centric (EC) - Document-surrogate method Model and query individual documents (entities) and aggregate their relevance scores Collection A Collection C Q Collection B ... ... X P (c|q) / P (e|q) e2c,r(e,q)<
  16. 16. Collection selectionTop-K selection - Choosing a fixed cutoff (K) ahead of time K is usually set between 5 and 20 A D C E B F
  17. 17. Our method: AENN“All that an Entity Needs is a Name” - Exploit that entities are searched by their name - The central broker maintains a complete dictionary of entity names (and corresponding identifiers) - Utilise this information in the collection selection step to dynamically adjust the #collections selected
  18. 18. AENN collection ranking- Key observation Different methods—collection-centric (CC) vs. entity- centric (EC)—work best for different queries- Idea Combination should give better results than any of the two methods alone AENN(c, q) = (1 ) · CC(c, q) + · EC(c, q)
  19. 19. AENN collection rankingExample CC EC AENN A 0.35 A 0.65 A 0.5 D 0.3 B 0.3 B 0.2 C 0.2 C 0.05 D 0.15 E 0.15 C 0.125 B 0.1 E 0.075 F 0.05 F 0.025
  20. 20. AENN collection selection- Key observation CC has higher recall, while EC has better precision- Idea Use the collection rankings generated by EC and/or CC to dynamically adjust the set of collections selected - Precision-oriented selection - Recall-oriented selection - Balanced selection
  21. 21. AENN collection selectionPrecision-oriented selection (AENN(p)) - Only select collections returned by EC CC EC AENN AENN(p) A 0.35 A 0.65 A 0.5 A 0.5 D 0.3 B 0.3 B 0.2 B 0.2 C 0.2 C 0.05 D 0.15 C 0.125 E 0.15 C 0.125 B 0.1 E 0.075 F 0.05 F 0.025
  22. 22. AENN collection selectionRecall-oriented selection (AENN(r)) - Include collections from CC until all from EC are covered. This defines the cutoff point for AENN CC EC AENN AENN(r) A 0.35 A 0.65 A 0.5 A 0.5 D 0.3 B 0.3 B 0.2 B 0.2 C 0.2 C 0.05 D 0.15 D 0.15 E 0.15 C 0.125 C 0.125 B 0.1 E 0.075 E 0.075 F 0.05 F 0.025
  23. 23. AENN collection selectionBalanced selection (AENN(b)) - Include collections from AENN until all from EC are covered CC EC AENN AENN(b) A 0.35 A 0.65 A 0.5 A 0.5 D 0.3 B 0.3 B 0.2 B 0.2 C 0.2 C 0.05 D 0.15 D 0.15 E 0.15 C 0.125 C 0.125 B 0.1 E 0.075 F 0.05 F 0.025
  24. 24. AENN collection selectionComparison of approaches AENN(p) AENN(r) AENN(b) A 0.5 A 0.5 A 0.5 B 0.2 B 0.2 B 0.2 C 0.125 D 0.15 D 0.15 C 0.125 C 0.125 E 0.075
  25. 25. Experimental setupBased on the 2010/11 Semantic Search Challenge - Distributed environment Top 100 largest second-level domains from BTC - Three sets with different handling of DBpedia - Relevance Considered the #relevant entities from each collection - Metrics - Collection ranking: Standard IR metrics (MAP, MRR, nDCG) - Collection selection: Analogues of precision and recall, plus the avg. #coll. selected
  26. 26. Test collections BTC BTC DBpedia DBpedia#Entities 68.8M 60.5M 8.8M#Collections 100 99 100#Queries 136 116 130Avg. #rel. entities/query 14.9 4.8 10.1Avg. #rel. entities/collection 3.4 2.8 9.4
  27. 27. ResultsCollection ranking (BTC) CC EC AENN 0.75 0.50 MAP 0.25 0 Name-only Full content
  28. 28. Results Different collection selection strategies (BTCDBpedia) Precision Recall Avg. #coll. selected0.9 1.0 500.6 0.9 330.3 0.7 17 0 0.6 0 1 3 10 20 50 100 1 3 10 20 50 100 1 3 10 20 50 100 K K K AENN(p) AENN(r) AENN(b)
  29. 29. Results Collection selection (DBpedia) Precision Recall Avg. #coll. selected0.5 1.0 1000.3 0.7 670.2 0.3 33 0 0 0 1 3 10 20 50 100 1 3 10 20 50 100 1 3 10 20 50 100 K K K CC-N EC-C AENN(b)
  30. 30. Summary- Addressed the task of ad-hoc entity retrieval in a distributed setting- Introduced AENN, a novel collection ranking and selection method based on a lean name-based entity representation- Showed experimentally that AENN can outperform standard baselines that consider all entity content- Further, AENN can be geared towards high precision, high recall, or a balanced setting
  31. 31. Questions?Resources are available at http://bit.ly/OzfYK2 Contact @krisztianbalog krisztianbalog.com

×