Your SlideShare is downloading. ×
Entity Retrieval (WWW 2013 tutorial)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Entity Retrieval (WWW 2013 tutorial)

571
views

Published on

This is Part II of the tutorial "Entity Linking and Retrieval" given at WWW 2013 (together with E. Meij and D. Odijk). For the complete tutorial material (including slides for the other parts) visit …

This is Part II of the tutorial "Entity Linking and Retrieval" given at WWW 2013 (together with E. Meij and D. Odijk). For the complete tutorial material (including slides for the other parts) visit http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/

Published in: Technology, Sports

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
571
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Part IIEntity RetrievalKrisztian BalogUniversity of StavangerHalf-day tutorial at the WWW’13 conference | Rio de Janeiro, Brazil, 2013
  • 2. What is an entity?- Uniquely identifiable “thing” or “object”- Properties:- ID- Name(s)- Type(s)- Attributes- Relationships to other entities
  • 3. Entity retrieval tasks- Ad-hoc entity retrieval- List completion- Question answering- Factual questions- List questions- Related entity finding- Type-restricted variations- People, blogs, products, movies, etc.
  • 4. What’s so special about it?- Entities are not always directly represented- Recognise and disambiguate entities in text- Collect and aggregate information about a givenentity from multiple documents and even multipledata collections- More structure- Types (from some taxonomy)- Attributes (from some ontology)- Relationships to other entities (“typed links”)
  • 5. In this Part- Focus on the ad-hoc entity retieval task- Mainly probabilistic models- Specifically, Language Models
  • 6. Outline for Part II- Crash course into probability theory- Ranking with ready-made entity descriptions- Ranking without explicit entity representations- Evaluation initiatives- Future directions
  • 7. Ad-hoc entity retrieval- Input: unconstrained natural language query- “telegraphic” queries (neither well-formed norgrammatically correct sentences or questions)- Output: ranked list of entities- Collection: unstructured and/or semi-structured documents
  • 8. Ranking with ready-madeentity descriptions
  • 9. This is not unrealistic...
  • 10. Document-based entityrepresentations- Each entity is described by a document- Ranking entities much like ranking documents- Unstructured- Semi-structured
  • 11. Standard Language Modelingapproach- Rank documents d according to their likelihoodof being relevant given a query q: P(d|q)P(d|q) =P(q|d)P(d)P(q)/ P(q|d)P(d)Document priorProbability of the documentbeing relevant to any queryQuery likelihoodProbability that query qwas “produced” by document dP(q|d) =Yt2qP(t|✓d)n(t,q)
  • 12. Standard Language Modelingapproach (2)Number of times t appears in qEmpiricaldocument modelCollectionmodelSmoothing parameterMaximumlikelihoodestimatesP(q|d) =Yt2qP(t|✓d)n(t,q)Document language modelMultinomial probability distributionover the vocabulary of termsP(t|✓d) = (1 )P(t|d) + P(t|C)n(t, d)|d|Pd n(t, d)Pd |d|
  • 13. Here, documents=entities, soP(e|q) / P(e)P(q|✓e) = P(e)Yt2qP(t|✓e)n(t,q)Entity priorProbability of the entitybeing relevant to any queryEntity language modelMultinomial probability distributionover the vocabulary of terms
  • 14. Semi-structured entityrepresentation- Entity description documents are rarelyunstructured- Representing entities as- Fielded documents -- the IR approach- Graphs -- the DB/SW approach
  • 15. dbpedia:Audi_A4foaf:name Audi A4rdfs:label Audi A4rdfs:comment The Audi A4 is a compact executive carproduced since late 1994 by the German carmanufacturer Audi, a subsidiary of theVolkswagen Group. The A4 has been built [...]dbpprop:production 1994200120052008rdf:type dbpedia-owl:MeanOfTransportationdbpedia-owl:Automobiledbpedia-owl:manufacturer dbpedia:Audidbpedia-owl:class dbpedia:Compact_executive_carowl:sameAs freebase:Audi A4is dbpedia-owl:predecessor of dbpedia:Audi_A5is dbpprop:similar of dbpedia:Cadillac_BLS
  • 16. Mixture of Language Models[Ogilvie & Callan, 2003]- Build a separate language model for each field- Take a linear combination of themmXj=1µj = 1Field language modelSmoothed with a collection model builtfrom all document representations of thesame type in the collectionField weightsP(t|✓d) =mXj=1µjP(t|✓dj )P. Ogilvie and J. Callan. Combining document representations for known item search. SIGIR03.
  • 17. Setting field weights- Heuristically- Proportional to the length of text content in that field,to the field’s individual performance, etc.- Empirically (using training queries)- Problems- Number of possible fields is huge- It is not possible to optimise their weights directly- Entities are sparse w.r.t. different fields- Most entities have only a handful of predicates
  • 18. Predicate folding- Idea: reduce the number of fields by groupingthem together- Grouping based on- Type [Pérez-Agüera et al. 2010]- Manually determined importance [Blanco et al. 2011]R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity search in RDF data. ISWC11.J.R. Pérez-Agüera, J. Arroyo, J. Greenberg, J.P. Iglesias, and V. Fresno. Using BM25F forsemantic search. SemSearch10.
  • 19. Hierarchical Entity Model[Neumayer et al. 2012]- Organise fields into a 2-level hierarchy- Field types (4) on the top level- Individual fields of that type on the bottom level- Estimate field weights- Using training data for field types- Using heuristics for bottom-level typesR. Neumayer, K. Balog and K. Nørvåg. On the modeling of entities for ad-hoc entity search inthe web of data. ECIR12.
  • 20. Two-level hierarchyfoaf:name Audi A4rdfs:label Audi A4rdfs:comment The Audi A4 is a compact executive carproduced since late 1994 by the German carmanufacturer Audi, a subsidiary of theVolkswagen Group. The A4 has been built [...]dbpprop:production 1994200120052008rdf:type dbpedia-owl:MeanOfTransportationdbpedia-owl:Automobiledbpedia-owl:manufacturer dbpedia:Audidbpedia-owl:class dbpedia:Compact_executive_carowl:sameAs freebase:Audi A4is dbpedia-owl:predecessor of dbpedia:Audi_A5is dbpprop:similar of dbpedia:Cadillac_BLSNameAttributesOut-relationsIn-relations
  • 21. FormallyP(t|✓d) =XFP(t|F, d)P(F|d)P(t|F, d) =Xdf 2FP(t|df , F)P(df |F, d)Field type importanceTaken to be the same for all entitiesP(F|d) = P(F)Term generationImportance of a term is jointly determined bythe field it occurs as well as all fields of thattype (smoothed with a coll. level model)Field generationUniform or estimatedheuristically (based onlength, popularity, etc)P(t|df , F) = (1 )P(t|df ) + P(t|✓dF)Term importance
  • 22. Comparison of modelsddfF...tdfF t... ...dtdf...tdf...dt...tUnstructureddocument modelFieldeddocument modelHierarchicaldocument model
  • 23. Probabilistic Retrieval Modelfor Semistructured data[Kim et al. 2009]- Extension to the Mixture of Language Models- Find which document field each query termmay be associated withMapping probabilityEstimated for each query termP(t|✓d) =mXj=1µjP(t|✓dj )P(t|✓d) =mXj=1P(dj|t)P(t|✓dj )J. Kim, X. Xue, and W.B. Croft. A probabilistic retrieval model for semistructured data. ECIR09.
  • 24. Estimating the mappingprobabilityTerm likelihoodProbability of a query termoccurring in a given field typePrior field probabilityProbability of mapping the query termto this field before observing collectionstatisticsP(dj|t) =P(t|dj)P(dj)P(t)XdkP(t|dk)P(dk)P(t|Cj) =Pd n(t, dj)Pd |dj|
  • 25. ExampleQuery: meg ryan warcast 0.407team 0.382title 0.187genre 0.927title 0.070location 0.002cast 0.601team 0.381title 0.017dj dj djP(t|dj) P(t|dj) P(t|dj)
  • 26. The usual suspects fromdocument retrieval...- Priors- HITS, PageRank- Document link indegree [Kamps & Koolen 2008]- Pseudo relevance feedback- Document-centric vs. entity-centric [Macdonald &Ounis 2007; Serdyukov et al. 2007]- sampling expansion terms from top ranked documentsand/or (profiles of) top ranked candidates- Field-based [Kim & Croft 2011]J. Kamps and M. Koolen. The importance of link evidence in Wikipedia. ECIR08.C. Macdonald and I. Ounis. Expertise drift and query expansion in expert search. CIKM07.P. Serdyukov, S. Chernov, and W. Nejdl. Enhancing expert search through query modeling. ECIR07.J.Y. Kim and W.B. Croft. A Field Relevance Model for Structured Document Retrieval. ECIR12.
  • 27. So far...- Ranking (fielded) documents...- What is special about entities?- Type(s)- Relationships with other entities
  • 28. Entity typesrdf:type dbpedia-owl:MeanOfTransportationdbpedia-owl:Automobile
  • 29. Using target typesAssuming they have been identified...- Constraining results- Soft/hard filtering- Different ways to measure type similarity (betweentarget types and the types associated with the entity)- Set-based- Content-based- Lexical similarity of type labels- Query expansion- Adding terms from type names to the query- Entity expansion- Categories as a separate metadata field
  • 30. Modeling terms and categories[Balog et al. 2011]K. Balog, M. Bron, and M. de Rijke. Query modeling for entity search based on terms,categories and examples. TOIS11.Term-based representationQuery modelp(t|✓Te )p(t|✓Tq ) p(c|✓Cq ) p(c|✓Ce )Entity model Query model Entity modelCategory-based representationKL(✓Tq ||✓Te ) KL(✓Cq ||✓Ce )P(e|q) / P(q|e)P(e)P(q|e) = (1 )P(✓Tq |✓Te ) + P(✓Cq |✓Ce )
  • 31. Identifying target types- Types of top ranked entities [Vallet & Zaragoza2008]- Direct term-based vs. indirect entity-basedrepresentations [Balog & Neumayer 2012]- Hierarchical case is difficult...D. Vallet and H. Zaragoza. Inferring the most important types of a query: a semantic approach. SIGIR08.K. Balog and R. Neumayer. Hierarchical target type identification for entity-oriented queries. CIKM12.U. Sawant and S. Chakrabarti. Learning joint query interpretation and response ranking. WWW13.
  • 32. Expanding target types- Pseudo relevance feedback- Based on hierarchical structure- Using lexical similarity of type labels
  • 33. Ranking without explicitentity representations
  • 34. Scenario- Entity descriptions are not readily available- Entity occurrences are annotated
  • 35. The basic ideaUse documents to get from queries to entitiesexxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xx xxxxxxxx x xxx xx x xxxx xxxxx x xxxxxx xx x xxx xxxxxx xx xxx xx x xxxxxxxx xx xqxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xxxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xxxxxx xx x xxxxx x xxxx xx xxx x xxxxxxx x xxx xx xxxx xx xxxxx x xxxxx xxxQuery-documentassociationthe document’s relevanceDocument-entityassociationhow well the documentcharacterises the entity
  • 36. Two principal approaches- Profile-based methods- Create a textual profile for entities, then rank them(by adapting document retrieval techniques)- Document-based methods- Indirect representation based on mentions identifiedin documents- First ranking documents (or snippets) and thenaggregating evidence for associated entities
  • 37. Profile-based methodsqxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xx xxxxxxxx x xxx xx x xxxx xxxxx x xxxxxx xx x xxx xxxxxx xx xxx xx x xxxxxxxx xx xxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xxxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xxxxxx xx x xxxxx x xxxx xx xxx x xxxxxxx x xxx xx xxxx xx xxxxx x xxxxx xxxxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xxexxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xxxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xxee
  • 38. Document-based methodsqxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xx xxxxxxxx x xxx xx x xxxx xxxxx x xxxxxx xx x xxx xxxxxx xx xxx xx x xxxxxxxx xx xxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xx x xxx xx xxxxxx xxx xx x xxxxx xxx xxx xxxx x xxx xxxxxx x xxx xx xxxxxx xxx xxx xx x xxxx xx xxx xxxxxxx xxxxxx xx x xxxxx x xxxx xx xxx x xxxxxxx x xxx xx xxxx xx xxxxx x xxxxx xxxXeXXee
  • 39. Many possibilities in terms ofmodeling- Generative probabilistic models- Discriminative probabilistic models- Voting models- Graph-based models
  • 40. Generative probabilisticmodels- Candidate generation models (P(e|q))- Two-stage language model- Topic generation models (P(q|e))- Candidate model, a.k.a. Model 1- Document model, a.k.a. Model 2- Proximity-based variations- Both families of models can be derived from theProbability Ranking Principle [Fang & Zhai 2007]H. Fang and C. Zhai. Probabilistic models for expert finding. ECIR07.
  • 41. Candidate models (“Model 1”)[Balog et al. 2006]P(q|✓e) =Yt2qP(t|✓e)n(t,q)SmoothingWith collection-wide background model(1 )P(t|e) + P(t)XdP(t|d, e)P(d|e)K. Balog, L. Azzopardi, and M. de Rijke. Formal Models for Expert Finding in Enterprise Corpora. SIGIR06.Document-entityassociationTerm-candidateco-occurrenceIn a particular document.In the simplest case:P(t|d)
  • 42. Document models (“Model 2”)[Balog et al. 2006]P(q|e) =XdP(q|d, e)P(d|e)Document-entityassociationDocument relevanceHow well document dsupports the claim that eis relevant to qYt2qP(t|d, e)n(t,q)Simplifying assumption(t and e are conditionallyindependent given d)P(t|✓d)K. Balog, L. Azzopardi, and M. de Rijke. Formal Models for Expert Finding in Enterprise Corpora. SIGIR06.
  • 43. Document-entity associations- Boolean (or set-based) approach- Weighted by the confidence in entity linking- Consider other entities mentioned in thedocument
  • 44. Proximity-based variations- So far, conditional independence assumptionbetween candidates and terms whencomputing the probability P(t|d,e)- Relationship between terms and entities that inthe same document is ignored- Entity is equally strongly associated with everythingdiscussed in that document- Let’s capture the dependence between entitiesand terms- Use their distance in the document
  • 45. Using proximity kernels[Petkova & Croft 2007]D. Petkova and W.B. Croft. Proximity-based document representation for named entity retrieval. CIKM07.P(t|d, e) =1ZNXi=1d(i, t)k(t, e)Indicator function1 if the term at position i is t,0 otherwiseNormalisingcontantProximity-based kernel- constant function- triangle kernel- Gaussian kernel- step function
  • 46. Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entityretrieval. CIKM07.
  • 47. Many possibilities in terms ofmodeling- Generative probabilistic models- Discriminative probabilistic models- Voting models- Graph-based models
  • 48. Discriminative models- Vs. generative models:- Fewer assumptions (e.g., term independence)- “Let the data speak”- Sufficient amounts of training data required- Incorporating more document features, multiplesignals for document-entity associations- Estimating P(r=1|e,q) directly (instead of P(e,q|r=1))- Optimisation can get trapped in a local optimum
  • 49. Arithmetic MeanDiscriminative (AMD) model[Yang et al. 2010]Y. Fang, L. Si, and A. P. Mathur. Discriminative models of integrating document evidence anddocument-candidate associations for expert search. SIGIR10.P✓(r = 1|e, q) =XdP(r1 = 1|q, d)P(r2 = 1|e, d)P(d)DocumentpriorQuery-documentrelevanceDocument-entityrelevancelogistic functionover a linearcombination of features⇣ NgXj=1jgj(e, dt)⌘⇣ NfXi=1↵ifi(q, dt)⌘standard logisticfunctionweightparameters(learned)features
  • 50. Learning to rank- Pointwise- AMD, GMD [Yang et al. 2010]- Multilayer perceptrons, logistic regression [Sorg &Cimiano 2011]- Additive Groves [Moreira et al. 2011]- Pairwise- Ranking SVM [Yang et al. 2009]- RankBoost, RankNet [Moreira et al. 2011]- Listwise- AdaRank, Coordinate Ascent [Moreira et al. 2011]P. Sorg and P. Cimiano. Finding the right expert: Discriminative models for expert retrieval. KDIR’11.C. Moreira, P. Calado, and B. Martins. Learning to rank for expert search in digital libraries of academicpublications. PAI11.Z. Yang, J. Tang, B. Wang, J. Guo, J. Li, and S. Chen. Expert2bole: From expert finding to bole search.KDD09.
  • 51. Voting models[Macdonald & Ounis 2006]- Inspired by techniques from data fusion- Combining evidence from different sources- Documents ranked w.r.t. the query are seen as“votes” for the entityC. Macdonald and I. Ounis. Voting for candidates: Adapting data fusion techniques for an expertsearch task. CIKM06.
  • 52. Voting modelsMany different variants, including...- Votes- Number of documents mentioning the entity- Reciprocal Rank- Sum of inverse ranks of documents- CombSUM- Sum of scores of documentsScore(e, q) = |{M(e) R(q)}|X{M(e)R(q)}s(d, q)Score(e, q) =X{M(e)R(q)}1rank(d, q)Score(e, q) = |M(e) R(q)|
  • 53. Graph-based models[Serdyukov et al. 2008]- One particular way of constructing graphs- Vertices are documents and entities- Only document-entity edges- Search can be approached as a random walkon this graph- Pick a random document or entity- Follow links to entities or other documents- Repeat it a number of timesP. Serdyukov, H. Rode, and D. Hiemstra. Modeling multi-step relevance prop- agation for expertfinding. CIKM08.
  • 54. Infinite random walk model[Serdyukov et al. 2008]P. Serdyukov, H. Rode, and D. Hiemstra. Modeling multi-step relevance propagation for expert finding.CIKM08.Pi(d) = PJ (d) + (1 )Xe!dP(d|e)Pi 1(e),Pi(e) =Xd!eP(e|d)Pi 1(d),PJ (d) = P(d|q),ee ed ded d
  • 55. Further readingK. Balog, Y. Fang, M. de Rijke, P. Serdyukov, and L. Si.Expertise Retrieval. FnTIR12.
  • 56. Evaluation initiatives
  • 57. Test collectionsCampaign Task CollectionEntityrepr.#TopicsTREC Enterprise(2005-08)Expert findingEnterprise intranets(W3C, CSIRO)Indirect99 (W3C)127 (CSIRO)TREC Entity(2009-11)Rel. entity finding Web crawl(ClueWeb09)Indirect120TREC Entity(2009-11) List completionWeb crawl(ClueWeb09)Indirect70INEX Entity Ranking(2007-09)Entity searchWikipedia Direct 55INEX Entity Ranking(2007-09) List completionWikipedia Direct 55SemSearch Chall.(2010-11)Entity search Semantic Web crawl(BTC2009)Direct142SemSearch Chall.(2010-11) List searchSemantic Web crawl(BTC2009)Direct50INEX Linked Data(2012-13)Ad-hoc searchWikipedia + RDF(Wikipedia-LOD)Direct100 (’12)144 (’13)
  • 58. Test collections (2)- Entity search as Question Answering- TREC QA track- QALD-2 challenge- INEX-LD Jeopardy task
  • 59. Entity search in DBpedia[Balog & Neumayer 2013]- Synthesising queries and relevanceassessments from previous eval. campaigns- From short keyword queries to naturallanguage questions- 485 queries in total- Results are mapped to DBpediaK. Balog and R. Neumayer. A test collection for entity search in DBpedia. SIGIR’13
  • 60. Open challenges- Combining text and structure- Knowledge bases and unstructured Web documents- Query understanding and modeling- See [Sawant & Chakrabarti 2013] at the mainconference- Result presentation- How to interact with entitiesU. Sawant and S. Chakrabarti. Learning joint query interpretation and response ranking. WWW13.
  • 61. Resources- Complete tutorial materialhttp://ejmeij.github.io/entity-linking-and-retrieval-tutorial/- Referred papershttp://www.mendeley.com/groups/3339761/entity-linking-and-retrieval-tutorial-at-www-2013-and-sigir-2013/papers/

×