Retrieval using Document Structure and Annotations

1,428 views

Published on

I used these slides for my thesis defense. The cover my work on using language models in the Inference Network model for search. The work focuses on using document structure (titles, in-link text, etc.) and linguistic annotations (semantic predicates) to improve effectiveness for a variety of tasks.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,428
On SlideShare
0
From Embeds
0
Number of Embeds
63
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Retrieval using Document Structure and Annotations

  1. 1. Retrieval using Document Structure and Annotations Paul Ogilvie Language Technologies Institute School of Computer Science Carnegie Mellon University Committee: Jamie Callan (chair) Christos Faloutsos Yiming Yang W. Bruce Croft (University of Massachusetts, Amherst) June 18, 2010 Slide 1
  2. 2. Outline Introduction Related Work Extensions to the Inference Network model Results Contributions Slide 2
  3. 3. Effective use of document structure and annotations is critical for successful retrieval in a wide range of applications. Slide 3
  4. 4. Result-universal retrieve any element, document, or annotation mix result types in a single ranking May wish to bias results toward by type or length. Slide 4
  5. 5. Structure-aware some fields more representative of content multiple representations of content title title link link link title Title, in-link text form alternative representations of a web page. Slide 5
  6. 6. Structure-expressive express structural constraints in the query language articles about suicide bombings with an image of investigators Slide 6
  7. 7. Structure-expressive express structural constraints in the query language sentences with a semantic predicate whose target verb is “train” and whose arg1 annotation matches “suicide bombers” [ARG1 Most Afghani suicide bombers] were [TARGET trained] [ARGM-LOC in neighboring Pakistan.] Slide 7
  8. 8. Annotation-robust text processing tools are not perfect robust to noisy document structure mislabeled annotations boundary errors [ARG0 George] [TARGET saw] [ARG1 the astronomer] [ARGM-MNR with a telescope.] Slide 8
  9. 9. Outline Introduction Related Work Extensions to the Inference Network model Results Contributions Slide 9
  10. 10. Long history Vector Space Probabilistic Model Other Approaches RU = Result Universal SE = Structure Expressive SA = Structure Aware AR = Annotation Robust SE NCCC 1979 1983 p-Norm SE SCAT-IR 1983 1985 RU Fox SA SE CODER 1986 Inference Networks 1989 SA Introduction 1989 1990 SA Turtle & Croft 1990 1992 RU SE Burkowski 1993 RU SA Fuller et al. 1994 RU SA Wilkinson BM25 SE Proximal Nodes 1995 Language Models Ponte & Croft 1998 Hiemstra 2000 RU SE XPRES 2001 SA Justsystem 2002 SA SE RU JuruXML SA Kraaij et al 2002 2003 SA Ogilvie & Callan 2003 2004 SA BM25F SE RU Indri SA Sigurbjornnson 2004 2005 SA BM25E SE RU Tijah 2005 2006 AR? JuruXML 2007 AR Bilotti et al 2007 2008 SA Kim et al 2008 2010 2010 Slide 10
  11. 11. Mixture of multinomials Ogilvie: SIGIR 03 Rank documents by probability query is generated |q| P(q|d) = P(qi |d) i=1 Estimate P(qi |d) using a mixture of representations (in-model combination) P(qi |d) = λr P(qi |θr ) r∈R Each representation is estimated from field counts and a collection model tf (qi , dr ) tf (qi , Cr ) P(qi |θr ) = αr + (1 − αr ) |dr | |Cr | Slide 11
  12. 12. Inference Network model d α, βt α, βd θt(d) θd bomber.(title) ct,i ct,j suicide cd,i cd,j bomber suicide.(title) #WSUM qd,i qd,j #WSUM #AND Id #AND( #WSUM( 0.6 suicide.(title) 0.4 suicide ) #WSUM( 0.6 bomber.(title) 0.4 bomber ) ) Slide 12
  13. 13. Inference Network model d α, βt α, βd θt(d) θd bomber.(title) ct,i ct,j suicide cd,i cd,j bomber suicide.(title) Query Nodes #WSUM qd,i qd,j #WSUM #AND Id P(Id = true|d, α, β) = P(Id |qd,i , qd,j ) #AND( #WSUM( 0.6 suicide.(title) 0.4 suicide ) #WSUM( 0.6 bomber.(title) 0.4 bomber ) ) Slide 13
  14. 14. Query operator belief combination Operator Combination Function n #AND(b1 b2 . . . bn ) i=1bel(bi ) #NOT(b) 1 − bel(b) n #OR(b1 b2 . . . bn ) 1 − i=1 (1 − bel(bi )) n wi #WAND(w1 b1 . . . wn bn ) i=1 bel(bi ) #MAX(b1 b2 . . . bn ) max(bel(b1 ), bel(b2 ), . . . , bel(bn )) n #WSUM(w1 b1 . . . wn bn ) i=1 wi bel(bi ) bel(bi ) is shorthand for P(bi = true|d, α, β) Slide 14
  15. 15. Inference Network model d α, βt α, βd Model Nodes θt(d) θt(d) ∼ multiple Bernoulli θd Concept bomber.(title) Nodes ct,i ct,j P(ct,j |θt(d) ) suicide cd,i cd,j bomber suicide.(title) #WSUM qd,i qd,j #WSUM #AND Id #AND( #WSUM( 0.6 suicide.(title) 0.4 suicide ) #WSUM( 0.6 bomber.(title) 0.4 bomber ) ) Slide 15
  16. 16. Concept nodes use multiple Bernoullis Metzler et al Model B tf (ci , dr ) + αi,r − 1 P(ci |θr ) = |dr | + αi,r + βi,r − 2 Common settings αi,r = µ tf (ci ,Cr ) + 1 |Cr | tf (ci ,Cr ) βi,r = µ 1 − |Cr | +1 yield multinomials smoothed using Dirichlet priors tf (ci , dr ) + µ tf (ci ,Cr ) |Cr | P(ci |θr ) = |dr | + µ Slide 16
  17. 17. Indri query language support for structure Extent retrieval: specifies result types, can be nested for structural constraints #AND[sentence]( suicide bombers trained ) Field evaluation: creates a language model for a field type suicide.(title) Field restriction: restricts counts to a field type grant.person Prior probabilities: accesses indexed prior beliefs of relevance for documents #PRIOR(urltype) Slide 17
  18. 18. Limitations of Inference Networks In-model combination Verbose queries Some model parameters in query, some in parameter files Representation construction based on containment Common to index extra document representations with document (in-link text) Indri query language does not support parent/child in extent retrieval or field evaluation Model not sufficiently annotation robust Nested extent retrieval confusing Slide 18
  19. 19. Belief combination for nested extent retrieval is critical one pair of nodes per figure caption suicide bombings c1 c2 i1 i2 in investigators ... a1 a2 ... an #AND[fgc] beliefs are multiplied I #AND[article] #AND[article]( suicide bombings #AND[fgc]( investigators ) ) Slide 19
  20. 20. Belief combination for nested extent retrieval is critical one pair of nodes per figure caption suicide bombings c1 c2 i1 i2 in investigators ... a1 a2 ... an #AND[fgc] best belief is selected m1 #MAX I #AND[article] #AND[article]( suicide bombings #MAX( #AND[fgc]( investigators ) ) ) Slide 20
  21. 21. Outline Introduction Related Work Extensions to the Inference Network model Results Contributions Slide 21
  22. 22. Collection structure can be represented as a graph title link title link Typed edges, typed nodes Nodes anchored in text to preserve containment Slide 22
  23. 23. Example annotation graph [ARG1 Most Afghani suicide bombers] were [TARGET trained] [ARGM-LOC in neighboring Pakistan.] SENTENCE TARGET ARG1 ARGM-LOC LOCATION Most Afghani suicide bombers were trained in neighboring Pakistan. Slide 23
  24. 24. Example annotation graph [ARG1 Most Afghani suicide bombers] were [TARGET trained] [ARGM-LOC in neighboring Pakistan.] SENTENCE TARGET ARG1 ARGM-LOC LOCATION Most Afghani suicide bombers were trained in neighboring Pakistan. Slide 23
  25. 25. Model representation layer d α, βt α, βd θt(d) θd bomber.(title) rt,i rt,j suicide rd,i rd,j bomber suicide.(title) #WSUM qd,i qd,j #WSUM #AND Id Needlessly complex for a conceptually simple operation Verbose queries prone to error, confusion Slide 24
  26. 26. Model representation layer d α, βt α, βd θt(d) θd bomber.(title) rt,i rt,j suicide rd,i rd,j bomber suicide.(title) #WSUM qd,i qd,j #WSUM #AND Id Move model combination into a model representation layer Simplify query construction Slide 24
  27. 27. Model representation layer Mixture of multiple Bernoullis + Inference Networks Observed Nodes t1 dk ... dn−1 dn Representation Nodes φt(dk ) φs(dk ) φCd (dk ) Model Nodes θdk Concept Nodes suicide ci cj bomber Query Nodes #AND Idk #AND( suicide bomber ) Slide 25
  28. 28. Model representation layer Mixture of multiple Bernoullis + Inference Networks Observed Nodes t1 dk ... dn−1 dn φt(dk ) φs(dk ) φCd (dk ) θdk suicide ci cj bomber #AND Idk All collection elements exist as observation nodes Slide 25
  29. 29. Model representation layer Mixture of multiple Bernoullis + Inference Networks t1 dk ... dn−1 dn Representation Nodes φt(dk ) φs(dk ) φCd (dk ) multiple Bernoulli θdk suicide ci cj bomber #AND Idk Representations may be connected to many elements Slide 25
  30. 30. Model representation layer Mixture of multiple Bernoullis + Inference Networks t1 dk ... dn−1 dn φt(dk ) φs(dk ) φCd (dk ) Model Nodes mixture of multiple Bernoullis θdk P(ci |θdk ) = λf P(ci |φf (dk ) ) f ∈F suicide ci cj bomber #AND Idk Model nodes combine multiple representations Slide 25
  31. 31. Representation functions connect observed elements to representation nodes t1 dk ... dn−1 dn φt(dk ) φs(dk ) φCd (dk ) θdk t(dk ) = {t1 } Slide 26
  32. 32. Representation functions connect observed elements to representation nodes t1 dk ... dn−1 dn φt(dk ) φs(dk ) φCd (dk ) θdk s(dk ) = {dk } Slide 26
  33. 33. Representation functions connect observed elements to representation nodes t1 dk ... dn−1 dn φt(dk ) φs(dk ) φCd (dk ) θdk Cd (dk ) = {d1 , d2 , . . . , dn } Slide 26
  34. 34. Collection structure can be represented as a graph title link title link Typed edges, typed nodes Nodes anchored in text to preserve containment Slide 27
  35. 35. Model representation layer Mixture of multiple Bernoullis + Inference Networks Observed Nodes t1 dk ... dn−1 dn Representation Nodes φt(dk ) φs(dk ) φCd (dk ) multiple Bernoulli Model Nodes mixture of multiple Bernoullis θdk P(ci |θdk ) = λf P(ci |φf (dk ) ) f ∈F ci cj Idk Slide 28
  36. 36. People don’t get nested extent retrieval one pair of nodes per figure caption suicide bombings c1 c2 i1 i2 in investigators ... a1 a2 ... an #AND[fgc] they forget to combine m1 #MAX I #AND[article] #AND[article]( suicide bombings #MAX( #AND[fgc]( investigators ) ) ) Slide 29
  37. 37. Scope operator for extent retrieval #SCOPE[RESULT:article]( #AND( suicide bombings #SCOPE[MAX:fgc]( investigators ) ) ) Move extent retrieval into a scope operator Force a choice of belief combination AVG, MAX, MIN OR = 1 − b (1 − b) AND = bb Slide 30
  38. 38. Scope operator makes belief combination explicit one pair of nodes per figure caption suicide bombings c1 c2 i1 i2 in investigators ... a1 a2 ... an #AND m1 #SCOPE[MAX:fgc] q1 #AND I #SCOPE[RESULT:article] #SCOPE[RESULT:article]( #AND( suicide bombings #SCOPE[MAX:fgc]( investigators ) ) ) Slide 31
  39. 39. Additional support for structural constraints Operator Description ./type Children w/ type .type Parent w/ type .//type Descendants w/ type .type Ancestors w/ type New structural operators in paths. ‘*’ may be substituted for an element type to select all that match the constraint. #SCOPE[AVG:target]( #AND( trained #SCOPE[AVG:./arg1]( #AND( suicide bombers ) ) ) ) Slide 32
  40. 40. Padding annotation boundaries Padding boundaries with weighted term occurrences Some annotation boundaries may be wrong Could provide additional context ARG1 2 3 3 2 1 4 4 1 4 4 4 George saw the astronomer with a telescope. Slide 33
  41. 41. New model summary Representation functions and representation layer increases structure-awareness allows for richer representations simplifies queries, parameters Scope operator increases structure-expressivity forces choice of belief combination Extensions for annotation-robustness Slide 34
  42. 42. Grid search No customization of code, computation of gradients Easy to parallelize Grid search optimizes for any measure A better understanding of the parameter space Per query analysis Estimates of confidence intervals Slide 35
  43. 43. Grid search Parameter Estimates for i2004k Topics 0.25 0.25 0.25 0.25 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 Best MAP 0.10 0.10 0.10 0.10 0.05 25 steps 0.05 0.05 0.05 10 steps 0.00 0.00 0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 λ element λ document λ collection length prior β Parameter Estimates for i2005k Topics 0.12 0.12 0.12 0.12 0.10 0.10 0.10 0.10 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 Best MAP 0.04 0.04 0.04 0.04 0.02 25 steps 0.02 0.02 0.02 10 steps 0.00 0.00 0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0 λ element λ document λ collection length prior β Slide 36
  44. 44. Outline Introduction Related Work Extensions to the Inference Network model Results Contributions Slide 37
  45. 45. Known-item finding Retrieve the best document for a query IRS 1040 instructions Evaluated using mean-reciprocal rank (MRR) WT10G .GOV Number documents 1,692,096 1,247,753 Size (GB) 10 18 Document types html html, doc, pdf, ps Task types homepage finding homepage and named-page t10ep samp. t10ep off. t12ki t13mi Number topics 100 145 300 150 Known-item finding testbeds Wrap query terms in #AND operator Include a prior probability of relevance based on URL type Slide 38
  46. 46. Known-item finding results t10ep samp. t10ep off. t12ki t13mi document 0.3 (0.1, 0.6) 0.4 (0.2, 0.5) 0.2 (0.1, 0.4) 0.3 (0.1, 0.3) link 0.2 (0.1, 0.5) 0.2 (0.1, 0.2) 0.2 (0.2, 0.4) 0.3 (0.1, 0.5) title 0.2 (0.1, 0.7) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) header 0.1 (0.0, 0.4) 0.0 (0.0, 0.5) 0.3 (0.0, 0.4) 0.1 (0.0, 0.4) meta 0.0 (0.0, 0.0) 0.0 (0.0, 0.2) 0.0 (0.0, 0.1) 0.0 (0.0, 0.2) collection 0.2 (0.0, 0.4) 0.2 (0.1, 0.5) 0.1 (0.1, 0.3) 0.1 (0.1, 0.6) Estimated parameters t10ep samp. t10ep off. t12ki t13mi doc + collection 0.756 0.654 0.403 0.372 train 0.905 0.829 0.704 0.671 test 0.891 0.821 0.702 0.650 Best from TREC - 0.774 0.7271 0.7382 Performance in MRR 1 Mixtures of multinomials + URL type prior [Ogilvie TREC 12] 2 Okapi BM25F + PageRank Slide 39
  47. 47. Element retrieval Keyword queries, retrieve any element Keyword + structure queries //article[about(., suicide bombings) and about(.//fgc, investigators)] Evaluated using MAP [Ogilvie CIKM 06] IEEE v1.4 IEEE v1.8 Number documents 11,980 17,000 Size (MB) 531 764 i2004k i2005k Keyword topics 34 29 i2003s i2004s Keyword and structure topics 30 26 Element retrieval testbeds, CS journal articles Slide 40
  48. 48. Element retrieval, keyword + structure queries NEXI queries to Indri queries, inference networks //article[about(., suicide bombings) and about(.//fgc, investigators)] #SCOPE[AVG:article]( #AND( suicide bombings #SCOPE[AVG:.//fgc]( investigators ) )) Slide 41
  49. 49. Element retrieval Keyword queries i2004k i2005k self 0.1 (0.1, 0.3) 0.3 (0.1, 0.4) collection 0.7 (0.4, 0.7) 0.3 (0.1, 0.6) document 0.2 (0.2, 0.3) 0.4 (0.2, 0.6) fig 0.0 (0.0, 0.1) 0.0 (0.0, 0.2) titles 0.0 (0.0, 0.0) 0.0 (0.0, 0.1) length 0.9 (0.9, 1.2) 1.2 (0.9, 1.5) Estimated parameters i2004k i2005k self + collection 0.179 0.099 train 0.239 0.116 test 0.234 0.112 Best from INEX 0.2353 0.104 Performance in MAP 3 Mixture model + pseudo rel. feedback Slide 42
  50. 50. Element retrieval Keyword + structure queries AND AVG MAX MIN OR self 0.4 0.3 0.4 0.4 0.3 collection 0.0 0.2 0.2 0.2 0.2 document 0.5 0.5 0.4 0.4 0.5 fig 0.0 0.0 0.0 0.0 0.0 titles 0.1 0.0 0.0 0.0 0.0 length 0.9 0.9 1.2 1.2 0.9 Estimated parameters from i2003s i2003s i2004s train test train test self + collection (AVG) 0.369 0.369 0.272 0.270 AND 0.282 0.273 0.224 0.174 AVG 0.403 0.401 0.294 0.290 MAX 0.386 0.384 0.286 0.280 MIN 0.407 0.403 0.291 0.285 OR 0.403 0.400 0.290 0.284 Best from INEX - 0.379 - 0.3524 Performance in MAP 4 Mixture model + term propagation Slide 43
  51. 51. Question answering experiments ACQUAINT collection ∼1 million news articles MIT 109 questions, exhaustive document judgments, sentence judgments Corpus tagged with ASSERT (semantic predicates), BBN Identifinder (named entities) Retrieve sentences containing answer to the question Measured by mean average-precision (MAP) 5-fold cross validation (same folds as [Bilotti thesis]) Slide 44
  52. 52. Question conversion Structured queries SENTENCE TARGET ARGM-LOC ARG1 Where are suicide bombers trained? #SCOPE[RESULT:sentence]( #AND( #SCOPE[AVG:target]( #AND( trained #SCOPE[AVG:./arg1]( #AND( suicide bombers ) ) #ANY:./argm-loc ) ) ) ) Slide 45
  53. 53. Question conversion Keyword + named entity queries SENTENCE LOCATION Where are suicide bombers trained? #SCOPE[RESULT:sentence]( #AND( trained suicide bombers #ANY:location ) ) Slide 46
  54. 54. Question answering results 1 2 3 4 5 element 0.1 (0.0, 0.2) 0.1 (0.1, 0.3) 0.1 (0.0, 0.1) 0.1 (0.0, 0.2) 0.1 (0.0, 0.1) collection 0.4 (0.2, 0.7) 0.4 (0.2, 0.7) 0.4 (0.3, 0.7) 0.4 (0.2, 0.8) 0.5 (0.4, 0.8) document 0.2 (0.1, 0.2) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) 0.2 (0.0, 0.2) sentence 0.3 (0.1, 0.4) 0.3 (0.1, 0.4) 0.3 (0.1, 0.3) 0.3 (0.1, 0.5) 0.2 (0.1, 0.3) length 2.1 (1.2, 2.1) 2.1 (0.0, 2.4) 2.1 (0.0, 2.4) 2.1(0.0, 2.4) 2.1 (0.0, 2.4) Estimated parameters across folds for structured + sentence (AVG) All Shallow Deep Shallow + Deep keyword + named-entity 0.218 0.197 0.232 0.211 structured 0.201 0.197 0.206 0.201 structured + padding 0.206 0.197 0.210 0.202 structured + sentence 0.240 0.197 0.303 0.240 Bilotti thesis 0.233 0.201 0.279 0.233 MAP averaged across test folds AVG combination even stronger on this testbed Slide 47
  55. 55. Question 1494 Who wrote "East is east, west is west and never the twain shall meet"? #SCOPE[RESULT:sentence]( #AND( #SCOPE[AVG:target]( #AND( wrote #SCOPE[AVG:./arg1]( #AND( east east west west never twain shall meet ) ) ) ) #ANY:person ) ) [ARGM-TMP One hundred years ago,] [PERSON Kipling] [TARGET wrote,] “Oh, East is East, and West is West, and never the twain shall meet.” Slide 48
  56. 56. Results summary Extensions to Inference Network + grid search provide strong results Scope AVG combination method robust Good choice of representations can improve annotation-robustness Slide 49
  57. 57. Outline Introduction Related Work Extensions to the Inference Network model Results Contributions Slide 50
  58. 58. Contributions Standardized the use of mixtures of language models for multiple representations [Ogilvie SIGIR 03] Pushed the state-of-the-art in query languages, index structures, retrieval models Introduced a vocabulary for discussing retrieval models with support for document structure and annotations Demonstrated the promise of annotation-robust models Grid search is a viable parameter estimation method Broader view of structure than prior work Shapes our understanding of what’s important Validated these models for many tasks Explicit recognition of the role of the query language Slide 51

×