CMU TREC 2007 Blog Track

825 views
755 views

Published on

Presentation for the TREC 2007 Blog Track

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
825
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CMU TREC 2007 Blog Track

  1. 1. Retrieval and Feedback Models for Blog Distillation CMU at the TREC 2007 Blog Track Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell
  2. 2. CMU’s Blog Distillation Focus • Two Research Questions: What is the appropriate unit of retrieval? Feeds or Entries? How can we effectively do pseudo-relevance feedback for Blog Distillation? • Our four submissions investigate these two dimensions.
  3. 3. Outline • Corpus Preprocessing Tasks • Two Feedback Models • Two Retrieval Models • Results • Discussion
  4. 4. Corpus Preprocessing • Used only FEED documents (vs. PERMALINK or HOMEPAGE documents). • For each FEEDNO, extracted each new entry across all feed fetches • Aggregated into 1-document-per-FEEDNO, retaining structural elements from the FEED documents: Title, description, entry, entry title, etc… • Very helpful: Python Universal Feed Parser
  5. 5. Two Pseudo-Relevance Feedback Models • Indri’s Built in Pseudo-Relevance Feedback (Lavrenko’s Relevance Model) – Using Metzler’s Dependence Model query on the full feed documents – Produces weighted unigram PRF query, QRM = #weight( w1 t1 w2 t2 … w50 t50) • Wikipedia-based Pseudo-Relevance Feedback – Focus on anchor text linking to highly ranked documents wrt baseline BOW query
  6. 6. Wikipedia-based PRF Wikipedia BOW query …
  7. 7. Wikipedia-based PRF Wikipedia Relevance Set (top T) BOW query Working Set (top N) …
  8. 8. Wikipedia-based PRF Wikipedia Relevance Set (top T) Working Set (top N) …
  9. 9. Wikipedia-based PRF Wikipedia Relevance Set Extract anchor text from Working Set that point to (top T) the Relevance Set. Working Set (top N) Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20) where wi ~ sum( T … - rank(target(ai)) )
  10. 10. Wikipedia-based PRF Query: “Apple iPod”
  11. 11. Two Pseudo-Relevance Feedback Models Q 983 “Photography” Wikipedia PRF Indri’s Relevance Model photography photography photographer aerial depth of field digital camera full photograph resource pulitzer prize stock digital camera free photographic film information photojournalism art cinematography wedding shutter speed great
  12. 12. Two Pseudo-Relevance Feedback Models Q 995 “Ruby on Rails” Wikipedia PRF Indri’s Relevance Model pokmon rail ruby ruby pokmon ruby and sapphire kakutani pokmon emerald que ruby programming language article php weblog pokmon firered and leafgreen activestate pokmon adventures rubyonrail pokmon video games new standard gauge develop tram dontstopmusic
  13. 13. Two Retrieval Models • Large Document model – Entire Feed is the unit of retrieval • Small Document model – Individual entry is the unit of retrieval – Ranked Entries are aggregated into a Feed Ranking
  14. 14. Large Document model Feed Document Collection <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  15. 15. Large Document model Feed Document Collection Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  16. 16. Large Document model Feed Document Collection Ranked Feeds Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  17. 17. Small Document model Entry Document Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  18. 18. Small Document model Entry Document Q Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  19. 19. Small Document model Entry Document Q Collection Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  20. 20. Small Document model Entry Document Q Collection Ranked Feeds Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  21. 21. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  22. 22. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document Large Document Indri Query: model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  23. 23. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  24. 24. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  25. 25. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art federated search algorithm
  26. 26. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  27. 27. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  28. 28. Small Document Model • Features used in the small document model : – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  29. 29. Small Document Model • Features used in the small document model : Small Document Indri Query: – P(Feed | Q ) entry-text – P(Feed | QRM) – P(Feed | Qwiki)
  30. 30. Parameter Setting • Selecting feature weights (λ’s) required training data • Relevance judgments produced for a small subset of the queries (6+2) BOW title query, 50 docs judged/query • Simple grid search to choose parameters that maximized MAP
  31. 31. Results Our best run: Wikipedia expansion + Large Doc.
  32. 32. Results Queries (ordered by CMUfeedW performance)
  33. 33. Feature Elimination CMUfeedW CMUfeed
  34. 34. Conclusions • What worked – Preprocessing the corpus & using only feed XML – Simple retrieval model with appropriate features – Wikipedia expansion – (small amount of) training data • What didn’t (… or what we tried without success) – Small Document Model (but we think it can) – Spam/Splog detection, anchor text, URL segmentation
  35. 35. Thank You
  36. 36. Feature Elimination +Training -Training +All (CMUfeedW) 0.3695 0.3536 -Dependence 0.3676 0.3457 Model -Wiki (CMUfeed) 0.3385 0.3152 -Wiki, -Indri’s PRF 0.2989 -Wiki, -Indri’s PRF, 0.2907 -Dep Model

×