Retrieval and Feedback Models for Blog Feed Search

3,642 views

Published on

SIGIR 2008 Presentation

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,642
On SlideShare
0
From Embeds
0
Number of Embeds
87
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Retrieval and Feedback Models for Blog Feed Search

    1. 1. Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU
    2. 2. Outline <ul><li>The task </li></ul><ul><ul><li>Overview of Blogs & Blog Search </li></ul></ul><ul><ul><li>Challenges in Blog Search </li></ul></ul><ul><li>Our approach </li></ul><ul><ul><li>Retrieval Models </li></ul></ul><ul><ul><li>Query Expansion Models </li></ul></ul><ul><li>Conclusion </li></ul>
    3. 3. Background
    4. 4. What is a Blog?
    5. 5. What is a Feed? <xml> <feed> <entry> <author>Peter …</> <title>Good, Evil…</> <content>I’ve said…</> </entry> <entry> <author>Peter …</> <title>Agreeing…</> <content>Some peo…</> </entry> …
    6. 6. Blog-Feed Correspondence Blog Feed Post Entry HTML XML
    7. 7. Why are Blogs important? <ul><li>Technorati currently tracking > 112.8 Million Blogs > 175,000 new Blogs per day > 1.6 Million posts per day </li></ul>[http://www.technorati.com/about/]
    8. 8. The Task
    9. 9. Feed Search at TREC <ul><li>Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X] </li></ul><ul><li>“ A relevant feed should have a principle and recurring interest in X ” </li></ul><ul><li>— TREC 2007 Blog Track </li></ul>(a.k.a. Blog Distillation)
    10. 10. Feed Search at TREC <ul><li>[Gardening] </li></ul><ul><li>[Apple iPod] </li></ul><ul><li>[Violence in Sudan] </li></ul><ul><li>[Gun Control] </li></ul><ul><li>[Food] </li></ul><ul><li>[Wine] </li></ul>Represent Ongoing Information Needs Frequently Very General
    11. 11. Challenges in Feed Search
    12. 12. Challenges in Feed Search <ul><li>A feed is a collection of documents </li></ul>entries time feed
    13. 13. <ul><li>A feed is a collection of documents </li></ul><ul><ul><li>How does relevance at the entry level correspond to relevance at the feed level? </li></ul></ul>Challenges in Feed Search entries time feed
    14. 14. Challenges in Feed Search <ul><li>2. Even a topical feed is topically diverse </li></ul>time Space Exploration topic NASA China’s plans for the moon shuttle launch My dog Mars rover Boeing
    15. 15. Challenges in Feed Search <ul><li>2. Even a topical feed is topically diverse </li></ul><ul><ul><li>Can we favor entries close to the central topic of the feed? </li></ul></ul>Space Exploration time topic
    16. 16. Challenges in Feed Search <ul><li>3. Feeds are noisy </li></ul><ul><ul><li>Spam blogs, Spam & off topic comments </li></ul></ul>time
    17. 17. Challenges in Feed Search <ul><li>4. General & Ongoing Information Needs </li></ul>[Mac] [Music] [Food] [Wine] … post regularly about new products , features , or application software of Apple Mac computers. … describing songs , biographies of musicians, musical styles and their influences of music on people are discussed. … such as tastings , reviews , food matching or pairing , and oenophile news and events . … describing experiences eating cuisines, culinary delights , recipes , nutrition plans .
    18. 18. Our Approach
    19. 19. Feeds: <ul><li>Topically Diverse </li></ul><ul><li>Noisy </li></ul><ul><li>Collections </li></ul>Information Needs: General & Ongoing Challenges Our Approach Retrieval Models Feedback Models
    20. 20. Retrieval Models <ul><li>Challenge: ranking topically diverse collections </li></ul><ul><li>Representation: feed vs. entry </li></ul><ul><li>Model topical relationship between entries </li></ul>
    21. 21. Large Document (Feed) Model [Q] <?xml… … </…> `<?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> <?xml… … </…> <?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> Feed Document Collection Ranked Feeds Rank by Indri’s standard retrieval model [Metzler and Croft, 2004; 2005]
    22. 22. Large Document (Feed) Model <ul><li>Advantages: </li></ul><ul><li>A straightforward application of existing retrieval techniques </li></ul><ul><li>Potential Pitfalls: </li></ul><ul><li>Large entries dominate a feed’s language model </li></ul><ul><li>Ignores relationship among entries </li></ul>Feed Entry E E Entry Entry E
    23. 23. Small Document (Entry) Model Ranked Entries [Q] <entry> <entry> <entry> <entry> <?xml… <entry> Entry Document Collection <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> Ranked Feeds document = entry Apply some rank aggregation function Rank By
    24. 24. Small Document (Entry) Model <ul><li>Query Likelihood </li></ul><ul><li>Entry Centrality </li></ul><ul><li>Feed Prior: favors longer feeds </li></ul>ReDDE Federated Search Algortihm [Si & Callan, 2003]
    25. 25. Entry Centrality <ul><li>Uniform : </li></ul><ul><li>Geometric Mean : </li></ul>time topic
    26. 26. Small Document (Entry) Model <ul><li>Advantages: </li></ul><ul><ul><li>Controls for differing entry length </li></ul></ul><ul><ul><li>Models topical relationship among entries </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Centrality computation is slow(er) </li></ul></ul>Not only improves speed, Also performance Q
    27. 27. Retrieval Model Results
    28. 28. Retrieval Model Results <ul><li>45 Queries from the TREC 2007 Blog Distillation Task </li></ul><ul><li>BLOG06 test collection, XML feeds only </li></ul><ul><li>5-Fold Cross Validation for all retrieval model smoothing parameters </li></ul>
    29. 29. Retrieval Model Results Mean Average Precision Large Document (Feed) Model Small Document (Entry) Models
    30. 30. Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform Log Prior Map 0.188
    31. 31. Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform n/a
    32. 32. Feedback Models <ul><li>Challenge: Noisy collection with general & ongoing information needs </li></ul><ul><li>Use a cleaner external collection for query expansion (Wikipedia) </li></ul><ul><li>With an expansion technique designed to identify multiple query facets </li></ul>
    33. 33. Query Expansion (PRF) [Q] BLOG06 Collection Related Terms from top K documents [Q + Terms] [Lavrenko & Croft, 2001]
    34. 34. Query Expansion Example <ul><li>Ideal </li></ul><ul><li>digital photography </li></ul><ul><li>depth of field </li></ul><ul><li>photographic film </li></ul><ul><li>photojournalism </li></ul><ul><li>cinematography </li></ul>[Photography] PRF photography nude erotic art girl free teen fashion women
    35. 35. Feedback Model Results Mean Average Precision None PRF
    36. 36. Query Expansion (Wikipedia PRF) [Q] BLOG06 Collection [Q + Terms] [Lavrenko & Croft, 2001] Wikipedia [Diaz & Metzler, 2006] Related Terms from top K documents
    37. 37. Query Expansion Example <ul><li>Ideal </li></ul><ul><li>digital photography </li></ul><ul><li>depth of field </li></ul><ul><li>photographic film </li></ul><ul><li>photojournalism </li></ul><ul><li>cinematography </li></ul>[Photography] PRF photography nude erotic art girl free teen fashion women Wikipedia PRF photography director special film art camera music cinematographer photographic
    38. 38. Feedback Model Results Mean Average Precision None PRF Wiki. PRF
    39. 39. Query Expansion (Wikipedia Link) [Q] BLOG06 Collection [Q + Terms] Wikipedia Related Terms from link structure
    40. 40. Wikipedia Link-Based Query Expansion
    41. 41. Wikipedia Link-Based Expansion Wikipedia … Q
    42. 42. Wikipedia Link-Based Expansion … Relevance Set, Top R = 100 Working Set, Top W = 1000 Q Wikipedia
    43. 43. Wikipedia Link-Based Expansion … Wikipedia Q Relevance Set, Top R = 100 Working Set, Top W = 1000
    44. 44. Wikipedia Link-Based Expansion Relevance Set, Top R = 100 Working Set, Top W = 1000 … Wikipedia Extract anchor text from Working Set that link to the Relevance Set . Q
    45. 45. Wikipedia Link-Based Expansion Relevance Set, Top R = 500 Working Set, Top W = 1000 … Wikipedia Extract anchor text from Working Set that link to the Relevance Set . Q Combines relevance and popularity Relevance: An anchor phrase that links to a high ranked article gets a high score Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score
    46. 46. Query Expansion Example <ul><li>Wikipedia Link-Based </li></ul><ul><li>photography </li></ul><ul><li>photographer </li></ul><ul><li>digital photography </li></ul><ul><li>photographic </li></ul><ul><li>depth of field </li></ul><ul><li>feature photography </li></ul><ul><li>film </li></ul><ul><li>photographic film </li></ul><ul><li>photojournalism </li></ul>[Photography] PRF photography nude erotic art girl free teen fashion women Ideal digital photography depth of field photographic film photojournalism cinematography
    47. 47. Feedback Model Results Mean Average Precision None PRF Wiki. PRF Wiki. Link
    48. 48. Conclusion <ul><li>Feed Search Challenges: </li></ul><ul><ul><li>Feeds are topically diverse, noisy collections </li></ul></ul><ul><ul><li>Ranked against ongoing & general information needs </li></ul></ul><ul><li>Novel Retrieval Models: </li></ul><ul><ul><li>Ranking collections, sensitive to topical relationship among entries </li></ul></ul><ul><li>Novel Feedback Models: </li></ul><ul><ul><li>Discover multiple query facets & robust to collection noise </li></ul></ul>
    49. 49. Thank You! Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research
    50. 50. Entry Centrality GM Derivation where Entry Generation Likelihood: |E|
    51. 51. Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>Music </li></ul><ul><li>Folk music </li></ul><ul><li>Electronic music </li></ul><ul><li>Folk </li></ul><ul><li>Music video </li></ul><ul><li>World music </li></ul><ul><li>Ambient </li></ul><ul><li>Electronic </li></ul><ul><li>Country music </li></ul>[Music] PRF Music Country Download Free MP3 Mp3andmore Lyric Listen Song
    52. 52. Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>scotland </li></ul><ul><li>scottish parliament </li></ul><ul><li>scottish </li></ul><ul><li>scottish national party </li></ul><ul><li>wars of scottish independence </li></ul><ul><li>scottish independence </li></ul><ul><li>william wallace </li></ul><ul><li>glasgow </li></ul><ul><li>scottish socialist party </li></ul>[Scottish Independence] PRF scotland independence party convention politics snp national people scot
    53. 53. Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>machine learning </li></ul><ul><li>learning </li></ul><ul><li>artificial intelligence </li></ul><ul><li>turing machine </li></ul><ul><li>machine gun </li></ul><ul><li>neural network </li></ul><ul><li>support vector machine </li></ul><ul><li>supervised learning </li></ul><ul><li>artificial neural network </li></ul>[Machine Learning] PRF learn machine credit card karaoke journal sex model sew
    54. 54. Query Generality Characteristics <ul><li>Query Length: </li></ul><ul><ul><li>BLOG: 1.9 words </li></ul></ul><ul><ul><li>TB04: 3.2 words </li></ul></ul><ul><ul><li>TB05: 3.0 words </li></ul></ul><ul><li>ODP Depth </li></ul><ul><ul><li>BLOG: 4.7 levels </li></ul></ul><ul><ul><li>TB04: 5.2 levels </li></ul></ul><ul><ul><li>TB05: 5.3 levels </li></ul></ul>
    55. 55. Relevance Set Cohesiveness … Relevance Set, Top R = 100 Wikipedia Cohesiveness = | L in | | L in U L out |
    56. 56. Relevant Set Cohesiveness
    57. 57. Is it the Queries? <ul><li>Feed Search Queries </li></ul><ul><li>≠ </li></ul><ul><li>TB Adhoc Queries </li></ul>But, none of these measures predict whether wikipedia expansions helps…

    ×