Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jai...
Outline <ul><li>The task </li></ul><ul><ul><li>Overview of Blogs & Blog Search </li></ul></ul><ul><ul><li>Challenges in Bl...
Background
What is a Blog?
What is a Feed? <xml> <feed> <entry> <author>Peter …</> <title>Good, Evil…</> <content>I’ve said…</> </entry> <entry> <aut...
Blog-Feed Correspondence Blog Feed Post Entry HTML XML
Why are Blogs important? <ul><li>Technorati currently tracking  > 112.8 Million Blogs > 175,000  new  Blogs per day > 1.6 ...
The Task
Feed Search at TREC <ul><li>Ranking Blogs/Feeds (collections of posts) in response to a user’s query,  [X] </li></ul><ul><...
Feed Search at TREC <ul><li>[Gardening] </li></ul><ul><li>[Apple iPod] </li></ul><ul><li>[Violence in Sudan] </li></ul><ul...
Challenges in Feed Search
Challenges in Feed Search <ul><li>A feed is a collection of documents  </li></ul>entries time feed
<ul><li>A feed is a collection of documents  </li></ul><ul><ul><li>How does  relevance  at the  entry  level correspond to...
Challenges in Feed Search <ul><li>2. Even a  topical  feed is  topically diverse </li></ul>time Space Exploration topic NA...
Challenges in Feed Search <ul><li>2. Even a  topical  feed is  topically diverse </li></ul><ul><ul><li>Can we favor entrie...
Challenges in Feed Search <ul><li>3.  Feeds are noisy </li></ul><ul><ul><li>Spam blogs, Spam & off topic comments </li></u...
Challenges in Feed Search <ul><li>4.  General & Ongoing Information Needs </li></ul>[Mac] [Music] [Food] [Wine] …  post re...
Our Approach
Feeds: <ul><li>Topically Diverse </li></ul><ul><li>Noisy </li></ul><ul><li>Collections </li></ul>Information Needs: Genera...
Retrieval Models <ul><li>Challenge:  ranking topically diverse collections </li></ul><ul><li>Representation: feed vs. entr...
Large Document (Feed) Model [Q] <?xml… … </…> `<?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <...
Large Document (Feed) Model <ul><li>Advantages: </li></ul><ul><li>A straightforward application of existing retrieval tech...
Small Document (Entry) Model Ranked Entries [Q] <entry> <entry> <entry> <entry> <?xml… <entry> Entry Document  Collection ...
Small Document (Entry) Model <ul><li>Query Likelihood </li></ul><ul><li>Entry Centrality </li></ul><ul><li>Feed Prior: fav...
Entry Centrality <ul><li>Uniform : </li></ul><ul><li>Geometric Mean : </li></ul>time topic
Small Document (Entry) Model <ul><li>Advantages: </li></ul><ul><ul><li>Controls for differing entry length </li></ul></ul>...
Retrieval Model Results
Retrieval Model Results <ul><li>45 Queries from the TREC 2007 Blog Distillation Task </li></ul><ul><li>BLOG06 test collect...
Retrieval Model Results Mean Average Precision Large Document (Feed) Model Small Document (Entry) Models
Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform Log Prior Map 0.188
Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform n/a
Feedback Models <ul><li>Challenge:  Noisy collection with general & ongoing information needs </li></ul><ul><li>Use a clea...
Query Expansion (PRF) [Q] BLOG06 Collection Related Terms from top K documents [Q + Terms] [Lavrenko & Croft, 2001]
Query Expansion Example <ul><li>Ideal </li></ul><ul><li>digital photography </li></ul><ul><li>depth of field </li></ul><ul...
Feedback Model Results Mean Average Precision None PRF
Query Expansion (Wikipedia PRF) [Q] BLOG06 Collection [Q + Terms] [Lavrenko & Croft, 2001] Wikipedia [Diaz & Metzler, 2006...
Query Expansion Example <ul><li>Ideal </li></ul><ul><li>digital photography </li></ul><ul><li>depth of field </li></ul><ul...
Feedback Model Results Mean Average Precision None PRF Wiki. PRF
Query Expansion (Wikipedia Link) [Q] BLOG06 Collection [Q + Terms] Wikipedia Related Terms from  link structure
Wikipedia Link-Based Query Expansion
Wikipedia Link-Based Expansion Wikipedia … Q
Wikipedia Link-Based Expansion … Relevance Set,  Top R = 100 Working Set,  Top W = 1000 Q Wikipedia
Wikipedia Link-Based Expansion … Wikipedia Q Relevance Set,  Top R = 100 Working Set,  Top W = 1000
Wikipedia Link-Based Expansion Relevance Set,  Top R = 100 Working Set,  Top W = 1000 … Wikipedia Extract anchor text from...
Wikipedia Link-Based Expansion Relevance Set,  Top R = 500 Working Set,  Top W = 1000 … Wikipedia Extract anchor text from...
Query Expansion Example <ul><li>Wikipedia Link-Based </li></ul><ul><li>photography </li></ul><ul><li>photographer </li></u...
Feedback Model Results Mean Average Precision None PRF Wiki. PRF Wiki. Link
Conclusion <ul><li>Feed Search Challenges: </li></ul><ul><ul><li>Feeds are topically diverse, noisy collections </li></ul>...
Thank You! Student Travel Grant funding from:    ACM SIGIR,    Amit Singhal,    Microsoft Research
Entry Centrality GM Derivation where Entry Generation Likelihood: |E|
Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>Music </li></ul><ul><li>Folk music </li></ul><ul><l...
Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>scotland </li></ul><ul><li>scottish parliament </li...
Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>machine learning </li></ul><ul><li>learning </li></...
Query Generality Characteristics <ul><li>Query Length: </li></ul><ul><ul><li>BLOG: 1.9 words  </li></ul></ul><ul><ul><li>T...
Relevance Set Cohesiveness … Relevance Set,  Top R = 100 Wikipedia Cohesiveness = |  L in  | |  L in  U  L out  |
Relevant Set Cohesiveness
Is it the Queries? <ul><li>Feed Search Queries  </li></ul><ul><li>≠ </li></ul><ul><li>TB Adhoc Queries </li></ul>But, none...
Upcoming SlideShare
Loading in...5
×

ppt download

718

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
718
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • From wikipedia: regular entries, displayed in reverse chronological order
  • Important to note: FEEDS ARE COLLECTIONS
  • Just to make we’re all the same page, I’d like to ground some terminology. A blog has a one-to-one correspondence with a feed. A Blog is an website to which an author or set of authors contribution content regularly in the form of blog posts. A feed is the XML equivalent of a blog. So, there’s a blog corresponds to a feed and a blog post corresponds to a feed entry.
  • NYTimes has &gt; 60 blogs now
  • New at TREC 2007
  • Why is this task different? Interesting? As compared to ad-hoc retrieval.
  • A feed is a collection of documents, the feed’s entries. So, we are actually ranking document collections rather than individual documents.
  • And in doing so, we must thinking about now relevance and the entry level corresponds to relevance ad the feed level.
  • The second property is that a topically coherent feed is inherently devoted to a multifaceted topic. This is a mile-high view of a blog or feed. One or more authors that share a common interest, say “Space Exploration” contribute content on a regular basis. Each post is going to focus on a different aspect of “Space exploration”. It wouldn’t be an interesting blog if each post covered the same topic. So, they might talk about a “shuttle launch”, “the mars rover”, China’s plans for the moon, etc. A search user has to represent these different subtopics in a one to four word query. This is an impoverished representation of a multifaceted topic.
  • The second property is that a topically coherent feed is inherently devoted to a multifaceted topic. This is a mile-high view of a blog or feed. One or more authors that share a common interest, say “Space Exploration” contribute content on a regular basis. Each post is going to focus on a different aspect of “Space exploration”. It wouldn’t be an interesting blog if each post covered the same topic. So, they might talk about a “shuttle launch”, “the mars rover”, China’s plans for the moon, etc. A search user has to represent these different subtopics in a one to four word query. This is an impoverished representation of a multifaceted topic.
  • Finally, blogs are noisy. The blogosphere has no shortage of spam blogs. Comments made in response to blog posts vary in quality and topicality, and even the posts vary in subject matter. It’s niave to assume that for a feed to be relevant to a query, all its posts must be relevant.
  • Represent an ongoing interest in a complex topic Impoverished representations of an ongoing, multifaceted &amp; often general information need
  • Retrieval models: models for ranking collections, account for the topical diversity within the collection Feedback models: overcome noise in the blog collection, aimed at addressing multifaceted information needs
  • Adapt &amp; extend existing federated search models to the task of feed search Take into account both multiple levels of representation -- the individual entry vs. the feed -- also the issue of “centrality”
  • Feed is a concatenation of entries
  • As before, we may want to favor some entries over others. We will attempt to address all of these in the following retrieval models
  • Agan, Indri’s standard retrieval model Rank aggregation: rank by max entry, count(entries) in top K, etc.
  • Extends ReDDE, a well known, state-of-the-art federated search resource ranking algorithm In this model, we only use the feed prior to correct for overly-optimistic scoring of short feeds, with few entries
  • Recall our diagram of a feed’s entries and some amount of topical drift Two purposes: ONE to balance scoring across feeds with differing numbers of entries. TWO to favor some feeds over others based on their similarity to the feed’s language model
  • As we increase the complexity of the SD model, it outperforms LD model
  • I’ll start by presenting the expansion methods we compare
  • Jackpot! But…. It doesn’t generalize
  • Jackpot! But…. It doesn’t generalize
  • Aims at extracting a DIVERSITY OF EXPANSION PHRASES
  • Jackpot! But…. It doesn’t generalize
  • ppt download

    1. 1. Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU
    2. 2. Outline <ul><li>The task </li></ul><ul><ul><li>Overview of Blogs & Blog Search </li></ul></ul><ul><ul><li>Challenges in Blog Search </li></ul></ul><ul><li>Our approach </li></ul><ul><ul><li>Retrieval Models </li></ul></ul><ul><ul><li>Query Expansion Models </li></ul></ul><ul><li>Conclusion </li></ul>
    3. 3. Background
    4. 4. What is a Blog?
    5. 5. What is a Feed? <xml> <feed> <entry> <author>Peter …</> <title>Good, Evil…</> <content>I’ve said…</> </entry> <entry> <author>Peter …</> <title>Agreeing…</> <content>Some peo…</> </entry> …
    6. 6. Blog-Feed Correspondence Blog Feed Post Entry HTML XML
    7. 7. Why are Blogs important? <ul><li>Technorati currently tracking > 112.8 Million Blogs > 175,000 new Blogs per day > 1.6 Million posts per day </li></ul>[http://www.technorati.com/about/]
    8. 8. The Task
    9. 9. Feed Search at TREC <ul><li>Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X] </li></ul><ul><li>“ A relevant feed should have a principle and recurring interest in X ” </li></ul><ul><li>— TREC 2007 Blog Track </li></ul>(a.k.a. Blog Distillation)
    10. 10. Feed Search at TREC <ul><li>[Gardening] </li></ul><ul><li>[Apple iPod] </li></ul><ul><li>[Violence in Sudan] </li></ul><ul><li>[Gun Control] </li></ul><ul><li>[Food] </li></ul><ul><li>[Wine] </li></ul>Represent Ongoing Information Needs Frequently Very General
    11. 11. Challenges in Feed Search
    12. 12. Challenges in Feed Search <ul><li>A feed is a collection of documents </li></ul>entries time feed
    13. 13. <ul><li>A feed is a collection of documents </li></ul><ul><ul><li>How does relevance at the entry level correspond to relevance at the feed level? </li></ul></ul>Challenges in Feed Search entries time feed
    14. 14. Challenges in Feed Search <ul><li>2. Even a topical feed is topically diverse </li></ul>time Space Exploration topic NASA China’s plans for the moon shuttle launch My dog Mars rover Boeing
    15. 15. Challenges in Feed Search <ul><li>2. Even a topical feed is topically diverse </li></ul><ul><ul><li>Can we favor entries close to the central topic of the feed? </li></ul></ul>Space Exploration time topic
    16. 16. Challenges in Feed Search <ul><li>3. Feeds are noisy </li></ul><ul><ul><li>Spam blogs, Spam & off topic comments </li></ul></ul>time
    17. 17. Challenges in Feed Search <ul><li>4. General & Ongoing Information Needs </li></ul>[Mac] [Music] [Food] [Wine] … post regularly about new products , features , or application software of Apple Mac computers. … describing songs , biographies of musicians, musical styles and their influences of music on people are discussed. … such as tastings , reviews , food matching or pairing , and oenophile news and events . … describing experiences eating cuisines, culinary delights , recipes , nutrition plans .
    18. 18. Our Approach
    19. 19. Feeds: <ul><li>Topically Diverse </li></ul><ul><li>Noisy </li></ul><ul><li>Collections </li></ul>Information Needs: General & Ongoing Challenges Our Approach Retrieval Models Feedback Models
    20. 20. Retrieval Models <ul><li>Challenge: ranking topically diverse collections </li></ul><ul><li>Representation: feed vs. entry </li></ul><ul><li>Model topical relationship between entries </li></ul>
    21. 21. Large Document (Feed) Model [Q] <?xml… … </…> `<?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> <?xml… … </…> <?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> Feed Document Collection Ranked Feeds Rank by Indri’s standard retrieval model [Metzler and Croft, 2004; 2005]
    22. 22. Large Document (Feed) Model <ul><li>Advantages: </li></ul><ul><li>A straightforward application of existing retrieval techniques </li></ul><ul><li>Potential Pitfalls: </li></ul><ul><li>Large entries dominate a feed’s language model </li></ul><ul><li>Ignores relationship among entries </li></ul>Feed Entry E E Entry Entry E
    23. 23. Small Document (Entry) Model Ranked Entries [Q] <entry> <entry> <entry> <entry> <?xml… <entry> Entry Document Collection <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> Ranked Feeds document = entry Apply some rank aggregation function Rank By
    24. 24. Small Document (Entry) Model <ul><li>Query Likelihood </li></ul><ul><li>Entry Centrality </li></ul><ul><li>Feed Prior: favors longer feeds </li></ul>ReDDE Federated Search Algortihm [Si & Callan, 2003]
    25. 25. Entry Centrality <ul><li>Uniform : </li></ul><ul><li>Geometric Mean : </li></ul>time topic
    26. 26. Small Document (Entry) Model <ul><li>Advantages: </li></ul><ul><ul><li>Controls for differing entry length </li></ul></ul><ul><ul><li>Models topical relationship among entries </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Centrality computation is slow(er) </li></ul></ul>Not only improves speed, Also performance Q
    27. 27. Retrieval Model Results
    28. 28. Retrieval Model Results <ul><li>45 Queries from the TREC 2007 Blog Distillation Task </li></ul><ul><li>BLOG06 test collection, XML feeds only </li></ul><ul><li>5-Fold Cross Validation for all retrieval model smoothing parameters </li></ul>
    29. 29. Retrieval Model Results Mean Average Precision Large Document (Feed) Model Small Document (Entry) Models
    30. 30. Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform Log Prior Map 0.188
    31. 31. Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform n/a
    32. 32. Feedback Models <ul><li>Challenge: Noisy collection with general & ongoing information needs </li></ul><ul><li>Use a cleaner external collection for query expansion (Wikipedia) </li></ul><ul><li>With an expansion technique designed to identify multiple query facets </li></ul>
    33. 33. Query Expansion (PRF) [Q] BLOG06 Collection Related Terms from top K documents [Q + Terms] [Lavrenko & Croft, 2001]
    34. 34. Query Expansion Example <ul><li>Ideal </li></ul><ul><li>digital photography </li></ul><ul><li>depth of field </li></ul><ul><li>photographic film </li></ul><ul><li>photojournalism </li></ul><ul><li>cinematography </li></ul>[Photography] PRF photography nude erotic art girl free teen fashion women
    35. 35. Feedback Model Results Mean Average Precision None PRF
    36. 36. Query Expansion (Wikipedia PRF) [Q] BLOG06 Collection [Q + Terms] [Lavrenko & Croft, 2001] Wikipedia [Diaz & Metzler, 2006] Related Terms from top K documents
    37. 37. Query Expansion Example <ul><li>Ideal </li></ul><ul><li>digital photography </li></ul><ul><li>depth of field </li></ul><ul><li>photographic film </li></ul><ul><li>photojournalism </li></ul><ul><li>cinematography </li></ul>[Photography] PRF photography nude erotic art girl free teen fashion women Wikipedia PRF photography director special film art camera music cinematographer photographic
    38. 38. Feedback Model Results Mean Average Precision None PRF Wiki. PRF
    39. 39. Query Expansion (Wikipedia Link) [Q] BLOG06 Collection [Q + Terms] Wikipedia Related Terms from link structure
    40. 40. Wikipedia Link-Based Query Expansion
    41. 41. Wikipedia Link-Based Expansion Wikipedia … Q
    42. 42. Wikipedia Link-Based Expansion … Relevance Set, Top R = 100 Working Set, Top W = 1000 Q Wikipedia
    43. 43. Wikipedia Link-Based Expansion … Wikipedia Q Relevance Set, Top R = 100 Working Set, Top W = 1000
    44. 44. Wikipedia Link-Based Expansion Relevance Set, Top R = 100 Working Set, Top W = 1000 … Wikipedia Extract anchor text from Working Set that link to the Relevance Set . Q
    45. 45. Wikipedia Link-Based Expansion Relevance Set, Top R = 500 Working Set, Top W = 1000 … Wikipedia Extract anchor text from Working Set that link to the Relevance Set . Q Combines relevance and popularity Relevance: An anchor phrase that links to a high ranked article gets a high score Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score
    46. 46. Query Expansion Example <ul><li>Wikipedia Link-Based </li></ul><ul><li>photography </li></ul><ul><li>photographer </li></ul><ul><li>digital photography </li></ul><ul><li>photographic </li></ul><ul><li>depth of field </li></ul><ul><li>feature photography </li></ul><ul><li>film </li></ul><ul><li>photographic film </li></ul><ul><li>photojournalism </li></ul>[Photography] PRF photography nude erotic art girl free teen fashion women Ideal digital photography depth of field photographic film photojournalism cinematography
    47. 47. Feedback Model Results Mean Average Precision None PRF Wiki. PRF Wiki. Link
    48. 48. Conclusion <ul><li>Feed Search Challenges: </li></ul><ul><ul><li>Feeds are topically diverse, noisy collections </li></ul></ul><ul><ul><li>Ranked against ongoing & general information needs </li></ul></ul><ul><li>Novel Retrieval Models: </li></ul><ul><ul><li>Ranking collections, sensitive to topical relationship among entries </li></ul></ul><ul><li>Novel Feedback Models: </li></ul><ul><ul><li>Discover multiple query facets & robust to collection noise </li></ul></ul>
    49. 49. Thank You! Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research
    50. 50. Entry Centrality GM Derivation where Entry Generation Likelihood: |E|
    51. 51. Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>Music </li></ul><ul><li>Folk music </li></ul><ul><li>Electronic music </li></ul><ul><li>Folk </li></ul><ul><li>Music video </li></ul><ul><li>World music </li></ul><ul><li>Ambient </li></ul><ul><li>Electronic </li></ul><ul><li>Country music </li></ul>[Music] PRF Music Country Download Free MP3 Mp3andmore Lyric Listen Song
    52. 52. Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>scotland </li></ul><ul><li>scottish parliament </li></ul><ul><li>scottish </li></ul><ul><li>scottish national party </li></ul><ul><li>wars of scottish independence </li></ul><ul><li>scottish independence </li></ul><ul><li>william wallace </li></ul><ul><li>glasgow </li></ul><ul><li>scottish socialist party </li></ul>[Scottish Independence] PRF scotland independence party convention politics snp national people scot
    53. 53. Query Expansion Examples <ul><li>Wikipedia Expansion </li></ul><ul><li>machine learning </li></ul><ul><li>learning </li></ul><ul><li>artificial intelligence </li></ul><ul><li>turing machine </li></ul><ul><li>machine gun </li></ul><ul><li>neural network </li></ul><ul><li>support vector machine </li></ul><ul><li>supervised learning </li></ul><ul><li>artificial neural network </li></ul>[Machine Learning] PRF learn machine credit card karaoke journal sex model sew
    54. 54. Query Generality Characteristics <ul><li>Query Length: </li></ul><ul><ul><li>BLOG: 1.9 words </li></ul></ul><ul><ul><li>TB04: 3.2 words </li></ul></ul><ul><ul><li>TB05: 3.0 words </li></ul></ul><ul><li>ODP Depth </li></ul><ul><ul><li>BLOG: 4.7 levels </li></ul></ul><ul><ul><li>TB04: 5.2 levels </li></ul></ul><ul><ul><li>TB05: 5.3 levels </li></ul></ul>
    55. 55. Relevance Set Cohesiveness … Relevance Set, Top R = 100 Wikipedia Cohesiveness = | L in | | L in U L out |
    56. 56. Relevant Set Cohesiveness
    57. 57. Is it the Queries? <ul><li>Feed Search Queries </li></ul><ul><li>≠ </li></ul><ul><li>TB Adhoc Queries </li></ul>But, none of these measures predict whether wikipedia expansions helps…
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×