Trec2009blog overview v9

699 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Trec2009blog overview v9

  1. 1. Overview of theTREC 2009 Blog Track<br />Iadh Ounis, Craig Macdonald, Ian Soborofftrecblog-organisers@dcs.gla.ac.uk<br />1<br />
  2. 2. Outline<br />Blog Track: Background<br />TREC Blog Track 2009 Overview<br /><ul><li>Blogs08 collection
  3. 3. Faceted blog distillation task
  4. 4. Top stories identification task</li></ul>Conclusions<br />2<br />
  5. 5. Blog Track @ TREC<br />Introduced in TREC 2006<br /><ul><li>Explores the information seeking behaviour in the blogosphere</li></ul>The Blog track adopted an incremental approach<br /><ul><li>From core and simple retrieval tasks to more complex search scenarios</li></ul>Thus far, two main search tasks have been addressed:<br /><ul><li>Opinion-finding task [2006-2008]</li></ul>“Find meposts about what people think of X”<br /><ul><li>Blog distillation task [2007-2008]</li></ul>“Find me blogswith a principle, recurring interest in X” <br />3<br />
  6. 6. Blog Track 2009<br />In 2009, the Blog track has been markedly revamped<br /><ul><li>Addresses more refined and complex search scenarios using a larger sample of the blogosphere</li></ul>An up-to-date sample of the blogosphere: Blogs08<br /><ul><li>One order of magnitude larger than the older Blogs06 (28M posts, 1.3M feeds)
  7. 7. A much longer timespan: 13 months from Jan 08 to Feb 09</li></ul>Two new search tasks:<br /><ul><li>Faceted blog distillation</li></ul>Addresses the quality aspect of the retrieved blogs<br /><ul><li>Top stories identification task</li></ul>Addresses the news-related dimension of the blogosphere<br />4<br />
  8. 8. The New Blogs08 Collection<br />Crawled from the blogosphere over a 13-month period from 14th Jan 08 to 10th Feb 09<br /><ul><li>Includes spam, non-English documents, and non-blogs</li></ul>Facilitates addressing the temporal/chronological aspect of the blogosphere<br /><ul><li>e.g. news and filtering tasks</li></ul>Follow a similar structure to the older Blogs06 collection:<br /><ul><li>808GB feeds (>1.3M blogs)
  9. 9. 1445GB permalinks (28M documents)</li></ul>A single post and its comments <br /><ul><li>56GB homepages</li></ul>Created by the Univ. of Glasgow and distributed since April 2009<br />5<br />
  10. 10. Outline<br />Blog Track: Background<br />TREC Blog Track 2009 Overview<br /><ul><li>Blogs08 collection
  11. 11. Faceted blog distillation task
  12. 12. Top stories identification task</li></ul>Conclusions<br />6<br />
  13. 13. Blog Distillation Task <br />Blog search users often wish to identify blogs about a given topic<br /><ul><li>They can subscribe to and read on a regular basis</li></ul>Filtering: Subscribe to a repeated search in their RSS reader<br />Distillation: add blog feeds with a recurring central interest to their RSS reader<br />Blog distillation task [2007-2008]<br /><ul><li>“Find me a blog with a principle, recurring interest in X”</li></ul>The TREC 2007 and 2008 incarnations focused on topical relevance<br /><ul><li>The task did not address the “quality” aspect of the retrieved blogs</li></ul>7<br />
  14. 14. Faceted Blog Search<br />New task mimics an exploratory search task<br /><ul><li>“Find me a quality blog to follow/read about X”
  15. 15. Quality aspect is addressed through the use of facets in the search interface (Hearst et al., SSM 2008)</li></ul>Faceted search allows the users to explore the attributes of those blogs they might wish to follow and read:<br /><ul><li>In-depth/shallow analysis
  16. 16. Humouristic/serious style
  17. 17. Expert/novice viewpoint
  18. 18. etc.</li></ul>8<br />
  19. 19. Task Definition<br />For operationalising at TREC <br /><ul><li>Each topic has a facet of interest attached to it
  20. 20. Blogs do not have facet attributes</li></ul>For TREC 2009, we used an initial set of 3 facets of varying difficulty:<br /><ul><li>Opinionated: ‘opinionated’ vs ‘factual’ blogs
  21. 21. Personal: ‘personal’ vs. ‘official’ blogs
  22. 22. Indepth: ‘in-depth’ vs. ‘shallow’ blogs</li></ul>The use of the Opinionated facet allowed to leverage past track work on opinion-finding<br />9<br />} binary<br />
  23. 23. Topics<br />One appropriate facet added to each topic<br /><query> hugo chavez </query> <br /><desc> I am looking for blogs that talk about Venezuelan<br />president Hugo Chavez and his politics. </desc><br /><facet> indepth </facet> <br /><narr>I want to follow blogs that talk about Hugo Chavez,<br />the president of Venezuela. Blogs that follow his role in<br />Venezuelan politics are relevant, as well as those that<br />discuss non-political stories and activities. I am more<br />interested in blogs about Chavez than blogs about<br />Venezuelan politics generally.</narr><br />50 new topics were created by TREC assessors:<br /><ul><li>21 Opinionated
  24. 24. 10 Personal
  25. 25. 19 Indepth </li></ul>10<br />
  26. 26. Runs<br />Retrieval unit:<br /><ul><li>Blogs from the Feeds component of Blogs08</li></ul>For each topic, a run consists of three rankings of 100 blogs:<br /><ul><li>One with the 1st inclination of facet enabled
  27. 27. One with the 2nd inclination of facet enabled
  28. 28. One with no facet inclination enabled (akin to topic-relevance baseline)</li></ul>Example: For a topic with Personal facet<br /><ul><li>1st ranking should have 100 ‘personal’ blogs
  29. 29. 2nd ranking should have 100 ‘official’ blogs
  30. 30. 3rd ranking should have 100 relevant blogs </li></ul>11<br />
  31. 31. Assessment Procedure<br />How does one assess a blog?<br /><ul><li>By reading some of its posts</li></ul>Assessment scale:<br /><ul><li>[0]: Not relevant
  32. 32. [1]: Relevant but not clearly inclined to a facet inclination
  33. 33. [2]: Relevant and clearly inclined towards the 1st facet inclination (opinionated, personal, indepth)
  34. 34. [3]: Relevant and clearly inclined towards the 2nd facet inclination (factual, official, shallow)</li></ul>Topic-relevance baseline runs<br /><ul><li>Measure using NR={0}, R={1,2,3}</li></ul>Faceted blog search runs<br /><ul><li>Measure using NR={0,1}, R={2|3}
  35. 35. Measure MAP for all facet inclination rankings (2 inclinations for each topic)</li></ul>12<br />
  36. 36. Runs and Pooling<br />Each group permitted up to 4 runs<br /><ul><li>9 groups took part in the faceted blog distillation task
  37. 37. 29 submitted runs, including 24 title-only runs
  38. 38. All runs pooled (and all 3 rankings in each run) to depth 30</li></ul>13<br />
  39. 39. Overview of Results<br />Baseline retrieval performances are lower than expected<br /><ul><li>96% of the pooled blogs were judged irrelevant</li></ul>Facet performances are low<br /><ul><li>Performance across facets differs
  40. 40. E.g. Indepth vs Opinionated</li></ul>Task complexity, early-stage techniques, or difficult topics?<br />14<br />
  41. 41. Baseline runs results: 39 topics; Top 5 Groups; Title-only (ranked by MAP)<br />Topic relevance model and expansion using terms from <desc> and <narr> topic fields.<br />Blog posts ranked using BM25, then scores aggregated to blogs<br />Fuzzy aggregation methods to combine regularized blog posts scores into blog scores.<br /><ul><li>Most of the groups indexed only the Permalinks components of Blogs08
  42. 42. Almost all deployed retrieval techniques scored a blog based on the scores of its corresponding relevant posts</li></ul>15<br />
  43. 43. Faceted blog search runs results: 39 topics; Top 5 Groups; Ranked by ALL (MAP) <br />Indepth facet: posts scored using Cross Entropy. For other facets: Mutual Information is used to weight terms in posts, using various lexicons.<br />Did not attempt faceted search. Post scores are altered using temporal information before being aggregated into blog scores.<br />Learned a classifier for the Indepth facet. For other facets, they used heuristics to score blog posts before aggregation.<br /><ul><li>Faceted search proved to be particularly challenging
  44. 44. For all groups, and in almost all cases: Applying faceted search leads to a decrease in performance viz. the faceted performance of the baseline ranking</li></ul>16<br />
  45. 45. Outline<br />Blog Track: Background<br />TREC Blog Track 2009 Overview<br /><ul><li>Blogs08 collection
  46. 46. Faceted blog distillation task
  47. 47. Top stories identification task</li></ul>Conclusions<br />17<br />
  48. 48. Top Stories Identification Task<br />Many blog search engine queries are news-related<br />New task’s main research question:How well does the blogosphere respond to real-world events?<br />Facilitated by the Blogs08 test collection – 54 weeks in length, including<br /><ul><li>US election cycle
  49. 49. China earthquake
  50. 50. etc.</li></ul>18<br />
  51. 51. Task Definition<br />Federal takeover of Fannie Mae and Freddie Mac<br />1.<br />For a given unit of time (“query date”), identify the top news stories on that date<br /><ul><li>And also identify some related blog posts to the headline, covering its various/diverse aspects</li></ul>News stories are represented by headlines broadcast by NY Times<br /><ul><li>For entire timespan of Blogs08
  52. 52. Distributed with kind permission of NYT</li></ul>--✔--<br />--✗--<br />2.<br />…<br />--✗--<br />19<br />
  53. 53. Task Details<br />Example Query :<br /><top><br /><num> TS09-33 </num><br /><date> 2008-08-25 </date><br /></top><br />Provide a ranking of news headlines in range <date> ± 1 <br /><ul><li>e.g. If a story happens early on day d in Europe, it will be reported by an American broadcaster (NYT) on day d-1</li></ul>For each ranked news headline, suggest relevant, diverse blog posts<br /><ul><li>Relevant blog posts may occur anytime after the date of the event</li></ul>The task is of Retrospective Event Detection (RED) type<br />20<br />
  54. 54. Topic Development<br />The organisers selected 55 dates as topics <br /><ul><li>Covering various global, political, economics, cultural, sports and technology events</li></ul>These included dates related to events such as:<br /><ul><li>Chinese Earthquake
  55. 55. Obama’s inauguration
  56. 56. Banking crisis
  57. 57. Beijing Olympics
  58. 58. Oscars
  59. 59. Microsoft/Yahoo (aborted) deal
  60. 60. etc.</li></ul>21<br />
  61. 61. Runs and Assessments<br />A run consists of a ranking of 100 headlines, each supported by up to 10 diverse blog posts<br /><ul><li>Runs use the SUPPORTing run format developed for the Enterprise track expert search task
  62. 62. 25 runs by 7 groups: pooled top 20 headlines from each run</li></ul>Two phases of participant community judging:<br /><ul><li>Top news story judging: Identify important news stories for each day
  63. 63. Blog post judging: Identify relevant and diverse blog posts for relevant headlines</li></ul>22<br />
  64. 64. Phase 1: Top News Story Judging<br />We asked assessors to take the role of a newspaper editor<br /><ul><li>What stories would they put on the front page of a newspaper or news website?
  65. 65. Assess whether the headline actually occurred on the query day, and judge each headline story as “Important” or “Not Important”
  66. 66. Could consider their own recollection of events, or refer to external Web resources</li></ul>Editorial factors to consider: Timing, Significance, Prominence, Human Interest, Proximity<br />Interface provided pool of headlines to judge, headline and snippet of story, and link to actual NYT news article<br />23<br />
  67. 67. Phase 2: Blog Post Judging<br />Once headlines were judged, important ones were sampled for which to perform blog post judging<br /><ul><li>2-phase judging avoids judging blog posts at the same time as judging headline
  68. 68. Assessors only have to read blog posts for judged important headlines</li></ul>Blog posts were judged “Relevant” or “Not Relevant” to the headline<br />When judging, assessors defined “aspects” to group relevant blog posts<br /><ul><li>e.g. for a headline on the Oscars, the assessor defined aspects such as “liveblogs”, “factual”, “opinionated”, “accuracy of predictions”
  69. 69. Aspects are used during diversity evaluation</li></ul>24<br />
  70. 70. Relevance Assessments<br />Top news story identification was hard:<br />Blog post judging, less so:<br />Result reporting in two phases: Top news story identification, then diverse blog post retrieval<br />25<br />
  71. 71. Identifying Top News Stories<br /><ul><li>All 25 submitted runs were automatic
  72. 72. Task was fairly difficult: retrieval performances were rather low </li></ul>26<br />
  73. 73. Identifying Top News Stories: Runs <br />Voting Model: Number of blog posts mentioning a headline. <br />Probabilistic: Combination of query generating headline probability and headline prior calculated from time- or term-based evidence<br />Two probabilistic approaches: news to blogs or blogs to news.<br /><ul><li>All groups indexed only the Permalinks component of Blogs08 (exceptions are UAms & USI)</li></ul>27<br />
  74. 74. Identifying Blog Posts<br /><ul><li>Runs with high top story recall have more chance to identify relevant blog posts
  75. 75. Moreover, systems found identifying blog posts for a headline easier
  76. 76. Evaluation measures are diversity-based, from the Web track:</li></ul>α-NDCG@10 (α=0.5)<br />IA-P@10<br />See Charlie’s talk for Web track<br />28<br />
  77. 77. Identifying Blog Posts: Runs<br />Divergence From Randomness DPH ranking and MMR<br />Latent Dirichlet Relevance Model, but applied no diversification<br /><ul><li>Means calculated over all 258 judged headlines
  78. 78. However, ranking of runs not identical to top story identification evaluation
  79. 79. Some swaps between groups, and between runs for a given group</li></ul>29<br />
  80. 80. Conclusions<br />In 2009, the Blog track has been markedly revamped<br /><ul><li>Two new pilot search tasks that go beyond topical relevance and simple adhoc retrieval</li></ul>The results on both tasks confirm the complexities of faceted blog search and top stories identification<br /><ul><li>There is a large scope for further research and improvements</li></ul>Blog track will run in 2010<br /><ul><li> Same tasks
  81. 81. … but with a few proposed refinements intended to facilitate research into considering the blogosphere as a time stream</li></ul>More at the Blog track workshop on Friday <br />30<br />

×