Building a Microblog Corpus for Search Result Diversification

1,223 views

Published on

The talk given by Ke Tao at AIRS 2013 titled "Building a Microblog Corpus for Search Result Diversification"

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,223
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • annimation with lack of corpus
  • 3 dropped topicsG20 finance ministers meetingUEFA champions leagueNorth Korea nullify armistice
  • Basic statistics
  • 6.6  start early, later fasterMost of the tweets are assigned exactly one subtopic.
  • Different from TREC 2011/12 subtopics, no timestamp considered for building this corpus.
  • Diversity difficultyTREC 2010 0.727 (0.449, 0.994)TREC 2011 0.809 (0.643, 0.977)
  • Diversity difficulty
  • Diversity difficulty
  • Building a Microblog Corpus for Search Result Diversification

    1. 1. Building a Microblog Corpus for Search Result Diversification AIRS 2013, Singapore, December 10 Ke Tao, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft, the Netherlands Delft University of Technology
    2. 2. Research Challenges 1. Diversification needed: Users are likely to use shorter queries, which tend to be underspecified, to search on microblog 2. Lack of Corpus for Diversification Study: How can one build a microblog corpus for evaluating further study on diversification? Search Result tweets query Diversified Result diversification strategy diversity judgment Building a Microblog Corpus for Search Result Diversification 2
    3. 3. Methodology Overview 1. Data Source • How can we find a good representative Twitter dataset? 2. Topic Selection • How do we select the search topics? 3. Tweets Pooling • Which tweets are we going to annotate? 4. Diversity Annotation • How do we annotate the tweets with diversity characteristics? Building a Microblog Corpus for Search Result Diversification 3
    4. 4. Methodology – Data source • From where? • Twitter sampling API  around 1% of whole Twitter streams • Duration • From February 1st to March 31st 2013 • Coincide with TREC 2013 Microblog Track • Tools • Twitter Public Stream Sampling Tools by @lintool • Amazon EC2 in EU TREC 2013 Microblog Guideline: https://github.com/lintool/twitter-tools/wiki/ TREC-2013-Track-Guidelines Twitter Public Stream Sampling Tool: https://github.com/lintool/twitter-tools/wiki/Sampling-the-public-Twitter-stream Building a Microblog Corpus for Search Result Diversification 4
    5. 5. Methodology – Topic Selection How do we select the search topics? • Candidates in Wikipedia Current Events Portal • Enough importance • More than local interests • Temporal Characteristics • Evenly distributed during the period of 2-month • Enables further analysis on temporal characteristics • Selected • 50 topics on trending news events Wikipedia Current Events Portal: http://en.wikipedia.org/wiki/Portal: Current_events Building a Microblog Corpus for Search Result Diversification 5
    6. 6. Methodology – Tweets Pooling – 1/2 Maximize coverage & Minimize effort • Challenge for adopting existing solution • Lack of access to multiple retrieval systems • Topic Expansion • Manually created query for each topic • Aim at maximum coverage of tweets that are relevant to the topic • Duplicate Filtering • Filter out the duplicate tweets (cosine similarity > 0.9) Building a Microblog Corpus for Search Result Diversification 6
    7. 7. Methodology – Tweets Pooling – 2/2 Topic Expansion Example Hillary Clinton steps down as United States Secretary of State Possible variety of expressions Building a Microblog Corpus for Search Result Diversification 7
    8. 8. Methodology – Diversity Annotation Annotation Efforts • 500 tweets for each topic • No identification of subtopics beforehand • Tweets about general topic (=no added value) are judged non-relevant • No further check on URL links  may be not available as time goes • 50 topics split between 2 annotators • Subjective process • Later comparative results • 3 topics dropped – e.g. not enough diversity / relevant documents Building a Microblog Corpus for Search Result Diversification 8
    9. 9. Topic Analysis The Topics and Subtopics 1/2 All topics Avg. #subtopics Std. dev. #subtopics Min. #subtopics Max. #subtopics 9.27 3.88 2 21 Topics annotated by Annotator 1 Annotator 2 8.59 9.88 5.11 2.14 2 6 21 13 On average, we found 9 subtopics per each topic. The subjectivity of annotation is confirmed based on the differences in the standard deviation of number of subtopics per each topic between two annotators. Building a Microblog Corpus for Search Result Diversification 9
    10. 10. Topic Analysis The Topics and Subtopics 2/2 The annotators on average spent 6.6 seconds to annotate a tweet. Most of the tweets are assigned with exactly one subtopic. Building a Microblog Corpus for Search Result Diversification 10
    11. 11. Topic Analysis The relevance judgment 1/2 • Different diversity in topics • 25 topics have less than 100 tweets with subtopics • 6 topics have more than 350 tweets with subtopics • Difference between 2 annotators • On average, 96 tweets v.s. 181 tweets with subtopic assignment Number of documents 500 400 300 RELEVANT 200 NONRELEVANT 100 0 Topics Building a Microblog Corpus for Search Result Diversification 11
    12. 12. Topic Analysis The relevance judgment 2/2 • Temporal persistence • Some topics are active during the entire timespan • Northern Mali conflicts • Syrian civil war • Low to 24 hours for some topics • BBC Twitter account hacked • Eiffel Tower, evacuated due to bomb threat Difference in days 60 50 40 30 20 10 0 Topics Building a Microblog Corpus for Search Result Diversification 12
    13. 13. Topic Analysis Diversity Difficulty • The difficulty to diversify the search results • Ambiguity or Under-specification of topics • Diverse content available in the corpus • Golbus et al. proposed diversity difficulty measure dd • dd > 0.9 = arbitrary ranked list is likely to cover all subtopics • dd < 0.5 means hard to discover subtopics by an untuned retrieval system All topics Avg. diversity difficulty Std. Dev. diversity difficulty 0.71 0.07 Topics annotated by Annotator 1 Annotator 2 0.72 0.70 0.06 0.07 Golbus et al.: Increasing evaluation sensitivity to diversity. Information Retrieval (2013) 16 Building a Microblog Corpus for Search Result Diversification 13
    14. 14. Topic Analysis Diversity Difficulty • The difficulty to diversify the search results • Ambiguity or Under-specification of topics • Diverse content available in the corpus • Golbus et al. proposed diversity difficulty measure dd • dd > 0.9 indicates a diverse query • dd < 0.5 means hard to discover subtopics by an untuned retrieval system • Difference between long-/short-term topics • The topics with longer timespan (>50 days) are easier in diversity difficulty (0.73 > 0.70) Golbus et al.: Increasing evaluation sensitivity to diversity. Information Retrieval (2013) 16 Building a Microblog Corpus for Search Result Diversification 14
    15. 15. Diversification by De-Duplicating – 1/6 Lower redudancy, but higher diversity? • In previous work, we were motivated by the fact that • 20% of search results are duplicate information in different extent • Therefore, we proposed to remove the duplicates in order to achieve lower redundancy in top-k results • Implemented with a machine learning framework • Make use of syntactical, semantic, and contextual features • Eliminate the identified duplicates with lower rank in the search result Whether it can achieve in higher diversity? Tao et al.: Groundhog Day: Near-duplicate Detection on Twitter. In Proceedings of 22nd International World Wide Web Conference. Building a Microblog Corpus for Search Result Diversification 15
    16. 16. Diversification by De-Duplicating – 2/6 Measures • We adopts following measures: • alpha-(n)DCG • Precision-IA • Subtopic-Recall • Redundancy Clarke et al.: Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of SIGIR, 2008. Agrawal et al.: Diversifying Search Results. In Proceedings of WSDM, 2009. Zhai et al.: Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval. In Proceedings of SIGIR, 2003. Building a Microblog Corpus for Search Result Diversification 16
    17. 17. Diversification by De-Duplicating – 3/6 Baseline and De-Duplicate Strategies • Baseline Strategies • Automatic run: using standard queries (no more than 3 terms) • Filtered Auto: filter the duplicates out w.r.t. cosine similarity • Manual Run: manually created complex queries with automatic filtering • De-duplicate Strategies • Sy = Syntactical, Se= Semantic, Co = Contextual • Four strategies: Sy, SyCo, SySe, SySeCo Building a Microblog Corpus for Search Result Diversification 17
    18. 18. Diversification by De-Duplicating – 4/6 Overall comparison Overall, the de-duplicate strategies did achieve in lower redundancy. However, they didn’t achieve in terms of higher diversity. Building a Microblog Corpus for Search Result Diversification 18
    19. 19. Diversification by De-Duplicating – 5/6 Influence of Annotator Subjectivity Building a Microblog Corpus for Search Result Diversification 19
    20. 20. Diversification by De-Duplicating – 5/6 Influence of Annotator Subjectivity The same general trends for both annotators. alpha-nDCG scores are higher for Annotator 2  can be explained by on average more documents judged as relevant by Annotator 2. Building a Microblog Corpus for Search Result Diversification 20
    21. 21. Diversification by De-Duplicating – 6/6 Influence of Temporal Persistence Building a Microblog Corpus for Search Result Diversification 21
    22. 22. Diversification by De-Duplicating – 6/6 Influence of Temporal Persistence De-duplicate strategies can help for long-term topics, because the vocabulary was richer while only a small set of terms were used for short-term topics. Building a Microblog Corpus for Search Result Diversification 22
    23. 23. Conclusions • We have done: • Created a microblog-based corpus for search result diversification • Conducted comprehensive analysis and showed its suitability • Confirmed considerable subjectivity among annotators, although the trends w.r.t. the different evaluation measures were largely independent of annotators • We have made the corpus available via: • http://wis.ewi.tudelft.nl/airs2013/ • What we will do: • Apply the diversification approaches that have been shown to perform well in the Web search setting. • Propose the diversification approaches specifically designed for search on microblogging platforms. Building a Microblog Corpus for Search Result Diversification 23
    24. 24. Thank you! @wisdelft http://ktao.nl Ke Tao @taubau Building a Microblog Corpus for Search Result Diversification 24

    ×