Arcomem training diversification


Published on

This presentation on Diversification is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Published in: Technology, Career
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Arcomem training diversification

  1. 1. Athena Research and Innovation Center Yahoo! Research Diversifying User Comments on News Articles
  2. 2. 2 Problem Problem description: Given a news article and the respective set of user comments, return a subset of the most diverse comments Perception of a diverse set of comments: A set of comments that represents different opinions and sentiments, …expressed by users with different demographic characteristics, …covering different aspects of the news article. Motivation Article’s content itself is not always enough to form a complete view over a topic The public opinion complements the article and represents the “wisdom of the crowds”
  3. 3. 3 Example Given a political article: Find all the subtopics handled  Persons related  Events (election, bill voting) Find all opinions and sentiments expressed  Positive/negative/neutral  On the whole article/on specific subtopics Find different kinds of users commenting  Different demographics  Different commenting history on previous articles Present a set of comments that better represents the diversity of the above dimensions
  4. 4. 4 Motivation Several articles are very popular (>10000 comments) Articles get aggregated  even more comments Impossible for the reader to review Current comment sorting options are based on more simple criteria Date Votes Replies
  5. 5. 5 Method outline Define diversification criteria Dimensions  Content, Sentiment, Named Entities, User co-commenting behavior Define a (dis)similarity function that produces a diversity score based on the criteria Quantify the dissimilarity of comments Weighted sum of cosine similarities on diversity feature vectors Apply and iterative heuristic algorithm that, at each step, selects the candidate comment that maximizes a diversity objective
  6. 6. 6 Method description - Criteria Content Baseline diversity criterion Used in the rest of the literature to diversify search results. Objective  obtain comments with diverse content. Processing  Comments’ text  term vectors  Document length-normalized tf values
  7. 7. 7 Method description - Criteria Named Entities (Nes) Person, Organizations, Locations Many times news articles revolve around Nes  Even when an article talks about events or situations, usually one or more Persons or Locations are involved Objective  obtain comments that cover (uniformly) as many different NEs as possible Processing  Extraction of Nes in comments (Stanford NER)  Comments’ Nes  term vectors  Document length-normalized tf values
  8. 8. 8 Method description - Criteria Sentiment 9 classes of sentiment within the interval [-4, 4]  -4  very negative  4  very positive  0  neutral Expresses users’ opinions on the news articles’ topics. Objective  obtain comments that cover (uniformly) different classes of sentiment Processing  Sentiment analysis of the comment’s text (SentiStrength)  Construct sentiment vectors  Each vector value represents a sentiment class
  9. 9. 9 Comment scoring Cosine similarity function between A pair of comments A comment and a set of comments Apply the similarity function for each criterion separately Produce a final diversity score as a weighted sum of all criterion scores Produce a final score that incorporates comment-to-article similarity
  10. 10. 10 Algorithm (MAXSUM) Initially Empty diverse result set  all comments belong to the candidate set Arbitrary insertion of a candidate comment into the result set Greedy construction heuristic Compare each candidate comment with the centroid (average) of the current result set Finish after (k-1) iterations  k comments are inserted
  11. 11. 11 Evaluation Comparison of methods’ coverage on different information nuggets they contain Baseline diversification based only on content Proposed method (combination of multiple criteria) Proposed methods outperform the baseline
  12. 12. 12 Framework - Implementations A desktop java application retrieving news articles and comments comments stored in a MySQL database News and comments obtained by the NY Times API Arcomem offline module for calculating diverse WebObjects of WebResources