Arcomem training diversification
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Arcomem training diversification

Uploaded on

This presentation on Diversification is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving......

This presentation on Diversification is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

More in: Technology , Career
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Athena Research and Innovation Center Yahoo! Research Diversifying User Comments on News Articles
  • 2. 2 Problem Problem description: Given a news article and the respective set of user comments, return a subset of the most diverse comments Perception of a diverse set of comments: A set of comments that represents different opinions and sentiments, …expressed by users with different demographic characteristics, …covering different aspects of the news article. Motivation Article’s content itself is not always enough to form a complete view over a topic The public opinion complements the article and represents the “wisdom of the crowds”
  • 3. 3 Example Given a political article: Find all the subtopics handled  Persons related  Events (election, bill voting) Find all opinions and sentiments expressed  Positive/negative/neutral  On the whole article/on specific subtopics Find different kinds of users commenting  Different demographics  Different commenting history on previous articles Present a set of comments that better represents the diversity of the above dimensions
  • 4. 4 Motivation Several articles are very popular (>10000 comments) Articles get aggregated  even more comments Impossible for the reader to review Current comment sorting options are based on more simple criteria Date Votes Replies
  • 5. 5 Method outline Define diversification criteria Dimensions  Content, Sentiment, Named Entities, User co-commenting behavior Define a (dis)similarity function that produces a diversity score based on the criteria Quantify the dissimilarity of comments Weighted sum of cosine similarities on diversity feature vectors Apply and iterative heuristic algorithm that, at each step, selects the candidate comment that maximizes a diversity objective
  • 6. 6 Method description - Criteria Content Baseline diversity criterion Used in the rest of the literature to diversify search results. Objective  obtain comments with diverse content. Processing  Comments’ text  term vectors  Document length-normalized tf values
  • 7. 7 Method description - Criteria Named Entities (Nes) Person, Organizations, Locations Many times news articles revolve around Nes  Even when an article talks about events or situations, usually one or more Persons or Locations are involved Objective  obtain comments that cover (uniformly) as many different NEs as possible Processing  Extraction of Nes in comments (Stanford NER)  Comments’ Nes  term vectors  Document length-normalized tf values
  • 8. 8 Method description - Criteria Sentiment 9 classes of sentiment within the interval [-4, 4]  -4  very negative  4  very positive  0  neutral Expresses users’ opinions on the news articles’ topics. Objective  obtain comments that cover (uniformly) different classes of sentiment Processing  Sentiment analysis of the comment’s text (SentiStrength)  Construct sentiment vectors  Each vector value represents a sentiment class
  • 9. 9 Comment scoring Cosine similarity function between A pair of comments A comment and a set of comments Apply the similarity function for each criterion separately Produce a final diversity score as a weighted sum of all criterion scores Produce a final score that incorporates comment-to-article similarity
  • 10. 10 Algorithm (MAXSUM) Initially Empty diverse result set  all comments belong to the candidate set Arbitrary insertion of a candidate comment into the result set Greedy construction heuristic Compare each candidate comment with the centroid (average) of the current result set Finish after (k-1) iterations  k comments are inserted
  • 11. 11 Evaluation Comparison of methods’ coverage on different information nuggets they contain Baseline diversification based only on content Proposed method (combination of multiple criteria) Proposed methods outperform the baseline
  • 12. 12 Framework - Implementations A desktop java application retrieving news articles and comments comments stored in a MySQL database News and comments obtained by the NY Times API Arcomem offline module for calculating diverse WebObjects of WebResources