Successfully reported this slideshow.
Your SlideShare is downloading. ×


Upcoming SlideShare
Respect My Authority
Respect My Authority
Loading in …3

Check these out next

1 of 12 Ad

More Related Content


Recently uploaded (20)


  1. 1. Picking the NYT Picks: Editorial Criteria and Automation in the Curation of Online News Comments Nicholas Diakopoulos University of Maryland, College Park – College of Journalism @ndiakopoulos | |
  2. 2. “NYT Picks is the most popular comment queue. We spend a lot of time tweaking that and getting that right.” What are criteria for selection? How can we augment moderator capability to consider more comments?
  3. 3. Criteria from Literature Negative / Exclusion Personal attacks, profanity, abusive behavior Positive / Inclusion Internal Coherence Thoughtfulness Brevity / Length Relevance Fairness / Diversity Novelty Argument Quality Criticality Emotionality Entertainment Value Readability Personal Experience
  4. 4. Crowdsourcing Argument Quality Criticality Emotionality Entertainment Value Readability Personal Experience Internal Coherence Thoughtfulness Brevity / Length Relevance Fairness / Diversity Novelty RQ1: Do “NYT Picks” comments reflect positive editorial criteria identified in literature?
  5. 5. Automation Argument Quality Criticality Emotionality Entertainment Value Readability Personal Experience Internal Coherence Thoughtfulness Brevity / Length Relevance Fairness / Diversity Novelty RQ2: Can algorithmic approaches to assessing criteria be developed?
  6. 6. Automated scores point towards scalable opportunities for moderation and UX…
  7. 7. But automation also raises questions about over-generalization across contexts, and algorithmic transparency
  8. 8. Questions? Contact Nick Diakopoulos University of Maryland, College of Journalism Twitter: @ndiakopoulos Email: Web: More Info N. Diakopoulos. The Editor’s Eye: Curation and Comment Relevance on the New York Times. Proc. CSCW. March, 2015.

Editor's Notes

  • On September 11, 2013, Vladamir Putin published an op-ed in the NYT. Among other things, he questioned american exceptionalism – and if there’s one thing you shouldn’t do in ‘merica it’s that. He was prodding the american public.

    In response, comments flooded in, 6,367 of them in fact. Of those 4,447 were published along with the piece.
  • How could you possibly organize thousands of comments and find the interesting or insightful ones?

    Like other commenting systems users can vote up a comment by recommending it. Comments are sorts by oldest first, or they can be filtered by their recommendation scores. .

    The published comments included 85 of which were deemed NYT Picks, which garner a little badge and reflect the “most interesting and thoughtful” comments.

    What makes this most impressive though is that each of those comments was read by a human moderator, a trained journalist, at NYT before being published. That it, the NYT practices pre-moderation, in comparison to many other publications which only look at comments after they’re published.

  • In fact they’re read by a team led by Bassey Etim, the community manager at NYT. Together with him team of 13 moderators, they read almost every comment before it’s posted to the site.

    Part of that job is choosing the NYT Picks comments. “NYT Picks is the most popular comment queue. Spend a lot of time tweaking that and getting that right.”

    As a baseline they’re looking for about 5 picks per 100 comments. Outside of blogs they do about 22 queues a day, but they’d like to open comments on more articles. So how could we help them scale up?

    Talk about the potential benefits of selecting comments: signals norms and expectations for behavior, creating a beneficial feedback loop.
  • Positive criteria considered in the literature from studies of: Letters to the editor, online comments for print, on-air radio comments

    Readability: style, clarity, adherence to standard grammar, degree it’s well-articulated.

    Stress that operationalizing these is hard and there are many challenges for future work.

  • The focus of this work is initially on crowdsourcing ratings for 9 of these dimensions, so excluding relevance, fairness, and novelty since they are much more difficult to measure using crowdsourcing, and also I have a previous paper that looked at relevance explicitly.

    The crowdsourcing approach collected human ratings of 8 of the 9 criteria here (b/c length is trival to measure by counting words). 500 comments 250 each of NYT Picks and non picks. Rated on a scale from 1 to 5, collected on Amazon Mechanical Turk. 3 independent ratings of each comment. Restricted to workers with reliable history, and substantial history, and from US or Canada. Collected 1500 ratings from 89 different workers.

    We measured the Krippendorf’s alpha which is a measure of the interrater reliability and got slight to moderate agreement among the 3 raters except for entertainment value (so people couldn’t agree on what was funny).

  • Eventually would like to compute scores for all of these criteria automatically, but for now we do three of them.

    Readability is the reading level according to the SMOG index, and index that measures the usage of more complex words. There was a high correlation between the SMOG index and the crowdsourced ratings of readability.

    Personal experience is based on detecting the proportion of words from LIWC dictionaries that reflect 1st person personal pronouns as well as family and friends relationships. Comment tokens are stemmed to match the dictionary
  • So I found a statistically sig diff for all criteria except entertaining and emotionality, and emotionality was actually sig at p=0.08.

    Several of these criteria also correlated fairly well, such as thoughtfulness and readability, and argument quality with thoughtfulness. Future work might look at scaling up the data collection and looking at dimensionality reduction techniques
  • All stat sig at p =0.05 or lower.
  • Editorial selections (NYT Picks) do reflect many of the editorial criteria articulated in the literature. Continuity of professional criteria into online space (except brevity)

    Online spaces don’t have same space constraints and we found NYT editors preferred longer comments for Picks. Raises question of how well this serves users from their perspective.

    The scores we computed, in particular the personal experience score could have some really nice applications for amplifying the value of comments for moderators, as well as reporters. In some follow-up work we’ve shown the this to comment moderators and they’re excited about the possibilities.

    Automation could also enable new end-user experiences, where users adapt their own view of the comments based on automatically computed scores along journalistically interesting lines.
  • Over-generalization … diff communities, or topics (e.g. sports) require different treatment, so algorithmic solutions can’t be one-size-fits-all. Is it always better to high a highly readable comment, and when does that come into tension with diversity or fairness of perspectives.

    Do Picks affect community or individual behavior?
  • Mention CommentIQ project at UMD, funded by the Knight Foundation

    We’re going to be hiring a fellow or fellows, so if you’re interested in joining the lab, please come speak to me. We work on everything from data visualization to algorithmic accountability and transparency, as well as data mining things like online comments. If you want to combine data and computing, with design, in the context of journalism, please come talk.