Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding High-Quality Content in Social Media

7,085 views

Published on

Eugene Agichtein, Carlos Castillo, Debora Donato, Aris Gionis and Gilad Mishne. Presented at WSDM 2008.

  • Be the first to comment

  • Be the first to like this

Finding High-Quality Content in Social Media

  1. 1. Eugene Agichtein Emory University Atlanta, USA Carlos Castillo Debora Donato Aris Gionis Yahoo! Research Barcelona, Spain Gilad Mishne Yahoo! S&A Sciences Santa Clara, CA, USA
  2. 2. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. User-generated content Traditional publishing≠
  3. 3. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Chris Anderson: “The Long Tail”. Hyperion, 2006. Frequency Quality Traditional publishing User- generated
  4. 4. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Chris Anderson: “The Long Tail”. Hyperion, 2006. Quantity Quality User- generated Traditional publishing
  5. 5. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. <!--
  6. 6. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Quantity Quality ?
  7. 7. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Chris Martin from Coldplay in The Rolling Stone, Fortieth Aniversary, July 2007. Quantity Quality “We think it's all about quality over quantity now, because there's so much noise everywhere, there's no point in putting anything out unless it's fucking amazing.”
  8. 8. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Quantity Quality AUser- generated Traditional publishing
  9. 9. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Quantity Quality F.A.User- generated Traditional publishing
  10. 10. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. -->
  11. 11. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Quantity Quality ?
  12. 12. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Quantity Quality (Hard) problem
  13. 13. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
  14. 14. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Best answer Picked by votes -or- Picked by asker All answers + “Thumbs up” + “Thumbs down” Question + “Stars”
  15. 15. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. ¼ questions want an opinion: informal polls ¾ questions seek for information or advice
  16. 16. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Q. Su, D. Pavlov, J.-H. Chow, W. C. Baker. “Internet-scale collection of human-reviewed data”.WWW'07. 17%-45% of answers were correct 65%-90% of questions had at least one correct answer
  17. 17. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. There are top contributors ... ... but they don't have all the answers
  18. 18. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Quantity Quality Task: find high-quality items
  19. 19. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Existing tools ● Link-based ranking methods ● Propagation of trust/distrust ● Automatic text analysis ● Usage mining ● ...
  20. 20. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Sources of information ● Content analysis ● Usage data (clicks) ● Community ratings
  21. 21. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Sources of information ● Content analysis (with errors) ● Clicks (with noise) ● Community ratings (sparse, with spam)
  22. 22. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community
  23. 23. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community
  24. 24. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Language modeling Text analysis Readability statistics
  25. 25. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Language modelingReadability statistics Punctuation density Capitalization errors Number of words + spacing density, sylablles per word,...
  26. 26. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Language modelingReadability statistics G. Mishne, D. Carmel, R. Lempel: “Blocking blog spam with language model disagreement”. AIRWeb'05 Language model disagreement Distributions of word n-grams and part-of-speech sequences when|how|why -- “to” -- verb “how to identify ...” when|how|why – verb – verb – pronoun – verb “how do I remove ...”
  27. 27. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community
  28. 28. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Clicks If we know that a question is clicked 100 times, and another question is clicked 10,000 times ... ... we still know nothing
  29. 29. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Clicks Per-category average Clicks Per-category stdev. Question age
  30. 30. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community
  31. 31. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Power laws
  32. 32. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. P. Jurczyk, E. Agichtein: “Discovering authorities in Q.A. communities by using link analysis” CIKM'07 Askers Answerers
  33. 33. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Community answers votes + votes - picks as best
  34. 34. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Community Degree-based metrics # answers given # answers received # votes + given # votes + received etc...
  35. 35. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Community Propagation-based metrics 1. Pagerank score 2. HITS hub score 3. HITS authority score Computed on each graph
  36. 36. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community Relations Learning Training labels
  37. 37. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community Relations Learning Training labels
  38. 38. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. High Medium Low High 15% Medium 76% Low 9% 100% Answer quality Question quality
  39. 39. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. High Medium Low High 15% 8% Medium 76% 74% Low 9% 18% 100% 100% Answer quality Question quality
  40. 40. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. High Medium Low High 41% 15% 8% Medium 53% 76% 74% Low 6% 9% 18% 100% 100% 100% Answer quality Question quality
  41. 41. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. High Medium Low High 41% 15% 8% Medium 53% 76% 74% Low 6% 9% 18% 100% 100% 100% Answer quality Question quality Question quality and answer quality are not independent
  42. 42. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Relations: questions A Q V Q A A AQ A V U Answers to the question being evaluated User asking question Question being evaluated Questions asked Answers given Votes given QAAnswers to questions asked U U U Answerers of question being evaluated
  43. 43. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. A Q V A Q A A A U Q A V U Other answers to the same question Asker of question being answered Question being answered Answerer Answer being evaluated Questions asked Answers given Votes given QA Answers to questions asked Relations: answers
  44. 44. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community Relations Learning Training labels
  45. 45. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community Relations Learning: stochastic gradient boosted trees Labeled data: 6K questions 8K answers J. H. Friedman: “Stochastic gradient boosting”. Comp. Stat. Data. Anal., 38(4), 367-378, 2002. Evaluation: Precision, Recall (F1); Area under ROC curve
  46. 46. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Precision Recall AUC N-grams (N) 65% 48% 0.52 N+ text analysis 76% 65% 0.65 N+ clicks 68% 57% 0.58 N+ relations 74% 65% 0.66 All 79% 77% 0.76 Task: high-quality questions
  47. 47. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Precision Recall AUC N-grams (N) 67% 86% 0.81 N + text analysis 71% 93% 0.88 N + clicks - - - N + relations 69% 85% 0.82 All 73% 91% 0.87 Task: high-quality answers
  48. 48. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. In the paper ... ● Framework for quality estimation in social media ● Graph-based model of contributor relationships ● Details on the relative importance of (sets of) features
  49. 49. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. What did we learn? ● Human assessments for this task – ... have relatively low agreement ● Classifying questions/answers – ... is substantially different from document classification ● Look at orthogonal feature spaces
  50. 50. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Text analysis Clicks Community Relations Learning Future work Relational learning
  51. 51. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08. Thank you!
  52. 52. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
  53. 53. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
  54. 54. ROC curve: high-quality questions Best N-grams
  55. 55. ROC curve: high-quality answers Best N-grams

×