Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Article Popularity Prediciton

119 views

Published on

Article Popularity Prediciton

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Article Popularity Prediciton

  1. 1. News Data Science Integration Data Science in News - Popularity Prediction of News Articles Shuguang Wang Big Data @ WaPo Oct 27, 2015
  2. 2. News Data Science Integration Goal Better news - better reading experience and engagement
  3. 3. News Data Science Integration Reading Experience and User Engagement • Break news Fast • Premium quality of content • ...
  4. 4. News Data Science Integration Challenges • What are the popular articles?
  5. 5. News Data Science Integration Challenges • What are the popular articles? • Resource are limited, which articles to work more on? • Article refinements: more text, images, videos, or follow up articles
  6. 6. News Data Science Integration Hypothesis Figure: Page views for different articles
  7. 7. News Data Science Integration Hypothesis Figure: Page views for different articles 1 def h y p o t h e s i s : 2 i f i s P o p u l a r ( ” https ://www. washingtonpost . com / . . . ” ) : 3 considerRefinment
  8. 8. News Data Science Integration What we have • click stream from omniture www. washingtonpost . com/ l o c a l / . . . 26DFB82685159CA7 | o |60000171001F4BEF 1430369692542 t . co /8dZKZtCbV9 www. washingtonpost . com/news / . . . 2A11255B85012C0E | o |40001603 C00213A1 1430369692573 m. facebook . com/ www. washingtonpost . com/ p o l i t i c s / . . . 29E638AB85160E6C | o |400001 A4200087AC 1430369692590 d r u d g e r e p o r t . com/ . . .
  9. 9. News Data Science Integration What we have • click stream from omniture www. washingtonpost . com/ l o c a l / . . . 26DFB82685159CA7 | o |60000171001F4BEF 1430369692542 t . co /8dZKZtCbV9 www. washingtonpost . com/news / . . . 2A11255B85012C0E | o |40001603 C00213A1 1430369692573 m. facebook . com/ www. washingtonpost . com/ p o l i t i c s / . . . 29E638AB85160E6C | o |400001 A4200087AC 1430369692590 d r u d g e r e p o r t . com/ . . . • article content from page builder {” date ” : ”2015−07−15T22 : 5 7 : 3 9 Z” , ”html md5” : ” be5e36dcde6a76c4b7e24630d82c6111 ” , ” s t o r y ” :{ ” i d ” : ” http ://www. washingtonpost . com/ s p o r t s /was−dez−bryant−b l u f f i n g −about−s i t t i n g −out−the−cowboys−dont−have−to−find−out /2015/07/15/68715962 −05 a5−48fb−b5ba−e 6 d 4 c a b 7 2 b f c s t o r y . html ” , ” c a n o n i c a l u r l ” : ” http ://www. washingtonpost . com/ s p o r t s /was−dez−bryant−b l u f f i n g −about−s i t t i n g −out−the− cowboys−dont−have−to−find−out /2015/07/15/68715962 −05a5−48fb−b5ba− e 6 d 4 c a b 7 2 b f c s t o r y . html ” , ” c l a v i s k e y w o r d s ” : [{ ” i d ” : ” D a l l a s Cowboys” , ” fr equency ” :11 , ” s c o r e ” :0.5743849751851194 , ”term−f r e q ” :11 } ,
  10. 10. News Data Science Integration Data Exploration • page builder: time Figure: Total number of page views @ WaPo site for a couple of days
  11. 11. News Data Science Integration Data Exploration • page builder: sections Figure: Total number of page views @ WaPo site breakdown on sections
  12. 12. News Data Science Integration Data Exploration • click stream: page views Figure: Number of page views on two articles as time serials
  13. 13. News Data Science Integration Problem Statement • We want to predict article popularity ASAP. • Information is collected for 30 minutes before prediction.
  14. 14. News Data Science Integration Problem Statement • We want to predict article popularity ASAP. • Information is collected for 30 minutes before prediction. • News articles’ life span is short. • Prediction is for number of clicks in first 24 hours.
  15. 15. News Data Science Integration Tools • Several regression models • Multiple Linear Regression, Lasso, Ridge Regression, and Tree Regression • Spark-streaming, kafka, hbase, R, Scikit-Learn, Splunk, Vader sentiment analyzer, Standford NLP, Lucene, Flesch-Kincaid Readability Index, ...
  16. 16. News Data Science Integration System
  17. 17. News Data Science Integration From ... To ... (a) Baseline (b) All features
  18. 18. News Data Science Integration For U 1 def applyDS : 2 i f hasQuestions ( yourTeam ) and d a t a A v a i l a b l e : 3 considerDS

×