0
Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model...
http://www.meetup.com/HandsOnPro         grammingEvents/          copyright All Rights Reserved Doug Chang              do...
Data Mining Hackathon     copyright All Rights Reserved Doug Chang         dougc at stanfordalumni dot org
Funded by Rapleaf•   With Motley Fool’s data•   App note for Rapleaf/Motley Fool•   Template for other hackathons•   Did n...
Getting more subscribers      copyright All Rights Reserved Doug Chang          dougc at stanfordalumni dot org
Headline Data, Weblog     copyright All Rights Reserved Doug Chang         dougc at stanfordalumni dot org
Demographics copyright All Rights Reserved Doug Chang     dougc at stanfordalumni dot org
Cleaning Data• training.csv(201,000), headlines.tsv(811MB), e  ntry.tsv(100k), demographics.tsv• Feature Engineering• Gith...
Ensemble Methods• Bagging, Boosting, randomForests• Overfitting• Stability (small changes make large prediction  changes)•...
ROC Curves                                           Binary Classifier Only!copyright All Rights Reserved Doug Chang    do...
Paid Subscriber ROC curve, ~61%          copyright All Rights Reserved Doug Chang              dougc at stanfordalumni dot...
Boosted Regression Trees Performance• training data ROC score = 0.745• cv ROC score = 0.737 ; se = 0.002• 5.5% less perfor...
Contribution of predictor variables           copyright All Rights Reserved Doug Chang               dougc at stanfordalum...
Predictive Importance• Friedman, number of times a variable is selected for splitting weighted by  squared error or improv...
Behavioral vs. Demographics• Demographics are sparse• Behavioral weblogs are the best source. Most  sites aren’t using thi...
Fitted Values (Crappy)     copyright All Rights Reserved Doug Chang         dougc at stanfordalumni dot org
Fitted Values Better    copyright All Rights Reserved Doug Chang        dougc at stanfordalumni dot org
Predictor Variable Interaction• Adjusting variable  interactions                copyright All Rights Reserved Doug Chang  ...
Variable Interactions    copyright All Rights Reserved Doug Chang        dougc at stanfordalumni dot org
Plot Interactions age, loc       copyright All Rights Reserved Doug Chang           dougc at stanfordalumni dot org
Trees vs. other methods• Can see multiple levels good for trees. Do  other variables match this? Simplify model or  add mo...
Number of Trees  copyright All Rights Reserved Doug Chang      dougc at stanfordalumni dot org
Data Set Number of Trees      copyright All Rights Reserved Doug Chang          dougc at stanfordalumni dot org
Hackathon Results   copyright All Rights Reserved Doug Chang       dougc at stanfordalumni dot org
Weblogs only 68.15%, 18% better than              random            copyright All Rights Reserved Doug Chang              ...
Demographics add 1%    copyright All Rights Reserved Doug Chang        dougc at stanfordalumni dot org
AWS Advantages• Running multiple instances with different  algorithms and parameters using R• Add tutorial, install Screen...
Conclusion• Data Mining at scale requires more development  in visualization, MR algorithms, MR data  preprocessing.• Tuni...
Upcoming SlideShare
Loading in...5
×

Demographics andweblogtargeting

459

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
459
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Demographics andweblogtargeting"

  1. 1. Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight intowhich variables are important for strategies to increase the subscription rate Learn by Doing copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  2. 2. http://www.meetup.com/HandsOnPro grammingEvents/ copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  3. 3. Data Mining Hackathon copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  4. 4. Funded by Rapleaf• With Motley Fool’s data• App note for Rapleaf/Motley Fool• Template for other hackathons• Did not use AWS. R on individual PCs• Logisics: Rapleaf funded prizes and food for 2 weekends for ~20-50. Venue was free copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  5. 5. Getting more subscribers copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  6. 6. Headline Data, Weblog copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  7. 7. Demographics copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  8. 8. Cleaning Data• training.csv(201,000), headlines.tsv(811MB), e ntry.tsv(100k), demographics.tsv• Feature Engineering• Github: copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  9. 9. Ensemble Methods• Bagging, Boosting, randomForests• Overfitting• Stability (small changes make large prediction changes)• Previously none of these work at scale• Small scale results using R, large scale exist in proprietary implementations(google, amazon, etc..) copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  10. 10. ROC Curves Binary Classifier Only!copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  11. 11. Paid Subscriber ROC curve, ~61% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  12. 12. Boosted Regression Trees Performance• training data ROC score = 0.745• cv ROC score = 0.737 ; se = 0.002• 5.5% less performance than the winning score without doing any data processing• Random is 50% or .50. We are .737-.50 better than random by 23.7% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  13. 13. Contribution of predictor variables copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  14. 14. Predictive Importance• Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of sparsity in data• Fit plots remove averages of model variables• 1 pageV 74.0567852• 2 loc 11.0801383• 3 income 4.1565597• 4 age 3.1426519• 5 residlen 3.0813927• 6 home 2.3308287• 7 marital 0.6560258• 8 sex 0.6476549• 9 prop 0.3817017• 10 child 0.2632598• 11 own 0.2030012 copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  15. 15. Behavioral vs. Demographics• Demographics are sparse• Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm• Linear vs. Nonlinear copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  16. 16. Fitted Values (Crappy) copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  17. 17. Fitted Values Better copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  18. 18. Predictor Variable Interaction• Adjusting variable interactions copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  19. 19. Variable Interactions copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  20. 20. Plot Interactions age, loc copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  21. 21. Trees vs. other methods• Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model• No Math. Analyst copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  22. 22. Number of Trees copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  23. 23. Data Set Number of Trees copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  24. 24. Hackathon Results copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  25. 25. Weblogs only 68.15%, 18% better than random copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  26. 26. Demographics add 1% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  27. 27. AWS Advantages• Running multiple instances with different algorithms and parameters using R• Add tutorial, install Screen, R GUI bugs• http://amazonlabs.pbworks.com/w/page/280 36646/FrontPage copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  28. 28. Conclusion• Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.• Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3.• This isn’t reproducable in Hadoop/Mahout or any open source code I know of• Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.• Careful with MR paradigms, Hadoop MR != Couchbase MR copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×