MLconf NYC 0xdata


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MLconf NYC 0xdata

  1. 1. 4/23/13 Movie Night: Data Science - Even On Our Night Off May 27, 2014
  2. 2. Anqi and Irene – (H2O) • Anqi is the in-house R expert and is responsible for K-means and PCA • Irene is the pencil and paper stats nerd and technical writer • Part of a data science team that’s 75% women, and on a technical team that’s 23% women (well above average). Sergei- (Collective) VP, Data Sciences at Collective, where he is responsible for the architecture, development and scaling of data-driven technology products for digital advertising.
  3. 3. What is H2O? • Same statistics - new volumes of data • On a distributed cluster models on a terabyte of data can finish in minutes. • Provide an interface to give more people the power of data science. • Also hook H2O into R and Scala
  4. 4. Overview Walk through the practical problem of what movie to go see together. Examine work flow from data to prediction, and let the best model inform our choice Extend to production setting applications with a customer use case
  5. 5. Movie Lens Data Data is the 100,000 observation MovieLens data set Demographic Features: State Age Occupation Gender Factor Integer Factor Factor Levels: 62 Range (7,73) Levels: 21 Levels: 2 Largest class: California Mean: 32.9 Largest Class: Student M:F is about 3:1
  6. 6. Movie Classes Movies are classified by types, types are not exclusive.
  7. 7. Dependent Variable Users rated movies on a Likert scale of 1 to 5. We converted this to a binomial indicator: Ratings >= 4: recoded to 1, indicating liked movie Ratings < 4: recoded to 0, indicating disliked the movie
  8. 8. Super Models Both models are predicting the same dependent variable as a function of the same set of features. First model with tree based GBM - start simple and let the model get as complex as it needs to with depth Alternative model with regularized GLM - start with complexity and let model generalize with regularization
  9. 9. WWIM Using Gradient Boosted Classification on two classes GBM is nonparametric, great when there’s no theoretical model. Accounts for complex interaction Control overfitting with learning rate
  10. 10. WWAM: Alternative – Logistic GLM Logistic binomial regression End model has interpretability Control for overfitting introducing penalty into objective function - aids in feature selection and generalizability Ridge regression- all L2 Penalty
  11. 11. Rubber; Meet Road Comparison of error rates on holdout set GBM Model GLM Model Error on Dislike (0) 28% 30% Error on Like (1) 18% 50% Overall 22% 40%
  12. 12. GBM Predictions GLM Predictions Like: 300, Her, Need For Speed Dislike: Frozen, Pebody Like: 300, Her, Capt. America Dislike: Frozen, Divergent
  13. 13. Lights Out - Some Closing Points We didn't address a serious problem here - but this is the general process used in a production environment. To give you a sense for the real world implementation, we’ve asked one of our users to share his use case with you.
  14. 14. Stories change people, while statistics gives them something to argue about - Bernie Siegel
  15. 15. Ad Server (publisher) Ad Server (advertiser) AgencyBrowser BrandsPublishers Content Inventory Ads Audience
  16. 16. Audience Modeling 1. Build the Audience Cloud of stable cookies. 2. Define target audience using Cookie level data. 3. Assemble 1,000s of features on every cookie. 4. Build a predictive model using machine learning. 5. Score every cookie in the Audience Cloud. 6. Create a targetable segment with the top X users. 7. Adjust X daily to optimize delivery & performance. 8. Rebuild models weekly (daily if warranted). Audience Cloud (200M+ Stable Cookies) Target Audience (100K Cookies) 1M Cookies 3M Cookies Preprint of paper submitted to KDD’14 Audience Extension: audiences (age 25-40, buys toys, watches TNT) Audience Optimization: actions (clicks, online purchases)
  17. 17. Modeling Platform MODEL BUILDING Computing predictive models on Current Future DATA SIZES Size of data ALGORITHM Complexity and performance GBMglmnet 1 million 1,000 1 billion 100,000 SCORING Predicting outcomes Batch Real Time + H2O
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.