Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
4/23/13
Movie Night: Data Science
- Even On Our Night Off
May 27, 2014
Anqi and Irene – (H2O)
• Anqi is the in-house R expert and is responsible for K-means and PCA
• Irene is the pencil and pa...
What is H2O?
• Same statistics - new volumes of data
• On a distributed cluster models on a terabyte of data can finish in...
Overview
Walk through the practical problem of what movie to go see
together.
Examine work flow from data to prediction, a...
Movie Lens Data
Data is the 100,000 observation MovieLens data set
Demographic Features:
State Age Occupation Gender
Facto...
Movie Classes
Movies are classified by types, types are not exclusive.
Dependent Variable
Users rated movies on a Likert scale of 1 to 5.
We converted this to a binomial indicator:
Ratings >= 4...
Super Models
Both models are predicting the same dependent variable as a function
of the same set of features.
First model...
WWIM
Using Gradient Boosted Classification on two classes
GBM is nonparametric, great when there’s no theoretical
model.
A...
WWAM: Alternative – Logistic GLM
Logistic binomial regression
End model has interpretability
Control for overfitting intro...
Rubber; Meet Road
Comparison of error rates on holdout set
GBM Model GLM Model
Error on Dislike (0) 28% 30%
Error on Like ...
GBM Predictions GLM Predictions
Like: 300, Her, Need For Speed
Dislike: Frozen, Pebody
Like: 300, Her, Capt. America
Disli...
Lights Out - Some Closing
Points
We didn't address a serious problem here - but this is the
general process used in a prod...
Stories change people, while statistics gives
them something to argue about
- Bernie Siegel
Ad Server
(publisher)
Ad Server
(advertiser)
AgencyBrowser
BrandsPublishers
Content
Inventory
Ads
Audience
Audience Modeling
1. Build the Audience Cloud of stable cookies.
2. Define target audience using Cookie level data.
3. Ass...
Modeling Platform
MODEL BUILDING
Computing predictive models
on
Current Future
DATA SIZES
Size of data
ALGORITHM
Complexit...
Upcoming SlideShare
Loading in …5
×

MLconf NYC 0xdata

472 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

MLconf NYC 0xdata

  1. 1. 4/23/13 Movie Night: Data Science - Even On Our Night Off May 27, 2014
  2. 2. Anqi and Irene – (H2O) • Anqi is the in-house R expert and is responsible for K-means and PCA • Irene is the pencil and paper stats nerd and technical writer • Part of a data science team that’s 75% women, and on a technical team that’s 23% women (well above average). Sergei- (Collective) VP, Data Sciences at Collective, where he is responsible for the architecture, development and scaling of data-driven technology products for digital advertising.
  3. 3. What is H2O? • Same statistics - new volumes of data • On a distributed cluster models on a terabyte of data can finish in minutes. • Provide an interface to give more people the power of data science. • Also hook H2O into R and Scala
  4. 4. Overview Walk through the practical problem of what movie to go see together. Examine work flow from data to prediction, and let the best model inform our choice Extend to production setting applications with a customer use case
  5. 5. Movie Lens Data Data is the 100,000 observation MovieLens data set Demographic Features: State Age Occupation Gender Factor Integer Factor Factor Levels: 62 Range (7,73) Levels: 21 Levels: 2 Largest class: California Mean: 32.9 Largest Class: Student M:F is about 3:1
  6. 6. Movie Classes Movies are classified by types, types are not exclusive.
  7. 7. Dependent Variable Users rated movies on a Likert scale of 1 to 5. We converted this to a binomial indicator: Ratings >= 4: recoded to 1, indicating liked movie Ratings < 4: recoded to 0, indicating disliked the movie
  8. 8. Super Models Both models are predicting the same dependent variable as a function of the same set of features. First model with tree based GBM - start simple and let the model get as complex as it needs to with depth Alternative model with regularized GLM - start with complexity and let model generalize with regularization
  9. 9. WWIM Using Gradient Boosted Classification on two classes GBM is nonparametric, great when there’s no theoretical model. Accounts for complex interaction Control overfitting with learning rate
  10. 10. WWAM: Alternative – Logistic GLM Logistic binomial regression End model has interpretability Control for overfitting introducing penalty into objective function - aids in feature selection and generalizability Ridge regression- all L2 Penalty
  11. 11. Rubber; Meet Road Comparison of error rates on holdout set GBM Model GLM Model Error on Dislike (0) 28% 30% Error on Like (1) 18% 50% Overall 22% 40%
  12. 12. GBM Predictions GLM Predictions Like: 300, Her, Need For Speed Dislike: Frozen, Pebody Like: 300, Her, Capt. America Dislike: Frozen, Divergent
  13. 13. Lights Out - Some Closing Points We didn't address a serious problem here - but this is the general process used in a production environment. To give you a sense for the real world implementation, we’ve asked one of our users to share his use case with you.
  14. 14. Stories change people, while statistics gives them something to argue about - Bernie Siegel
  15. 15. Ad Server (publisher) Ad Server (advertiser) AgencyBrowser BrandsPublishers Content Inventory Ads Audience
  16. 16. Audience Modeling 1. Build the Audience Cloud of stable cookies. 2. Define target audience using Cookie level data. 3. Assemble 1,000s of features on every cookie. 4. Build a predictive model using machine learning. 5. Score every cookie in the Audience Cloud. 6. Create a targetable segment with the top X users. 7. Adjust X daily to optimize delivery & performance. 8. Rebuild models weekly (daily if warranted). Audience Cloud (200M+ Stable Cookies) Target Audience (100K Cookies) 1M Cookies 3M Cookies bit.ly/MLatScale Preprint of paper submitted to KDD’14 Audience Extension: audiences (age 25-40, buys toys, watches TNT) Audience Optimization: actions (clicks, online purchases)
  17. 17. Modeling Platform MODEL BUILDING Computing predictive models on Current Future DATA SIZES Size of data ALGORITHM Complexity and performance GBMglmnet 1 million 1,000 1 billion 100,000 SCORING Predicting outcomes Batch Real Time + H2O

×