Parmeshwar Khurd, Ehsan Saberian & Maryam Esmaeili
ML Platform Meetup, 6/20/2019
Missing Values in
Recommender Models
Talk Outline
● Problem Statement: Missing Features in Recommender Systems (RS)
● Handling Missing Features in GBDTs
● Handling Missing Features in NNs
● Conclusion
01
Problem Statement
● Scientists and engineers in the mathematical sciences have historically
dealt with the problem of missing observations for a long time
● Typical patterns in physics:
a. Astronomers fill in missing observations for orbits via least-squares:
Ceres example
b. Models to explain all observations including missing / future ones
i. Physicist proposes a new model explaining past observations that previous models
cannot adequately explain
ii. She realizes new model predicts events for which past observations do not exist
iii. New observations are collected to validate new model
Missing Observations vs. Missing Features
Physics Example from 100 Years Ago
● Einstein proposed general relativity model for
gravitation in 1915, an improvement over Newtonian
models, with two striking examples:
○ It better explained known observed shifts in
perihelion (closest point to Sun) of Mercury’s orbit
○ It predicted as yet unmeasured bending of light
from distant stars, e.g., during solar eclipse,
bending ~ 1.75 arc-seconds, twice Newtonian
prediction. Arms race to validate experimentally:
Eddington succeeded in May 1919
Non-parametric/ big-data Correlational Models
● We have already talked about several complex models:
○ Correlational: Assume time-dependent elliptical functional form for planetary orbit, fit/regress parameters
assuming normal noise to fill in missing past coordinates and predict future motion
○ Causal: Newton or Einstein’s general causal models for gravitation PDEs for planetary motion
functional forms of orbits / perihelion shifts and suggested new observations no one had thought to measure
● In rest of talk, we focus on correlational models, but they are statistical and more complex:
○ trained on more data (both more features and samples)
○ non-parametric (decision trees) or many parameters (neural networks)
● But observation not missing, only a part of it:
○ incomplete observation is called observation with missing data
○ if input is incomplete, it is an observation with missing features
Improving correlational ML Models in RS
● Given context, predictive ML model in recommender system (RS) needs
to match users with items they might enjoy
● Thankfully, as ML engineers in the recommendation space, we need less
creativity and labor than Einstein / Eddington to improve models
● In supervised ML models, we can time-travel our (features, labels) to see
if our newer predictive models improve performance on historical offline
metrics [Netflix Delorean ML blog]
● Model improvements come from leveraging
○ business information (more appropriate metrics or inputs)
○ ML models: BERT, CatBoost, Factorization Machines, etc.
Problem of Missing Data in RS - I
● ML models in RS need to deal with missing data patterns for cases such as:
○ New users
○ New contexts (e.g., country, time-zone, language, device, row-type)
○ New items
○ Timeouts and failures in data microservices
○ Modeling causal impact of recommendations
○ Intent-to-treat
● Unfortunately, last two problems similar to Einstein/Eddington example:
Solutions involve causal models / contextual bandits and discussed elsewhere [Netflix talk]
● Not handling missing labels: Optimizing RS for longer-term reward (label) a harder problem
[Netflix talk]
Problem of Missing Data in RS - II
Guiding principles in this talk for RS cold-start + other correlational missing
feature problems
● Let ML models handle missing values rather than imputing and/or adding
features (via models or simple statistics)
○ Both GBDTs and NNs allow this
● ML models generally better at interpolation than extrapolation
○ Many past examples of service handling new users, items and contexts
○ For robust extrapolation during timeouts or data service failures, add simulated
examples in training and/or impose feature monotonicity constraints
New Users - I
● New users join Netflix every minute
New Users - II
● We get some taste information in
the sign-up flow
● But clearly, we don’t know enough
(what have they watched
elsewhere, broader tastes, etc.) to
personalize well
● Rather than try to extrapolate into
the past, personalize progressively
better as they interact with our
service
New Contexts
● ML models in search / recommender systems need to respect user
language choice
● As new languages are supported, these choices will grow
New Items - I
● New items are added to the Netflix service every day
SNL
New items - II
● New items miss any features
based on engagement data
● “Coming Soon” tab shows
trailers
○ This tab needs a
personalized ranker as well
02
Handling Missing Features in
GBDTs
GBDT for RS
● Several packages to train GBDTs: XGBoost, R’s GBM, CatBoost,
LightGBM, Cognitive Foundry, sklearn, etc.
● XGBoost won several structured data Kaggle competitions
● Netflix talk on fast scoring of XGBoost models
● Dwell-time for Yahoo homepage recommender (RecSys 2014 Best Paper)
Source: XGBoost
(S)GBDT Background - I
Training Stochastic Gradient Boosted Decision Trees (SGBDTs) for (logistic) loss
minimization consists of one main algorithm (greedily learn ensemble) and two
sub-algorithms (learn individual tree, learn split at each node of tree) :
Learn leaf coefficient
by one iteration of
Newton-Raphson
Get gradient of (logistic)
loss per example w.r.t.
current ensemble
Learn tree structure
(S)GBDT Background - II
Learn left and right
trees recursively
Find best split via
variance reduction
Missing Value Handling w GBDTs: Taxonomy
● ESL-II, (Section 9.6) mentions 3 ways to handle missing values:
○ Discard observations with any missing values
○ Impute all missing values before training via models or simple statistics :
Item popularities may be initialized randomly or to zero or via weighted averaging, where
weights may indicate similarity determined via meta-data
○ Rely on the learning algorithm to deal with missing values in its training
phase via surrogate splits non-strict usage
in tree:
■ Categoricals can include one more “missing” category
■ Continuous / categorical:
● Send example left or right for missing value appropriately (XGBoost)
● Use ternary split with missing branch (R’s GBM)
Missing Value Handling w R’s GBM
● Use ternary split with
missing branch:
○ Weighted
variance
reduction in
Best-Split
algorithm
updated to
include missing
variance
Missing Value Handling w XGBoost
Always send example left or right for
missing value appropriately:
● Evaluate best threshold and
variance reduction in Best-Split
algorithm from sending missing
values left or right (post-hoc)
and then pick better choice
03
Handling Missing Features in
NNs
Recurrent Neural Network (NN) for RS
● Youtube latent cross
recurrent NN, WSDM 2018
● Trained with
TensorFlow/Keras
○ Other options include
PyTorch, MxNet,
CNTK, etc.
Missing Value Handling w NNs: Taxonomy
● Similar taxonomy as in the case of GBDTs
○ Discard observations with any missing values
■ Dropout: Drop connections w missing values, scale up others
○ Impute all missing values before training via models or simple
statistics: Item embeddings may be initialized randomly or to zeros or via weighted
averaging, where weights may indicate similarity determined via meta-data
○ Rely on the learning algorithm to deal with missing values in its
training phase via hidden layers
■ Categoricals: Single “missing” item hidden embedding or DropoutNet (NIPS17)
■ Continuous / Categorical: Impute continuous + include “missing” embedding or
Hidden layer reaches (NIPS18) “average” for missing feature or item
Missing Value Handling w DropoutNet
Auto-encoder with item/user-vec randomly retained or set to zero/average
Missing Value Handling w Hidden “Average”
Partly closed-form “average” for missing GMM first hidden layer activation
● A variety of ways to handle missing values in recommender models
● Only presented subset of approaches that do not modify / impute
inputs and treat missing values within training algorithm
● Optimal approach for a problem likely dataset-dependent !
Conclusion
● How Gauss determined the orbit of Ceres, J. Tennenbaum, et al.
● Why beauty is truth: a history of symmetry, I. Stewart
● MAY 29, 1919: A MAJOR ECLIPSE, RELATIVELY SPEAKING, L. Buchen, Wired
● Delorean, H. Taghavi, et al., Netflix
● Bandits for Recommendations, J. Kawale, et al., Netflix
● Longer-term outcomes, B. Rostykus et al., Netflix
● Speeding up XGBoost Scoring, D. Parekh, et al., Netflix
● Beyond clicks: Dwell-time for personalization, X Yi, et al.
● Latent Cross: Making Use of Context in Recurrent Recommender Systems, Beutel, et al.
● ESL-II: Elements of Statistical Learning, Hastie, Tibshirani, Friedman
● R GBM
● Xgboost
● Processing of missing data via neural networks, Smieja, et al.
● DropoutNet: Addressing Cold Start in Recommender Systems, Volkovs et al.
● Inference and missing data. Biometrika, 63, 581–592, Rubin, et al.
References
Acknowledgments
The presenters wish to thank J. Basilico, H. Taghavi, Y. Raimond, S. Das, J. Kim, A. Deoras, C. Alvino and several
others for discussions and contributions
Thank You !

Missing values in recommender models

  • 1.
    Parmeshwar Khurd, EhsanSaberian & Maryam Esmaeili ML Platform Meetup, 6/20/2019 Missing Values in Recommender Models
  • 2.
    Talk Outline ● ProblemStatement: Missing Features in Recommender Systems (RS) ● Handling Missing Features in GBDTs ● Handling Missing Features in NNs ● Conclusion
  • 3.
  • 4.
    ● Scientists andengineers in the mathematical sciences have historically dealt with the problem of missing observations for a long time ● Typical patterns in physics: a. Astronomers fill in missing observations for orbits via least-squares: Ceres example b. Models to explain all observations including missing / future ones i. Physicist proposes a new model explaining past observations that previous models cannot adequately explain ii. She realizes new model predicts events for which past observations do not exist iii. New observations are collected to validate new model Missing Observations vs. Missing Features
  • 5.
    Physics Example from100 Years Ago ● Einstein proposed general relativity model for gravitation in 1915, an improvement over Newtonian models, with two striking examples: ○ It better explained known observed shifts in perihelion (closest point to Sun) of Mercury’s orbit ○ It predicted as yet unmeasured bending of light from distant stars, e.g., during solar eclipse, bending ~ 1.75 arc-seconds, twice Newtonian prediction. Arms race to validate experimentally: Eddington succeeded in May 1919
  • 6.
    Non-parametric/ big-data CorrelationalModels ● We have already talked about several complex models: ○ Correlational: Assume time-dependent elliptical functional form for planetary orbit, fit/regress parameters assuming normal noise to fill in missing past coordinates and predict future motion ○ Causal: Newton or Einstein’s general causal models for gravitation PDEs for planetary motion functional forms of orbits / perihelion shifts and suggested new observations no one had thought to measure ● In rest of talk, we focus on correlational models, but they are statistical and more complex: ○ trained on more data (both more features and samples) ○ non-parametric (decision trees) or many parameters (neural networks) ● But observation not missing, only a part of it: ○ incomplete observation is called observation with missing data ○ if input is incomplete, it is an observation with missing features
  • 7.
    Improving correlational MLModels in RS ● Given context, predictive ML model in recommender system (RS) needs to match users with items they might enjoy ● Thankfully, as ML engineers in the recommendation space, we need less creativity and labor than Einstein / Eddington to improve models ● In supervised ML models, we can time-travel our (features, labels) to see if our newer predictive models improve performance on historical offline metrics [Netflix Delorean ML blog] ● Model improvements come from leveraging ○ business information (more appropriate metrics or inputs) ○ ML models: BERT, CatBoost, Factorization Machines, etc.
  • 8.
    Problem of MissingData in RS - I ● ML models in RS need to deal with missing data patterns for cases such as: ○ New users ○ New contexts (e.g., country, time-zone, language, device, row-type) ○ New items ○ Timeouts and failures in data microservices ○ Modeling causal impact of recommendations ○ Intent-to-treat ● Unfortunately, last two problems similar to Einstein/Eddington example: Solutions involve causal models / contextual bandits and discussed elsewhere [Netflix talk] ● Not handling missing labels: Optimizing RS for longer-term reward (label) a harder problem [Netflix talk]
  • 9.
    Problem of MissingData in RS - II Guiding principles in this talk for RS cold-start + other correlational missing feature problems ● Let ML models handle missing values rather than imputing and/or adding features (via models or simple statistics) ○ Both GBDTs and NNs allow this ● ML models generally better at interpolation than extrapolation ○ Many past examples of service handling new users, items and contexts ○ For robust extrapolation during timeouts or data service failures, add simulated examples in training and/or impose feature monotonicity constraints
  • 10.
    New Users -I ● New users join Netflix every minute
  • 11.
    New Users -II ● We get some taste information in the sign-up flow ● But clearly, we don’t know enough (what have they watched elsewhere, broader tastes, etc.) to personalize well ● Rather than try to extrapolate into the past, personalize progressively better as they interact with our service
  • 12.
    New Contexts ● MLmodels in search / recommender systems need to respect user language choice ● As new languages are supported, these choices will grow
  • 13.
    New Items -I ● New items are added to the Netflix service every day SNL
  • 14.
    New items -II ● New items miss any features based on engagement data ● “Coming Soon” tab shows trailers ○ This tab needs a personalized ranker as well
  • 15.
  • 16.
    GBDT for RS ●Several packages to train GBDTs: XGBoost, R’s GBM, CatBoost, LightGBM, Cognitive Foundry, sklearn, etc. ● XGBoost won several structured data Kaggle competitions ● Netflix talk on fast scoring of XGBoost models ● Dwell-time for Yahoo homepage recommender (RecSys 2014 Best Paper) Source: XGBoost
  • 17.
    (S)GBDT Background -I Training Stochastic Gradient Boosted Decision Trees (SGBDTs) for (logistic) loss minimization consists of one main algorithm (greedily learn ensemble) and two sub-algorithms (learn individual tree, learn split at each node of tree) : Learn leaf coefficient by one iteration of Newton-Raphson Get gradient of (logistic) loss per example w.r.t. current ensemble Learn tree structure
  • 18.
    (S)GBDT Background -II Learn left and right trees recursively Find best split via variance reduction
  • 19.
    Missing Value Handlingw GBDTs: Taxonomy ● ESL-II, (Section 9.6) mentions 3 ways to handle missing values: ○ Discard observations with any missing values ○ Impute all missing values before training via models or simple statistics : Item popularities may be initialized randomly or to zero or via weighted averaging, where weights may indicate similarity determined via meta-data ○ Rely on the learning algorithm to deal with missing values in its training phase via surrogate splits non-strict usage in tree: ■ Categoricals can include one more “missing” category ■ Continuous / categorical: ● Send example left or right for missing value appropriately (XGBoost) ● Use ternary split with missing branch (R’s GBM)
  • 20.
    Missing Value Handlingw R’s GBM ● Use ternary split with missing branch: ○ Weighted variance reduction in Best-Split algorithm updated to include missing variance
  • 21.
    Missing Value Handlingw XGBoost Always send example left or right for missing value appropriately: ● Evaluate best threshold and variance reduction in Best-Split algorithm from sending missing values left or right (post-hoc) and then pick better choice
  • 22.
  • 23.
    Recurrent Neural Network(NN) for RS ● Youtube latent cross recurrent NN, WSDM 2018 ● Trained with TensorFlow/Keras ○ Other options include PyTorch, MxNet, CNTK, etc.
  • 24.
    Missing Value Handlingw NNs: Taxonomy ● Similar taxonomy as in the case of GBDTs ○ Discard observations with any missing values ■ Dropout: Drop connections w missing values, scale up others ○ Impute all missing values before training via models or simple statistics: Item embeddings may be initialized randomly or to zeros or via weighted averaging, where weights may indicate similarity determined via meta-data ○ Rely on the learning algorithm to deal with missing values in its training phase via hidden layers ■ Categoricals: Single “missing” item hidden embedding or DropoutNet (NIPS17) ■ Continuous / Categorical: Impute continuous + include “missing” embedding or Hidden layer reaches (NIPS18) “average” for missing feature or item
  • 25.
    Missing Value Handlingw DropoutNet Auto-encoder with item/user-vec randomly retained or set to zero/average
  • 26.
    Missing Value Handlingw Hidden “Average” Partly closed-form “average” for missing GMM first hidden layer activation
  • 27.
    ● A varietyof ways to handle missing values in recommender models ● Only presented subset of approaches that do not modify / impute inputs and treat missing values within training algorithm ● Optimal approach for a problem likely dataset-dependent ! Conclusion
  • 28.
    ● How Gaussdetermined the orbit of Ceres, J. Tennenbaum, et al. ● Why beauty is truth: a history of symmetry, I. Stewart ● MAY 29, 1919: A MAJOR ECLIPSE, RELATIVELY SPEAKING, L. Buchen, Wired ● Delorean, H. Taghavi, et al., Netflix ● Bandits for Recommendations, J. Kawale, et al., Netflix ● Longer-term outcomes, B. Rostykus et al., Netflix ● Speeding up XGBoost Scoring, D. Parekh, et al., Netflix ● Beyond clicks: Dwell-time for personalization, X Yi, et al. ● Latent Cross: Making Use of Context in Recurrent Recommender Systems, Beutel, et al. ● ESL-II: Elements of Statistical Learning, Hastie, Tibshirani, Friedman ● R GBM ● Xgboost ● Processing of missing data via neural networks, Smieja, et al. ● DropoutNet: Addressing Cold Start in Recommender Systems, Volkovs et al. ● Inference and missing data. Biometrika, 63, 581–592, Rubin, et al. References
  • 29.
    Acknowledgments The presenters wishto thank J. Basilico, H. Taghavi, Y. Raimond, S. Das, J. Kim, A. Deoras, C. Alvino and several others for discussions and contributions Thank You !