Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neural Field aware Factorization Machine

296 views

Published on

The NFFM is a model that predicts user behaviour such as install rate, video completion rate or click-through rate when engaging with mobile ads.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Neural Field aware Factorization Machine

  1. 1. Why Neural Net Field Aware Factorization Machines are able to break ground in digital behaviours prediction Presenter: Gunjan Sharma Co-Author: Varun Kumar Modi
  2. 2. About the Authors Presenter: Gunjan Sharma System Architect @ InMobi (3 years) SE @Facebook (2.5 Years) DPE @Google (1 year) Twitter Handle: @gunjan_1409 LinkedIn: https://www.linkedin.com/in/gunjan- sharma-a6794414/ Co-author: Varun Kumar Modi Sr Research Scientist @ InMobi(5 years) LinkedIn: https://www.linkedin.com/in/varun- modi-33800652/
  3. 3. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  4. 4. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  5. 5. InMobi is one of the largest advertising platform at scale globally InMobi reaches >2 billion MAU across the world - specialised in mobile In-app advertising JAPA N INDIA+ SEA CHINA Afri ca ANZ NORTH AMERICA KOREA EMEA Latin America LATIN AMERICA Afri ca AfricaAFRICA China APAC Consolidation has taken place to clean up the ecosystem few advertising platforms at scale exist North America (only Video) Very limited number of players have presence in Asia, InMobi is dominating Few players control each component of the chain; No presence of global players, except InMobi
  6. 6. Problem stmt and why it matters ● What are the problems: Use case 1 - Conversion ratio (CVR) prediction: - CVR = Install rate of users = Probability of a install given a click - Usage: CPM = CTR * CVR * CPI Use case 2 - Video completion rate (VCR) prediction: - Video completion rate of users watching advertising videos given click ● Why are they important: ○ Performance business - based on arbitrage, so the model directly determines the margin/profit of the business and the ability of the campaign to achieve significant scale = > multi-million dollar businesses!
  7. 7. Existing context and challenges ● Models traditionally used Linear/Logistic Regression and Tree-based models ● Both have their strengths and weaknesses when used in production ● What we need is an awesome model that sits somewhere in the middle and can bring in the best of both worlds LR Tree Based Generalise for unseen combinations Our use cases could not Potentially Underfit at times Potentially can overfit at times Requires lesser RAM Can at times bloat RAM usage specially with high cardinality features
  8. 8. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  9. 9. Why think of NN for CVR/VCR prediction ● Using cross features in LR wasn’t cutting it for us. ● Plus at some point it starts to become cumbersome both at training and prediction time. ● All the major predictions noted here follow a complex curve ● LR left much to desire compared to Tree based models for example because interaction-terms are limited ● We tried couple of awesome models that were also not able to beat Tree based models We all agreed that Neural Nets are a suitable technology to find higher order interactions between our features At the same time they have the power of generalising to unseen combinations.
  10. 10. Challenges Involved ● Traditionally NNs are more utilized for Classification problems ● We want to model our predictions as regression problem ● Most of the features are categorical which means we need to use one-hot encoding ● This causes NN to spew very bad results as they need a lot of data to train efficiently. ● Plus cardinality of some features is very high and it makes life more troublesome. ● Model should be easy to productionised both for training and serving ● Spark isn’t suited for custom NN networks. ● Model should be debuggable as much as possible to be able to explain the Business changes ● The resistance to using NN for a long time came because of the lack of understanding into their internals
  11. 11. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  12. 12. Consider the following dummy dataset Publisher Advertiser Gender CVR ESPN Nike Male 0.01 CNBC Nike Male 0.0004 ESPN Adidas Female 0.008 Sony Coke Female 0.0005 Sony P&G Male 0.002
  13. 13. Factorization Machine (FM) - What are those ESPN CNBC SONY Adi Nike Coke P&G Male Female X0 X1 X2 Y0 Y1 Y2 Z0 Z1 Z2 Publisher Advertiser Gender CVR ESPN Nike Male 0.01 1 0 0 0 1 0 0 1 0 = Publisher Latent Vector (PV) = Advertiser Latent Vector (AV) = Gender Latent Vector (GV) PVT*AV + AVT*GV + GVT*PV = pCVR NOTE: All vectors are K dimensional which is hyper parameter for the algorithm
  14. 14. Factorization Machine (FM) - What are those ● K dimensional representation for every feature value ● Captures second order interactions across all the features (ATB = |A|*|B|*cos(Θ)) ● Essentially a combination of hyperbolas summed up to form the final prediction ● Works better than LR but tree based models are still more powerful. ● EG: Predict movie’s revenue: Features Movie City Gender Latent Features Horror Comedy Action Romance Second Order Intuition ● For every latent feature ● For every pair of original feature ● How much this latent feature affect revenue when considering these pair Final predicted revenue is linear sum over all latent features
  15. 15. Field aware Factorization Machine (FFM) ESPN CNBC SONY Adi Nike Coke P&G Male Female XA 0 XA 1 XA 2 Publisher Advertiser Gender CVR ESPN Nike Male 0.01 1 0 0 0 1 0 0 1 0 PVA PVA T*AVP + AVG T*GVA + GVP T*PVG = pCVR NOTE: All vectors are K dimensional which is hyper parameter for the algorithm XG 0 XG 1 XG 2 PVG YP 0 YP 1 YP 2 AVP YG 0 YG 1 YG 2 AVG ZP 0 ZP 1 ZP 2 GVP ZA 0 ZA 1 ZA 2 GVA
  16. 16. Field aware Factorization Machine (FFM) ● We have a K dimensional vector for every feature value for every other feature type ● Still second order interactions but with more degrees of freedom than FM ● Intuition: Latent features interact with every other cross feature differently Works significantly better than FM, but at certain cuts was still not able to beat Tree based model
  17. 17. Deep neural-net with Factorisation Machine: DeepFM Sigmoid(FM + NeuralNet(PV :+ AV :+ GV)) = pCVR
  18. 18. DeepFM ● Now we are entering the neural net world ● This model is a combination of FM and NN and the final prediction is sum of the output from the 2 models ● Here we optimize the entire graph together. ● It performs better than using the latent vectors from FM and then running them through neural net as a secondary optimization (FNN) ● It performs better than FM but not better than FFM ● Intuition: FM finds the second order interactions while neural net uses the latent vectors to find the higher order nonlinear interactions.
  19. 19. Neural Factorization Machine: NFM NeuralNet((PV.*AV .+ AV.*GV .+ GV.*PV)T) = pCVR
  20. 20. NFM ● In this architecture you only run the second order features through NN instead of the raw latent vectors ● Intuition: The neural net takes the second order interactions and uses them to find the higher order nonlinear interactions ● Performs better than DeepFM mostly attributed to the 2 facts ○ The size of the net is smaller hence converges faster. ○ The neural net can take the second order interactions and convert them easily to higher order interactions. ● Results were better than DeepFM as well. But still not better than FFM
  21. 21. InMobi Spec: DeepFFM Feature1 F2E Dense Embeddings F3E F1E F3E F1E F2E Hidden Layers Act FF Machine Ypred Feature2 Feature3 Spare Features
  22. 22. InMobi Spec: DeepFFM ● A simple upgrade to deepFM ● Performs better than both DeepFM and FFM ● Training is slower ● FFM part of things does the majority of the prediction heavy lifting. Evidently due to faster gradient convergence. ● Intuition: Take the latent vectors run them through NN for higher order interactions and use FFM for second order interactions.
  23. 23. InMobi Spec: NFFM Feature1 F2E Dense Embeddings F3E F1E F3E F1E F2E Feature2 Feature3 Sparse Features FF Machine Hidden Layers ….... K inputs Ypred
  24. 24. InMobi Spec: NFFM ● A simple upgrade to NFM ● Does better than everyone significantly. ● Converges faster than DeepFFM ● Intuition: Take the second order interactions from FFM and run them through neural net to find higher order nonlinear interactions.
  25. 25. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  26. 26. Use case 1 - Results CVR Accuracy function: (ΣWᵢ * abs(Yactᵢ - Ypredᵢ)) ΣWᵢ Model FFM DeepFM DeepFFM NFFM Accuracy % Improvement over Linear model (small DS) 44% 35% 48% 64%
  27. 27. Use case 1 - Results CVR Training Data Dates Test Date Accuracy % Improvement over Linear Model T1-T7 T7 21% T1-T7 T8 14% T2-T8 T8 20% T2-T8 T9 14% % Improvement over Tree model Cut1 21.7% Cut2 18.5%
  28. 28. Use case 2 - Results VCR Error Ftn(AEPV - Absolute Error Per View): (Σ(Viewsᵢ-Cmpltdᵢ) * abs(Ypredᵢ) +(Cmpltdᵢ) * abs(1 - Ypredᵢ)) ΣViewsᵢ Model / % AEPV Improvement By Country OS Cut over last 7 day Avg Model Logistic Reg Logistic Reg(2nd order Autoregressive features) LR (GBT based Feature Engineering) NFFM Cut1 -3.71% 2.30% 2.51% 3.00% Cut2 -2.16% 3.05% 4.48% 28.83% Cut3 -0.31% -0.56% 5.65% 12.47%
  29. 29. Use case 2 - Results VCR ● LR with L2 Regularisation ● 2nd Order features were selected based on Information Gain criteria ● GBT package in spark Mlib was used(numTrees = 400, maxDepth=8, sampling=0.5 minInstancePerNode = 10). ○ Training process was too slow, even with large enough resources. ○ Xgboost with Spark(tried later) was faster , and resulted in further Improvements ● NFFM: Increasing the number of layers till 3 resulted in further 20% improvement in the validation errors, no significant improvement after that
  30. 30. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  31. 31. Building the full intuition Factorisation machine: ● Handling categorical features and sparse data matrix ● Extracting latent variables, e.g., identifying non-explicit segment profiles in the population Field-aware: ● Dimensionality reduction (high cardinality features to K dimension representation) ● Increases degrees of freedom (compared to FM in terms field-specific values) to enable exhaustive set of second-order interactions Neural network: ● Explores and weight higher order interactions - went up to 3 layers of interaction sucessfully ● Generates numerical prediction ● Training the factors based on performance of both FM machine and Neural Nets (instead of training them separately causing latent vectors to only be limited by power of FM)
  32. 32. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  33. 33. Implementation details ● Hyper params are k, lambda, num layers, num nodes in layers, activation functions ● Implemented in Tensorflow ● Adam optimizer ● L2 regularization. No dropouts ● No batch-normalization ● 1 layer 100 nodes performs good enough and saves compute ● ReLU activations (converges faster) ● k=16 (try with powers of 2) ● Weighted RMSE as loss function for both use cases
  34. 34. Predicting for unseen feature values ESPN CNBC SONY UNKNOWN? XA 0 XA 1 XA 2 XG 0 XG 1 XG 2 ● Avg latent feature interactions per feature for unknown values YA 0 YA 1 YA 2 YG 0 YG 1 YG 2 ZA 0 ZA 1 ZA 2 ZG 0 ZG 1 ZG 2 (XA 0+YA 0+ZA 0)/3 (XA 1+YA 1+ZA 1)/3 (XA 2+YA 2+ZA 2)/3 (XG 0+YG 0+ZG 0)/3 (XG 1+YG 1+ZG 1)/3 (XG 2+YG 2+ZG 2)/3
  35. 35. Implementing @ low-latency, high-scale ● MLeap: MLeap framework provides support for models trained both in Spark and Tensorflow. Helps us train models in Spark for Tree based models and TF models for NN based models ● Offline training and challenges: We cannot train TF models on yarn cluster hence we use a GPU machine as gateway to pull data and from HDFS and train on GPU ● Online serving challenges: TF serving has pretty low throughput and wasn’t scaling for our QPS. Hence we are using local LRU cache with decent TTL to scale the TF serving
  36. 36. Future research that we are currently pursuing... ● Hybrid Binning NFFM ● Distributed training and serving ● Dropouts & Batch Normalization ● Methods to interpret the latent-vector (Using methods like t-Distributed Stochastic Neighbour Embedding (t-SNE) etc)
  37. 37. References FM: https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf FFM: http://research.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/ DeepFM: https://arxiv.org/pdf/1703.04247.pdf NFM: https://arxiv.org/pdf/1708.05027.pdf GBT Based Feature Engg: http://quinonero.net/Publications/predicting-clicks-facebook.pdf
  38. 38. Thank You!

×