Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)

For many companies, recommendation systems solve important machine learning problems. But as recommendation systems grow to millions of users and millions of items, they pose significant challenges when deployed at scale. The user-item matrix can have trillions of entries (or more), most of which are zero. To make common ML techniques practical, sparse data requires special techniques. Learn how to use MXNet to build neural network models for recommendation systems that can scale efficiently to large sparse datasets.

AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. November 30, 2016 Using MXNet for Recommendation Modeling at Scale MAC306 Leo Dirac, Principal Engineer, AWS Deep Learning
  2. 2. What to Expect from the Session Background on recommender systems and machine learning. Learn how to implement them on MXNet using p2 instances and the AWS Deep Learning AMI. Explore several types of recommender systems, including advanced deep learning ideas. Learn tricks for handling sparse data in MXNet.
  3. 3. Background: Recommender Systems & Machine Learning
  4. 4. Netflix Prize: 2006-2009 $1,000,000 4
  5. 5. Recommending Movies /* Predict what Star Rating will user u give movie m */ float predictRating(User u, Movie m) { // How??? } 5
  6. 6. Q: How??? A: Machine Learning: Learn code from data float predictRating(User u, Movie m) { return mlModel.run(u,m); } 6
  7. 7. Input Data Predictions Training Data Training Model 7
  8. 8. 8 Training Data All Labelled Data 75% 25% . .
  9. 9. 9 Training Data Training Trial Model All Labelled Data 75% 25% . .
  10. 10. 10 Training Data Training Trial Model Test Data All Labelled Data 75% 25% . .
  11. 11. 11 Training Data Training Trial Model Evaluation Result Test Data All Labelled Data 75% 25% . .
  12. 12. 12 Training Data Training Trial Model Evaluation Result Test Data Accuracy All Labelled Data 75% 25% . .
  13. 13. Sparse Data
  14. 14. User-Item Ratings Matrix 14
  15. 15. Size of user-item ratings matrix 15 Sample dataset: MovieLens 20M (27,000 movies) * (138,000 users) = 3,700,000,000 possible ratings But only 20,000,000 ratings available. 99.5% of ratings are unknown. http://grouplens.org/datasets/movielens/20m/
  16. 16. Storing the matrix Dense 3.7B entries Each entry: •Rating: 1 byte 3.7 GB Sparse 20M non-zero entries Each entry: •Rating: 1 byte •Movie_id: 32-bit integer •User_id: 32-bit integer 180 MB 16 Sparse is 20x smaller
  17. 17. Matrix Factorization
  18. 18. MF as Math 18 Sparse Behavior Matrix ≈ Items Users IxU Item Embeddings X User Embeddings IxD DxU
  19. 19. Embeddings Emb(“The Karate Kid”) = Amazon [-3.168 -0.136 3.770 4.767 3.558 -4.168 0.464 2.034 3.411 … 0.866]
  20. 20. Embeddings Emb(“The Karate Kid”) = [-3.168 -0.136 3.770 4.767 3.558 -4.168 0.464 2.034 3.411 … 0.866] Emb(“Ferris Bueller”) = [-3.101 -0.057 3.800 4.862 3.632 -4.157 0.549 2.064 3.428 … 0.884] D(Emb(“K.Kid”) – Emb(“Ferris”)) = 0.138 D(Emb(“K.Kid”) – Emb(“My Little Pony”)) = 1.572
  21. 21. MXNet
  22. 22. 22
  23. 23. p2.xlarge 4,300,000,000,000 32-bit floating point operations/second
  24. 24. GPUs: Feeding the beast GPU Cores GPU RAM PCI: ~10 GB/s CPU 240 GB/s Ethernet 2.5 GB/s
  25. 25. p2.16xlarge GPU CPU Ethernet 2.5 GB/s PCIx: ~10 GB/s GPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU
  26. 26. MXNet scaling 26
  27. 27. MF as a neural network (NN) 27 User embedding UUser (1-hot)Item (1-hot) Item Embedding embed embed Dot Product Rating
  28. 28. Deep Learning AMI with p2 Pre-installed: • MXNet & other popular deep learning frameworks • GPU Drivers, CUDA, cuDNN • Jupyter notebook & python libraries
  29. 29. MF Demo in MXNet demo1-MF.ipynb 29
  30. 30. Binary Predictions
  31. 31. Why binary? 31
  32. 32. Binary user-item matrix 32
  33. 33. Original data 33
  34. 34. Predicting binary float predictScore(User u, Movie m) { return 1.0; } 34
  35. 35. Original data 35
  36. 36. Negative sampling 36
  37. 37. Negative sampling from mxreco import NegativeSamplingDataIter train_data = NegativeSamplingDataIter( train_data, sample_ratio=5) 37 More details: BlackOut: Speeding up RNNLM w/ Very Large Vocabularies Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish, Michael J. Anderson, Pradeep Dubey
  38. 38. Negative Sampling Demo demo2-binary.ipynb 38
  39. 39. Content Features
  40. 40. What do we know? Behavioral interactions between users & items Names of items Pictures of items What users searched for 40
  41. 41. How to represent these in NN? Unique Identifier: Embedding Images: ConvNet (a.k.a. CNN) Text: LSTM Text: Bag of Words 41
  42. 42. Deep Structured Semantic Model 42 DSSM Embedding URight ObjectLeft Object Embedding Deep Net Deep Net Similarity Label
  43. 43. CosineLoss layer 43 import mxreco pred = mxreco.CosineLoss(a=user, b=item, label=label) L~=0 L~=1 L~=2
  44. 44. Content Features DSSM Demo demo3-dssm.ipynb 44
  45. 45. Inspirational References Learning Deep Structured Semantic Models for Web Search using Clickthrough Data • Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, Larry Heck, October, 2013 Deep Neural Networks for YouTube Recommendations • Paul Covington and Jay Adams and Emre Sargin, 2016 Order-Embeddings of Images and Language • Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun, March 2016 45
  46. 46. User-Level Models
  47. 47. Predicting with embeddings def movies_for_user(u): scores = {} for m in movies: score[m.id] = predictScore(u,m) top_movies = sorted(scores.items()…) return top_moves 47
  48. 48. All content at once def movies_for_user(u): scores = userModel.predict(u) top_movies = sorted(scores.items()…) return top_movies 48 GPU Cores GPU RAM PCI: ~10 GB/s CPU 240 GB/s Ethernet 2.5 GB/s
  49. 49. Multi-label neural network Output Bag of Movies Input Bag of Movies Hidden Units Movie Probabilities Loss & Gradient UxN NxU Sparse input Sparse output
  50. 50. Storing indexes Conceptually: • Predict: 1882, 2808, 24, 160, 1831, 2668 • Inputs: 2986, 329, 2012, 442, 512, 1544, 2615, 1037, 1876, 1917, 2532, 196, 1375, 1779, 2054, 2530, 2628, 1909, 2407, 316, 1356, 1603, 2046, 2428 50
  51. 51. Storing sparse data Simpler if fixed width Pad to end with “-1” • Predict: 1882, 2808, 24, 160, 1831, 2668,-1,-1,-1,-1 • Inputs: 2986, 329, 2012, 442, 512, 1544, 2615, 1037, 1876, 1917, 2532, 196, 1375, 1779, 2054, 2530, 2628, 1909, 2407, 316, 1356, 1603, 2046, 2428,-1,-1,-1,-1,-1,-1,-1,-1 51
  52. 52. Trying It Yourself
  53. 53. Trying it yourself Launch Deep Learning AMI https://aws.amazon.com/marketplace/pp/B01M0AXXQB Try examples in https://github.com/dmlc/mxnet/example/recommender
  54. 54. Thank you!
  55. 55. Remember to complete your evaluations!

×