Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Funda Gunes, Senior Research Statis... by MLconf 848 views
- Soumith Chintala, AI research engin... by MLconf 585 views
- Nikhil Garg, Engineering Manager, Q... by MLconf 1245 views
- Ben Hamner, CTO, Kaggle, at MLconf ... by MLconf 501 views
- Brian Lucena, Senior Data Scientist... by MLconf 1112 views
- Luna Dong, Principal Scientist, Ama... by MLconf 502 views

1,917 views

Published on

Abstract summary

Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.

Published in:
Technology

No Downloads

Total views

1,917

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

80

Comments

0

Likes

6

No embeds

No notes for slide

- 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alexander Smola AWS Machine Learning Personalization and Scalable Deep Learning with MXNET
- 2. Outline • Personalization • Latent Variable Models • User Engagement and Return Times • Deep Recommender Systems • MXNet • Basic concepts • Launching a cluster in a minute • Imagenet for beginners
- 3. Personalization
- 4. Latent Variable Models • Temporal sequence of observations Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Clusters (navigational, informational queries in search) • Topics (interest distributions for users over time) • Kalman Filter (trajectory and location modeling) Action Explanation
- 5. Latent Variable Models • Temporal sequence of observations Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Clusters (navigational, informational queries in search) • Topics (interest distributions for users over time) • Kalman Filter (trajectory and location modeling) Action Explanation Are the parametric models really true?
- 6. Latent Variable Models • Temporal sequence of observations Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Nonparametric model / spectral • Use data to determine shape • Sidestep approximate inference x h ht = f(xt 1, ht 1) xt = g(xt 1, ht)
- 7. Latent Variable Models • Temporal sequence of observations Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Plain deep network = RNN • Deep network with attention = LSTM / GRU … (learn when to update state, how to read out) x h
- 8. Long Short Term Memory x h Schmidhuber and Hochreiter, 1998 it = (Wi(xt, ht) + bi) ft = (Wf (xt, ht) + bf ) zt+1 = ft · zt + it · tanh(Wz(xt, ht) + bz) ot = (Wo(xt, ht, zt+1) + bo) ht+1 = ot · tanh zt+1
- 9. Long Short Term Memory x h Schmidhuber and Hochreiter, 1998 (zt+1, ht+1, ot) = LSTM(zt, ht, xt) Treat it as a black box
- 10. User Engagement 9:01 8:55 11:50 12:30 never next week ? (app frame toutiao.com)
- 11. User Engagement Modeling • User engagement is gradual • Daily average users? • Weekly average users? • Number of active users? • Number of users? • Abandonment is passive • The last time you tweeted? Pin? Like? Skype? • Churn models assume active abandonment (insurance, phone, bank) 9:01
- 12. User Engagement Modeling • User engagement is gradual • Model user returns • Context of activity • World events (elections, Super Bowl, …) • User habits (morning reader, night owl) • Previous reading behavior (poor quality content will discourage return) 9:01
- 13. Survival Analysis 101 • Model population where something dramatic happens • Cancer patients (death; efficacy of a drug) • Atoms (radioactive decay) • Japanese women (marriage) • Users (opens app) • Survival probability TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JA well known that the differential equation can be solved partial integration, i.e. Pr(tsurvival T) = exp Z T 0 (T)dt ! . (2) ce, if the patient survives until time T and we stop kernel time t Conse hazard rate function
- 14. Session Model • User activity is sequence of times • bi when app is opened • ei when app is closed • In between wait for user return • Model user activity likelihood start end
- 15. Look up table One-hot UserID Hidden2 Hidden1 User Embedding Look up table One-hot TimeID Time Embedding…… 0 0 1 0 0 0…… …… 0 0 0 1 0 0…… …… …… …… External Feature Rate Fig. 1. A Personalized Time-Aware architecture for Survival Analysis. Given the data from previous session, we aims to predict the (quantized) rate values for the next session. tun to [39 sp [40 of tho in to mo ins lin lea Session Model start end
- 16. Personalized LSTM IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8 Hidden2 Hidden1 Input …… …… …… Hidden2 Hidden1 …… …… Hidden2 Hidden1 …… …… Input …… Input …… Session s-2 Session s-1 Session s Fig. 2. Unfolded LSTM network for 3 sessions. The input vector for session s is the concatenation of user embedding, time slot embedding and the • LSTM for global state update • LSTM for indvidual state update • Update both of them • Learn using backprop and SGD Jing and Smola, WSDM’17
- 17. Perplexity (quality of prediction) next visit time (hour) Fig. 6. The histogram of the time period between two sessions. The top one is from Toutiao and the bottom one is from Last.fm. The small bump around 24 hours corresponds to users having a daily habit of using the app at the same time. global constant model. A static model with only one pa- rameter, assuming that the rate is constant throughout the time frame for all users. global+user constant model. A static model that assumes that the rate is an additive function of a global constant and a user-speciﬁc constant model. piecewise constant model. A more ﬂexible static model that learns parameters for each discretized bin. Hawkes process. A self-exciting point process that respects past sessions. integrated model. A combined model with all the above components. DNN. A model that assumes that the rate is a function of time, user, session feature, parameterized by a deep neural network. LSTM. A recurrent neural network that incorporates past activities. For completeness, we also report the result for Cox’s model where the Hazard Rate is given by u(t) = 0(t) exp(h , xu(t)i) (28) perp = exp ⇣ 1 M mX u=1 muX i=1 log p({bi, ei}; ) ⌘ (29) where M is the total number of sessions in the test set. The lower the value, the better the model is at explaining the test data. In other words, perplexity measures the amount of surprise in a user’s behavior relative to our prediction. Obviously a good model can predict well, hence there will be less surprise. 6.6 Model Comparison The summarized results are shown in table 1. As can be seen from the table, there is a big gap between linear models and the two deep models. The Cox model is inferior to our integrated model and signiﬁcantly worse than the deep networks. model Toutiao Last.fm Cox Model 27.13 28.31 global constant 45.29 59.98 user constant 28.74 45.44 piecewise constant 26.88 26.12 Hawkes process 22.58 30.80 integrated model 21.56 26.06 DNN 18.87 20.62 LSTM 18.10 19.80 TABLE 1 Average perplexity evaluated on the test set for different models. ﬂexible static model iscretized bin. nt process that respects el with all the above the rate is a function ameterized by a deep that incorporates past result for Cox’s model xu(t)i) (28) from the table, there is a big gap between line and the two deep models. The Cox model is our integrated model and signiﬁcantly worse than networks. model Toutiao Last.fm Cox Model 27.13 28.31 global constant 45.29 59.98 user constant 28.74 45.44 piecewise constant 26.88 26.12 Hawkes process 22.58 30.80 integrated model 21.56 26.06 DNN 18.87 20.62 LSTM 18.10 19.80 TABLE 1 Average perplexity evaluated on the test set for different
- 18. Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 Toutiao Last.fm # of sessions (%) 0 20 40 60 80 100 Perplexity 0 20 40 60 80 100 120 140 160 global constant user constant piecewise constant Hawkes Process Integrated Cox DNN LSTM # of sessions (%) 0 20 40 60 80 100 Perplexity 0 20 40 60 80 100 120 140 160 180 global constant user constant piecewise constant Hawkes Process Integrated Cox DNN LSTM %) 50 LSTM v.s. Integrated LSTM v.s. Cox %) 45 50 LSTM v.s. Integrated LSTM v.s. Cox # of sessions (%) 0 20 40 60 80 100 0 20 # of sessions (%) 0 5 10 15 20 RelativeImprovements(%) 0 10 20 30 40 50 LSTM v.s. Integrated LSTM v.s. Cox Fig. 7. Top row: Average test perplexity as a function of the fraction of o LSTMs over the integrated and the Cox model. Left column: Toutiao datJing and Smola, WSDM’17
- 19. t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time g. 9. Six randomly sampled learned predictive rate function. Three from toutiao (left) and three from Last.fm (right). Each pair of ﬁgure denotes e instantaneous rate value (t) (purple), the survival function p(return t) in red, and the actual return time in blue. Clearly, our deep model is
- 20. Recommender Systems
- 21. Recommender systems, not recommender archaeology users items time NOW predict that (future) use this (past) don’t predict this (archaeology)
- 22. The Netﬂix contest got it wrong …
- 23. Getting it right change in taste and expertise change in perception and novelty LSTM LSTM Wu et al, WSDM’17
- 24. Wu et al, WSDM’17
- 25. Prizes
- 26. Sanity Check
- 27. Deep Learning with MXNet
- 28. Caffe Torch Theano Tensorflow CNTK Keras Paddle (image - Banksy/wikipedia) Why yet another deep networks tool?
- 29. Why yet another deep networks tool? • Frugality & resource efficiency Engineered for cheap GPUs with smaller memory, slow networks • Speed • Linear scaling with #machines and #GPUs • High efficiency on single machine, too (C++ backend) • Simplicity Mix declarative and imperative code single implementation of backend system and common operators performance guarantee regardless which frontend language is used frontend backend
- 30. Imperative Programs import numpy as np a = np.ones(10) b = np.ones(10) * 2 c = b * a print c d = c + 1 Easy to tweak with python codes Pro • Straightforward and flexible. • Take advantage of language native features (loop, condition, debugger) Con • Hard to optimize
- 31. Declarative Programs A = Variable('A') B = Variable('B') C = B * A D = C + 1 f = compile(D) d = f(A=np.ones(10), B=np.ones(10)*2) Pro • More chances for optimization • Cross different languages Con • Less flexible A B 1 + ⨉ C can share memory with D, because C is deleted later
- 32. Imperative vs. Declarative for Deep Learning Computational Graph of the Deep Architecture forward backward Needs heavy optimization, fits declarative programs Needs mutation and more language native features, good for imperative programs Updates and Interactions with the graph • Iteration loops • Parameter update • Beam search • Feature extraction … w w ⌘@wf(w)
- 33. Mixed Style Training Loop in MXNet executor = neuralnetwork.bind() for i in range(3): train_iter.reset() for dbatch in train_iter: args["data"][:] = dbatch.data[0] args["softmax_label"][:] = dbatch.label[0] executor.forward(is_train=True) executor.backward() for key in update_keys: args[key] -= learning_rate * grads[key] Imperative NDArray can be set as input nodes to the graph Executor is bound from declarative program that describes the network Imperative parameter update on GPU
- 34. Mixed API for Quick Extensions • Runtime switching between different graphs depending on input • Useful for sequence modeling and image size reshaping • Use of imperative code in Python, 10 lines of additional Python code BucketingVariable length sentences
- 35. 3D Image Construction Deep3D 100 lines of Python code https://github.com/piiswrong/deep3d
- 36. Distributed Deep Learning
- 37. Distributed Deep Learning
- 38. Distributed Deep Learning ## train num_gpus = 4 gpus = [mx.gpu(i) for i in range(num_gpus)] model = mx.model.FeedForward( ctx = gpus, symbol = softmax, num_round = 20, learning_rate = 0.01, momentum = 0.9, wd = 0.00001) model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size)) 2 lines for multi GPU
- 39. Scaling on p2.16xlarge alexnet inception-v3 resnet-50 GPUs GPUs average throughput per GPU aggregate throughput GPU-GPU sync alexnet inception-v3 resnet-50 108x 75x
- 40. Demo
- 41. Getting Started • Website http://mxnet.io/ • GitHub repository git clone —recursive git@github.com:dmlc/mxnet.git • Docker docker pull dmlc/mxnet • Amazon AWS Deep Learning AMI (with other toolkits & anaconda) https://aws.amazon.com/marketplace/pp/B01M0AXXQB http://bit.ly/deepami • CloudFormation Template https://github.com/dmlc/mxnet/tree/master/tools/cfn http://bit.ly/deepcfn
- 42. Acknowledgements • User engagement How Jing, Chao-Yuan Wu • Temporal recommenders Chao-Yuan Wu, Alex Beutel, Amr Ahmed • MXNet & Deep Learning AMI Mu Li, Tianqi Chen, Bing Xu, Eric Xie, Joseph Spisak, Naveen Swamy, Anirudh Subramanian and many more … We are hiring {smola, thakerb, spisakj}@amazon.com

No public clipboards found for this slide

Be the first to comment