Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alexander Smola
AWS Machine Learning
Personalization and Scalable Deep Learning with MXNET

Outline
• Personalization
• Latent Variable Models
• User Engagement and Return Times
• Deep Recommender Systems
• MXNet
• Basic concepts
• Launching a cluster in a minute
• Imagenet for beginners

Latent Variable Models
• Temporal sequence of observations 
Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Clusters (navigational, informational queries in search)
• Topics (interest distributions for users over time)
• Kalman Filter (trajectory and location modeling)
Action
Explanation

• Clusters (navigational, informational queries in search)
• Topics (interest distributions for users over time)
• Kalman Filter (trajectory and location modeling)
Action
Explanation
Are the parametric models really true?

• Nonparametric model / spectral
• Use data to determine shape
• Sidestep approximate inference
x
h
ht = f(xt 1, ht 1)
xt = g(xt 1, ht)

• Plain deep network = RNN
• Deep network with attention = LSTM / GRU … 
(learn when to update state, how to read out)
x
h

Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
it = (Wi(xt, ht) + bi)
ft = (Wf (xt, ht) + bf )
zt+1 = ft · zt + it · tanh(Wz(xt, ht) + bz)
ot = (Wo(xt, ht, zt+1) + bo)
ht+1 = ot · tanh zt+1

Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
(zt+1, ht+1, ot) = LSTM(zt, ht, xt)
Treat it as a black box

User Engagement
9:01 8:55 11:50
12:30
never
next week
?
(app frame toutiao.com)

User Engagement Modeling
• User engagement is gradual
• Daily average users?
• Weekly average users?
• Number of active users?
• Number of users?
• Abandonment is passive
• The last time you tweeted? Pin? Like? Skype?
• Churn models assume active abandonment  
(insurance, phone, bank)
9:01

User Engagement Modeling
• User engagement is gradual
• Model user returns
• Context of activity
• World events (elections, Super Bowl, …)
• User habits (morning reader, night owl)
• Previous reading behavior 
(poor quality content will discourage return)
9:01

Survival Analysis 101
• Model population where something dramatic happens
• Cancer patients (death; efficacy of a drug)
• Atoms (radioactive decay)
• Japanese women (marriage)
• Users (opens app)
• Survival probability
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JA
well known that the differential equation can be solved
partial integration, i.e.
Pr(tsurvival T) = exp
Z T
0
(T)dt
!
. (2)
ce, if the patient survives until time T and we stop
kernel
time t
Conse
hazard rate function

Session Model
• User activity is sequence of times
• bi when app is opened
• ei when app is closed
• In between wait for user return
• Model user activity likelihood
start
end

Look up
table
One-hot
UserID
Hidden2
Hidden1
User
Embedding
Look up
table
One-hot
TimeID
Time
Embedding……
0 0 1 0 0 0……
……
0 0 0 1 0 0……
……
……
……
External
Feature
Rate
Fig. 1. A Personalized Time-Aware architecture for Survival Analysis.
Given the data from previous session, we aims to predict the (quantized)
rate values for the next session.
tun
to
[39
sp
[40
of
tho
in
to
mo
ins
lin
lea
Session Model
start
end

Personalized LSTM
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8
Hidden2
Hidden1
Input
……
……
……
Hidden2
Hidden1
……
……
Hidden2
Hidden1
……
……
Input
……
Input
……
Session s-2 Session s-1 Session s
Fig. 2. Unfolded LSTM network for 3 sessions. The input vector for session s is the concatenation of user embedding, time slot embedding and the
• LSTM for global state update
• LSTM for indvidual state update
• Update both of them
• Learn using backprop and SGD
Jing and Smola, WSDM’17

Perplexity (quality of prediction)
next visit time (hour)
Fig. 6. The histogram of the time period between two sessions. The top
one is from Toutiao and the bottom one is from Last.fm. The small bump
around 24 hours corresponds to users having a daily habit of using the
app at the same time.
global constant model. A static model with only one pa-
rameter, assuming that the rate is constant throughout
the time frame for all users.
global+user constant model. A static model that assumes
that the rate is an additive function of a global constant
and a user-specific constant model.
piecewise constant model. A more flexible static model
that learns parameters for each discretized bin.
Hawkes process. A self-exciting point process that respects
past sessions.
integrated model. A combined model with all the above
components.
DNN. A model that assumes that the rate is a function
of time, user, session feature, parameterized by a deep
neural network.
LSTM. A recurrent neural network that incorporates past
activities.
For completeness, we also report the result for Cox’s model
where the Hazard Rate is given by
u(t) = 0(t) exp(h , xu(t)i) (28)
perp = exp
⇣ 1
M
mX
u=1
muX
i=1
log p({bi, ei}; )
⌘
(29)
where M is the total number of sessions in the test set. The
lower the value, the better the model is at explaining the
test data. In other words, perplexity measures the amount
of surprise in a user’s behavior relative to our prediction.
Obviously a good model can predict well, hence there will
be less surprise.
6.6 Model Comparison
The summarized results are shown in table 1. As can be seen
from the table, there is a big gap between linear models
and the two deep models. The Cox model is inferior to
our integrated model and significantly worse than the deep
networks.
model Toutiao Last.fm
Cox Model 27.13 28.31
global constant 45.29 59.98
user constant 28.74 45.44
piecewise constant 26.88 26.12
Hawkes process 22.58 30.80
integrated model 21.56 26.06
DNN 18.87 20.62
LSTM 18.10 19.80
TABLE 1
Average perplexity evaluated on the test set for different models.
flexible static model
iscretized bin.
nt process that respects
el with all the above
the rate is a function
ameterized by a deep
that incorporates past
result for Cox’s model
xu(t)i) (28)
from the table, there is a big gap between line
and the two deep models. The Cox model is
our integrated model and significantly worse than
networks.
model Toutiao Last.fm
Cox Model 27.13 28.31
global constant 45.29 59.98
user constant 28.74 45.44
piecewise constant 26.88 26.12
Hawkes process 22.58 30.80
integrated model 21.56 26.06
DNN 18.87 20.62
LSTM 18.10 19.80
TABLE 1
Average perplexity evaluated on the test set for different

Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016
Toutiao Last.fm
# of sessions (%)
0 20 40 60 80 100
Perplexity
0
20
40
60
80
100
120
140
160
global constant
user constant
piecewise constant
Hawkes Process
Integrated
Cox
DNN
LSTM
# of sessions (%)
0 20 40 60 80 100
Perplexity
0
20
40
60
80
100
120
140
160
180
global constant
user constant
piecewise constant
Hawkes Process
Integrated
Cox
DNN
LSTM
%)
50
LSTM v.s. Integrated
LSTM v.s. Cox
%)
45
50
LSTM v.s. Cox
# of sessions (%)
0 20 40 60 80 100
0
20
# of sessions (%)
0 5 10 15 20
RelativeImprovements(%)
0
10
20
30
40
50
LSTM v.s. Cox
Fig. 7. Top row: Average test perplexity as a function of the fraction of o
LSTMs over the integrated and the Cox model. Left column: Toutiao datJing and Smola, WSDM’17

t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
0.6
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
g. 9. Six randomly sampled learned predictive rate function. Three from toutiao (left) and three from Last.fm (right). Each pair of ﬁgure denotes
e instantaneous rate value (t) (purple), the survival function p(return t) in red, and the actual return time in blue. Clearly, our deep model is

Recommender systems, not recommender archaeology
users
items
time
NOW
predict that
(future)
use this
(past)
don’t predict this
(archaeology)

The Netﬂix contest
got it wrong …

Getting it right
change in
taste and
expertise
change in
perception
and novelty
LSTM
LSTM
Wu et al, WSDM’17

Caffe
Torch
Theano
Tensorflow
CNTK
Keras
Paddle
(image - Banksy/wikipedia)
Why yet another deep networks tool?

Why yet another deep networks tool?
• Frugality & resource efficiency 
Engineered for cheap GPUs with smaller memory, slow networks
• Speed
• Linear scaling with #machines and #GPUs
• High efficiency on single machine, too (C++ backend)
• Simplicity 
Mix declarative and imperative code
single implementation of
backend system and
common operators
performance guarantee
regardless which frontend
language is used
frontend
backend

Imperative Programs
import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
print c
d = c + 1 Easy to tweak
with python
codes
Pro
• Straightforward and flexible.
• Take advantage of language native
features (loop, condition, debugger)
Con
• Hard to optimize

Declarative Programs
A = Variable('A')
B = Variable('B')
C = B * A
D = C + 1
f = compile(D)
d = f(A=np.ones(10),
B=np.ones(10)*2)
Pro
• More chances for optimization
• Cross different languages
Con
• Less flexible
A B
1
+
⨉
C can share memory with D,
because C is deleted later

Imperative vs. Declarative for Deep Learning
Computational Graph
of the Deep Architecture
forward backward
Needs heavy optimization,
fits declarative programs
Needs mutation and more
language native features, good for
imperative programs
Updates and Interactions
with the graph
• Iteration loops
• Parameter update 
• Beam search
• Feature extraction …
w w ⌘@wf(w)

Mixed Style Training Loop in MXNet
executor = neuralnetwork.bind()
for i in range(3):
train_iter.reset()
for dbatch in train_iter:
args["data"][:] = dbatch.data[0]
args["softmax_label"][:] = dbatch.label[0]
executor.forward(is_train=True)
executor.backward()
for key in update_keys:
args[key] -= learning_rate * grads[key]
Imperative NDArray can be set as input
nodes to the graph
Executor is bound from
declarative program that
describes the network
Imperative parameter update on GPU

Mixed API for Quick Extensions
• Runtime switching between different graphs depending on input
• Useful for sequence modeling and image size reshaping
• Use of imperative code in Python, 10 lines of additional Python code
BucketingVariable length sentences

3D Image Construction
Deep3D
100 lines of Python code
https://github.com/piiswrong/deep3d

Distributed Deep Learning
## train
num_gpus = 4
gpus = [mx.gpu(i) for i in range(num_gpus)]
model = mx.model.FeedForward(
ctx = gpus,
symbol = softmax,
num_round = 20,
learning_rate = 0.01,
momentum = 0.9,
wd = 0.00001)
model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size))
2 lines for multi GPU

Scaling on p2.16xlarge
alexnet
inception-v3
resnet-50
GPUs GPUs
average throughput
per GPU
aggregate throughput
GPU-GPU sync
alexnet
inception-v3
resnet-50 108x
75x

Getting Started
• Website 
http://mxnet.io/
• GitHub repository 
git clone —recursive git@github.com:dmlc/mxnet.git
• Docker 
docker pull dmlc/mxnet
• Amazon AWS Deep Learning AMI (with other toolkits & anaconda) 
https://aws.amazon.com/marketplace/pp/B01M0AXXQB 
http://bit.ly/deepami
• CloudFormation Template 
https://github.com/dmlc/mxnet/tree/master/tools/cfn  
http://bit.ly/deepcfn

Acknowledgements
• User engagement 
How Jing, Chao-Yuan Wu
• Temporal recommenders 
Chao-Yuan Wu, Alex Beutel, Amr Ahmed
• MXNet & Deep Learning AMI 
Mu Li, Tianqi Chen, Bing Xu, Eric Xie, Joseph Spisak,
Naveen Swamy, Anirudh Subramanian and many more …
We are hiring
{smola, thakerb, spisakj}@amazon.com

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

Similar to Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016