Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Understanding Deep Learning for
Big Data
Le Song
http://www.cc.gatech.edu/~lsong/
College of Computing
Georgia Institute of Technology
1

AlexNet: deep convolution neural networks
2
11
11
5
5
3
3
3
3
256
13
13
3
3
40964096
1000
Rectified linear unit: ℎ 𝑢 = max{0, 𝑢}
224
224
3
55
55
96
256
27
27
384
13
13
384
13
13
3.7 million parameters58.6 million parameters
Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥
Image
𝑥
Label
𝑦
cat/
bike/
…?

3
a benchmark image classification problem
~ 1.3 million examples, ~ 1 thousand classes

Training is end-to-end
Minimize negative log-likelihood over 𝑚 data points 𝑥𝑖, 𝑦𝑖 𝑖=1
𝑚
min
𝑓∈𝓕
𝑅 𝑊1, … , 𝑊8 ≔ −
1
𝑚
𝑖=1
𝑚
log Pr 𝑦𝑖|𝑥𝑖
(Stochastic) gradient descent
𝑊8
𝑡+1
= 𝑊8
𝑡
− 𝜂
𝜕 𝑅
𝜕𝑊8
…
𝑊1
𝑡+1
= 𝑊1
𝑡
− 𝜂
𝜕 𝑅
𝜕𝑊1
4
AlexNet achieve
~40%
top-1 error

Traditional image features not learned end-to-end
5
Handcrafted
feature extractor
(eg. SIFT)
Divide image
to patches
Combine features
Learn classifier

Rectified linear unit: ℎ 𝑢 = max{0, 𝑢}
Deep learning not fully understood
11
11
5
5
3
3
3
3
256
13
13
3
3
40964096
1000
224
224
3
55
55
96
256
27
27
384
13
13
384
13
13
3.7 million
parameters
58.6 million parameters
6
ully connected layers
crucial?
Convolution layers
crucial?
Image
𝑥
Train end-to-end important?

Experiments
1. Fully connected layers crucial?
2. Convolution layers crucial?
3. Learn parameters end-to-end crucial?

Kernel methods: alternative nonlinear model
Combination of random basis functions 𝑘(𝑤, 𝑥)
𝑓 𝑥 =
𝑖=1
𝑇
𝛼𝑖 𝑘(𝑤𝑖, 𝑥)
8
𝑖=1
7
𝛼𝑖 exp − 𝑤𝑖 − 𝑥 2
𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7
𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7
𝑘 𝑤𝑖, 𝑥
= exp − 𝑤𝑖 − 𝑥 2
𝑥𝑤1
[Dai et al. NIPS 14]
𝑥

Replace fully connected by kernel methods
I. Jointly trained neural nets
(AlexNet)
Pr 𝑦 𝑥 ∝
exp 𝑊8ℎ7 𝑊7 ℎ6 … ℎ1 𝑊1 𝑥
Learn
II. Fixed neural nets
III. Scalable kernel methods
[Dai et al. NIPS 14]
Learn Fix
Learn Fix
9

10
Learn classifiers from a benchmark subset of
~ 1.3 million examples, ~ 1 thousand classes

Kernel machine learns faster
ImageNet 1.3M original images, and 1000 classes
Random cropping and mirroring images in streaming fashion
Number of training samples
10
5
40
60
80
100
Test
top-1 error
(%)
10
6
10
7
10
8
jointly-trained neural net
fixed neural net
doubly SGD
Training 1 week
using GPU
47.8
44.5
42.6
Random guessing
99.9% error
11

Similar results with MNIST8M
Classification with handwritten digits
8M images, 10 classes
LeNet5
12

Similar results with CIFAR10
Classification with internet images
60K images, 10 classes
13

Experiments
1. Fully connected layers crucial? No
2. Convolution layers crucial?

Kernel methods directly on inputs?
Fixed convolutionWithout convolution
0
0.2
0.4
0.6
0.8
1
1.2
MNIST
2 convolution layer
0
10
20
30
40
CIFAR10
2 convolution layers
0
20
40
60
80
100
ImageNet
15

Kernel methods + random convolutions?
Fixed convolutionWithout convolution Random convolution
0
0.2
0.4
0.6
0.8
1
1.2
MNIST
2 convolution layer
0
10
20
30
40
CIFAR10
# random conv
≫
# fixed conv
Random
16

Structured composition useful
Not just fully connected layers, and plain composition
𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥
Structured composition of nonlinear functions
𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ1
, ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ2
, … , ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ 𝑚
17
the same function

Experiments
1. Fully connected layers crucial? No
2. Convolution layers crucial? Yes

Lots of random features used
58M parameters
131M parameters
AlexNet
Scalable
Kernel Method
Error
42.6%
Error
44.5%
1000
4096 4096
256
13
13
256
13
13
131K
1000
19
Fix

131M parameters needed?
58M parameters
32M parameters
AlexNet
Error
42.6%
Error
50.0%
1000
4096 4096
256
13
13
256
13
13
32K
1000
20
Scalable
Kernel Method
Fix

Basis function adaptation crucial
Integrated squared approximation error by 𝑇 basis function [Barron ‘93]
Error of
adapting basis function
≤
1
𝑇
Error of
fixed basis function
≥
1
𝑇2/𝑑
𝑓 𝑥 =
𝑖=1
7
𝛼𝑖 𝑘 𝑥𝑖, 𝑥
𝛼1 𝛼2 𝛼3 𝛼4 𝛼5
𝛼6 𝛼7
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7
𝑘(𝑥𝑖, 𝑥)
𝑓 𝑥 =
𝑖=1
2
𝛼𝑖 𝑘 𝜃 𝑖
𝑥𝑖, 𝑥
𝑥1 𝑥2
𝑘 𝜃 𝑖
(𝑥𝑖, 𝑥)
𝛼1 𝛼2
21

Learning random features helps a lot
58M parameters
32M parameters
Learn and basis adaptation
AlexNet
Error
42.6%
Error
43.7%
1000
4096 4096
256
13
13
256
13
13
32K
1000
Fix
22/50
Scalable
Kernel Method

Learning convolution together helps more
58M parameters
32M parameters
Learn and basis adaptation
AlexNet
Error
42.6%
Error
41.9%
1000
4096 4096
256
13
13
256
13
13
32K
1000
Jointly learn
23
Scalable
Kernel Method

Lesson learned:
Exploit Structure & Train End-to-End
Deep learning over (time-varying) graph

Co-evolutionary features
ChristineAliceDavid Jacob
Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
User-item interactions
evolve over time
… 25

User embedding
𝑓𝑢(𝑡)
Item embedding
𝑓𝑖(𝑡)
evolve over time
… 26

User embedding
𝑓𝑢(𝑡)
Item embedding
𝑓𝑖(𝑡)
evolve over time
… 27

Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
evolve over time
… 28

Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
evolve over time
… 29

Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
evolve over time
… 30

Co-evolutionary embedding
Initialize item embedding
𝑓𝑖 𝑛
𝑡0 = ℎ 𝑉0 ⋅ 𝑓𝑖 𝑛
0
Initialize user embedding
𝑓𝑢 𝑛
𝑡0 = ℎ 𝑊0 ⋅ 𝑓𝑢 𝑛
0
𝑢 𝑛, 𝑖 𝑛, 𝑡 𝑛, 𝑞 𝑛
Item raw profile features
User raw profile features
Drift
Context
Evolution
Co-evolution
User Item𝑓𝑖 𝑛
𝑡 𝑛 = ℎ
𝑉1 ⋅ 𝑓𝑖 𝑛
𝑡 𝑛
−
+𝑉2 ⋅ 𝑓𝑢 𝑛
𝑡 𝑛
−
+𝑉3 ⋅ 𝑞 𝑛
+𝑉4 ⋅ (𝑡 𝑛 − 𝑡 𝑛−1)
Update U2I:
Drift
Context
Evolution
Co-evolution
ItemUser𝑓𝑢 𝑛
𝑡 𝑛 = ℎ
𝑊1 ⋅ 𝑓𝑢 𝑛
𝑡 𝑛
−
+𝑊2 ⋅ 𝑓𝑖 𝑛
𝑡 𝑛
−
+𝑊3 ⋅ 𝑞 𝑛
+𝑊4 ⋅ (𝑡 𝑛 − 𝑡 𝑛−1)
Update I2U:
31[Dai et al. Recsys16]

Deep learning with time-varying computation graph
time
𝑡2
𝑡3
𝑡1
𝑡0
Mini-batch 1
Computation graph of RNN
determined by
1. The bipartite interaction
graph
2. The temporal ordering of
events
32

Much improvement prediction on Reddit dataset
Next item prediction Return time prediction
1,000 users, 1403 groups, ~10K interactions
MAR: mean absolute rank difference
MAE: mean absolute error (hours)
33

Predicting efficiency of solar panel materials
Dataset Harvard clean
energy project
Data point # 2.3 million
Type Molecule
Atom type 6
Avg node # 28
Avg edge # 33
Power Conversion Efficiency (PCE)
(0 -12 %)
predict
Organic
Solar Panel
Materials
34

Structure2Vec
𝜇2
(1)
𝜇2
(0)
𝜇1
(0)
𝜇3
(1)
𝜇1
(1)
……
𝜇2
(𝑇)
𝜇3
(𝑇)
𝜇1
(𝑇)
𝑋6
𝑋1
𝑋2 𝑋3
𝑋4
𝑋5
𝜒
𝜇6
(0)
……
……
Iteration 1:
Iteration 𝑇:
Label 𝑦
classification/regression
with parameter 𝑉
Aggregate
𝜇1
(𝑇)
𝜇2
(𝑇)
+
+
⋮
= 𝜇 𝑎(𝑊, 𝜒)
35
[Dai et al. ICML 16]

Improved prediction with small model
Structure2vec gets ~4% relative error
with 10,000 times smaller model!
Test MAE Test RMSE # parameters
Mean predictor 1.986 2.406 1
WL level-3 0.143 0.204 1.6 m
WL level-6 0.096 0.137 1378 m
structure2vec 0.085 0.117 0.1 m
10% data for testing
36

Take Home Message:
Deep fully connected layers not the key
Exploit structure (CNN, Coevolution,
Structure2vec)
Train end-to-end

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Similar to Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Editor's Notes