Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Understanding Deep Learning for
Big Data
Le Song
http://www.cc.gatech.edu/~lsong/
College of Computing
Georgia Institute o...
AlexNet: deep convolution neural networks
2
11
11
5
5
3
3
3
3
256
13
13
3
3
40964096
1000
Rectified linear unit: ℎ 𝑢 = max...
3
a benchmark image classification problem
~ 1.3 million examples, ~ 1 thousand classes
Training is end-to-end
Minimize negative log-likelihood over 𝑚 data points 𝑥𝑖, 𝑦𝑖 𝑖=1
𝑚
min
𝑓∈𝓕
𝑅 𝑊1, … , 𝑊8 ≔ −
1
𝑚
𝑖=1
𝑚...
Traditional image features not learned end-to-end
5
Handcrafted
feature extractor
(eg. SIFT)
Divide image
to patches
Combi...
Rectified linear unit: ℎ 𝑢 = max{0, 𝑢}
Deep learning not fully understood
11
11
5
5
3
3
3
3
256
13
13
3
3
40964096
1000
22...
Experiments
1. Fully connected layers crucial?
2. Convolution layers crucial?
3. Learn parameters end-to-end crucial?
Kernel methods: alternative nonlinear model
Combination of random basis functions 𝑘(𝑤, 𝑥)
𝑓 𝑥 =
𝑖=1
𝑇
𝛼𝑖 𝑘(𝑤𝑖, 𝑥)
8
𝑖=1
7
...
Replace fully connected by kernel methods
I. Jointly trained neural nets
(AlexNet)
Pr 𝑦 𝑥 ∝
exp 𝑊8ℎ7 𝑊7 ℎ6 … ℎ1 𝑊1 𝑥
Learn...
10
Learn classifiers from a benchmark subset of
~ 1.3 million examples, ~ 1 thousand classes
Kernel machine learns faster
ImageNet 1.3M original images, and 1000 classes
Random cropping and mirroring images in strea...
Similar results with MNIST8M
Classification with handwritten digits
8M images, 10 classes
LeNet5
12
Similar results with CIFAR10
Classification with internet images
60K images, 10 classes
13
Experiments
1. Fully connected layers crucial? No
2. Convolution layers crucial?
3. Learn parameters end-to-end crucial?
Kernel methods directly on inputs?
Fixed convolutionWithout convolution
0
0.2
0.4
0.6
0.8
1
1.2
MNIST
2 convolution layer
...
Kernel methods + random convolutions?
Fixed convolutionWithout convolution Random convolution
0
0.2
0.4
0.6
0.8
1
1.2
MNIS...
Structured composition useful
Not just fully connected layers, and plain composition
𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥
Structured com...
Experiments
1. Fully connected layers crucial? No
2. Convolution layers crucial? Yes
3. Learn parameters end-to-end crucia...
Lots of random features used
58M parameters
131M parameters
AlexNet
Scalable
Kernel Method
Error
42.6%
Error
44.5%
1000
40...
131M parameters needed?
58M parameters
32M parameters
AlexNet
Error
42.6%
Error
50.0%
1000
4096 4096
256
13
13
256
13
13
3...
Basis function adaptation crucial
Integrated squared approximation error by 𝑇 basis function [Barron ‘93]
Error of
adaptin...
Learning random features helps a lot
58M parameters
32M parameters
Learn and basis adaptation
AlexNet
Error
42.6%
Error
43...
Learning convolution together helps more
58M parameters
32M parameters
Learn and basis adaptation
AlexNet
Error
42.6%
Erro...
Lesson learned:
Exploit Structure & Train End-to-End
Deep learning over (time-varying) graph
Co-evolutionary features
ChristineAliceDavid Jacob
Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
User-item interactions
evolve...
ChristineAliceDavid Jacob
User embedding
𝑓𝑢(𝑡)
Co-evolutionary features
Item embedding
𝑓𝑖(𝑡)
User-item interactions
evolve...
ChristineAliceDavid Jacob
User embedding
𝑓𝑢(𝑡)
Co-evolutionary features
Item embedding
𝑓𝑖(𝑡)
User-item interactions
evolve...
ChristineAliceDavid Jacob
Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
Co-evolutionary features
User-item interactions
evolve...
ChristineAliceDavid Jacob
Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
Co-evolutionary features
User-item interactions
evolve...
ChristineAliceDavid Jacob
Co-evolutionary features
Item embedding
𝑓𝑖(𝑡)
User embedding
𝑓𝑢(𝑡)
User-item interactions
evolve...
Co-evolutionary embedding
ChristineAliceDavid Jacob
Initialize item embedding
𝑓𝑖 𝑛
𝑡0 = ℎ 𝑉0 ⋅ 𝑓𝑖 𝑛
0
Initialize user embe...
Deep learning with time-varying computation graph
time
𝑡2
𝑡3
𝑡1
𝑡0
Mini-batch 1
Computation graph of RNN
determined by
1. ...
Much improvement prediction on Reddit dataset
Next item prediction Return time prediction
1,000 users, 1403 groups, ~10K i...
Predicting efficiency of solar panel materials
Dataset Harvard clean
energy project
Data point # 2.3 million
Type Molecule...
Structure2Vec
𝜇2
(1)
𝜇2
(0)
𝜇1
(0)
𝜇3
(1)
𝜇1
(1)
……
𝜇2
(𝑇)
𝜇3
(𝑇)
𝜇1
(𝑇)
𝑋6
𝑋1
𝑋2 𝑋3
𝑋4
𝑋5
𝜒
𝜇6
(0)
……
……
Iteration 1:
Ite...
Improved prediction with small model
Structure2vec gets ~4% relative error
with 10,000 times smaller model!
Test MAE Test ...
Take Home Message:
Deep fully connected layers not the key
Exploit structure (CNN, Coevolution,
Structure2vec)
Train end-t...
Upcoming SlideShare
Loading in …5
×

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

621 views

Published on

Understanding Deep Learning for Big Data: The complexity and scale of big data impose tremendous challenges for their analysis. Yet, big data also offer us great opportunities. Some nonlinear phenomena, features or relations, which are not clear or cannot be inferred reliably from small and medium data, now become clear and can be learned robustly from big data. Typically, the form of the nonlinearity is unknown to us, and needs to be learned from data as well. Being able to harness the nonlinear structures from big data could allow us to tackle problems which are impossible before or obtain results which are far better than previous state-of-the-arts.

Nowadays, deep neural networks are the methods of choice when it comes to large scale nonlinear learning problems. What makes deep neural networks work? Is there any general principle for tackling high dimensional nonlinear problems which we can learn from deep neural works? Can we design competitive or better alternatives based on such knowledge? To make progress in these questions, my machine learning group performed both theoretical and experimental analysis on existing and new deep learning architectures, and investigate three crucial aspects on the usefulness of the fully connected layers, the advantage of the feature learning process, and the importance of the compositional structures. Our results point to some promising directions for future research, and provide guideline for building new deep learning models.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

  1. 1. Understanding Deep Learning for Big Data Le Song http://www.cc.gatech.edu/~lsong/ College of Computing Georgia Institute of Technology 1
  2. 2. AlexNet: deep convolution neural networks 2 11 11 5 5 3 3 3 3 256 13 13 3 3 40964096 1000 Rectified linear unit: ℎ 𝑢 = max{0, 𝑢} 224 224 3 55 55 96 256 27 27 384 13 13 384 13 13 3.7 million parameters58.6 million parameters Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥 Image 𝑥 Label 𝑦 cat/ bike/ …?
  3. 3. 3 a benchmark image classification problem ~ 1.3 million examples, ~ 1 thousand classes
  4. 4. Training is end-to-end Minimize negative log-likelihood over 𝑚 data points 𝑥𝑖, 𝑦𝑖 𝑖=1 𝑚 min 𝑓∈𝓕 𝑅 𝑊1, … , 𝑊8 ≔ − 1 𝑚 𝑖=1 𝑚 log Pr 𝑦𝑖|𝑥𝑖 (Stochastic) gradient descent 𝑊8 𝑡+1 = 𝑊8 𝑡 − 𝜂 𝜕 𝑅 𝜕𝑊8 … 𝑊1 𝑡+1 = 𝑊1 𝑡 − 𝜂 𝜕 𝑅 𝜕𝑊1 4 Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥 AlexNet achieve ~40% top-1 error
  5. 5. Traditional image features not learned end-to-end 5 Handcrafted feature extractor (eg. SIFT) Divide image to patches Combine features Learn classifier
  6. 6. Rectified linear unit: ℎ 𝑢 = max{0, 𝑢} Deep learning not fully understood 11 11 5 5 3 3 3 3 256 13 13 3 3 40964096 1000 224 224 3 55 55 96 256 27 27 384 13 13 384 13 13 3.7 million parameters 58.6 million parameters 6 ully connected layers crucial? Convolution layers crucial? Image 𝑥 Train end-to-end important? Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥
  7. 7. Experiments 1. Fully connected layers crucial? 2. Convolution layers crucial? 3. Learn parameters end-to-end crucial?
  8. 8. Kernel methods: alternative nonlinear model Combination of random basis functions 𝑘(𝑤, 𝑥) 𝑓 𝑥 = 𝑖=1 𝑇 𝛼𝑖 𝑘(𝑤𝑖, 𝑥) 8 𝑖=1 7 𝛼𝑖 exp − 𝑤𝑖 − 𝑥 2 𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7 𝑘 𝑤𝑖, 𝑥 = exp − 𝑤𝑖 − 𝑥 2 𝑥𝑤1 [Dai et al. NIPS 14] 𝑥
  9. 9. Replace fully connected by kernel methods I. Jointly trained neural nets (AlexNet) Pr 𝑦 𝑥 ∝ exp 𝑊8ℎ7 𝑊7 ℎ6 … ℎ1 𝑊1 𝑥 Learn II. Fixed neural nets III. Scalable kernel methods [Dai et al. NIPS 14] Learn Fix Learn Fix 9
  10. 10. 10 Learn classifiers from a benchmark subset of ~ 1.3 million examples, ~ 1 thousand classes
  11. 11. Kernel machine learns faster ImageNet 1.3M original images, and 1000 classes Random cropping and mirroring images in streaming fashion Number of training samples 10 5 40 60 80 100 Test top-1 error (%) 10 6 10 7 10 8 jointly-trained neural net fixed neural net doubly SGD Training 1 week using GPU 47.8 44.5 42.6 Random guessing 99.9% error 11
  12. 12. Similar results with MNIST8M Classification with handwritten digits 8M images, 10 classes LeNet5 12
  13. 13. Similar results with CIFAR10 Classification with internet images 60K images, 10 classes 13
  14. 14. Experiments 1. Fully connected layers crucial? No 2. Convolution layers crucial? 3. Learn parameters end-to-end crucial?
  15. 15. Kernel methods directly on inputs? Fixed convolutionWithout convolution 0 0.2 0.4 0.6 0.8 1 1.2 MNIST 2 convolution layer 0 10 20 30 40 CIFAR10 2 convolution layers 0 20 40 60 80 100 ImageNet 5 convolution layers 15
  16. 16. Kernel methods + random convolutions? Fixed convolutionWithout convolution Random convolution 0 0.2 0.4 0.6 0.8 1 1.2 MNIST 2 convolution layer 0 10 20 30 40 CIFAR10 2 convolution layers # random conv ≫ # fixed conv Random 16
  17. 17. Structured composition useful Not just fully connected layers, and plain composition 𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥 Structured composition of nonlinear functions 𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ1 , ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ2 , … , ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ 𝑚 17 the same function
  18. 18. Experiments 1. Fully connected layers crucial? No 2. Convolution layers crucial? Yes 3. Learn parameters end-to-end crucial?
  19. 19. Lots of random features used 58M parameters 131M parameters AlexNet Scalable Kernel Method Error 42.6% Error 44.5% 1000 4096 4096 256 13 13 256 13 13 131K 1000 19 Fix
  20. 20. 131M parameters needed? 58M parameters 32M parameters AlexNet Error 42.6% Error 50.0% 1000 4096 4096 256 13 13 256 13 13 32K 1000 20 Scalable Kernel Method Fix
  21. 21. Basis function adaptation crucial Integrated squared approximation error by 𝑇 basis function [Barron ‘93] Error of adapting basis function ≤ 1 𝑇 Error of fixed basis function ≥ 1 𝑇2/𝑑 𝑓 𝑥 = 𝑖=1 7 𝛼𝑖 𝑘 𝑥𝑖, 𝑥 𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑘(𝑥𝑖, 𝑥) 𝑓 𝑥 = 𝑖=1 2 𝛼𝑖 𝑘 𝜃 𝑖 𝑥𝑖, 𝑥 𝑥1 𝑥2 𝑘 𝜃 𝑖 (𝑥𝑖, 𝑥) 𝛼1 𝛼2 21
  22. 22. Learning random features helps a lot 58M parameters 32M parameters Learn and basis adaptation AlexNet Error 42.6% Error 43.7% 1000 4096 4096 256 13 13 256 13 13 32K 1000 Fix 22/50 Scalable Kernel Method
  23. 23. Learning convolution together helps more 58M parameters 32M parameters Learn and basis adaptation AlexNet Error 42.6% Error 41.9% 1000 4096 4096 256 13 13 256 13 13 32K 1000 Jointly learn 23 Scalable Kernel Method
  24. 24. Lesson learned: Exploit Structure & Train End-to-End Deep learning over (time-varying) graph
  25. 25. Co-evolutionary features ChristineAliceDavid Jacob Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) User-item interactions evolve over time … 25
  26. 26. ChristineAliceDavid Jacob User embedding 𝑓𝑢(𝑡) Co-evolutionary features Item embedding 𝑓𝑖(𝑡) User-item interactions evolve over time … 26
  27. 27. ChristineAliceDavid Jacob User embedding 𝑓𝑢(𝑡) Co-evolutionary features Item embedding 𝑓𝑖(𝑡) User-item interactions evolve over time … 27
  28. 28. ChristineAliceDavid Jacob Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) Co-evolutionary features User-item interactions evolve over time … 28
  29. 29. ChristineAliceDavid Jacob Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) Co-evolutionary features User-item interactions evolve over time … 29
  30. 30. ChristineAliceDavid Jacob Co-evolutionary features Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) User-item interactions evolve over time … 30
  31. 31. Co-evolutionary embedding ChristineAliceDavid Jacob Initialize item embedding 𝑓𝑖 𝑛 𝑡0 = ℎ 𝑉0 ⋅ 𝑓𝑖 𝑛 0 Initialize user embedding 𝑓𝑢 𝑛 𝑡0 = ℎ 𝑊0 ⋅ 𝑓𝑢 𝑛 0 𝑢 𝑛, 𝑖 𝑛, 𝑡 𝑛, 𝑞 𝑛 Item raw profile features User raw profile features Drift Context Evolution Co-evolution User Item𝑓𝑖 𝑛 𝑡 𝑛 = ℎ 𝑉1 ⋅ 𝑓𝑖 𝑛 𝑡 𝑛 − +𝑉2 ⋅ 𝑓𝑢 𝑛 𝑡 𝑛 − +𝑉3 ⋅ 𝑞 𝑛 +𝑉4 ⋅ (𝑡 𝑛 − 𝑡 𝑛−1) Update U2I: Drift Context Evolution Co-evolution ItemUser𝑓𝑢 𝑛 𝑡 𝑛 = ℎ 𝑊1 ⋅ 𝑓𝑢 𝑛 𝑡 𝑛 − +𝑊2 ⋅ 𝑓𝑖 𝑛 𝑡 𝑛 − +𝑊3 ⋅ 𝑞 𝑛 +𝑊4 ⋅ (𝑡 𝑛 − 𝑡 𝑛−1) Update I2U: 31[Dai et al. Recsys16]
  32. 32. Deep learning with time-varying computation graph time 𝑡2 𝑡3 𝑡1 𝑡0 Mini-batch 1 Computation graph of RNN determined by 1. The bipartite interaction graph 2. The temporal ordering of events 32
  33. 33. Much improvement prediction on Reddit dataset Next item prediction Return time prediction 1,000 users, 1403 groups, ~10K interactions MAR: mean absolute rank difference MAE: mean absolute error (hours) 33
  34. 34. Predicting efficiency of solar panel materials Dataset Harvard clean energy project Data point # 2.3 million Type Molecule Atom type 6 Avg node # 28 Avg edge # 33 Power Conversion Efficiency (PCE) (0 -12 %) predict Organic Solar Panel Materials 34
  35. 35. Structure2Vec 𝜇2 (1) 𝜇2 (0) 𝜇1 (0) 𝜇3 (1) 𝜇1 (1) …… 𝜇2 (𝑇) 𝜇3 (𝑇) 𝜇1 (𝑇) 𝑋6 𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝜒 𝜇6 (0) …… …… Iteration 1: Iteration 𝑇: Label 𝑦 classification/regression with parameter 𝑉 Aggregate 𝜇1 (𝑇) 𝜇2 (𝑇) + + ⋮ = 𝜇 𝑎(𝑊, 𝜒) 35 [Dai et al. ICML 16]
  36. 36. Improved prediction with small model Structure2vec gets ~4% relative error with 10,000 times smaller model! Test MAE Test RMSE # parameters Mean predictor 1.986 2.406 1 WL level-3 0.143 0.204 1.6 m WL level-6 0.096 0.137 1378 m structure2vec 0.085 0.117 0.1 m 10% data for testing 36
  37. 37. Take Home Message: Deep fully connected layers not the key Exploit structure (CNN, Coevolution, Structure2vec) Train end-to-end

×