Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

661 views

Published on

Understanding Deep Learning for Big Data: The complexity and scale of big data impose tremendous challenges for their analysis. Yet, big data also offer us great opportunities. Some nonlinear phenomena, features or relations, which are not clear or cannot be inferred reliably from small and medium data, now become clear and can be learned robustly from big data. Typically, the form of the nonlinearity is unknown to us, and needs to be learned from data as well. Being able to harness the nonlinear structures from big data could allow us to tackle problems which are impossible before or obtain results which are far better than previous state-of-the-arts.

Nowadays, deep neural networks are the methods of choice when it comes to large scale nonlinear learning problems. What makes deep neural networks work? Is there any general principle for tackling high dimensional nonlinear problems which we can learn from deep neural works? Can we design competitive or better alternatives based on such knowledge? To make progress in these questions, my machine learning group performed both theoretical and experimental analysis on existing and new deep learning architectures, and investigate three crucial aspects on the usefulness of the fully connected layers, the advantage of the feature learning process, and the importance of the compositional structures. Our results point to some promising directions for future research, and provide guideline for building new deep learning models.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

  1. 1. Understanding Deep Learning for Big Data Le Song http://www.cc.gatech.edu/~lsong/ College of Computing Georgia Institute of Technology 1
  2. 2. AlexNet: deep convolution neural networks 2 11 11 5 5 3 3 3 3 256 13 13 3 3 40964096 1000 Rectified linear unit: ℎ 𝑢 = max{0, 𝑢} 224 224 3 55 55 96 256 27 27 384 13 13 384 13 13 3.7 million parameters58.6 million parameters Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥 Image 𝑥 Label 𝑦 cat/ bike/ …?
  3. 3. 3 a benchmark image classification problem ~ 1.3 million examples, ~ 1 thousand classes
  4. 4. Training is end-to-end Minimize negative log-likelihood over 𝑚 data points 𝑥𝑖, 𝑦𝑖 𝑖=1 𝑚 min 𝑓∈𝓕 𝑅 𝑊1, … , 𝑊8 ≔ − 1 𝑚 𝑖=1 𝑚 log Pr 𝑦𝑖|𝑥𝑖 (Stochastic) gradient descent 𝑊8 𝑡+1 = 𝑊8 𝑡 − 𝜂 𝜕 𝑅 𝜕𝑊8 … 𝑊1 𝑡+1 = 𝑊1 𝑡 − 𝜂 𝜕 𝑅 𝜕𝑊1 4 Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥 AlexNet achieve ~40% top-1 error
  5. 5. Traditional image features not learned end-to-end 5 Handcrafted feature extractor (eg. SIFT) Divide image to patches Combine features Learn classifier
  6. 6. Rectified linear unit: ℎ 𝑢 = max{0, 𝑢} Deep learning not fully understood 11 11 5 5 3 3 3 3 256 13 13 3 3 40964096 1000 224 224 3 55 55 96 256 27 27 384 13 13 384 13 13 3.7 million parameters 58.6 million parameters 6 ully connected layers crucial? Convolution layers crucial? Image 𝑥 Train end-to-end important? Pr 𝑦|𝑥 ∝ exp 𝑊8ℎ 𝑊7ℎ 𝑊6ℎ 𝑊5ℎ 𝑊4ℎ 𝑊3ℎ 𝑊2ℎ 𝑊1 𝑥
  7. 7. Experiments 1. Fully connected layers crucial? 2. Convolution layers crucial? 3. Learn parameters end-to-end crucial?
  8. 8. Kernel methods: alternative nonlinear model Combination of random basis functions 𝑘(𝑤, 𝑥) 𝑓 𝑥 = 𝑖=1 𝑇 𝛼𝑖 𝑘(𝑤𝑖, 𝑥) 8 𝑖=1 7 𝛼𝑖 exp − 𝑤𝑖 − 𝑥 2 𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7 𝑘 𝑤𝑖, 𝑥 = exp − 𝑤𝑖 − 𝑥 2 𝑥𝑤1 [Dai et al. NIPS 14] 𝑥
  9. 9. Replace fully connected by kernel methods I. Jointly trained neural nets (AlexNet) Pr 𝑦 𝑥 ∝ exp 𝑊8ℎ7 𝑊7 ℎ6 … ℎ1 𝑊1 𝑥 Learn II. Fixed neural nets III. Scalable kernel methods [Dai et al. NIPS 14] Learn Fix Learn Fix 9
  10. 10. 10 Learn classifiers from a benchmark subset of ~ 1.3 million examples, ~ 1 thousand classes
  11. 11. Kernel machine learns faster ImageNet 1.3M original images, and 1000 classes Random cropping and mirroring images in streaming fashion Number of training samples 10 5 40 60 80 100 Test top-1 error (%) 10 6 10 7 10 8 jointly-trained neural net fixed neural net doubly SGD Training 1 week using GPU 47.8 44.5 42.6 Random guessing 99.9% error 11
  12. 12. Similar results with MNIST8M Classification with handwritten digits 8M images, 10 classes LeNet5 12
  13. 13. Similar results with CIFAR10 Classification with internet images 60K images, 10 classes 13
  14. 14. Experiments 1. Fully connected layers crucial? No 2. Convolution layers crucial? 3. Learn parameters end-to-end crucial?
  15. 15. Kernel methods directly on inputs? Fixed convolutionWithout convolution 0 0.2 0.4 0.6 0.8 1 1.2 MNIST 2 convolution layer 0 10 20 30 40 CIFAR10 2 convolution layers 0 20 40 60 80 100 ImageNet 5 convolution layers 15
  16. 16. Kernel methods + random convolutions? Fixed convolutionWithout convolution Random convolution 0 0.2 0.4 0.6 0.8 1 1.2 MNIST 2 convolution layer 0 10 20 30 40 CIFAR10 2 convolution layers # random conv ≫ # fixed conv Random 16
  17. 17. Structured composition useful Not just fully connected layers, and plain composition 𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥 Structured composition of nonlinear functions 𝑓 𝑥 = ℎ 𝑛 ℎ 𝑛−1 … ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ1 , ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ2 , … , ℎ1 𝑥 𝑝𝑎𝑡𝑐ℎ 𝑚 17 the same function
  18. 18. Experiments 1. Fully connected layers crucial? No 2. Convolution layers crucial? Yes 3. Learn parameters end-to-end crucial?
  19. 19. Lots of random features used 58M parameters 131M parameters AlexNet Scalable Kernel Method Error 42.6% Error 44.5% 1000 4096 4096 256 13 13 256 13 13 131K 1000 19 Fix
  20. 20. 131M parameters needed? 58M parameters 32M parameters AlexNet Error 42.6% Error 50.0% 1000 4096 4096 256 13 13 256 13 13 32K 1000 20 Scalable Kernel Method Fix
  21. 21. Basis function adaptation crucial Integrated squared approximation error by 𝑇 basis function [Barron ‘93] Error of adapting basis function ≤ 1 𝑇 Error of fixed basis function ≥ 1 𝑇2/𝑑 𝑓 𝑥 = 𝑖=1 7 𝛼𝑖 𝑘 𝑥𝑖, 𝑥 𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑘(𝑥𝑖, 𝑥) 𝑓 𝑥 = 𝑖=1 2 𝛼𝑖 𝑘 𝜃 𝑖 𝑥𝑖, 𝑥 𝑥1 𝑥2 𝑘 𝜃 𝑖 (𝑥𝑖, 𝑥) 𝛼1 𝛼2 21
  22. 22. Learning random features helps a lot 58M parameters 32M parameters Learn and basis adaptation AlexNet Error 42.6% Error 43.7% 1000 4096 4096 256 13 13 256 13 13 32K 1000 Fix 22/50 Scalable Kernel Method
  23. 23. Learning convolution together helps more 58M parameters 32M parameters Learn and basis adaptation AlexNet Error 42.6% Error 41.9% 1000 4096 4096 256 13 13 256 13 13 32K 1000 Jointly learn 23 Scalable Kernel Method
  24. 24. Lesson learned: Exploit Structure & Train End-to-End Deep learning over (time-varying) graph
  25. 25. Co-evolutionary features ChristineAliceDavid Jacob Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) User-item interactions evolve over time … 25
  26. 26. ChristineAliceDavid Jacob User embedding 𝑓𝑢(𝑡) Co-evolutionary features Item embedding 𝑓𝑖(𝑡) User-item interactions evolve over time … 26
  27. 27. ChristineAliceDavid Jacob User embedding 𝑓𝑢(𝑡) Co-evolutionary features Item embedding 𝑓𝑖(𝑡) User-item interactions evolve over time … 27
  28. 28. ChristineAliceDavid Jacob Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) Co-evolutionary features User-item interactions evolve over time … 28
  29. 29. ChristineAliceDavid Jacob Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) Co-evolutionary features User-item interactions evolve over time … 29
  30. 30. ChristineAliceDavid Jacob Co-evolutionary features Item embedding 𝑓𝑖(𝑡) User embedding 𝑓𝑢(𝑡) User-item interactions evolve over time … 30
  31. 31. Co-evolutionary embedding ChristineAliceDavid Jacob Initialize item embedding 𝑓𝑖 𝑛 𝑡0 = ℎ 𝑉0 ⋅ 𝑓𝑖 𝑛 0 Initialize user embedding 𝑓𝑢 𝑛 𝑡0 = ℎ 𝑊0 ⋅ 𝑓𝑢 𝑛 0 𝑢 𝑛, 𝑖 𝑛, 𝑡 𝑛, 𝑞 𝑛 Item raw profile features User raw profile features Drift Context Evolution Co-evolution User Item𝑓𝑖 𝑛 𝑡 𝑛 = ℎ 𝑉1 ⋅ 𝑓𝑖 𝑛 𝑡 𝑛 − +𝑉2 ⋅ 𝑓𝑢 𝑛 𝑡 𝑛 − +𝑉3 ⋅ 𝑞 𝑛 +𝑉4 ⋅ (𝑡 𝑛 − 𝑡 𝑛−1) Update U2I: Drift Context Evolution Co-evolution ItemUser𝑓𝑢 𝑛 𝑡 𝑛 = ℎ 𝑊1 ⋅ 𝑓𝑢 𝑛 𝑡 𝑛 − +𝑊2 ⋅ 𝑓𝑖 𝑛 𝑡 𝑛 − +𝑊3 ⋅ 𝑞 𝑛 +𝑊4 ⋅ (𝑡 𝑛 − 𝑡 𝑛−1) Update I2U: 31[Dai et al. Recsys16]
  32. 32. Deep learning with time-varying computation graph time 𝑡2 𝑡3 𝑡1 𝑡0 Mini-batch 1 Computation graph of RNN determined by 1. The bipartite interaction graph 2. The temporal ordering of events 32
  33. 33. Much improvement prediction on Reddit dataset Next item prediction Return time prediction 1,000 users, 1403 groups, ~10K interactions MAR: mean absolute rank difference MAE: mean absolute error (hours) 33
  34. 34. Predicting efficiency of solar panel materials Dataset Harvard clean energy project Data point # 2.3 million Type Molecule Atom type 6 Avg node # 28 Avg edge # 33 Power Conversion Efficiency (PCE) (0 -12 %) predict Organic Solar Panel Materials 34
  35. 35. Structure2Vec 𝜇2 (1) 𝜇2 (0) 𝜇1 (0) 𝜇3 (1) 𝜇1 (1) …… 𝜇2 (𝑇) 𝜇3 (𝑇) 𝜇1 (𝑇) 𝑋6 𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝜒 𝜇6 (0) …… …… Iteration 1: Iteration 𝑇: Label 𝑦 classification/regression with parameter 𝑉 Aggregate 𝜇1 (𝑇) 𝜇2 (𝑇) + + ⋮ = 𝜇 𝑎(𝑊, 𝜒) 35 [Dai et al. ICML 16]
  36. 36. Improved prediction with small model Structure2vec gets ~4% relative error with 10,000 times smaller model! Test MAE Test RMSE # parameters Mean predictor 1.986 2.406 1 WL level-3 0.143 0.204 1.6 m WL level-6 0.096 0.137 1378 m structure2vec 0.085 0.117 0.1 m 10% data for testing 36
  37. 37. Take Home Message: Deep fully connected layers not the key Exploit structure (CNN, Coevolution, Structure2vec) Train end-to-end

×