Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Productionizing Deep Learning From the Ground Up

2,314 views

Published on

ODSC Boston Presentation by Adam Gibson

Published in: Technology
  • Login to see the comments

Productionizing Deep Learning From the Ground Up

  1. 1. PRODUCTIONIZING DEEP LEARNING FROM THE GROUND UP Adam Gibson O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  2. 2. Open DataSciCon May 2015 Productionizing Deep Learning From the Ground Up
  3. 3. Overview ● What is Deep Learning? ● Why is it hard? ● Problems to think about ● Conclusions
  4. 4. What is Deep Learning? Pattern recognition on unlabeled & unstructured data.
  5. 5. What is Deep Learning? ● Deep Neural Networks >= 3 Layers ● For media/unstructured data ● Automatic Feature Engineering ● Benefits From Complex Architectures ● Computationally Intensive ● Accelerates With Special Hardware
  6. 6. Get why it’s hard yet?
  7. 7. Deep Networks >= 3 Layers ● Backpropagation and Old School ANNs = 3
  8. 8. Deep Networks ● Neural Networks themselves as hidden Layers ● Different Types of Layers can be Interchanged/stacked ● Multiple Layer Types, each with own Hyperparameters and Loss Functions
  9. 9. What Are Common Layer Types?
  10. 10. Feedforward 1. MLPs 2. AutoEncoders 3. RBMs
  11. 11. Recurrent 1. MultiModal 2. LSTMs 3. Stateful
  12. 12. Convolutional Lenet: Mixes convolutional & subsampling layers
  13. 13. Recursive/Tree Uses a parser to form a tree structure
  14. 14. Other kinds ● Memory Networks ● Deep Reinforcement Learning ● Adversarial Architectures ● New recursive ConvNet variant to come in 2016? ● Over 9,000 Layers? (22 is already pretty common)
  15. 15. Automatic Feature Engineering
  16. 16. Automatic Feature Engineering (TSNE) Visualizations are crucial: Use TSNE to render different kinds of data: http://lvdmaaten.github.io/tsne/
  17. 17. deeplearning4j.org presentation@ Google, Nov. 17 2014 “TWO PIZZAS SITTING ON A STOVETOP”
  18. 18. Benefits from Complex Architectures Google’s result combined: ● LSTMs (learning captions) ● Word Embeddings ● Convolutional features from images (aligned to be same size as embeddings)
  19. 19. Computationally Intensive ● One iteration of ImageNet (1k label dataset and over 1MM examples) takes 7 hours on GPUs ● Project Adam ● Google Brain
  20. 20. Special Hardware required Unlike most solutions, multiple GPUs are used today (Not common in Java-based stacks!)
  21. 21. Software Engineering Concerns ● Pipelines to deal with messy data, not canned problems... (Real life is not Kaggle, people.) ● Scale/Maintenance (Clusters of GPUs aren’t done well today.) ● Different kinds of parallelism (model and data)
  22. 22. Model vs Data Parallelism ● Model is sharding model across servers (HPC style) ● Data is mini batch
  23. 23. Vectorizing unstructured data ● Data is stored in different databases ● Different kinds of files (raw) ● Deep Learning works well on mixed signal
  24. 24. Parallelism ● Model (HPC) ● Data (Mini batch param averaging)
  25. 25. Production Stacks today ● Hadoop/Spark not enough ● GPUs not friendly to average programmer ● Cluster management of GPUs as a resource not typically done ● Many frameworks don’t work well in a distributed env (getting better, though)
  26. 26. Problems With Neural Nets ● Loss functions ● Scaling data ● Mixing different neural nets ● Hyperparameter tuning
  27. 27. Loss Functions ● Classification ● Regression ● Reconstruction
  28. 28. Scaling Data ● Zero mean and unit variance ● Zero to 1 ● Other forms of preprocessing relative to distribution of data ● Processing can also be columnwise (categorical?)
  29. 29. Mixing and Matching Neural Networks ● Video: ConvNet + Recurrent ● Convolutional RBMs? ● Convolutional -> Subsampling -> Fully Connected ● DBNs: Different hidden and visible units for each layer
  30. 30. Hyperparameter tuning ● Underfit ● Overfit ● Overdescribe (your hidden layers) ● Layerwise interactions ● What activation function? (Competing? Relu? Good ol’ Sigmoid?)
  31. 31. Hyperparameter Tuning (2) ● Grid search for neural nets (Don’t do it!) ● Bayesian (Getting better. There are at least priors here.) ● Gradient-based approaches (Your hyper- parameters are a neural net, so there are neural nets optimizing your neural nets...)
  32. 32. Questions? Twitter: @agibsonccc Github: agibsonccc LinkedIn: /in/agibsonccc Email: adam@skymind.io (combo breaker!) Web: deeplearning4j.org

×