Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(CMP305) Deep Learning on AWS Made EasyCmp305

2,975 views

Published on

Deep learning is making news across the country as one of the most promising techniques in machine learning research. However, these methods are complex to implement, finicky to tune, and state-of-the-art accuracy is only achieved by a few experts in the field. In this session, we give a beginner-friendly explanation of deep learning using neural networks—what it is, what it does, and how; and introduce the concept of deep features, which allows you to obtain great performance with reduced running times and data set sizes. We then show how these methods can easily be deployed on GPU instances (G2) on Amazon EC2.

Published in: Technology
  • Be the first to comment

(CMP305) Deep Learning on AWS Made EasyCmp305

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Danny Bickson, Co-founder DATO CMP305 Deep Learning on AWS Made Easy October 2015
  2. 2. 2 Who is Dato? Seattle-based Machine Learning Company 45+ and growing fast!
  3. 3. Deep learning example
  4. 4. 4 Image classification Input: x Image pixels Output: y Predicted object
  5. 5. Neural networks  Learning *very* non-linear features
  6. 6. 6 Linear classifiers (binary) Score(x) > 0 Score(x) < 0 Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd
  7. 7. 7 Graph representation of classifier: useful for defining neural networks x1 x2 xd y … 1 w2 > 0, output 1 < 0, output 0 Input Output Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd
  8. 8. 8 What can a linear classifier represent? x1 OR x2 x1 AND x2 x1 x2 1 y x1 x2 1 y1 1 -0.5 1 1 -1.5
  9. 9. 9 What can’t a simple linear classifier represent? XOR the counterexample to everything Need non-linear features
  10. 10. Solving the XOR problem: Adding a layer XOR = x1 AND NOT x2 OR NOT x1 AND x2 z1 -0.5 1 -1 z1 z2 z2 -0.5 -1 1 x1 x2 1 y 1 -0.5 1 1 Thresholded to 0 or 1
  11. 11. 11 A neural network • Layers and layers and layers of linear models and non-linear transformations • Around for about 50 years • In last few years, big resurgence - Impressive accuracy on several benchmark problems - Advanced in hardware allows computation (i.e. aws g2 instances) x 1 x 2 1 z 1 z 2 1 y
  12. 12. Application of deep learning to computer vision
  13. 13. 13 Feature detection – traditional approach • Features = local detectors - Combined to make prediction - (in reality, features are more low-level) Face! Eye Eye Nose Mouth
  14. 14. 14 SIFT [Lowe ‘99] •Spin Images [Johnson & Herbert ‘99] •Textons [Malik et al. ‘99] •RIFT [Lazebnik ’04] •GLOH [Mikolajczyk & Schmid ‘05] •HoG [Dalal & Triggs ‘05] •… Many hand created features exist for finding interest points…
  15. 15. 15 Standard image classification approach Input Use simple classifier e.g., logistic regression, SVMs Face? Extract features Hand-created features
  16. 16. 16 SIFT [Lowe ‘99] •Spin Images [Johnson & Herbert ‘99] •Textons [Malik et al. ‘99] •RIFT [Lazebnik ’04] •GLOH [Mikolajczyk & Schmid ‘05] •HoG [Dalal & Triggs ‘05] •… Many hand created features exist for finding interest points… Hand-created features … but very painful to design
  17. 17. 17 Deep learning: implicitly learns features Layer 1 Layer 2 Layer 3 Prediction Example detectors learned Example interest points detected [Zeiler & Fergus ‘13]
  18. 18. Deep learning performance
  19. 19. Deep learning accuracy • German traffic sign recognition benchmark - 99.5% accuracy (IDSIA team) • House number recognition - 97.8% accuracy per character [Goodfellow et al. ’13]
  20. 20. ImageNet 2012 competition: 1.2M training images, 1000 categories 0 0.05 0.1 0.15 0.2 0.25 0.3 SuperVision ISI OXFORD_VGG Error(bestof5guesses) Huge gain Exploited hand-coded features like SIFT Top 3 teams
  21. 21. ImageNet 2012 competition: 1.2M training images, 1000 categories Winning entry: SuperVision 8 layers, 60M parameters [Krizhevsky et al. ’12] Achieving these amazing results required: • New learning algorithms • GPU implementation
  22. 22. Deep learning performance • ImageNet: 1.2M images 0 10 20 30 40 50 60 g2.xlarge g2.8xlarge Running time (hours)
  23. 23. Deep learning in computer vision
  24. 24. Scene parsing with deep learning [Farabet et al. ‘13]
  25. 25. Retrieving similar images Input Image Nearest neighbors
  26. 26. Deep learning usability
  27. 27. Designed a simple user interface #training the model model = graphlab.neuralnet.create(train_images) #predicting classes for new images outcome = model.predict(test_images)
  28. 28. Deep learning demo
  29. 29. Challenges of deep learning
  30. 30. Deep learning score card Pros • Enables learning of features rather than hand tuning • Impressive performance gains - Computer vision - Speech recognition - Some text analysis • Potential for more impact
  31. 31. Deep learning workflow Lots of labeled data Training set Validation set Learn deep neural net Validate Adjust parameters, network architecture,…
  32. 32. 32 Many tricks needed to work well… Different types of layers, connections,… needed for high accuracy [Krizhevsky et al. ’12]
  33. 33. Deep learning score card Pros • Enables learning of features rather than hand tuning • Impressive performance gains - Computer vision - Speech recognition - Some text analysis • Potential for more impact Cons • Requires a lot of data for high accuracy • Computationally really expensive • Extremely hard to tune - Choice of architecture - Parameter types - Hyperparameters - Learning algorithm - … Computational cost+ so many choices = incredibly hard to tune
  34. 34. Deep features: Deep learning + Transfer learning
  35. 35. 35 Standard image classification approach Input Use simple classifier e.g., logistic regression, SVMs Face? Extract features Hand-created features Can we learn features from data, even when we don’t have data or time?
  36. 36. 36 What’s learned in a neural net Very specific to Task 1 Should be ignored for other tasks More generic Can be used as feature extractor vs. Neural net trained for Task 1: cat vs. dog
  37. 37. 37 Transfer learning in more detail… Very specific to Task 1 Should be ignored for other tasks More generic Can be used as feature extractor For Task 2, predicting 101 categories, learn only end part of neural net Use simple classifier e.g., logistic regression, SVMs, nearest neighbor,… Class? Keep weights fixed! Neural net trained for Task 1: cat vs. dog
  38. 38. 38 Careful where you cut: latter layers may be too task specific Layer 1 Layer 2 Layer 3 Prediction Example detectors learned Example interest points detected [Zeiler & Fergus ‘13] Too specific for new task Use these!
  39. 39. Transfer learning with deep features workflow Some labeled data Extract features with neural net trained on different task Learn simple classifier Validate Training set Validation set
  40. 40. How general are deep features?
  41. 41. Barcelona Buildings
  42. 42. Architectural transition
  43. 43. Deep learning in production on AWS
  44. 44. 44 How to use deep learning in production? PredictiveUnderstands input & takes actions or makes decisions InteractiveResponds in real time LearningImproves its performance with experience
  45. 45. Intelligent service at the core…
  46. 46. 46 Yourintelligentapplication Intelligent backend service Real-time data Predictions & decisions Historical data Machine learning model Predictions & decisions Most ML research here… But ML research useless without great solution here…
  47. 47. 47 Essential ingredients of intelligent service Responsive Intelligent applications are interactive  Need low latency, high throughput & high availability Adaptive ML models out-of-date the moment learning is done  Need to constantly understand & improve end-to-end performance Manageable Many thousands of models, created by hundreds of people  Need versioning, attribution, provenance & reproducibility
  48. 48. Responsive: Now and Always Responsive Intelligent applications are interactive  Need low latency, high throughput & high availability Adaptive ML models out-of-date the moment learning is done  Need to constantly understand & improve end-to-end performance Manageable Many thousands of models, created by hundreds of people  Need versioning, attribution, provenance & reproducibility
  49. 49. Addressing latency
  50. 50. 50 Challenge: Scoring Latency Compute predictions in < 20ms for complex all while under heavy query load Models Queries TopK Features SELECT * FROM users JOIN items, click_logs, pages WHERE …
  51. 51. 51 The Common Solutions to Latency Faster Online Model Scoring “Execute Predict(query) in real-time as queries arrive” Pre-Materialization and Lookup “Pre-compute Predict(query) for all queries and lookup answer at query time”Dato Predictive Services does Both
  52. 52. 52 Faster Online Model Scoring: Highly optimized machine learning • SFrame: Native code, optimized data frame - Available open-source (BSD) • Model querying acceleration with native code, e.g., - TopK and Nearest Neighbor eval: • LSH, Ball Trees,…
  53. 53. 53 The Common Solutions to Latency Faster Online Model Scoring “Execute Predict(query) in real-time as queries arrive” Pre-Materialization and Lookup “Pre-compute Predict(query) for all queries and lookup answer at query time”Dato Predictive Services does Both
  54. 54. 54 Smart Materialization  Caching Unique Queries QueryFrequency Example: top 10% of all unique queries cover 90% of all queries performed. Caching a small number of unique queries has a very large impact.
  55. 55. 55 Distributed shared caching Distributed Shared Cache (Redis) Cache: Model query results Common features (e.g., product info) Scale-out improves throughput and latency
  56. 56. 56 Dato Latency by the numbers Easy Case: cache hit ~2ms Hard Case: cache miss • Simple Linear Models: 5-6ms • Complex Random Forests: 7-8ms - P99: ~ 15ms [using aws m3.xlarge instance]
  57. 57. 57 Challenge: Availability Heavy load substantial delays Frequent model updates  cache misses Machine failures
  58. 58. 58 Scale-Out availability under load Heavy Load Elastic Load Balancing load balancer
  59. 59. Adaptive: Accounting for Constant Change Responsive Intelligent applications are interactive  Need low latency, high throughput & high availability Adaptive ML models out-of-date the moment learning is done  Need to constantly understand & improve end-to-end performance Manageable Many thousands of models, created by hundreds of people  Need versioning, attribution, provenance & reproducibility
  60. 60. 60 Change at Different Scales and Rates Shopping for Mom Shopping for Me Months Rate of Change Minutes Population Granularity of Change Session
  61. 61. 61 Months Rate of Change Minutes Population Granularity of Change SessionIndividual and Session Level Change Small Data Online learning Bandits to Assess Models Shopping for Mom Shopping for Me Change at Different Scales and Rates
  62. 62. 62 The Dangerous Feedback Loop I once looked at cameras on Amazon … Bags Similar cameras and accessories If this is all they showed how would they learn that I also like bikes, and shoes?
  63. 63. 63 Exploration / Exploitation Tradeoff Systems that can take actions can adversely affect future data Exploration Exploitation Best Action Random Action Learn more about what is good and bad Make the best use of what we believe is good.
  64. 64. 64 Dato Solution to Adaptivity Rapid offline learning with GraphLab Create Online bandit adaptation in Predictive Services • Demo
  65. 65. Manageable: Unification and simplification Responsive Intelligent applications are interactive  Need low latency, high throughput & high availability Adaptive ML models out-of-date the moment learning is done  Need to constantly understand & improve end-to-end performance Manageable Many thousands of models, created by hundreds of people  Need versioning, attribution, provenance & reproducibility
  66. 66. 66 Ecosystem of Intelligent Services Data Infrastructure MySQL MySQL Serving Data Science ModelA ModelB TableA TableB Service A Service B Complicated! Many systems, with overlapping roles, no single source of truth for Intelligent Service.
  67. 67. 67 Dato Predictive Services Responsive Adaptive Manageable
  68. 68. 68 Model Management  like code management, but for life cycle of intelligent applications Provenance & Reproducibility • Track changes & rollback • Cover code, model type, parameters, data… Collaboration • Review, blame  • Share • Common feature engineering pipelines Continuous Integration • Deploy & update • Measure & improve • Avoid down time and impact on end-users
  69. 69. 69 Dato Predictive Services Responsive Adaptive Manageable Dato Predictive Services Serving Models and Managing the Machine Learning Lifecycle GraphLab Create Accurate, Robust, and Scalable Model Training
  70. 70. GraphLab Create: Sophisticated machine learning made easy High-level ML toolkits AutoML tune params, model selection,…  so you can focus on creative parts Reusable features transferrable feature engineering  accuracy with less data & less effort
  71. 71. 71 High-level ML toolkits get started with 4 lines of code, then modify, blend, add yours… Recommender Image search Sentiment analysis Data matching Auto tagging Churn predictor Object detector Product sentiment Click prediction Fraud detection User segmentation Data completion Anomaly detection Document clustering Forecasting Search ranking Summarization … import graphlab as gl data = gl.SFrame.read_csv('my_data.csv') model = gl.recommender.create(data, user_id='user', item_id='movie’, target='rating') recommendations = model.recommend(k=5)
  72. 72. SFrame ❤️ all ML tools SGraph SFrame: Sophisticated machine learning made scalable
  73. 73. Opportunity for Out-of-Core ML Capacity 1 TB 0.5 GB/s 10 TB 0.1 GB/s 0.1 TB 1 GB/sThroughput Fast, but significantly limits data sizeOpportunity for big data on 1 machine For sequential reads only! Random access very slow Out-of-core ML opportunity is huge Usual design → Lots of random access → Slow Design to maximize sequential access for ML algo patterns GraphChi early example SFrame data frame for ML
  74. 74. Performance of SFrame/SGraph 70 sec 251 sec 200 sec 2,128 sec 0 750 1500 2250 GraphLab Create GraphX Giraph Spark Connected components in Twitter graph Source(s): Gonzalez et. al. (OSDI 2014) Twitter: 41 million Nodes, 1.4 billion Edges SGraph 16 machines 1 machine
  75. 75. 75 SFrame & SGraph Optimized out-of-core computation for ML High Performance 1 machine can handle: TBs of data 100s Billions of edges Optimized for ML . Columnar transformation . Create features . Iterators . Filter, join, group-by, aggregate . User-defined functions . Easily extended through SDK Tables, graphs, text, images Open- source ❤️ BSD license
  76. 76. 76 The Dato Machine Learning Platform Predictive Services Serve Models and Manage the Machine Learning Lifecycle GraphLab Create Train Accurate, Robust, and Scalable models
  77. 77. 77 Our customers

×