Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Deep Learning with Hadoop and TensorFlow


Published on

Training deep neural nets can take long time and heavy resources. By leveraging an existing distributed versions of TensorFlow and Hadoop can train neural nets quickly and efficiently.

Published in: Business

Distributed Deep Learning with Hadoop and TensorFlow

  1. 1. Distributed Deep Learning with Hadoop and TensorFlow
  2. 2. Image Classification- 2016 Human Performance AI Performance 95% 97% The ability to understand the content of an image by using machine learning
  3. 3. 4 AI beats human in games - 2016 Komodo beasts H. Nakamura in 2016AlphaGo beats L. Sedols in 2016 Go 4:1 Chess 2:1
  4. 4. Breast Cancer Diagnoses - 2017 Pathologist Performance AI Performance 73% 92% Doctors often use additional tests to find or diagnose breast cancer The pathologist ended up spending 30 hours on this task on 130 slides A closeup of a lymph node biopsy.
  5. 5. Google TPU
  6. 6. The power of 12 GB HBM2 memory and 640 Tensor Cores, delivering 110 TeraFLOPS of performance.
  7. 7. AI history à Perceptron 1958 F. Rosenblatt, “Perceptron” model, neuronal networks 1943 W. McCulloch, W. Pitts, “Neuron” as logical element OR function XOR function 1969 M. Minsky, S. Papert, triggers first AI winter feed forward
  8. 8. AI history à AI winter 1958 F. Rosenblatt, Perzeptron model, neuronal networks 1987-1993 the second AI winter, desktop computer, LISP machines expensive 1943 W. McCulloch, W. Pitts, neuron as logical element 1980 Boom expert systems, Q&A using logical rules, Prolog 1969 M. Minsky, S. Papert, trigger first AI winter 1993-2001 Moore’s law, Deep blue chess- playing, Standford DARPA challenge
  9. 9. 12 Machine Learning Problem Types
  10. 10. Structured data 80% of world’s data is unstructured
  11. 11. Fishing in the sea versus fishing in the lake Data Warehouse Data Lake Business Intellingence helps find answers to questions you know. Data Science helps you find the question itself. Any kind of data & schema-on-readStructured data & schema-on-write Parallel processing on big dataSQL-ish queries on database tables Extract, Transform, Load Extract, Load, Transform-on-the-fly Low cost on commodity hardwareExpensive for large data
  12. 12. More Data + Bigger Models Accuracy Scale (data size, model size) other approaches neural networks 1990s
  13. 13. More Data + Bigger Models + More Computation Accuracy Scale (data size, model size) other approaches neural networks Now more compute
  14. 14. More Data + Bigger Models + More Computation = Better Results in Machine Learning
  15. 15. Millions of “trip” events each day globally 400+ billion viewing- related events per day Five billion data points for Price Tip feature Movie recommendation Price optimization Routing and price optimization
  16. 16. How to start?
  17. 17. Single machineML specialist Small data
  18. 18. Single machineML specialist Small data Single machineML specialist Small data
  19. 19. Single machineML specialist Small data Single machineML specialist Small data X X
  20. 20. Single machineML specialist Big data Single machineML specialist Big data X X
  21. 21. Train and evaluate machine learning models at scale Single machine Data center How to run more experiments faster and in parallel? How to share and reproduce research? How to go from research to real products?
  22. 22. Distributed Machine Learning Data Size Model Size Model parallelism Single machine Data center Data parallelism training very large models exploring several model architectures, hyper- parameter optimization, training several independent models speeds up the training
  23. 23. Compute Workload for Training and Evaluation I/O intensive Compute intensive Single machine Data center
  24. 24. I/O Workload for Simulation and Testing I/O intensive Compute intensive Single machine Data center
  25. 25. Distributed Machine Learning
  26. 26. Distributed Machine Learning X
  27. 27. The new rising star
  28. 28. 12/19/17 31 TensorFlow Standalone TensorFlow On YARN TensorFlow On multi- colored YARN TensorFlow On Spark TensorFrames TensorFlow On Kubernetes TensorFlow On Mesos Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark -tensorflow-on-hadoop-mesos-kubernetes-spark
  29. 29. Data Parallel vs. Model Parallel Between-Graph Replication In-Graph Replication
  30. 30. Data Shards vs. Data Combined
  31. 31. Synchronous vs. Asynchronous
  32. 32. TensorFlow Standalone
  33. 33. TensorFlow Standalone Dedicated cluster Short & long running jobs Flexibility Manual scheduling of workers No shared resources Hard to share data with other applications No data locality
  34. 34. TensorFlow On YARN (Intel) v3 released March 12, 2017 / YARN-6043
  35. 35. TensorFlow On YARN (Intel) Shared cluster and data Optimised long running jobs Scheduling Data locality (not yet implemented) Not easy to have rapid adoption from upstream Fault tolerance not yet implemented GPU still not seen as a “native” resource on yarn No use of yarn elasticity
  36. 36. TensorFlow On multi-colored YARN (Hortonworks) v3 Not yet implemented!
  37. 37. TensorFlow On multi-colored YARN (Hortonworks) Shared cluster GPUs shared by multiple tenants and applications Centralised scheduling YARN-3611 Docker support YARN-4793 Native processes Needs YARN wrapper of NVIDIA Docker (GPU driver) Not implemented yet!
  38. 38. TensorFlow On Spark (Yahoo) v2 released January 22, 2017
  39. 39. TensorFlow On Spark (Yahoo) Shared cluster and data Data locality through HDFS or other Spark sources Add-hoc training and evaluation Slice and dice data with Spark distributed transformations Scheduling not optimal Necessary to “convert” existing TensorFlow application, although simple process Might need to restart Spark cluster No GPU resource management
  40. 40. TensorFrames (Databricks) v2 Scala binding to TF via JNI released Feb 28, 2016
  41. 41. TensorFrames (Databricks) Possible shared cluster TensorFrame infers the shapes for small tensors (no analyse required) Data locality via RDD Experimental Still not centralised scheduling, TF and Spark need to be deployed and scheduled separately TF and Spark might not be collocated Might need data transfer between some nodes
  42. 42. TensorFlow On Kubernetes
  43. 43. TensorFlow On Kubernetes Shared cluster Centralised scheduling by Kubernetes Solved network orchestration, federation etc. Experimental support for managing NVIDIA GPUs (at this time better than yarn however) Fault tolerance Data locality
  44. 44. TensorFlow On Mesos Marathon
  45. 45. TensorFlow On Mesos Shared cluster GPU-based scheduling Short and long running jobs Memory footprint Number of services relative to Kubernetes Fault tolerance Data locality
  46. 46. Hidden Technical Debt in Machine Learning Systems Google, 2015
  47. 47. Hidden Technical Debt in Machine Learning Systems Google, 2015
  48. 48. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform Google, 2017
  49. 49. Michelangelo: Uber’s Machine Learning Platform
  50. 50.
  51. 51. Pricing for 890,000 real-time predictions w/o training AWS: Compute Fees + Prediction Fees = $8.40 + $96.44 = $104.84 per month Google: Prediction $0.10 per thousand predictions, plus $0.40 per hour = $377 per month Azure: Packages $0, $100,13, $1.000,06, $9.999,98 = $1.000 per month Q3, 2017
  53. 53. High-level Development Process for Autonomous Vehicles 1 Collect sensors data 3 Autonomous Driving 2 Model Engineering Data Logger Control Unit Big Data Trained Model Data Center Agenda
  54. 54. Sensors Udacity Lincoln MKZ Camera 3x Blackfly GigE Camera, 20 Hz Lidar Velodyne HDL-32E, 9.5 Hz IMU Xsens, 400 Hz GPS 2x fixed, 1 Hz CAN bus, 1,1 kHz Robot Operating System Data 3 GB per minute
  55. 55. Sensors Spec Sensor blinding, sunlight, darkness rain, fog, snow non-metal objects wind/ high velocity resolution range data Ultrasonic yes yes yes no + + + Lidar yes no yes yes +++ ++ + Radar yes yes no yes ++ +++ + Camera no no yes yes +++ +++ +++
  56. 56. Machine Learning 101 Observations State Estimation Modeling & Prediction Planning Controls f(x) Controls Observations
  57. 57. Machine Learning for Autonomous Driving + Sensor Fusion clustering, segmentation, pattern recognition + Road ego-motion, image processing and pattern recognition + Localization simultaneous localization and mapping + Situation Understanding detection and classification + Trajectory Planning motion planning and control + Control Strategy reinforcement and supervised learning + Driver Model image processing and pattern recognition
  58. 58. Machine Learning Cycle Data collection for training/test Feature engineering I/O workload Model development and architecture Compute workload I/O workload Training and evaluation Re- Simulation and Testing Scaling and monitoring Model deployment versioning 1 2 3 Model tuning
  59. 59. Flux – Open Machine Learning Stack Training & Test data Compute + Network + Storage Deploy model ML Development & Catalog & REST API ML-Specialists Feature Engineering Training Evaluation Re-Simulation Testing CaffeOnSpark Sample Model Prediction Batch Regression Cluster Dataset Correlation Centroid Anomaly Test Scores ü Mainly open source ü No vendor lock in ü Scale-out architecture ü Multi user support ü Resource management ü Job scheduling ü Speed-up training ü Speed-up simulation
  60. 60. Feature Engineering + Hadoop InputFormat and Record Reader for Rosbag + Process Rosbag with Spark, Yarn, MapReduce, Hadoop Streaming API, … + Spark RDD are cached and optimized for analysis Ros bag Processing Engine Computer Network Storage Advanced Analytics RDD Record Reader RDD DataFrame, DataSet SQL, Spark APIs NumPy Ros Msg
  61. 61. Training & Evaluation + Tensorflow ROSRecordDataset + Protocol Buffers to serialize records + Save time because data conversion not needed + Save storage because data duplication not needed Training Engine Machine Learning Ros bag Computer Network Storage ROS Dataset Ros msg
  62. 62. Re-Simulation & Testing + Use Spark for preprocessing, transformation, cleansing, aggregation, time window selection before publish to ROS topics + Use Re-Simulation framework of choice to subscribe to the ROS topics Engine Re-Simulation with framework of choice Computer Network Storage Ros bag Ros topic core subscribe publish
  63. 63. Time Travel fold(left) t fold(right) reduce/ shuffle
  64. 64. HOW TO START?
  65. 65. Think Big Business Strategy Data Strategy Technology Strategy Agile Delivery Model Business Case Validation Prototypes, MVPs Data Exploration Data AcquisitionStart Small Value Proposition
  66. 66. + Classification, Regression, Clustering, Collaborative Filtering, Anomaly Detection + Supervised/Unsupervised Reinforcement Learning, Deep Learning, CNN + Model Training, Evaluation, Testing, Simulation, Inference + Big Data Strategy, Consulting, Data Lab, Data Science as a Service + Data Collection, Cleaning, Analyzing, Modeling, Validation, Visualization + Business Case Validation, Prototyping, MVPs, Dashboards Data Science Machine Learning
  67. 67. + Architecture, DevOps, Cloud Building + App. Management Hadoop Ecosystem + Managed Infrastructure Services + Compute, Network, Storage, Firewall, Loadbalancer, DDoS, Protection + Continuous Integration and Deployment + Data Pipelines (Acquisition, Ingestion, Analytics, Visualization) + Distributed Data Architectures + Data Processing Backend + Hadoop Ecosystem + Test Automation and Testing Data Engineering Data Operations
  68. 68. “Culture eats strategy for breakfast, technology for lunch, and products for dinner, and soon thereafter everything else too.” Peter Drucker
  69. 69. thank you