Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark


Published on

Run more experiments faster and in parallel. Share and reproduce research. Go from research to real products.

Published in: Technology

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark

  1. 1. 7 WAYS TO RUN TensorFlow IN PARALLEL
  2. 2. Train and evaluate machine learning models at scale Single machine Data center How to run more experiments faster and in parallel? How to share and reproduce research? How to go from research to real products?
  3. 3. More Data + Bigger Models Accuracy Scale (data size, model size) other approaches neural networks 1990s
  4. 4. More Data + Bigger Models + More Computation Accuracy Scale (data size, model size) other approaches neural networks Now more compute
  5. 5. Why Distributed Machine Learning? Data Size Model Size Model parallelism Single machine Data center Data parallelism training very large models exploring several model architectures, hyper- parameter optimization, training several independent models speeds up the training
  6. 6. Compute Workload for Training and Evaluation I/O intensive Compute intensive Single machine Data center
  7. 7. I/O Workload for Simulation and Testing I/O intensive Compute intensive Single machine Data center
  8. 8. Data Parallel vs. Model Parallel Between-Graph Replication In-Graph Replication
  9. 9. Data Shards vs. Data Combined
  10. 10. Synchronous vs. Asynchronous
  11. 11. 9/23/17 11 TensorFlow Standalone TensorFlow On YARN TensorFlow On multi- colored YARN TensorFlow On Spark TensorFrames TensorFlow On Kubernetes TensorFlow On Mesos Do not put all of your eggs into one basket
  12. 12. TensorFlow Standalone
  13. 13. TensorFlow Standalone Dedicated cluster Short & long running jobs Flexibility Manual scheduling of workers No shared resources Hard to share data with other applications No data locality
  14. 14. TensorFlow On YARN (Intel) v3
  15. 15. TensorFlow On YARN (Intel) Shared cluster and data Optimised long running jobs Scheduling Data locality (not yet implemented) Not easy to have rapid adoption from upstream Fault tolerance not yet implemented GPU still not seen as a “native” resource on yarn No use of yarn elasticity
  16. 16. TensorFlow On multi-colored YARN (Hortonworks) v3 YARN-3611 and YARN-4793 not yet implemented
  17. 17. TensorFlow On multi-colored YARN (Hortonworks) Shared cluster GPUs shared by multiple tenants and applications Centralised scheduling YARN-3611 and YARN-4793 not implemented yet Needs YARN wrapper of NVIDIA Docker (GPU driver) Not implemented yet!
  18. 18. TensorFlow On Spark (Yahoo) v2
  19. 19. TensorFlow On Spark (Yahoo) Shared cluster and data Data locality through HDFS or other Spark sources Add-hoc training and evaluation Slice and dice data with Spark distributed transformations Scheduling not optimal Necessary to “convert” existing TensorFlow application, although simple process Might need to restart Spark cluster
  20. 20. TensorFrames (Databricks) v2 Scala binding to TF via JNI
  21. 21. TensorFrames (Databricks) Possible shared cluster TensorFrame infers the shapes for small tensors (no analyse required) Data locality via RDD Experimental Still not centralised scheduling, TF and Spark need to be deployed and scheduled separately TF and Spark might not be collocated Might need data transfer between some nodes
  22. 22. TensorFlow On Kubernetes
  23. 23. TensorFlow On Kubernetes Shared cluster Centralised scheduling by Kubernetes Solved network orchestration, federation etc. Experimental support for Managing NVIDIA GPUs (at this time better than yarn however)
  24. 24. TensorFlow On Mesos Marathon
  25. 25. TensorFlow On Mesos Shared cluster GPU-based scheduling Memory footprint Number of services relative to Kubernetes
  26. 26. thank you