Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed TensorFlow: Scaling Google's Deep Learning Library on Spark

21,992 views

Published on

We took Google’s recently open sourced deep learning library called “TensorFlow”, which is single-machine, and built a distributed implementation on Spark. We did this by using the abstraction layer called “Distributed DataFrame”, or DDF. TensorFlow started as an internal research project at Google and has since spread to be used across the organization. Google open sourced the project in Nov. 2015, but kept the distributed version of it closed. There are other deep learning frameworks that have been implemented on Spark, but TensorFlow’s API is elegant and powerful and has proven to work at Google scale. Putting TensorFlow on Spark makes it easier for companies to harness the power of deep learning in their own businesses without sacrificing scalability.

KEY TAKE-AWAYS: (1) Google’s TensorFlow release is a single-machine implementation, (2) We have built a distributed implementation in Spark which allows TensorFlow to scale horizontally.

This talk was presented at the Spark Summit East 2016; watch it here: https://www.youtube.com/watch?v=-QtcP3yRqyM&index=5&list=PLMpF-Q2WyoGudT3z-VPGs-6nnv45cod_R

Published in: Technology
  • Be the first to comment

Distributed TensorFlow: Scaling Google's Deep Learning Library on Spark

  1. 1. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Christopher Nguyen, PhD. | Chris Smith | Ushnish De | Vu Pham | Nanda Kishore DISTRIBUTED TENSORFLOW: SCALING GOOGLE'S DEEP LEARNING LIBRARY ON SPARK
  2. 2. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Feb 2016 Distributed TensorFlow on Spark Spark Summit East June 2016 TensorFlow Ensembles Spark Hadoop Summit Nov 2015 Distributed DL on Spark + Tachyon Strata NYC
  3. 3. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com 1. Motivation 2. Architecture 3. Experimental Parameters 4. Results 5. Lesson Learned 6. Summary Outline
  4. 4. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Motivation
  5. 5. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Google TensorFlow 1 Spark Deployments 2
  6. 6. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com TensorFlow Review
  7. 7. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Computational Graph
  8. 8. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Graph of a Neural Network
  9. 9. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Google uses TensorFlow for the following: ★ Search ★ Advertising ★ Speech Recognition ★ Photos ★ Maps ★ StreetView—Blurring out house numbers ★ Translate ★ YouTube Why TensorFlow?
  10. 10. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com DistBelief Large Scale Distributed Deep Networks, Jeff Dean et al, NIPS 2012 1. Compute Gradients 2. Update Params 
 (Descent)
  11. 11. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Arimo’s 3 Pillars App Development BIG APPS PREDICTIVE ANALYTICS NATURAL INTERFACES COLLABO- RATION Big Data + Big Compute
  12. 12. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Architecture
  13. 13. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com TensorSpark Architecture
  14. 14. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Spark & Tachyon Architectural Options Model as Broadcast Variable Spark-Only Model Stored as Tachyon File Tachyon- Storage Model Hosted in HTTP Server Param- Server Model Stored and Updated by Tachyon Tachyon- CoProc
  15. 15. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Experimental Parameters
  16. 16. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Data Set Size Parameters Hidden 
 Layers Hidden
 Units MNIST 50K x 784 =39.2M 1.8M 2 1024 Higgs 10M x 28 =280M 8.5M 3 2048 Molecular 150K x 2871 =430M 14.2M 3 2048 Datasets
  17. 17. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Experimental Dimensions Dataset Size1 Model Size2 Compute Size3 CPU vs. GPU4
  18. 18. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Results
  19. 19. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com
  20. 20. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com
  21. 21. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com MNIST Higgs Molecular
  22. 22. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com
  23. 23. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com
  24. 24. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Lessons Learned
  25. 25. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com GPU vs CPU ★ GPU is 8-10x faster on local; lower gains with Spark but better for larger data sets ★ On local: GPU 8–10x faster ★ On distributed: GPU 2–3x faster, more with larger datasets ★ GPU memory is limited > Performance impact if hit ★ AWS has old GPUs, need to build TF from source
  26. 26. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Distributed Gradients ★ Initial warmup phase on param server needed ★ “Wobble” effect may still happen with convergence due to mini-batches ✦ Drop “obsolete” gradients ✦ Reduce learning rate and initial parameters ✦ Reduce mini-batch size ★ Work to maximize network throughput, e.g., model compression
  27. 27. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Deployment Configuration ★ Should run only 1 TF instance per machine ★ Compress gradients/parameters (e.g., as binaries) to reduce network overhead ★ Batch-size tradeoff: convergence vs. communication overhead
  28. 28. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Conclusions
  29. 29. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Conclusions ★ This implementation best for large datasets & small models ✦ For large models, we’re looking into 
 a) model parallelism b) model compression ★ If data fits on single machine, single-machine GPU would be best (if you have it) ★ For large datasets, leverage both speed and scale using Spark cluster with GPUs ★ Future Work: a) Tachyon b) Ensembles (Hadoop Summit - June 2016)
  30. 30. @pentagoniac @arimoinc #SparkSummit #DDF arimo.com Top 10 World’s
 Most Innovative 
 Companies 
 in Data Science https://github.com/adatao/tensorspark VISIT US BOOTH K8 We’re OPEN SOURCING our work at

×