Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East talk by Myles Collins
Next
Download to read offline and view in fullscreen.

Share

Tuning and Monitoring Deep Learning on Apache Spark

Download to read offline

Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.

Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.

More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.

Speaker: Tim Hunter

This talk was originally presented at Spark Summit East 2017.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Tuning and Monitoring Deep Learning on Apache Spark

  1. 1. Tuning and Monitoring Deep Learning on Apache Spark Tim Hunter Databricks, Inc.
  2. 2. About me Software engineer at Databricks Apache Spark contributor Ph.D. UC Berkeley in Machine Learning (and Spark user since Spark 0.2) 2
  3. 3. Outline • Lessons (and challenges) learned from the field • Tuning Spark for Deep Learning and GPUs • Loading data in Spark • Monitoring 3
  4. 4. Deep Learning and Spark • 2016: the year of emerging solutions for combining Spark and Deep Learning – 6 talks on Deep Learning, 3 talks on GPUs • Yet, no consensus: – Official MLlib support is limited (perceptron-like networks) – Each talk mentions a different framework 4
  5. 5. Deep Learning and Spark • Deep learning frameworks with Spark bindings: – Caffe (CaffeOnSpark) – Keras (Elephas) – mxnet – Paddle – TensorFlow (TensorFlow on Spark, TensorFrames) • Running natively on Spark – BigDL – DeepDist – DeepLearning4J – MLlib – SparkCL – SparkNet • Extensions to Spark for specialized hardware – Blaze (UCLA & Falcon Computing Solutions) – IBM Conductor with Spark (Alphabetical order!) 5
  6. 6. One Framework to Rule Them All? • Should we look for The One Deep Learning Framework? 6
  7. 7. Databricks’ perspective • Databricks: a Spark vendor on top of a public cloud • Now provides GPU instances • Enables customers with compute-intensive workloads • This talk: – lessons learned when deploying GPU instances on a public cloud – what Spark could do better when running Deep Learning applications 7
  8. 8. ML in a data pipeline • ML is small part in the full pipeline: Hidden technical debt in Machine Learning systems Sculley et al., NIPS 2016 8
  9. 9. DL in a data pipeline (1) • Training tasks: 9 Data collection ETL Featurization Deep Learning Validation Export, Serving compute intensive IO intensiveIO intensive Large cluster High memory/CPU ratio Small cluster Low memory/CPU ratio
  10. 10. DL in a data pipeline (2) • Specialized data transform (feature extraction, …) 10 Input Output cat dog dog Saulius Garalevicius - CC BY-SA 3.0
  11. 11. Recurring patterns • Spark as a scheduler – Embarrassingly parallel tasks – Data stored outside Spark • Embedded Deep Learning transforms – Data stored in DataFrame/RDDs – Follows the RDD guarantees • Specialized computations – Multiple passes over data – Specific communication patterns 11
  12. 12. Using GPUs through PySpark • A popular choice when a lot of independent tasks • A lot of DL packages have a Python interface: TensorFlow, Theano, Caffe, mxnet, etc. • Lifetime for python packages: the process • Requires some configuration tweaks in Spark 12
  13. 13. PySpark recommendation • spark.executor.cores = 1 – Gives the DL framework full access over all the resources – This may be important for some frameworks that attempt to optimize the processor pipelines 13
  14. 14. Cooperative frameworks • Use Spark for data input • Examples: – IBM GPU efforts – Skymind’s DeepLearning4J – DistML and other Parameter Server efforts 14 RDD Partition 1 Partition n RDD Partition 1 Partition m Black box
  15. 15. Cooperative frameworks • Bypass Spark for asynchronous / specific communication patterns across machines • Lose benefit of RDDs and DataFrames and reproducibility/determinism • But these guarantees are not requested anyway when doing deep learning (stochastic gradient) • “reproducibility is worth a factor of 2” (Leon Bottou, quoted by John Langford) 15
  16. 16. Streaming data through DL • The general choice: – Cold layer (HDFS/S3/etc.) – Local storage: files, Spark’s on-disk persistence layer – In memory: Spark RDDs or Spark Dataframes • Find out if you are I/O constrained or processor-constrained – How big is your dataset? MNIST or ImageNet? • If using PySpark: – All frameworks heavily optimized for disk I/O – Use Spark’s broadcast for small datasets that fit in memory – Reading files is fast: use local files when it does not fit • If using a binding: – each binding has its own layer – in general, better performance by caching in files (same reasons) 16
  17. 17. Detour: simplifying the life of users • Deep Learning commonly used with GPUs • A lot of work on Spark dependencies: – Few dependencies on local machine when compiling Spark – The build process works well in a large number of configurations (just scala + maven) • GPUs present challenges: CUDA, support libraries, drivers, etc. – Deep software stack, requires careful construction (hardware + drivers + CUDA + libraries) – All these are expected by the user – Turnkey stacks just starting to appear 17
  18. 18. Container: nvidia-docker, lxc, etc. Simplifying the life of users • Provide a Docker image with all the GPU SDK • Pre-install GPU drivers on the instance GPU hardware Linux kernel NV Kernel driver CuBLAS CuDNN Deep learning libraries (Tensorflow, etc.) JCUDA Python / JVM clients 18 CUDA NV kernel driver (userspace interface)
  19. 19. Monitoring ? 19
  20. 20. Monitoring • How do you monitor the progress of your tasks? • It depends on the granularity – Around tasks – Inside (long-running) tasks 20
  21. 21. Monitoring: Accumulators • Good to check throughput or failure rate • Works for Scala • Limited use for Python (for now, SPARK-2868) • No “real-time” update batchesAcc = sc.accumulator(1) def processBatch(i): global acc acc += 1 # Process image batch here images = sc.parallelize(…) images.map(processBatch).collect() 21
  22. 22. Monitoring: external system • Plugs into an external system • Existing solutions: Grafana, Graphite, Prometheus, etc. • Most flexible, but more complex to deploy 22
  23. 23. Conclusion • Distributed deep learning: exciting and fast- moving space • Most insights are specific to a task, a dataset and an algorithm: nothing replaces experiments • Easy to get started with parallel jobs • Move in complexity only when enough data/insufficient patience 23
  24. 24. Challenges to address • For Spark developers – Monitoring long-running tasks – Presenting and introspecting intermediate results • For DL developers: – What boundary to put between the algorithm and Spark? – How to integrate with Spark at the low-level? 24
  25. 25. Thank You. Check our blog post on Deep Learning and GPUs! https://docs.databricks.com/applications/deep-learning/index.html Office hours: today 3:50 at the Databricks booth
  • DengchengYan1

    Oct. 29, 2017
  • oeegee

    Jun. 28, 2017
  • JeoungPyoHong

    May. 13, 2017
  • manuzhang

    Feb. 15, 2017
  • bunkertor

    Feb. 12, 2017

Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs. Speaker: Tim Hunter This talk was originally presented at Spark Summit East 2017.

Views

Total views

3,145

On Slideshare

0

From embeds

0

Number of embeds

6

Actions

Downloads

227

Shares

0

Comments

0

Likes

5

×