Successfully reported this slideshow.
Your SlideShare is downloading. ×

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 37 Ad

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Download to read offline

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.

Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:

* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.

We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.

Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.

Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:

* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.

We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.

Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

Advertisement
Advertisement

More Related Content

Slideshows for you (18)

Similar to Deep Learning on Apache® Spark™ : Workflows and Best Practices (20)

Advertisement

More from Jen Aman (20)

Advertisement

Deep Learning on Apache® Spark™ : Workflows and Best Practices

  1. 1. Deep Learning on Apache® Spark™: Workflows and Best Practices Tim Hunter (Software Engineer) Jules S. Damji (Spark Community Evangelist) May 4, 2017
  2. 2. Agenda • Logistics • Databricks Overview • Deep Learning on Apache® Spark™: Workflows and Best Practices • Q & A
  3. 3. Logistics • We can’t hear you… • Recording will be available... • Slides and Notebooks will be available... • Queue up Questions …. • Orange Button for Tech Support difficulties...
  4. 4. Empower anyone to innovate faster with big data. Founded by the creators of Apache Spark. Contributes 75%of the open source code, 10x more than any other company. VISION WHO WE ARE A data processing for data scientists, data engineers, and data analysts that simplifies that data integration, real-time experimentation, machine learning and deployment of production pipelines . PRODUCT
  5. 5. A New Paradigm SECOND GENERATION THE BEST OF BOTH WORLDS Hadoop + data lake Hard to centralize data and extract value with disparate tools Virtual analytics • Holisticallyanalyze data from data warehouses, data lakes, and other data stores • Utilize a single engine for batch, ML, streaming & real-time queries • Enable enterprise-wide collaboration + FIRST GENERATION Data warehouses ETL process is rigid, scaling out is expensive, limited to SQL
  6. 6. CLUSTER TUNING & MANAGEMENT INTERACTIVE WORKSPACE PRODUCTION PIPELINE AUTOMATION OPTIMIZED DATA ACCESS DATABRICKS ENTERPRISE SECURITY YOUR TEAMS Data Science Data Engineering Many others… BI Analysts YOUR DATA Cloud Storage Data Warehouses Data Lake VIRTUAL ANALYTICS PLATFORM
  7. 7. Deep Learning on Apache® Spark™: Workflows and Best Practices Tim Hunter (Software Engineer) May 4 , 2017
  8. 8. About Me • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user • Contributor to MLlib • Author of TensorFrames and GraphFrames
  9. 9. Deep Learning and Apache Spark Deep Learning frameworks w/ Spark bindings • Caffe (CaffeOnSpark) • Keras (Elephas) • MXNet • Paddle • TensorFlow(TensorFlowOnSpark,TensorFrames) Extensions to Spark for specialized hardware • Blaze (UCLA & Falcon Computing Solutions) • IBM Conductor with Spark Native Spark • BigDL • DeepDist • DeepLearning4J • MLlib • SparkCL • SparkNet
  10. 10. Deep Learning and Apache Spark 2016: the year of emerging solutions for Spark + Deep Learning No consensus • Many approaches for libraries: integrate existing ones with Spark, build on top of Spark, modify Spark itself • Official Spark MLlib support is limited(perceptron-like networks)
  11. 11. One Framework to Rule Them All? Should we look for The One Deep Learning Framework?
  12. 12. Databricks’ perspective • Databricks: hosted Spark platform on public cloud • GPUs for compute-intensive workloads • Customers use many Deep Learning frameworks: TensorFlow, MXNet, BigDL, Theano, Caffe, and more This talk • Lessons learned from supporting many Deep Learning frameworks • Multiple ways to integrate Deep Learning & Spark • Best practices for these integrations
  13. 13. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  14. 14. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  15. 15. ML is a small part of data pipelines. Hidden technical debt in Machine Learning systems Sculley et al., NIPS 2016
  16. 16. DL in a data pipeline: Training Data collection ETL Featurization Deep Learning Validation Export, Serving compute intensive IO intensiveIO intensive Large cluster High memory/CPU ratio Small cluster Low memory/CPU ratio
  17. 17. DL in a data pipeline: Transformation Specialized data transforms: feature extraction & prediction Input Output cat dog dog Saulius Garalevicius - CC BY-SA3.0
  18. 18. Outline • Deep Learning in data pipelines • Recurringpatterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  19. 19. Recurring patterns Spark as a scheduler • Data-parallel tasks • Data stored outside Spark Embedded Deep Learning transforms • Data-parallel tasks • Data stored in DataFrames/RDDs Cooperative frameworks • Multiple passes over data • Heavy and/or specialized communication
  20. 20. Streaming data through DL Primary storage choices: • Cold layer (HDFS/S3/etc.) • Local storage: files, Spark’s on-disk persistence layer • In memory: SparkRDDs or SparkDataFrames Find out if you are I/O constrained or processor-constrained • How big is your dataset? MNIST or ImageNet? If using PySpark: • All frameworks heavily optimized for diskI/O • Use Spark’s broadcastfor small datasets that fitin memory • Reading files is fast: use local files when it does not fit
  21. 21. Cooperative frameworks • Use Spark for data input • Examples: • IBM GPU efforts • Skymind’s DeepLearning4J • DistML and other Parameter Server efforts RDD Partition 1 Partition n RDD Partition 1 Partition m Black box
  22. 22. Cooperative frameworks • Bypass Spark for asynchronous / specific communication patterns across machines • Lose benefit of RDDs and DataFrames and reproducibility/determinism • But these guarantees are not requested anyway when doing deep learning (stochastic gradient) • “reproducibility is worth a factor of 2” (Leon Bottou, quoted by John Langford)
  23. 23. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  24. 24. The GPU software stack • Deep Learning commonly used with GPUs • A lot of workon Spark dependencies: • Few dependencies on local machine when compiling Spark • The build process works well in a largenumber of configurations (just scala + maven) • GPUs present challenges: CUDA, support libraries, drivers, etc. • Deep softwarestack, requires careful construction (hardware+ drivers + CUDA + libraries) • All these are expected by the user • Turnkey stacks just starting to appear
  25. 25. • Provide a Docker image with all the GPU SDK • Pre-install GPU drivers on the instance Container: nvidia-docker, lxc, etc. The GPU software stack GPU hardware Linux kernel NV Kernel driver CuBLAS CuDNN Deep learning libraries (Tensorflow, etc.) JCUDA Python / JVM clients CUDA NV kernel driver (userspace interface)
  26. 26. Using GPUs through PySpark • Popular choice for many independent tasks • Many DL packages have Python interfaces: TensorFlow, Theano, Caffe, MXNet, etc. • Lifetime for python packages: the process • Requires some configuration tweaks in Spark
  27. 27. PySpark recommendation • spark.executor.cores = 1 • Gives the DL framework full access over all the resources • Important for frameworks that optimize processor pipelines
  28. 28. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  29. 29. Monitoring ?
  30. 30. Monitoring • How do you monitor the progress of your tasks? • It depends on the granularity • Around tasks • Inside (long-running) tasks
  31. 31. Monitoring: Accumulators • Good to check throughput or failure rate • Works for Scala • Limited use for Python (for now, SPARK-2868) • No “real-time” update batchesAcc = sc.accumulator(1) def processBatch(i): global acc acc += 1 # Process image batch here images = sc.parallelize(…) images.map(processBatch).collect()
  32. 32. Monitoring: external system • Plugs into an external system • Existing solutions: Grafana, Graphite, Prometheus, etc. • Most flexible, but more complex to deploy
  33. 33. Conclusion • Distributed deep learning: exciting and fast-moving space • Most insights are specific to a task, a dataset and an algorithm: nothing replaces experiments • Get started with data-parallel jobs • Move to cooperative frameworks only when your data are too large.
  34. 34. Challenges to address For Spark developers • Monitoringlong-running tasks • Presentingand introspecting intermediate results For DL developers • What boundary to put between the algorithm and Spark? • How to integrate with Spark at the low-level?
  35. 35. Resources Recent blog posts — http://databricks.com/blog • TensorFrames • GPU acceleration • Getting started with Deep Learning • Intel’s BigDL Docs for Deep Learning on Databricks — http://docs.databricks.com • Getting started • Spark integration
  36. 36. SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO ORGANIZED BY spark-summit.org/2017
  37. 37. Thank You! Questions? Happy Sparking & Deep Learning!

×