Wait! Exclusive 60 day trial to the world's largest digital library.
The SlideShare family just got bigger. You now have unlimited* access to books, audiobooks, magazines, and more from Scribd.Cancel anytime.
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.