Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters


Published on

In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly.

In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.

Published in: Technology
  • Be the first to comment

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

  1. 1. TensorFlowOnSpark S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s Lee Yang, Andr ew Feng Yahoo Big D ata ML Platfor m Team
  2. 2. What is TensorFlowOnSpark?
  3. 3. Why TensorFlowOnSpark at Yahoo? ▪ Major contributor to open-source Hadoop ecosystem • Originators of Hadoop (2006) • An early adopter of Spark (since 2013) • Open-sourced CaffeOnSpark (2016) • CaffeOnSpark Update: Recent Enhancements and Use Cases • Wednesday @ 12:20pm by Mridul Jain & Jun Shi ▪ Large investment in production clusters • Tens of clusters • Thousands of nodes per cluster ▪ Massive amounts of data • Petabytes of data
  4. 4. Private ML Clusters Machine-learning at scale
  5. 5. Why TensorFlowOnSpark? Machine-learning at scale
  6. 6. Scaling Near-linear scaling
  7. 7. RDMA Speedup over gRPC 2.4X faster
  8. 8. RDMA Speedup over gRPC
  9. 9. RDMA Speedup over gRPC
  10. 10. TensorFlowOnSpark Design Goals ▪ Scale up existing TF apps with minimal changes ▪ Support all current TensorFlow functionality • Synchronous/asynchronous training • Model/data parallelism • TensorBoard ▪ Integrate with existing HDFS data pipelines and ML algorithms • ex. Hive, Spark, MLlib
  11. 11. TensorFlowOnSpark ▪ Pyspark wrapper of TF app code ▪ Launches distributed TF clusters using Spark executors ▪ Supports TF data ingestion modes • feed_dict – RDD.mapPartitions() • queue_runner – direct HDFS access from TF ▪ Supports TensorBoard during/after training ▪ Generally agnostic to Spark/TF versions
  12. 12. Supported Environments ▪ Python 2.7 - 3.x ▪ Spark 1.6 - 2.x ▪ TensorFlow 0.12, 1.x ▪ Hadoop 2.x
  13. 13. Architectural Overview
  14. 14. TensorFlowOnSpark Basics 1. Launch TensorFlow cluster 2. Feed data to TensorFlow app 3. Shutdown TensorFlow cluster
  15. 15. API Example cluster =, map_fn, args, num_executors, num_ps, tensorboard, input_mode) cluster.train(dataRDD, num_epochs=0) cluster.inference(dataRDD) cluster.shutdown()
  16. 16. Conversion Example # diff –w 20a21,27 > from pyspark.context import SparkContext > from pyspark.conf import SparkConf > from tensorflowonspark import TFCluster, TFNode > import sys > > def main_fun(argv, ctx): 27a35,36 > sys.argv = argv > 84,85d92 < < def main(_): 88a96,97 > cluster_spec, server = TFNode.start_cluster_server(ctx) > 191c200,204 < --- > sc = SparkContext(conf=SparkConf().setAppName("eval_image_classifier")) > num_executors = int(sc._conf.get("spark.executor.instances")) > cluster =, main_fun, sys.argv, num_executors, 0, False, TFCluster.InputMode.TENSORFLOW) > cluster.shutdown()
  17. 17. Input Modes ▪ InputMode.SPARK HDFS → RDD.mapPartitions → feed_dict ▪ InputMode.TENSORFLOW TFReader + QueueRunner ← HDFS
  18. 18. Executor python worker Executor python workerpython worker python worker InputMode.SPARK worker:0 queue Executor python workerpython worker ps:0 worker:1 queue RDD RDD
  19. 19. Executor python worker InputMode.TENSORFLOW Executor python worker Executor python workerpython worker ps:0 python worker worker:0 python worker worker:1 HDFS
  20. 20. Failure Recovery ▪ TF Checkpoints written to HDFS ▪ InputMode.SPARK • TF worker runs in background • RDD data feeding tasks can be retried • However, TF worker failures will be “hidden” from Spark ▪ InputMode.TENSORFLOW • TF worker runs in foreground • TF worker failures will be retried as Spark task • TF worker restores from checkpoint
  21. 21. Failure Recovery ▪ Executor failures are problematic • e.g. pre-emption • TF cluster_spec is statically-defined at startup • YARN does not re-allocate on same node • Even if possible, port may no longer be available. ▪ Need dynamic cluster membership • Exploring options w/ TensorFlow team
  22. 22. TensorBoard
  23. 23. TensorBoard
  24. 24. TensorBoard
  25. 25. TensorFlow App Development Experimentation Phase •Single-node •Small scale data •TensorFlow APIs tf.Graph tf.Session tf.InteractiveSession
  26. 26. TensorFlow App Development Scaling Phase •Multi-node •Medium scale data (local disk) •Distributed TensorFlow APIs tf.train.ClusterSpec tf.train.Server tf.train.Saver tf.train.Supervisor
  27. 27. TensorFlow App Development Production Phase •Cluster deployment •Upstream data pipeline •Model training w/ TensorFlowOnSpark APIs TFNode.start_cluster_server TFCluster.shutdown •Production inference w/ TensorFlow Serving
  28. 28. Example Usage
  29. 29. Common Gotchas ▪ Single node/task per executor ▪ HDFS access (native libs/env) ▪ Why doesn’t algorithm X scale linearly?
  30. 30. What’s New? ▪ Community contributions • CDH compatibility • TFNode.DataFeed • Bug fixes ▪ RDMA merged into TensorFlow repository ▪ Registration server ▪ Spark streaming ▪ Pip packaging
  31. 31. Spark Streaming from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 10) images = sc.textFile(args.images).map(lambda ln: parse(ln)) stream = ssc.textFileStream(args.images) imageRDD = ln: parse(ln)) cluster =, map_fun, args,…) predictionRDD = cluster.inference(imageRDD) predictionRDD.saveAsTextFile(args.output) predictionRDD.saveAsTextFiles(args.output) ssc.start() cluster.shutdown(ssc)
  32. 32. Pip packaging pip install tensorflowonspark ${SPARK_HOME}/bin/spark-submit --master ${MASTER} --py-files ${TFoS_HOME}/examples/mnist/spark/ --archives ${TFoS_HOME}/ ${TFoS_HOME}/examples/mnist/spark/ --cluster_size ${SPARK_WORKER_INSTANCES} --images examples/mnist/csv/train/images --labels examples/mnist/csv/train/labels --format csv --mode train --model mnist_model
  33. 33. Demo
  34. 34. Next Steps ▪ TF/Keras Layers ▪ Failure recovery w/ dynamic cluster management (e.g. registration server)
  35. 35. Summary TFoS brings deep learning to big-data clusters • TensorFlow: 0.12 -1.x • Spark: 1.6-2.x • Cluster manager: YARN, Standalone, Mesos • EC2 image provided • RDMA in TensorFlow
  36. 36. Thanks! And our open-source contributors!
  37. 37. Questions?