Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed TensorFlow on Hops (Papis London, April 2018)

300 views

Published on

Distributed Deep Learning with TensorFlow on Hops

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Distributed TensorFlow on Hops (Papis London, April 2018)

  1. 1. Techniques for Distributed TensorFlow on Hops Jim Dowling CEO, Logical Clocks AB Assoc Prof, KTH Stockholm Senior Researcher, RISE SICS jim_dowling Europe 2018
  2. 2. ©2018 Logical Clocks AB. All Rights Reserved AI Hierarchy of Needs 2 DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion Data Engineers Data Scientists Data Science- Engineers
  3. 3. ©2018 Logical Clocks AB. All Rights Reserved Distributed Deep Learning is Important 3 “We have three main lines of attack: 1.We can search for improved model architectures. 2.We can scale computation. 3.We can create larger training data sets.” *https://blog.acolyer.org/2018/03/28/deep-learning-scaling-is-predictable-empirically/ How to improve the state of the art in deep learning? *
  4. 4. ©2018 Logical Clocks AB. All Rights Reserved More Data means Better Predictions Prediction Performance Traditional ML Deep Neural Nets Amount Labelled Data Hand-crafted can outperform 4/45
  5. 5. ©2018 Logical Clocks AB. All Rights Reserved How much Labelled Data do we Need?* 5 *https://arxiv.org/pdf/1712.00409.pdf
  6. 6. ©2018 Logical Clocks AB. All Rights Reserved More Data and Prediction Improvement* 6 *https://blog.acolyer.org/2018/03/28/deep-learning-scaling-is-predictable-empirically/ Generalization Error
  7. 7. ©2018 Logical Clocks AB. All Rights Reserved Better Model and Prediction Improvement* 7 *https://blog.acolyer.org/2018/03/28/deep-learning-scaling-is-predictable-empirically/ Generalization Error
  8. 8. ©2018 Logical Clocks AB. All Rights Reserved Get Better Models with More Compute “Methods that scale with computation are the future of AI”* - Rich Sutton (A Founding Father of Reinforcement Learning) * https://www.youtube.com/watch?v=EeMCEQa85tw 8/45
  9. 9. • Model Architecture Search* - Explore on smaller datasets, then scale to larger datasets => enables more searches. • SOTA on CIFAR10 (2.13% top 1) SOTA on ImageNet (3.8% top 5) - 450 GPU / 7 days - 900 TPU / 5 days Parallel Experiments to Find Better Models *https://arxiv.org/abs/1802.01548 9/45
  10. 10. ©2018 Logical Clocks AB. All Rights Reserved Parallel Experiments 1/4 Time The Outer Loop (hyperparameters): “I have to run a hundred experiments to find the best model,” he complained, as he showed me his Jupyter notebooks. “That takes time. Every experiment takes a lot of programming, because there are so many different parameters. [Rants of a Data Scientist]Hops
  11. 11. ©2018 Logical Clocks AB. All Rights Reserved Need for a Distributed Filesystem 11 Experiment 1 Experiment N Driver Distributed FSTensorBoard Training/test data, evaluation results, experiment configurations, etc
  12. 12. •Datasets are getting larger •Model checkpointing •Model-architecture search •Hyperparameter search •Hierarchical Filesystems (fast) - HDFS / HopsFS - Ceph, GlusterFS •Object Stores (slow) - S3, GCS, WFS More on Why we need a Distributed Filesystem 12/45 *http://www.logicalclocks.com/fixing-the-small-files-problem-in-hdfs/ PLUG for HopsFS
  13. 13. What about Distributed Training? 13
  14. 14. ©2018 Logical Clocks AB. All Rights Reserved More Compute should mean Faster Training Training Performance Single-Host Distributed Available Compute 14/45
  15. 15. ©2018 Logical Clocks AB. All Rights Reserved Distributed Training 2/4 Weeks Time The Inner Loop (training): “ All these experiments took a lot of computation — we used hundreds of GPUs/TPUs for days. Much like a single modern computer can outperform thousands of decades-old machines, we hope that in the future these experiments will become household.” [Google SoTA ImageNet, Cifar-10, March18] Mins Hops
  16. 16. ©2018 Logical Clocks AB. All Rights Reserved Reduce DNN Training Time In 2017, Facebook reduced training time on ImageNet for a CNN from 2 weeks to 1 hour by scaling out to 256 GPUs using Ring-AllReduce on Caffe2. https://arxiv.org/abs/1706.02677 16/45
  17. 17. ©2018 Logical Clocks AB. All Rights Reserved Distributed Training: Theory and Practice 17 17/45 Image from @hardmaru on Twitter.
  18. 18. Asynchronous vs Synchronous SGD •Synchronous Stochastic Gradient Descent (SGD) now dominant “Revisiting Synchronous SGD”, Chen et al, ICLR 2016 https://research.google.com/pubs/pub45187.html 18
  19. 19. Synchronous Distributed SGD Algorithms not all Equal Training Performance Parameter Servers AllReduce Available Compute 19/45
  20. 20. ©2018 Logical Clocks AB. All Rights Reserved Ring-AllReduce vs Parameter Server GPU 0 GPU 1 GPU 2 GPU 3 send send send send recv recv recv recv GPU 1 GPU 2 GPU 3 GPU 4 Param Server(s) Network Bandwidth is the Bottleneck for Distributed Training 20/45
  21. 21. ©2018 Logical Clocks AB. All Rights Reserved AllReduce outperforms Parameter Servers 21/45 *https://github.com/uber/horovod 16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network (synthetic data). Speed below is images processed per second.* For Bigger Models, Parameter Servers don’t scale
  22. 22. ML in Production: Machine Learning Pipelines
  23. 23. ©2018 Logical Clocks AB. All Rights Reserved A Machine Learning Pipeline with TensorFlow 23/45 Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification TfServingTensorFlowSpark Distributed FS Message Queue Resource Manager with GPU support Test Kubernetes Data Engineering Data Science Ops
  24. 24. ©2018 Logical Clocks AB. All Rights Reserved Hops Small Data ML Pipeline 24/45 Hops (Kafka/HopsFS/Spark/TensorFlow/Kubernetes) Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification Test Project Teams (Data Engineers/Scientists) TfServingTensorFlow
  25. 25. ©2018 Logical Clocks AB. All Rights Reserved PySpark Hops Big Data ML Pipeline 25/45 Hops (Kafka/HopsFS/Spark/TensorFlow/Kubernetes) Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification Test Project Teams (Data Engineers/Scientists) TfServingTensorFlow
  26. 26. Why not Kubeflow? •Operational Reasons -No Integrated Enterprise Security Framework • Encryption-in-Transit, Encryption-at-Rest -Stateful services not designed for Kubernetes • Distributed Storage, Kafka, Databases •Usability Reasons -Not a Fully Managed Platform • Write YML files and restart just to install a new Python library -Slow startup times for applications/notebooks 26/45
  27. 27. Machine Learning Pipelines in Code
  28. 28. ©2018 Logical Clocks AB. All Rights Reserved 28/45 Small Data Preparation with tf.data API def input_fn(batch_size): files = tf.data.Dataset.list_files(IMAGES_DIR) def tfrecord_dataset(filename): return tf.data.TFRecordDataset(filename, num_parallel_reads=32, buffer_size=8*1024*1024) dataset = files.apply(tf.data.parallel_interleave (tfrecord_dataset, cycle_length=32, sloppy=True) dataset = dataset.apply(tf.data.map_and_batch(parser_fn, batch_size, num_parallel_batches=4)) dataset = dataset.prefetch(4) return dataset Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  29. 29. ©2018 Logical Clocks AB. All Rights Reserved Big Data Preparation with PySpark from mmlspark import ImageTransformer images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache() tr = (ImageTransformer().setOutputCol(“transformed”) .resize(height = 200, width = 200) .crop(0, 0, height = 180, width = 180) ) smallImages = tr.transform(images).select(“transformed”) 29/45 Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  30. 30. ©2018 Logical Clocks AB. All Rights Reserved Hyperparam Opt. with Tf/Spark on Hops def model_fn(learning_rate, dropout): import tensorflow as tf from hops import tensorboard, hdfs, devices [TensorFlow Code here] from hops import experiment args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]} experiment.launch(spark, model_fn, args_dict) Launch TF jobs in Spark Executors 30/45 Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  31. 31. ©2018 Logical Clocks AB. All Rights Reserved HyperParam Opt. Visualization on TensorBoard 31/45 Hyperparam Opt Results Visualization
  32. 32. ©2018 Logical Clocks AB. All Rights Reserved Distributed Training with Horovod on Hops def conv_model(feature, target, mode) ….. hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb') “Pure” TensorFlow code 32/45 Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  33. 33. Hops API •Python (also Java/Scala) -Manage tensorboard, Load/save models in HDFS -Horovod, TensorFlowOnSpark -Parallel experiments • Gridsearch • Model Architecture Search with Genetic Algorithms -Secure Streaming Analytics with Kafka/Spark/Flink • SSL/TLS certs, Avro Schema, Endpoints for Kafka/Zookeeper/etc 33/45 Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  34. 34. ©2018 Logical Clocks AB. All Rights Reserved TensorFlow Model Serving 34/45 Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  35. 35. Hops Data Platform
  36. 36. ©2018 Logical Clocks AB. All Rights Reserved Hops: Next Generation Hadoop* 16x Throughput FasterBigger *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi 37x Number of files Scale Challenge Winner (2017) 37 GPUs in YARN 37/45
  37. 37. ©2018 Logical Clocks AB. All Rights Reserved Engineering Kafka Topic Project-X Project Model for Sensitive Data/GDPR 38/45 Project-42 Shared DBTopic Project-All CompanyDB Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017 FX Project FX Topic FX DB FX Data Stream Shared Analytics FX team
  38. 38. ©2018 Logical Clocks AB. All Rights Reserved Hopsworks Data Platform 39/45
  39. 39. ©2018 Logical Clocks AB. All Rights Reserved Proj-42 Projects sandbox Private Data A Project is a Grouping of Users and Data Proj-X Shared TopicTopic /Projs/My/Data Proj-AllCompanyDB Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017 40/45
  40. 40. ©2018 Logical Clocks AB. All Rights Reserved How are Projects used? Engineering Kafka Topic FX Project FX Topic FX DB FX Data Stream Shared Interactive Analytics FX team 41/45
  41. 41. Python in the Cluster: Per-Project Conda Envs Python libraries are usable by Spark/Tensorflow 42/45
  42. 42. ©2018 Logical Clocks AB. All Rights Reserved HopsFS YARN FeatureStore Tensorflow Serving Public Cloud or On-Premise Tensorboard TensorFlow in Hopsworks Experiments Kafka Hive 43/45
  43. 43. ©2018 Logical Clocks AB. All Rights Reserved 44/45
  44. 44. Summary •The future of Deep Learning is Distributed https://www.oreilly.com/ideas/distributed-tensorflow •Hops is a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs *https://twitter.com/karpathy/status/972701240017633281 “It is starting to look like deep learning workflows of the future feature autotuned architectures running with autotuned compute schedules across arbitrary backends.” Andrej Karpathy - Head of AI @ Tesla
  45. 45. ©2018 Logical Clocks AB. All Rights Reserved The Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi, Roshan Sedar. www.hops.io @hopshadoop
  46. 46. ©2018 Logical Clocks AB. All Rights Reserved Spark Scikit-learn integration from sklearn import svm, grid_search, datasets from spark_sklearn import GridSearchCV iris = datasets.load_iris() parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} svr = svm.SVC() clf = GridSearchCV(sc, svr, parameters) clf.fit(iris.data, iris.target) 59/45

×