Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling TensorFlow with Hops, Global AI Conference Santa Clara


Published on

Scaling TensorFlow to 100s of GPUs with Hops. TensorFlow inferencing demonstrated on Hopsworks.

Published in: Technology
  • Be the first to comment

Scaling TensorFlow with Hops, Global AI Conference Santa Clara

  1. 1. Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop Global AI Conference, Santa Clara, January 18th 2018 Hops Jim Dowling Associate Prof @ KTH Senior Researcher @ RISE SICS CEO @ Logical Clocks AB
  2. 2. AI Hierarchy of Needs 2 DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion [Adapted from ]
  3. 3. AI Hierarchy of Needs 3[Adapted from ] DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion Analytics Prediction
  4. 4. AI Hierarchy of Needs 4 DDL (Distributed Deep Learning) Deep Learning, RL, Automated ML A/B Testing, Experimentation, ML B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion Hops [Adapted from ]
  5. 5. More Data means Better Predictions Prediction Performance Traditional AI Deep Neural Nets Amount Labelled Data Hand-crafted can outperform 1980s1990s2000s 2010s 2020s?
  6. 6. What about More Compute? “Methods that scale with computation are the future of AI”* - Rich Sutton (A Founding Father of Reinforcement Learning) 2018-01-19 6/46 *
  7. 7. More Compute should mean Faster Training Training Performance Single-Host Distributed Available Compute 20152016 2017 2018?
  8. 8. Reduce DNN Training Time from 2 weeks to 1 hour 2018-01-19 8/46 In 2017, Facebook reduced training time on ImageNet for a CNN from 2 weeks to 1 hour by scaling out to 256 GPUs using Ring-AllReduce on Caffe2.
  9. 9. DNN Training Time and Researcher Productivity 9 •Distributed Deep Learning -Interactive analysis! -Instant gratification! •Single Host Deep Learning • Suffer from Google-Envy “My Model’s Training.” Training
  10. 10. Distributed Training: Theory and Practice Image from @hardmaru on Twitter. 10
  11. 11. Distributed Algorithms are not all Created Equal Training Performance Parameter Servers AllReduce Available Compute
  12. 12. Ring-AllReduce vs Parameter Server(s) 2018-01-19 13/46 GPU 0 GPU 1 GPU 2 GPU 3 send send send send recv recv recv recv GPU 1 GPU 2 GPU 3 GPU 4 Param Server(s) Network Bandwidth is the Bottleneck for Distributed Training
  13. 13. AllReduce outperforms Parameter Servers 2018-01-19 14/46 * 16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network (synthetic data). Speed below is images processed per second.* For Bigger Models, Parameter Servers don’t scale
  14. 14. Multiple GPUs on a Single Server 2018-01-19 15/46
  15. 15. NVLink vs PCI-E Single Root Complex 2018-01-19 16/46On Single-Host (dist. Training), the Bus can be the Bottleneck [Images from: ] NVLink – 80 GB/s PCI-E – 16 GB/s
  16. 16. Scale: Remove Bus and Net B/W Bottlenecks 2018-01-19 17/46 Only one slow worker or bus or n/w link is needed to bottleneck DNN training time. Ring-AllReduce
  17. 17. The Cloud is full of Bottlenecks…. Training Performance Public Cloud (10 GbE) Infiniband On-Premise Available Compute
  18. 18. Deep Learning Hierarchy of Scale 2018-01-19 19/46 DDL AllReduce on GPU Servers DDL with GPU Servers and Parameter Servers Parallel Experiments on GPU Servers Single GPU Many GPUs on a Single GPU Server Days/Hours Days Weeks Minutes Training Time for ImageNet Hours
  19. 19. Lots of good GPUs > A few great GPUs Hops 100 x Nvidia 1080Ti (DeepLearning11) 8 x Nvidia P/V100 (DGX-1) VS Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (2017).
  20. 20. Consumer GPU Server $15K (10 x 1080Ti) 2018-01-19 21/46
  21. 21. Cluster of Commodity GPU Servers #EUai8 22 InfiniBan Max 1-2 GPU Servers per Rack (2-4 KW per server)
  22. 22. #EUai8 TensorFlow Spark Platforms •TensorFlow-on-Spark •Deep Learning Pipelines •Horovod •Hops 23
  23. 23. Hops – Running Parallel Experiments def model_fn(learning_rate, dropout): import tensorflow as tf from hops import tensorboard, hdfs, devices ….. from hops import tflauncher args_dict = {'learning_rate': [0.001], 'dropout': [0.5]} tflauncher.launch(spark, model_fn, args_dict) 24 Launch TF jobs as Mappers in Spark “Pure” TensorFlow code in the Executor
  24. 24. Hops – Parallel Experiments 25 #EUai8 def model_fn(learning_rate, dropout): ….. from hops import tflauncher args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]} tflauncher.launch(spark, model_fn, args_dict) Launches 6 Executors with with a different Hyperparameter combination. Each Executor can have 1-N GPUs.
  25. 25. Hops AllReduce/Horovod/TensorFlow 27 #EUai8 import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb') “Pure” TensorFlow code
  26. 26. TensorFlow and Hops Hadoop 2018-01-19 28/46
  27. 27. Don’t do this: Different Clusters for Big Data and ML 29
  28. 28. Hops: Single ML and Big Data Cluster 30/70 IT DataLake GPUs Compute Kafka Data EngineeringData Science Project1 ProjectN Elasticsearch
  29. 29. HopsFS: Next Generation HDFS* 16x Throughput FasterBigger * 37x Number of files Scale Challenge Winner (2017) 31
  30. 30. Size Matters: Improving the Performance of Small Files in HDFS. Salman Niazi, Seif Haridi, Jim Dowling. Poster, EuroSys 2017. ` HopsFS now stores Small Files in the DB
  31. 31. GPUs supported as a Resource in Hops 2.8.2* 33 Hops is the only Hadoop distribution to support GPUs-as-a-Resource. *Robin Andersson, GPU Integration for Deep Learning on YARN, MSc Thesis, 2017
  32. 32. GPU Resource Requests in Hops 34 HopsYARN 4 GPUs on any host 10 GPUs on 1 host 100 GPUs on 10 hosts with ‘Infiniband’ 20 GPUs on 2 hosts with ‘Infiniband_P100’ HopsFS Mix of commodity GPUs and more powerful GPUs good for (1) parallel experiments and (2) distributed training
  33. 33. Hopsworks Data Platform 35 Develop Train Test Serve MySQL Cluster Hive InfluxDB ElasticSearch KafkaProjects,Datasets,Users HopsFS / YARN Spark, Flink, Tensorflow Jupyter, Zeppelin Jobs, Kibana, Grafana REST API Hopsworks
  34. 34. Python is a First-Class Citizen in Hopsworks 36
  35. 35. Custom Python Environments with Conda Python libraries are usable by Spark/Tensorflow 37
  36. 36. What is Hopsworks used for? 2018-01-19 38/46
  37. 37. HopsFS YARN Public Cloud or On-Premise Parquet ETL Workloads 39 Hive Hopsworks Jobs trigger
  38. 38. HopsFS YARN Public Cloud or On-Premise Parquet Business Intelligence Workloads 40 Hive Jupyter/Zeppelin or Jobs Kibana reports Zeppelin
  39. 39. HopsFS YARN Grafana/ InfluxDB Elastic/ Kibana Public Cloud or On-Premise Parquet Data Src Batch Analytics Kafka …...MySQL Streaming Analytics in Hopsworks 41 Hive
  40. 40. HopsFS YARN FeatureStore Tensorflow Serving Public Cloud or On-Premise Tensorboard TensorFlow in Hopsworks 42 Experiments Kafka Hive
  41. 41. One Click Deployment of TensorFlow Models
  42. 42. Hops API •Java/Scala library - Secure Streaming Analytics with Kafka/Spark/Flink • SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc •Python Library - Managing tensorboard, Load/save models in HopsFS - Distributed Tensorflow in Python - Parameter sweeps for parallel experiments
  43. 43. TensorFlow-as-a-Service in RISE SICS ICE • Hops • Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service • RISE SICS ICE • 250 kW Datacenter, ~400 servers • Research and test environment 45
  44. 44. Summary •Distribution can make Deep Learning practitioners more productive. •Hopsworks is a new Data Platform built on HopsFS with first-class support for Python and Deep Learning / ML - Tensorflow / Spark
  45. 45. The Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi. @hopshadoop
  46. 46. Thank You. Follow us: @hopshadoop Star us: Join us: Thank You. Follow us: @hopshadoop Star us: Join us: Hops