Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End-to-End Platform Support for Distributed Deep Learning in Finance

74 views

Published on

ReWork - Deep Learning in Finance Summit

Published in: Technology
  • Be the first to comment

  • Be the first to like this

End-to-End Platform Support for Distributed Deep Learning in Finance

  1. 1. End-to-End Platform Support for Distributed Deep Learning in Finance Jim Dowling CEO, Logical Clocks AB Assoc Prof, KTH Stockholm Senior Researcher, RISE SICS jim_dowling
  2. 2. Deep Learning in Finance •Financial modelling problems are typically complex and non-linear. •If you’re lucky, you have lots of labelled data -Deep learning models can learn non-linear relationships and recurrent structures that generalize beyond the training data. •Potential areas in finance: pricing, portfolio construction, risk management and HFT* 2/33 * https://towardsdatascience.com/deep-learning-in-finance-9e088cb17c03
  3. 3. More Data means Better Predictions Prediction Performance Traditional ML Deep Neural Nets Amount Labelled Data Hand-crafted can outperform 1980s1990s2000s 2010s 2020s? 3/33
  4. 4. Do we need more Compute? “Methods that scale with computation are the future of AI”* - Rich Sutton (A Founding Father of Reinforcement Learning) * https://www.youtube.com/watch?v=EeMCEQa85tw 4/33
  5. 5. Reduce DNN Training Time In 2017, Facebook reduced training time on ImageNet for a CNN from 2 weeks to 1 hour by scaling out to 256 GPUs using Ring-AllReduce on Caffe2. https://arxiv.org/abs/1706.02677 5/33
  6. 6. • Hyper-parameter optimization is parallelizable • Neural Architecture Search (Google) - 450 GPU / 7 days - 900 TPU / 5 days - New SOTA on CIFAR10 (2.13% top 1) - New SOTA on ImageNet (3.8% top 5) Reduce Experiment Time with Parallel Experiments https://arxiv.org/abs/1802.01548 6/33
  7. 7. Training Time and ML Practitioner Productivity 7 •Distributed Deep Learning -Interactive analysis! -Instant gratification! “My Model’s Training.” Training 7/33
  8. 8. More Compute should mean Faster Training Training Performance Single-Host Distributed Available Compute 20152016 2017 2018? 8/33
  9. 9. Distributed Training: Theory and Practice 9 9/33 Image from @hardmaru on Twitter.
  10. 10. Distributed Training Algorithms not all Equal Training Performance Parameter Servers AllReduce Available Compute 10/33
  11. 11. Ring-AllReduce vs Parameter Server GPU 0 GPU 1 GPU 2 GPU 3 send send send send recv recv recv recv GPU 1 GPU 2 GPU 3 GPU 4 Param Server(s) Network Bandwidth is the Bottleneck for Distributed Training 11/33
  12. 12. AllReduce outperforms Parameter Servers 12/33 *https://github.com/uber/horovod 16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network (synthetic data). Speed below is images processed per second.* For Bigger Models, Parameter Servers don’t scale
  13. 13. Infiniband for Training to overcome N/W Bottleneck RDMA/Infiniband Read Input Files, Write Model Checkpoints to Network FS Aggregate Gradients Separate Gradient sharing/aggregation network traffic from I/O traffic. 13/33
  14. 14. Horovod on Hops import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb') “Pure” TensorFlow code 14/33
  15. 15. Parallel Experiments
  16. 16. Parallel Experiments on Hops def model_fn(learning_rate, dropout): import tensorflow as tf from hops import tensorboard, hdfs, devices [TensorFlow Code here] from hops import experiment args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]} experiment.launch(spark, model_fn, args_dict) Launch TF jobs in Spark Executors 17/33 Launches 6 Spark Executors with a different Hyperparameter combinations. Each Executor can have 1-N GPUs.
  17. 17. Parallel Experiments Visualization on TensorBoard 18/33 Parallel Experiment Results Visualization
  18. 18. Lots of good GPUs > A few great GPUs 100 x Nvidia 1080Ti (DeepLearning11) 8 x Nvidia V100 (DGX-1) VS Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (March 2018). 19/33
  19. 19. Share GPUs to Maximize Utilization GPU Resource Management (Hops, Mesos) 20/33 4 GPUs on any host 10 GPUs on 1 host 100 GPUs on 10 hosts with ‘Infiniband’ 20 GPUs on 2 hosts with ‘Infiniband_P100’
  20. 20. DeepLearning11 Server $15K (10 x 1080Ti) 21/33
  21. 21. Economics of GPUs and the Cloud Time GPU Utilization On-Premise GPU Cloud DeepLearning11 (10x1080Tis) will pay for itself in 11 weeks, compared to using a p3.8xlarge in AWS 22/33
  22. 22. Distributed Deep Learning for Finance •Platform for Hyperscale Data Science •Controlled* access to datasets *GDPR-compliance, Sarbanes-Oxley, etc 23/33
  23. 23. Hopsworks
  24. 24. Hops: Next Generation Hadoop* 16x Throughput FasterBigger *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi 37x Number of files Scale Challenge Winner (2017) 25 GPUs in YARN 25/33
  25. 25. Hopsworks Data Platform Develop Train Test Serve MySQL Cluster Hive InfluxDB ElasticSearch KafkaProjects,Datasets,Users HopsFS / YARN Spark, Flink, Tensorflow Jupyter, Zeppelin Jobs, Kibana, Grafana REST API Hopsworks 26/33
  26. 26. Proj-42 Projects sandbox Private Data A Project is a Grouping of Users and Data Proj-X Shared TopicTopic /Projs/My/Data Proj-AllCompanyDB Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017 27/33
  27. 27. How are Projects used? Engineering Kafka Topic FX Project FX Topic FX DB FX Data Stream Shared Interactive Analytics FX team 28/33
  28. 28. Per-Project Python Envs with Conda Python libraries are usable by Spark/Tensorflow 29/33
  29. 29. HopsFS YARN FeatureStore Tensorflow Serving Public Cloud or On-Premise Tensorboard TensorFlow in Hopsworks Experiments Kafka Hive 30/33
  30. 30. One Click Deployment of TensorFlow Models 31/33
  31. 31. Hops API •Python/Java/Scala library -Manage tensorboard, Load/save models in HDFS -Horovod, TensorFlowOnSpark -Parameter sweeps for parallel experiments -Neural Architecture Search with Genetic Algorithms -Secure Streaming Analytics with Kafka/Spark/Flink • SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc 32/33
  32. 32. Deep Learning Hierarchy of Scale DDL AllReduce on GPU Servers DDL with GPU Servers and Parameter Servers Parallel Experiments on GPU Servers Single GPU Many GPUs on a Single GPU Server 33/33
  33. 33. Summary •Distribution can make Deep Learning practitioners more productive. https://www.oreilly.com/ideas/distributed-tensorflow •Hopsworks is a new Data Platform built on HopsFS with first-class support for Python / Deep Learning / ML / Strong Data Governance
  34. 34. The Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi. www.hops.io @hopshadoop

×