Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed deep learning optimizations for Finance

280 views

Published on

Talk @ Global AI Conference 2017, New York. Covers Deep Learning use cases for Finance and how to do distributed deep learning optimizations

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Distributed deep learning optimizations for Finance

  1. 1. Distributed Deep Learning Optimizations for Finance GEETA CHAUHAN, CTO SVSG
  2. 2. Agenda  Distributed DL Challenges  Deep Learning in Finance  @ Scale DL Infrastructure  Parallelize your models  Techniques for Optimization  Look into future  References
  3. 3. Rise of Deep Learning • Computer Vision, Language Translation, Speech Recognition, Question & Answer, … Major Advances in AI • Latency, Cost, Power consumption issues • Complexity & size outpacing commodity “General purpose compute” • Hyper-parameter tuning, Black box Challenging to build & deploy for large scale applications Exascale, 15 Watts 3
  4. 4. Deep Learning in Finance Visual Chart Pattern trading (AlpacaAlgo) Deep Portfolio Autoencoder Trading Gym Reinforcement Learning Real Time Fraud Detection (Kabbage) FX Trading across time zones Cyber Security (Deep Instinct) Face Recognition for secure login Customer Experience AI (AugmentHQ)
  5. 5. Shift towards Specialized Compute  Special purpose Cloud  Google TPU, Microsoft Brainwave, Intel Nervana, IBM Power AI, Nvidia v100  Spectrum: CPU, GPU, FPGA, Custom Asics  Edge Compute: Hardware accelerators, AI SOC  Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)  Architectures  Cluster Compute, HPC, Neuromorphic, Quantum compute  Complexity in Software  Model tuning/optimizations specific to hardware  Growing need for compilers to optimize based on deployment hardware  Workload specific compute: Model training, Inference 5
  6. 6. CPU Optimizations  Leverage High Performant compute tools  Intel Python, Intel Math Kernel Library (MKL), NNPack (for multi-core CPUs)  Compile Tensorflow from Source for CPU Optimizations  Proper Batch size, using all cores & memory  Proper Data Format  NCHW for CPUs vs Tensorflow default NHWC  Use Queues for Reading Data Source: Intel Research Blog 6
  7. 7. Tensorflow CPU Optimizations  Compile from source  git clone https://github.com/tensorflow/tensorflow.git  Run ./configure from Tensorflow source directory  Select option MKL (CPU) Optimization  Build pip package for install  bazel build --config=mkl --copt=-DEIGEN_USE_VML -c opt //tensorflow/tools/pip_package:build_pip_package  Install the optimized TensorFlow wheel  bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/path_to_save_wheel pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl  Intel Optimized Pip Wheel files 7
  8. 8. Parallelize your models  Data Parallelism  Tensorflow Estimator + Experiments  Parameter Server, Worker cluster  Intel BigDL Spark Cluster  Baidu’s Ring AllReduce  Uber’s Horovod TensorFusion  HyperTune Google Cloud ML  Model Parallelism  Graph too large to fit on one machine  Tensorflow Model Towers 8
  9. 9. Optimizations for Training Source: Amazon MxNET 9
  10. 10. Workload Partitioning Source: Amazon MxNET  Minimize communication time  Place neighboring layers on same GPU  Balance workload between GPUs  Different layers have different memory-compute properties  Model on left more balanced  LSTM unrolling: ↓ memory, ↑ compute time  Encode/Decode: ↑ memory 10
  11. 11. Optimizations for Inferencing  Graph Transform Tool  Freeze graph (variables to constants)  Quantization (32 bit float → 8 bit)  Quantize weights (20 M weights for IV3)  Inception v3 93 MB → 1.5 MB  AlexNet 35x smaller, VGG-16 49x smaller  3x to 4x speedup, 3x to 7x more energy-efficient 11 bazel build tensorflow/tools/graph_transforms:transform_graph bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/tmp/classify_image_graph_def.pb --outputs="softmax" --out_graph=/tmp/quantized_graph.pb --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,299,299,3") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes strip_unused_nodes sort_by_execution_order'
  12. 12. Cluster Optimizations  Define your ML Container locally  Evaluate with different parameters in the cloud  Use EFS / GFS for data storage and sharing across nodes  Create separate Data processing container  Mount EFS/GFS drive on all pods for shared storage  Avoid GPU Fragmentation problems by bundling jobs  Placement optimizations – Kubernetes Bundle as pods, Mesos placement constraints  GPU Drivers bundling in container a problem  Mount as Readonly volume, or use Nvidia- docker 12
  13. 13. Uber’s Horovod on Mesos  Peleton Gang Scheduler  MPI based bandwidth optimized communication  Code for one GPU, replicates across cluster  Nested Containers 13 Source: Uber Mesoscon
  14. 14. Future: FPGA Hardware Microservices Project Brainwave Source: Microsoft Research Blog 14
  15. 15. FPGA Optimizations Brainwave Compiler Source: Microsoft Research Blog 15 Can FPGA Beat GPU Paper: ➢ Optimizing CNNs on Intel FPGA ➢ FPGA vs GPU: 60x faster, 2.3x more energy- efficient ➢ <1% loss of accuracy ESE on FPGA Paper: ➢ Optimizing LSTMs on Xilinx FPGA ➢ FPGA vs CPU: 43x faster, 40x more energy- efficient ➢ FPGA vs GPU: 3x faster, 11.5x more energy- efficient
  16. 16. Future: Neuromorphic Compute Intel’s Loihi: Brain Inspired AI Chip Neuromorphic memristors 16
  17. 17. Future: Quantum Computers Source: opentranscripts.org + Monte Carlo Simulations & Dynamic Portfolio Optimization ? Cybersecurity a big challenge 17
  18. 18. Resources  Deep Portfolios Paper: http://onlinelibrary.wiley.com/doi/10.1002/asmb.2209/pdf  A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers https://arxiv.org/pdf/1703.05364.pdf  Trading Gym: https://github.com/Prediction-Machines/Trading-Gym  ensorflow Intel CPU Optimized: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern- intel-architecture  Tensorflow Quantization: https://www.tensorflow.org/performance/quantization  Deep Compression Paper: https://arxiv.org/abs/1510.00149  Microsoft’s Project Brainwave: https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project- brainwave/  Can FPGAs Beat GPUs?: http://jaewoong.org/pubs/fpga17-next-generation-dnns.pdf  ESE on FPGA: https://arxiv.org/abs/1612.00694  Intel Spark BigDL: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-apache-spark  Baidu’s Paddle-Paddle on Kubernetes: http://blog.kubernetes.io/2017/02/run-deep-learning-with- paddlepaddle-on-kubernetes.html  Uber’s Horovod Distributed Training framework for Tensorflow: https://github.com/uber/horovod 18
  19. 19. Upcoming Talks  Deep Learning @ Edge with Intel Neural Compute Stick @ Global IoTDevFest, Online, Nov 7-8th 2017  Best Practices for On-demand HPC in Enterprises @ Intel HPC Developers Conference, Denver Colorado, Nov 11-12th 2017 19
  20. 20. Questions? Contact http://bit.ly/geeta4c geeta@svsg.co @geeta4c

×