The document discusses deep learning techniques for financial technology (FinTech) applications. It begins with examples of current deep learning uses in FinTech like trading algorithms, fraud detection, and personal finance assistants. It then covers topics like specialized compute hardware for deep learning training and inference, optimization techniques for CPUs and GPUs, and distributed training approaches. Finally, it discusses emerging areas like FPGA and quantum computing and provides resources for practitioners to start with deep learning for FinTech.
2. Agenda
AI & Deep Learning in FinTech
What is Deep Learning?
Rise of Specialized Compute
Techniques for Optimization
Look into future
Steps for starting your AI journey
References
4. Deep Learning in FinTech
Visual Chart
Pattern trading
(AlpacaAlgo)
AI - Crypto
Hedge Fund
(NumeraAI)
Trading Gym
(Prediction Machines)
Real Time
Fraud
Detection
(FeedZai, Kabbage)
FX Trading across
time zones
(QuantAlea)
Cyber Security
(Deep Instinct)
Personal Finance
Assistant
(Cleo AI)
Customer
Experience AI
(AugmentHQ)
5. What is
Deep
Learning?
AI Neural Networks
composed of many
layers
Learn like humans
Automated Feature
Learning
Layers are like Image
Filters
6. Rise of Deep Learning
• Computer Vision, Language Translation,
Speech Recognition, Question & Answer,
…
Major Advances
in AI
• Latency, Cost, Power consumption issues
• Complexity & size outpacing commodity
“General purpose compute”
• Hyper-parameter tuning, Black box
Challenging to
build & deploy
for large scale
applications
Exascale, 15 Watts
6
7. Shift towards Specialized Compute
Special purpose Cloud
Google TPU, Microsoft Brainwave, Intel Nervana, IBM Power AI, Nvidia v100
Bare Metal Cloud – Preview AWS, GCE coming April 2018
Spectrum: CPU, GPU, FPGA, Custom Asics
Edge Compute: Hardware accelerators, AI SOC
Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)
Architectures
Cluster Compute, HPC, Neuromorphic, Quantum compute
Complexity in Software
Model tuning/optimizations specific to hardware
Growing need for compilers to optimize based on deployment hardware
Workload specific compute: Model training, Inference
7
8. CPU Optimizations
Leverage High Performant compute tools
Intel Python, Intel Math Kernel Library (MKL),
NNPack (for multi-core CPUs)
Compile Tensorflow from Source for CPU
Optimizations
Proper Batch size, using all cores & memory
Proper Data Format
NCHW for CPUs vs Tensorflow default NHWC
Use Queues for Reading Data
Source: Intel Research Blog
8
10. Parallelize your models
Data Parallelism
Tensorflow Estimator + Experiments
Parameter Server, Worker cluster
Intel BigDL Spark Cluster
Baidu’s Ring AllReduce
Uber’s Horovod TensorFusion
HyperTune Google Cloud ML
Model Parallelism
Graph too large to fit on one
machine
Tensorflow Model Towers
10
12. Workload Partitioning
Source: Amazon MxNET
Minimize communication time
Place neighboring layers on same GPU
Balance workload between GPUs
Different layers have different memory-compute
properties
Model on left more balanced
LSTM unrolling: ↓ memory, ↑ compute time
Encode/Decode: ↑ memory
12
13. Optimizations for Inferencing
Graph Transform Tool
Freeze graph (variables to constants)
Quantization (32 bit float → 8 bit)
Quantize weights (20 M weights for IV3)
Inception v3 93 MB → 1.5 MB
AlexNet 35x smaller, VGG-16 49x smaller
3x to 4x speedup, 3x to 7x more energy-efficient
13
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph
--in_graph=/tmp/classify_image_graph_def.pb
--outputs="softmax" --out_graph=/tmp/quantized_graph.pb
--transforms='add_default_attributes strip_unused_nodes(type=float,
shape="1,299,299,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
strip_unused_nodes sort_by_execution_order'
14. Cluster
Optimizations
Define your ML Container locally
Evaluate with different parameters in the cloud
Use EFS / GFS for data storage and sharing across
nodes
Create separate Data processing container
Mount EFS/GFS drive on all pods for shared
storage
Avoid GPU Fragmentation problems by bundling
jobs
Placement optimizations – Kubernetes Bundle
as pods, Mesos placement constraints
GPU Drivers bundling in container a problem
Mount as Readonly volume, or use Nvidia-
docker
14
15. Uber’s
Horovod on
Mesos
Peleton Gang Scheduler
MPI based bandwidth
optimized communication
Code for one GPU, replicates
across cluster
Nested Containers
15
Source: Uber Mesoscon
16. Future: FPGA Hardware Microservices
Project Brainwave Source: Microsoft Research Blog
16
17. FPGA Optimizations
Brainwave Compiler Source: Microsoft Research Blog
17
Can FPGA Beat GPU Paper:
➢ Optimizing CNNs on Intel FPGA
➢ FPGA vs GPU: 60x faster, 2.3x more energy-
efficient
➢ <1% loss of accuracy
ESE on FPGA Paper:
➢ Optimizing LSTMs on Xilinx FPGA
➢ FPGA vs CPU: 43x faster, 40x more energy-
efficient
➢ FPGA vs GPU: 3x faster, 11.5x more energy-
efficient
20. Where to start your AI journey?
Level 1: Just Starting
Start with Lower Risk use case like AI driven Customer Services, RPA
Level 2: Intermediate
Invest in data cleansing and provenance for building richer systems
Combine 3rd party data sets for greater insights
Level 3: Advanced
Experiment with Deep Learning Models for complex scenarios
or New innovative use cases like Face Recognition for Banking app security
Level 4: Mature
Add feedback look to your models, learning from outcomes
Experiment with Deep Reinforcement Learning
Industrialize the ML/DL Pipeline, shared model repository across company
20
21. Resources
CBInsights AI in FinTech Market Map: https://www.cbinsights.com/research/ai-fintech-startup-market-map/
Deep Portfolios Paper: http://onlinelibrary.wiley.com/doi/10.1002/asmb.2209/pdf
Opening the Blackbox of Financial AI with ClearTrade: https://arxiv.org/pdf/1709.01574.pdf
Trading Gym: https://github.com/Prediction-Machines/Trading-Gym
Tensorflow Intel CPU Optimized: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-
intel-architecture
Tensorflow Quantization: https://www.tensorflow.org/performance/quantization
Deep Compression Paper: https://arxiv.org/abs/1510.00149
Microsoft’s Project Brainwave: https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-
brainwave/
Can FPGAs Beat GPUs?: http://jaewoong.org/pubs/fpga17-next-generation-dnns.pdf
ESE on FPGA: https://arxiv.org/abs/1612.00694
Intel Spark BigDL: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-apache-spark
Baidu’s Paddle-Paddle on Kubernetes: http://blog.kubernetes.io/2017/02/run-deep-learning-with-
paddlepaddle-on-kubernetes.html
Uber’s Horovod Distributed Training framework for Tensorflow: https://github.com/uber/horovod
A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers
https://arxiv.org/pdf/1703.05364.pdf
21