http://pipeline.ai
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool, I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models - and the TensorFlow Runtime - in GPU-based production environment. This talk is 100% demo based on open source tools and completely reproducible through Docker on your own GPU cluster.
Bio
Chris Fregly is Founder and Research Engineer at PipelineAI, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High-Performance TensorFlow in Production."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
http://pipeline.ai
Decoding Loan Approval: Predictive Modeling in Action
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Francisco Python Meetup - Nov 8, 2017
1. HIGH PERFORMANCE TENSORFLOW IN
PRODUCTION WITH GPUS
SF PYTHON MEETUP NOV 8, 2017
SPECIAL THANKS TO YELP!! !!
CHRIS FREGLY, FOUNDER @PIPELINE.AI
2. INTRODUCTIONS: ME
§ Chris Fregly, Founder & Engineer @ PipelineAI
§ Formerly Netflix and Databricks
§ Advanced Spark and TensorFlow Meetup
Please Join Our 50,000+ Global Members!!
Contact Me
chris@pipeline.ai
@cfregly
* San Francisco
* Chicago
* Austin
* Washington DC
* Dusseldorf
* London
3. INTRODUCTIONS: YOU
§ Software Engineer, Data Scientist, Data Engineer, Data Analyst
§ Interested in Optimizing and Deploying TF Models to Production
§ Nice to Have a Working Knowledge of TensorFlow (Not Required)
4. CONTENT BREAKDOWN
50% Training Optimizations (GPUs, Training Pipeline, JIT)
50% Prediction Optimizations (AOT Compile, TF Serving)
Why Heavy Focus on Model Prediction vs. Just Training?
10s of Data Scientists <<< Millions of App Users
Training
Boring & Batch
Prediction
Exciting & Real-Time!!
5. AGENDA
Part 0: Latest PipelineAI Research
Part 1: Optimize TensorFlow Model Training
Part 2: Optimize TensorFlow Model Serving
6. 100% OPEN SOURCE CODE
§ https://github.com/PipelineAI/pipeline/
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu.ml
https://github.com/rviscomi/red-dwarf
7. HANDS-ON EXERCISES
§ Combo of Jupyter Notebooks and Command Line
§ Command Line through Jupyter Terminal
§ Some Exercises Based on Experimental Features
You May See Errors. Stay Calm. You Will Be OK!!
9. AGENDA
Part 0: Latest PipelineAI Research
§ Package, Deploy, and Tune Both Model + Runtime
§ Deploy Models and Experiments Safely to Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
10. PACKAGE MODEL + RUNTIME AS ONE
§ Package Model + Runtime into Immutable Docker Image
§ Same Environment: Local, Dev, and Prod
§ No Dependency Surprises in Production
§ Deploy and Tune Model + Runtime Together
pipeline predict-server-build --model-type=tensorflow
--model-name=mnist
--model-tag=”c"
--model-path=./models/tensorflow/mnist
Package Model
Server C Locally
pipeline predict-server-push --model-type=tensorflow
--model-name=mnist
--model-tag=”c”
Push Image C To
Docker Registry
11. TUNE MODEL + RUNTIME TOGETHER
§ Try Different Model Hyper-Parameters + Runtime Configs
§ Even Different Runtimes: TF Serving, TensorRT
§ Auto-Quantize Model Weights + Activations
§ Auto-Fuse Neural Network Layers Together
§ Generate Native CPU + GPU Code
pipeline predict-server-start --model-type=tensorflow
--model-name=mnist
--model-tag=”c"
Start Model
Server C Locally
12. LOAD TEST MODEL + RUNTIME LOCALLY
§ Perform Mini-Load Test on Local Model Server
§ Provides Immediate Feedback on Prediction Performance
§ Relative Performance Compared to Other Variations
§ No Need to Deploy to Test or Prod for Prediction Metrics
§ See Where Time is Being Spent During Prediction
pipeline predict --model-server-url=http://localhost:6969
--model-type=tensorflow
--model-name=mnist
--model-tag=”c”
--test-request-concurrency=1000
Load Test Model
Server C Locally
13. RUNTIME OPTION: NVIDIA TENSOR-RT
§ GPU-Optimized Prediction Runtime
§ Alternative to TensorFlow Serving
§ Post-Training Model Optimizations
§ Similar to TF Graph Transform Tool
§ PipelineAI Supports TensorRT!
14. AGENDA
Part 0: Latest PipelineAI Research
§ Package, Deploy, and Tune Both Model + Runtime
§ Deploy Models and Experiments Safely to Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
15. DEPLOY MODELS SAFELY TO PROD
§ Deploy from Jupyter Notebook in 1-Click
§ Deploy to 1-2% Split or Shadowed Traffic
§ Tear-Down or Rollback Quickly
§ Command Line Interface (CLI)
pipeline predict-cluster-start --model-type=tensorflow
--model-name=mnist
--model-tag=”b”
--traffic-split=“0.02”
Start Model
Cluster B in Prod
pipeline predict-cluster-start --model-type=tensorflow
--model-name=mnist
--model-tag=”c”
--traffic-split=“0.01”
Start Model
Cluster C in Prod
pipeline predict-cluster-start --model-type=tensorflow
--model-name=mnist
--model-tag=”a”
--traffic-split=“0.97”
Start Model
Cluster A in Prod
Implementation Details…
No need to understand these.
We do, so you don’t have to! J
16. DEPLOY EXPERIMENTS SAFELY TO PROD
§ Create Experiments Directly from Jupyter or Command Line
§ Deploy Experiment
pipeline experiment-add --experiment-name=my_experiment
--model-type=tensorflow
--model-name=mnist
--model-tag=“a”
--traffic-split=“97%”
CLI
Drag n’
Drop
pipeline experiment-start --experiment-name=my_experiment
--traffic-shadow=“20%”
pipeline experiment-add --experiment-name=my_experiment
--model-type=tensorflow
--model-name=mnist
--model-tag=“b”
--traffic-split=“2%”
pipeline experiment-add --experiment-name=my_experiment
--model-type=tensorflow
--model-name=mnist
--model-tag=“c”
--traffic-split=“1%”
1-Click
Start Experiment
with 20% Shadowed
of Production Traffic
17. AGENDA
Part 0: Latest PipelineAI Research
§ Package, Deploy, and Tune Both Model + Runtime
§ Deploy Models and Experiments Safely to Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
18. COMPARE MODELS OFFLINE & ONLINE
§ Offline, Batch Metrics
§ Validation Accuracy
§ Training Accuracy
§ CPU/GPU Utilization
§ Actual, Live Predictions!
§ Relative Prediction Precision
§ Online, Real-Time Metrics
§ Response Time & Throughput
§ Cost Per Prediction!
21. CONTINUOUS MODEL TRAINING
§ Identify and Fix Borderline Predictions (~50-50% Confidence)
§ Fix Along Class Boundaries
§ Retrain on New Labeled Data
§ Game-ify Labeling Process
§ Enables Crowd Sourcing
22. AGENDA
Part 0: Latest PipelineAI Research
§ Package, Deploy, and Tune Both Model + Runtime
§ Deploy Models and Experiments Safely to Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
23. SHIFT TRAFFIC TO MAX(REVENUE)
§ Shift Traffic to Winning Model using AI Bandit Algorithms
24. SHIFT TRAFFIC TO MIN(CLOUD COST)
§ Real-Time Cost Per Prediction
§ Across Clouds & On-Premise
§ Bandit-based Explore/Exploit
25. AGENDA
Part 0: Latest PipelineAI Research
Part 1: Optimize TensorFlow Model Training
Part 2: Optimize TensorFlow Model Serving
26. AGENDA
Part 1: Optimize TensorFlow Model Training
§ GPUs and TensorFlow
§ Train, Inspect, and Debug TensorFlow Models
§ TensorFlow Distributed Model Training on a Cluster
§ Optimize Training with JIT XLA Compiler
28. SETUP ENVIRONMENT
§ Step 1: Browse to the following:
http://allocator.community.pipeline.ai/allocate
§ Step 2: Browse to the following:
http://<ip-address>
§ Step 3: Browse around.
I will provide a Jupyter Username/Password soon.
Need Help?
Use the Chat!
30. LET’S EXPLORE OUR ENVIRONMENT
§ Navigate to the following notebook:
01_Explore_Environment
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
32. BREAK
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu.ml
Need Help?
Use the Chat!
33. SETTING UP TENSORFLOW WITH GPUS
§ Very Painful!
§ Especially inside Docker
§ Use nvidia-docker
§ Especially on Kubernetes!
§ Use Kubernetes 1.8+
§ http://pipeline.ai for GitHub + DockerHub Links
34. GPU HALF-PRECISION SUPPORT
§ FP32 is “Full Precision”, FP16 is “Half Precision”
§ Supported by Pascal P100 (2016) and Volta V100 (2017)
§ Half-Precision is OK for Approximate Deep Learning Use Cases
§ Fit Two(2) FP16’s into FP32 GPU Cores for 2x Throughput!
You Can Set
TF_FP16_MATMUL_USE_FP32_COMPUTE=0
on GPU w/ Compute Capability(CC) 5.3+
35. VOLTA V100 (2017) VS. PASCAL P100 (2016)
§ 84 Streaming Multiprocessors (SM’s)
§ 5,376 GPU Cores
§ 672 Tensor Cores (ie. Google TPU)
§ Mixed FP16/FP32 Precision
§ More Shared Memory
§ New L0 Instruction Cache
§ Faster L1 Data Cache
§ V100 vs. P100 Performance
§ 12x TFLOPS @ Peak Training
§ 6x Inference Throughput
36. V100 AND CUDA 9
§ Independent Thread Scheduling - Finally!!
§ Similar to CPU fine-grained thread synchronization semantics
§ Allows GPU to yield execution of any thread
§ Still Optimized for SIMT (Same Instruction Multiple Thread)
§ SIMT units automatically scheduled together
§ Explicit Synchronization
P100 V100
37. GPU CUDA PROGRAMMING
§ Barbaric, But Fun Barbaric
§ Must Know Hardware Very Well
§ Hardware Changes are Painful
§ Use the Profilers & Debuggers
38. CUDA STREAMS
§ Asynchronous I/O Transfer
§ Overlap Compute and I/O
§ Keeps GPUs Saturated
§ Fundamental to Queue Framework in TensorFlow
39. LET’S SEE WHAT THIS THING CAN DO!
§ Navigate to the following notebook:
01a_Explore_GPU
01b_Explore_Numba
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
40. AGENDA
Part 1: Optimize TensorFlow Model Training
§ GPUs and TensorFlow
§ Train, Inspect, and Debug TensorFlow Models
§ TensorFlow Distributed Model Training on a Cluster
§ Optimize Training with JIT XLA Compiler
41. TRAINING TERMINOLOGY
§ Tensors: N-Dimensional Arrays
§ ie. Scalar, Vector, Matrix
§ Operations: MatMul, Add, SummaryLog,…
§ Graph: Graph of Operations (DAG)
§ Session: Contains Graph(s)
§ Feeds: Feed Inputs into Placeholder
§ Fetches: Fetch Output from Operation
§ Variables: What We Learn Through Training
§ aka “Weights”, “Parameters”
§ Devices: Hardware Device (GPU, CPU, TPU, ...)
-TensorFlow-
Trains
Variables
-User-
Fetches
Outputs
-User-
Feeds
Inputs
-TensorFlow-
Performs
Operations
-TensorFlow-
Flows
Tensors
with tf.device(“/cpu:0,/gpu:15”):
43. TENSORFLOW MODEL
§ MetaGraph
§ Combines GraphDef and Metadata
§ GraphDef
§ Architecture of your model (nodes, edges)
§ Metadata
§ Asset: Accompanying assets to your model
§ SignatureDef: Maps external : internal tensors
§ Variables
§ Stored separately during training (checkpoint)
§ Allows training to continue from any checkpoint
§ Variables are “frozen” into Constants when preparing for inference
GraphDef
x
W
mul add
b
MetaGraph
Metadata
Assets
SignatureDef
Tags
Version
Variables:
“W” : 0.328
“b” : -1.407
44. BATCH NORMALIZATION (2015)
§ Each Mini-Batch May Have Wildly Different Distributions
§ Normalize per Batch (and Layer)
§ Faster Training, Learns Quicker
§ Final Model is More Accurate
§ TensorFlow is already on 2nd Generation Batch Algorithm
§ First-Class Support for Fusing Batch Norm Layers
§ Final mean + variance Are Folded Into Our Graph Later
-- (Almost)Always Use Batch Normalization! --
z = tf.matmul(a_prev, W)
a = tf.nn.relu(z)
a_mean, a_var = tf.nn.moments(a, [0])
scale = tf.Variable(tf.ones([depth/channels]))
beta = tf.Variable(tf.zeros ([depth/channels]))
bn = tf.nn.batch_normalizaton(a, a_mean, a_var,
beta, scale, 0.001)
45. DROPOUT (2014)
§ Training Technique
§ Prevents Overfitting
§ Combines Exponential Set of Diff Neural Architectures
§ Inherent Ensembling, If You Will
§ Expressed as Probability Percentage (ie. 50%)
§ Boost Other Weights During Validation & Prediction
Training
Dropout
Validation &
Prediction
Dropout
0%
Dropout
50%
Dropout
46. OPTIMIZE GRAPH EXECUTION ORDER
§ https://github.com/yaroslavvb/stuff
"Linearize” Causes
TF to Minimize
Graph
Memory Usage.
This is Useful on
Single GPU with
Relatively Low
RAM.
48. DON’T USE FEED_DICT!!
§ feed_dict Requires Python <-> C++ Serialization
§ Not Optimized for Production Ingestion Pipelines
§ Retrieves Next Batch After Current Batch is Done
§ Single-Threaded, Synchronous
§ CPUs/GPUs Not Fully Utilized!
§ Use Queue or Dataset API
sess.run(train_step, feed_dict={…}
49. QUEUES
§ More than traditional Queue
§ Uses CUDA Streams
§ Perform I/O, pre-processing, cropping, shuffling, …
§ Pull from HDFS, S3, Google Storage, Kafka, ...
§ Combine many small files into large TFRecord files
§ Use CPUs to free GPUs for compute
§ Helps saturate CPUs and GPUs
50. QUEUE CAPACITY PLANNING
§ batch_size
§ # examples / batch (ie. 64 jpg)
§ Limited by GPU RAM
§ num_processing_threads
§ CPU threads pull and pre-process batches of data
§ Limited by CPU Cores
§ queue_capacity
§ Limited by CPU RAM (ie. 5 * batch_size)
51. DETECT UNDERUTILIZED CPUS, GPUS
§ Instrument training code to generate “timelines”
§ Analyze with Google Web
Tracing Framework (WTF)
§ Monitor CPU with `top`, GPU with `nvidia-smi`
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
52. LET’S FEED DATA WITH A QUEUE
§ Navigate to the following notebook:
02_Feed_Queue_HDFS
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
54. BREAK
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu.ml
Need Help?
Use the Chat!
55. LET’S TRAIN A MODEL (CPU)
§ Navigate to the following notebook:
03_Train_Model_CPU
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
56. LET’S TRAIN A MODEL (GPU)
§ Navigate to the following notebook:
03a_Train_Model_GPU
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
57. TENSORFLOW DEBUGGER
§ Step through Operations
§ Inspect Inputs and Outputs
§ Wrap Session in Debug Session
sess = tf.Session(config=config)
sess =
tf_debug.LocalCLIDebugWrapperSession(sess)
58. LET’S DEBUG A MODEL
§ Navigate to the following notebook:
04_Debug_Model
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
59. AGENDA
Part 1: Optimize TensorFlow Model Training
§ GPUs and TensorFlow
§ Train, Inspect, and Debug TensorFlow Models
§ TensorFlow Distributed Model Training on a Cluster
§ Optimize Training with JIT XLA Compiler
60. SINGLE NODE, MULTI-GPU TRAINING
§ cpu:0
§ By default, all CPUs
§ Requires extra config to target a CPU
§ gpu:0..n
§ Each GPU has a unique id
§ TF usually prefers a single GPU
§ xla_cpu:0, xla_gpu:0..n
§ “JIT Compiler Device”
§ Hints TensorFlow to attempt JIT Compile
with tf.device(“/cpu:0”):
with tf.device(“/gpu:0”):
with tf.device(“/gpu:1”):
GPU 0 GPU 1
61. DISTRIBUTED, MULTI-NODE TRAINING
§ TensorFlow Automatically Inserts Send and Receive Ops into Graph
§ Parameter Server Synchronously Aggregates Updates to Variables
§ Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS
Worker0 Worker0
Worker1
Worker
0
Worker
1
Worker
2
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0
gpu1
gpu0
gpu0
Single
Node
Multiple
Nodes
62. DATA PARALLEL VS MODEL PARALLEL
§ Data Parallel (“Between-Graph Replication”)
§ Send exact same model to each device
§ Each device operates on partition of data
§ ie. Spark sends same function to many workers
§ Each worker operates on their partition of data
§ Model Parallel (“In-Graph Replication”)
§ Send different partition of model to each device
§ Each device operates on all data
Very Difficult!, But
Required for Large Models.
(GPU RAM Limitation)
63. SYNCHRONOUS VS. ASYNCHRONOUS
§ Synchronous
§ Nodes compute gradients
§ Nodes update Parameter Server (PS)
§ Nodes sync on PS for latest gradients
§ Asynchronous
§ Some nodes delay in computing gradients
§ Nodes don’t update PS
§ Nodes get stale gradients from PS
§ May not converge due to stale reads!
64. CHIEF WORKER
§ Worker Task 0 is Usually the Chief
§ Task 0 is guaranteed to exist
§ Performs Maintenance Tasks
§ Writes log summaries
§ Instructs PS to checkpoint vars
§ Performs PS health checks
§ (Re-)Initialize variables at (re-)start of training
65. NODE AND PROCESS FAILURES
§ Checkpoint to Persistent Storage (HDFS, S3)
§ Use MonitoredTrainingSession and Hooks
§ Use a Good Cluster Orchestrator (ie. Kubernetes,Mesos)
§ Understand Failure Modes and Recovery States
Stateless, Not Bad: Training Continues Stateful, Bad: Training Must Stop Dios Mio! Long Night Ahead…
66. USE ESTIMATOR AND EXPERIMENT APIS
§ Simplify Model Building
§ Provide Clear Path to Production
§ Enable Rapid Model Experiments
§ Provide Flexible Parameter Tuning
§ Enable Downstream Optimizing & Serving Infra( )
§ Nudge Users to Best Practices Through Opinions
§ Provide Hooks/Callbacks to Override Opinions
§ Unified API for Local and Distributed TensorFlow
https://arxiv.org/pdf/1708.02637.pdf
67. ESTIMATOR API
§ “Train-to-Serve” Design
§ Create Custom - or Use a Canned Estimator
§ Hides Session, Graph, Layers, Iterative Loops (Train, Eval, Predict)
§ Hooks for All Phases of Model Training and Evaluation
§ Load Input: input_fn()
§ Train: model_fn() and train()
§ Evaluate: evaluate()
§ Save and Export: export_savedmodel()
§ Predict: predict() Uses sess.run() Slow Predictions!
Example:
https://github.com/GoogleCloudPlatform/
cloudml-samples/blob/master/census/
customestimator/
68. LAYERS API
§ Standalone Layer or Entire Sub-Graphs
§ Functions of Tensor Inputs & Outputs
§ Mix and Match with Operations
§ Assumes 1st Dimension is Batch Size
§ Handles One (1) to Many (*) Inputs
§ Special Types of Layers
§ Loss per Mini-Batch
§ Accuracy and MSE Track Across Mini-Batches
69. CANNED ESTIMATORS
§ Commonly-Used Estimators
§ Pre-Tested and Pre-Tuned
§ DNNClassifer, TensorForestEstimator
§ Always Use Canned Estimators If Possible
§ Reduced Lines of Code, Complexity, and Bugs
§ Use FeatureColumns to Define & Create Features
Custom vs. Canned
@ Google, August, 2017
70. FEATURECOLUMN ABSTRACTION
§ Used by Canned Estimator
§ Simplifies Input Ingestion
§ Declarative Way to Specify Model Training Inputs
§ Converts Sparse Features to Dense Tensors
§ Sparse Features: Query Keyword, Url, ProductID,…
§ Wide/Linear Models Use Feature-Crossing
§ Deep Models Use Embeddings
71. SINGLE VS. MULTI-OBJECTIVES + HEADS
§ Single-Objective Estimator
§ Single classification prediction
§ Multi-Objective Estimator
§ Two (2) classification predictions
§ Or One (1) classification prediction + One(1) final layer
§ Multiple Heads Are Used to Ensemble Models
§ Treats neural network as a feature engineering step!
72. EXPERIMENT API
§ Easier-to-Use Distributed TensorFlow
§ Combines Estimator with input_fn()
§ Used for Training, Evaluation, & Hyper-Parameter Tuning
§ Distributed Training Default to Data-Parallel & Async
§ Cluster Configuration is Fixed at Start of Training Job
§ No Auto-Scaling Allowed!!
73. ESTIMATOR & EXPERIMENT CONFIGS
§ TF_CONFIG
§ Special environment variable for config
§ Defines ClusterSpec in JSON incl. master, workers, PS’s
§ Must set ”{environment”:“cloud”} for distributed mode
§ RunConfig: Defines checkpoint interval, output directory, …
§ HParams: Hyper-parameter tuning parameters and ranges
§ learn_runner creates RunConfig before calling run() & tune()
§ schedule is set based on {”task”:{”type”}}
§ Set to train_and_evaluate for local, single-node training
TF_CONFIG=
'{
"environment": "cloud",
"cluster":
{
"master":["worker0:2222”],
"worker":["worker1:2222"],
"ps": ["ps0:2222"]
},
"task": {"type": "ps",
"index": "0"}
}'
74. SEPARATE TRAINING + EVALUATION
§ Separate Training and Evaluation Clusters
§ Evaluate Upon Checkpoint
§ Avoid Resource Contention
§ Let Training Continue in Parallel with Evaluation
Training
Cluster
Evaluation
Cluster
Parameter Server
Cluster
75. LET’S TRAIN DISTRIBUTED TENSORFLOW
§ Navigate to the following notebook:
05_Train_Model_Distributed_CPU
or 05a_Train_Model_Distributed_GPU
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
77. BREAK
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu.ml
Need Help?
Use the Chat!
78. AGENDA
Part 1: Optimize TensorFlow Model Training
§ GPUs and TensorFlow
§ Train, Inspect, and Debug TensorFlow Models
§ TensorFlow Distributed Model Training on a Cluster
§ Optimize Training with JIT XLA Compiler
79. XLA FRAMEWORK
§ Accelerated Linear Algebra (XLA)
§ Goals:
§ Reduce reliance on custom operators
§ Improve execution speed
§ Improve memory usage
§ Reduce mobile footprint
§ Improve portability
§ Helps TF Stay Flexible and Performant
80. XLA HIGH LEVEL OPTIMIZER (HLO)
§ Compiler Intermediate Representation (IR)
§ Independent of source and target language
§ Define Graphs using HLO Language
§ XLA Step 1 Emits Target-Independent HLO
§ XLA Step 2 Emits Target-Dependent LLVM
§ LLVM Emits Native Code Specific to Target
§ Supports x86-64, ARM64 (CPU), and NVPTX (GPU)
81. JIT COMPILER
§ Just-In-Time Compiler
§ Built on XLA Framework
§ Goals:
§ Reduce memory movement – especially useful on GPUs
§ Reduce overhead of multiple function calls
§ Similar to Spark Operator Fusing in Spark 2.0
§ Unroll Loops, Fuse Operators, Fold Constants, …
§ Scope to session, device, or `with jit_scope():`
82. VISUALIZING JIT COMPILER IN ACTION
Before After
Google Web Tracing Framework:
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
84. LET’S TRAIN WITH XLA CPU
§ Navigate to the following notebook:
06_Train_Model_XLA_CPU
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
85. LET’S TRAIN WITH XLA GPU
§ Navigate to the following notebook:
06a_Train_Model_XLA_GPU
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
86. AGENDA
Part 0: Latest PipelineAI Research
Part 1: Optimize TensorFlow Model Training
Part 2: Optimize TensorFlow Model Serving
87. AGENDA
Part 2: Optimize TensorFlow Model Serving
§ AOT XLA Compiler and Graph Transform Tool
§ Key Components of TensorFlow Serving
§ Deploy Optimized TensorFlow Model
§ Optimize TensorFlow Serving Runtime
88. AOT COMPILER
§ Standalone, Ahead-Of-Time (AOT) Compiler
§ Built on XLA framework
§ tfcompile
§ Creates executable with minimal TensorFlow Runtime needed
§ Includes only dependencies needed by subgraph computation
§ Creates functions with feeds (inputs) and fetches (outputs)
§ Packaged as cc_libary header and object files to link into your app
§ Commonly used for mobile device inference graph
§ Currently, only CPU x86-64 and ARM are supported - no GPU
89. GRAPH TRANSFORM TOOL (GTT)
§ Post-Training Optimization to Prepare for Inference
§ Remove Training-only Ops (checkpoint, drop out, logs)
§ Remove Unreachable Nodes between Given feed -> fetch
§ Fuse Adjacent Operators to Improve Memory Bandwidth
§ Fold Final Batch Norm mean and variance into Variables
§ Round Weights/Variables to improve compression (ie. 70%)
§ Quantize (FP32 -> INT8) to Speed Up Math Operations
92. AFTER STRIPPING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ Results
§ Graph much simpler
§ File size much smaller
93. AFTER REMOVING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ Results
§ Pesky nodes removed
§ File size a bit smaller
94. AFTER FOLDING CONSTANTS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ Results
§ Placeholders (feeds) -> Variables*
(*Why Variables and not Constants?)
95. AFTER FOLDING BATCH NORMS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ Results
§ Graph remains the same
§ File size approximately the same
96. AFTER QUANTIZING WEIGHTS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ Results
§ Graph is same, file size is smaller, compute is faster
97. WEIGHT QUANTIZATION
§ FP16 and INT8 Are Smaller and Computationally Simpler
§ Weights/Variables are Constants
§ Easy to Linearly Quantize
98. LET’S OPTIMIZE FOR INFERENCE
§ Navigate to the following notebook:
07_Optimize_Model*
*Why just CPU version? Why not GPU?
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
100. ACTIVATION QUANTIZATION
§ Activations Not Known Ahead of Time
§ Depends on input, not easy to quantize
§ Requires Additional Calibration Step
§ Use a “representative” dataset
§ Per Neural Network Layer…
§ Collect histogram of activation values
§ Generate many quantized distributions with different saturation thresholds
§ Choose threshold to minimize…
KL_divergence(ref_distribution, quant_distribution)
§ Not Much Time or Data is Required (Minutes on Commodity Hardware)
102. LET’S OPTIMIZE FOR INFERENCE
§ Navigate to the following notebook:
08_Optimize_Model_Activations
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
104. AGENDA
Part 2: Optimize TensorFlow Model Serving
§ AOT XLA Compiler and Graph Transform Tool
§ Key Components of TensorFlow Serving
§ Deploy Optimized TensorFlow Model
§ Optimize TensorFlow Serving Runtime
105. MODEL SERVING TERMINOLOGY
§ Inference
§ Only Forward Propagation through Network
§ Predict, Classify, Regress, …
§ Bundle
§ GraphDef, Variables, Metadata, …
§ Assets
§ ie. Map of ClassificationID -> String
§ {9283: “penguin”, 9284: “bridge”}
§ Version
§ Every Model Has a Version Number (Integer)
§ Version Policy
§ ie. Serve Only Latest (Highest), Serve Both Latest and Previous, …
106. TENSORFLOW SERVING FEATURES
§ Supports Auto-Scaling
§ Custom Loaders beyond File-based
§ Tune for Low-latency or High-throughput
§ Serve Diff Models/Versions in Same Process
§ Customize Models Types beyond HashMap and TensorFlow
§ Customize Version Policies for A/B and Bandit Tests
§ Support Request Draining for Graceful Model Updates
§ Enable Request Batching for Diff Use Cases and HW
§ Supports Optimized Transport with GRPC and Protocol Buffers
107. PREDICTION SERVICE
§ Predict (Original, Generic)
§ Input: List of Tensor
§ Output: List of Tensor
§ Classify
§ Input: List of tf.Example (key, value) pairs
§ Output: List of (class_label: String, score: float)
§ Regress
§ Input: List of tf.Example (key, value) pairs
§ Output: List of (label: String, score: float)
109. MULTI-HEADED INFERENCE
§ Inputs Pass Through Model Once
§ Model Returns Multiple Predictions or “Heads” including:
1. Human-readable prediction (ie. “penguin”, “church”,…)
2. Final layer of scores (float vector)
§ Final Layer of floats Pass to the Next Model in Ensemble
§ Optimizes Bandwidth, CPU/GPU, Latency, Memory
§ Enables Complex Model Composing and Ensembling
110. BUILD YOUR OWN MODEL SERVER
§ Adapt GRPC(Google) <-> HTTP (REST of the World)
§ Perform Batch Inference vs. Request/Response
§ Handle Requests Asynchronously
§ Support Mobile, Embedded Inference
§ Customize Request Batching
§ Add Circuit Breakers, Fallbacks
§ Control Latency Requirements
§ Reduce Number of Moving Parts
#include
“tensorflow_serving/model_servers/server_core.h”
class MyTensorFlowModelServer {
ServerCore::Options options;
// set options (model name, path, etc)
std::unique_ptr<ServerCore> core;
TF_CHECK_OK(
ServerCore::Create(std::move(options), &core)
);
}
Compile and Link with
libtensorflow.so
111. NVIDIA TENSOR-RT RUNTIME
§ Post-Training Model Optimizations
§ Similar to TF Graph Transform Tool
§ GPU-Optimized Prediction Runtime
§ Alternative to TensorFlow Serving
§ PipelineAI Supports TensorRT!
112. AGENDA
Part 2: Optimize TensorFlow Model Serving
§ AOT XLA Compiler and Graph Transform Tool
§ Key Components of TensorFlow Serving
§ Deploy Optimized TensorFlow Model
§ Optimize TensorFlow Serving Runtime
113. SAVED MODEL FORMAT
§ Navigate to the following notebook:
09_Deploy_Optimized_Model
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
114. AGENDA
Part 2: Optimize TensorFlow Model Serving
§ AOT XLA Compiler and Graph Transform Tool
§ Key Components of TensorFlow Serving
§ Deploy Optimized TensorFlow Model
§ Optimize TensorFlow Serving Runtime
115. REQUEST BATCH TUNING
§ max_batch_size
§ Enables throughput/latency tradeoff
§ Bounded by RAM
§ batch_timeout_micros
§ Defines batch time window, latency upper-bound
§ Bounded by RAM
§ num_batch_threads
§ Defines parallelism
§ Bounded by CPU cores
§ max_enqueued_batches
§ Defines queue upper bound, throttling
§ Bounded by RAM
Reaching either threshold
will trigger a batch
116. ADVANCED BATCHING & SERVING TIPS
§ Batch Just the GPU/TPU Portions of the Computation Graph
§ Batch Arbitrary Sub-Graphs using Batch / Unbatch Graph Ops
§ Distribute Large Models Into Shards Across TensorFlow Model Servers
§ Batch RNNs Used for Sequential and Time-Series Data
§ Find Best Batching Strategy For Your Data Through Experimentation
§ BasicBatchScheduler: Homogeneous requests (ie Regress or Classify)
§ SharedBatchScheduler: Mixed requests, multi-step, ensemble predict
§ StreamingBatchScheduler: Mixed CPU/GPU/IO-bound Workloads
§ Serve Only One (1) Model Inside One (1) TensorFlow Serving Process
§ Much Easier to Debug, Tune, Scale, and Manage Models in Production.
117. LET’S DEPLOY OPTIMIZED MODEL
§ Navigate to the following notebook:
10_Optimize_Model_Server
§ https://github.com/PipelineAI/pipeline/tree/master/
gpu.ml/notebooks
118. AGENDA
Part 0: Latest PipelineAI Research
Part 1: Optimize TensorFlow Model Training
Part 2: Optimize TensorFlow Model Serving
119. THANK YOU!! QUESTIONS?
§ https://github.com/PipelineAI/pipeline/
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu.ml
Contact Me
chris@pipeline.ai
@cfregly