2. NTHU-CS VLSI/CAD LAB
Learning to Optimize Tensor Programs
Tianqi Chen et. al
NeurIPS 2018
TVM: End-to-End Compilation Stack for
Deep Learning
Tianqi Chen et. al
MLSys (previously sysML) 2018
2
3. NTHU-CS VLSI/CAD LAB
Formalize the problem of learning to
optimize tensor programs and summarize its
key characteristics.
Propose a machine learning framework to
solve the problem.
Accelerate the optimization by 2x to 10x
using transfer learning.
3
4. NTHU-CS VLSI/CAD LAB
Introduction to TVM
Auto Tensor Optimization & Objective
autoTVM
Overview
Statistical Cost Model
Training Objective Function
Exploration Module
Acceleration by Transfer Learning
Experimental Results
Extensions: NAS w/ autoTVM
4
6. NTHU-CS VLSI/CAD LAB
• Deploy deep learning workloads from high-level
frameworks to diverse hardware back-ends (CPU, GPU,
FPGA)
• Introduce schedule primitives to take advantage of
cross-thread memory reuse, novel hardware intrinsics,
and latency hiding
• Evaluate TVM on a generic FPGA-based accelerator for
target specialized accelerators
6
Ref: https://tvm.apache.org/docs/vta/index.html
7. NTHU-CS VLSI/CAD LAB
Introduction to TVM
Auto Tensor Optimization & Objective
autoTVM
Overview
Statistical Cost Model
Training Objective Function
Exploration Module
Acceleration by Transfer Learning
Experimental Results
Extensions: NAS w/ autoTVM
7
8. NTHU-CS VLSI/CAD LAB
Tensor optimization is complicated
Choose from many implementations due to differences in threading, memory
reuse, pipelining and other hardware factors.
HW-framework co-optimization is a
challenge
Even on currently supported hardware, developing DL frameworks and models
is limited by the set of optimized operators in libraries, preventing optimizations
(such as operator fusion) that can produce unsupported operators.
Needs an automatic Tensor optimization
method
8
9. NTHU-CS VLSI/CAD LAB
Given program IR (𝑒), a set of possible
schedules of 𝑒 (𝑆𝑒), a code-gen (𝑔), and a
real hardware cost function (𝑓)
Find minimal 𝑓 𝑔 𝑒, 𝑠 in terms of 𝑠 when
𝑠 ∈ 𝑆𝑒
Find a schedule 𝑠 of program IR 𝑒 that minimize hardware cost under a given
code-gen 𝑔 and a real hardware environment 𝑓
9
10. NTHU-CS VLSI/CAD LAB
Low experiment cost
HPO cost hours or days
Run a tensor program cost few seconds
Domain-specific problem structure
HPO treat problem as black box
This work treat problem as white box
Large quantity of similar operators
Tensor operators are similar, transfer learning is
possible
10
11. NTHU-CS VLSI/CAD LAB
Formalize the problem of learning to
optimize tensor programs and summarize its
key characteristics.
Propose a machine learning framework to
solve the problem.
Accelerate the optimization by 2x to 10x
using transfer learning.
11
12. NTHU-CS VLSI/CAD LAB
Introduction to TVM
Auto Tensor Optimization & Objective
autoTVM
Overview
Statistical Cost Model
Training Objective Function
Exploration Module
Acceleration by Transfer Learning
Experimental Results
Extensions: NAS w/ autoTVM
12
16. NTHU-CS VLSI/CAD LAB
Train a cost model 𝑓 𝑥 with a database 𝒟 =
𝑒𝑖, 𝑠𝑖, 𝑐𝑖 , 𝑐𝑖 = 𝑓 𝑥𝑖 , 𝑥𝑖 = 𝑔 𝑒𝑖, 𝑠𝑖
Train a cost model with a database which contains the information of hardware
cost under a program IR w/ a schedule
Encode AST (𝑥) with two approaches
Gradient boosted trees (GBTs) w/ XGBoost
TreeGRU
16
17. NTHU-CS VLSI/CAD LAB
Use rank loss to train a cost model 𝑓 𝑥 with
a database 𝒟 = 𝑒𝑖, 𝑠𝑖, 𝑐𝑖
𝑖,𝑗 log 1 + 𝑒
−𝑠𝑖𝑔𝑛 𝑐𝑖−𝑐𝑗 𝑓 𝑥𝑖 −𝑓 𝑥𝑗
Why not 𝑙2 loss
Only care about the relative order of program
runtimes rather than their absolute values
17
18. NTHU-CS VLSI/CAD LAB
Given: a set of possible schedules of 𝑒 (𝑆𝑒)
Find: 𝑠∗
∈ 𝑆𝑒 that minimize 𝑓 𝑔 𝑒, 𝑠∗
18
19. NTHU-CS VLSI/CAD LAB
Step1: pick the next promising batch
Naïve: enumerate all 𝑠 ∈ 𝑆𝑒 to find top-b 𝑠 is
infeasible
Run parallel SA with 𝑓 𝑔 𝑒, 𝑠 to find top-b
𝑠 candidates 𝑆
Objective: change 𝑠 to find minimal 𝑓 𝑔 𝑒, 𝑠
Apply 𝜖-greedy to sample top-b
19
20. NTHU-CS VLSI/CAD LAB
Step2: run measurement on hardware env.
Run 𝑔 𝑒, 𝑠 , 𝑠 ∈ 𝑆 to get 𝑐𝑠 = 𝑓 𝑔 𝑒, 𝑠
Save each 𝒟𝑠 = (𝑒, 𝑠, 𝑐𝑠) to 𝒟
20
21. NTHU-CS VLSI/CAD LAB
Step3: update cost model
Update cost model c = 𝑓 𝑔 𝑒, 𝑠 with 𝒟 =
{ 𝑒, 𝑠, 𝑐 }
Gradient boosted trees (GBTs) w/ XGBoost
TreeGRU
21
22. NTHU-CS VLSI/CAD LAB
In real world, 𝒟 is from previous workloads,
which possible train 𝑓 from history 𝒟′
Because 𝑓 use embedding vector of ASTs from
code-gen 𝑔 to predict cost 𝑐
Goal: encode different AST to the same
embedding space
Gradient boosted trees w/ XGBoost
TreeGRU
22
23. NTHU-CS VLSI/CAD LAB
Introduction to TVM
Auto Tensor Optimization & Objective
autoTVM
Overview
Statistical Cost Model
Training Objective Function
Exploration Module
Acceleration by Transfer Learning
Experimental Results
Extensions: NAS w/ autoTVM
23
28. NTHU-CS VLSI/CAD LAB
Introduction to TVM
Auto Tensor Optimization & Objective
autoTVM
Overview
Statistical Cost Model
Training Objective Function
Exploration Module
Acceleration by Transfer Learning
Experimental Results
Extensions: NAS w/ autoTVM
28
29. NTHU-CS VLSI/CAD LAB
Train a function composition 𝑓 ∘ 𝑔(𝑒, 𝑠) for predict
hardware cost during NAS
𝑒 is known if an operation is selected
𝑠 can be defined as a hyper-params in a supernet
Characteristic of variation
Spatial locality of bounded PE
Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
Characteristic of sparsity
Three-level sub-problems
Model, DRAM access, Accelerator
Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
29
30. NTHU-CS VLSI/CAD LAB
Characteristic of variation
RRAM Cell-to-cell variation (𝑅𝑜𝑛)
Intrinsic ADC offset (process variation)
30
Source: RRAMedy: Protecting ReRAM-based Neural Network from Permanent and Soft Faults During Its Lifetime
31. NTHU-CS VLSI/CAD LAB
Characteristic of variation
RRAM Cell-to-cell variation (𝑅𝑜𝑛)
Intrinsic ADC offset (process variation)
Spatial locality of bounded PE
Tiling, Loop unrolling, different bit-line
Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
Input: schedule, Output: noise impact score
31
32. NTHU-CS VLSI/CAD LAB
Characteristic of sparsity
Model accuracy considering weight sparsity
/
32
Ref: https://tvm.apache.org/docs/vta/index.html
33. NTHU-CS VLSI/CAD LAB
Characteristic of sparsity
Model accuracy considering weight sparsity
/
Three-level sub-problems
Model level: model accuracy / weight storage
: access latency, data bandwidth
: sparse matrix operation
Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
Input: schedule, tvm IR, Output: latency or storage
33
Editor's Notes
Loop unrolling, tiling, operations sharing
Like hyper-parameter optimization, needs explanation
Points
1. Train a cost prediction model with several real HW cost
2. Auto select the best schedule for the target HW
3. Transfer \hat{f} for different programs with AST embedding transfer
Use the embedding of AST to train the \hat{f}
Ref: Compute-in-Memory Chips for Deep Learning: Recent Trends and Prospects, IEEE Circuits and Systems Magazine, Volume: 21, Issue: 3, thirdquarter 2021
Noisy Machines: Understanding Noisy Neural Networks and Enhancing Robustness to Analog Hardware Errors Using Distillation, arXiv preprint
Ref: Compute-in-Memory Chips for Deep Learning: Recent Trends and Prospects, IEEE Circuits and Systems Magazine, Volume: 21, Issue: 3, thirdquarter 2021
Noisy Machines: Understanding Noisy Neural Networks and Enhancing Robustness to Analog Hardware Errors Using Distillation, arXiv preprint