autoTVM

NTHU-CS VLSI/CAD LAB
Speaker:Yi-Wen Hung
2022/12/18

 Learning to Optimize Tensor Programs
 Tianqi Chen et. al
 NeurIPS 2018
 TVM: End-to-End Compilation Stack for
Deep Learning
 Tianqi Chen et. al
 MLSys (previously sysML) 2018
2

 Formalize the problem of learning to
optimize tensor programs and summarize its
key characteristics.
 Propose a machine learning framework to
solve the problem.
 Accelerate the optimization by 2x to 10x
using transfer learning.
3

 Introduction to TVM
 Auto Tensor Optimization & Objective
 autoTVM
 Overview
 Statistical Cost Model
 Training Objective Function
 Exploration Module
 Acceleration by Transfer Learning
 Experimental Results
 Extensions: NAS w/ autoTVM
4

NTHU-CS VLSI/CAD LAB 5
Ref: https://sampl.cs.washington.edu/projects/tvm.html

• Deploy deep learning workloads from high-level
frameworks to diverse hardware back-ends (CPU, GPU,
FPGA)
• Introduce schedule primitives to take advantage of
cross-thread memory reuse, novel hardware intrinsics,
and latency hiding
• Evaluate TVM on a generic FPGA-based accelerator for
target specialized accelerators
6
Ref: https://tvm.apache.org/docs/vta/index.html

 autoTVM
 Overview
7

 Tensor optimization is complicated
 Choose from many implementations due to differences in threading, memory
reuse, pipelining and other hardware factors.
 HW-framework co-optimization is a
challenge
 Even on currently supported hardware, developing DL frameworks and models
is limited by the set of optimized operators in libraries, preventing optimizations
(such as operator fusion) that can produce unsupported operators.
 Needs an automatic Tensor optimization
method
8

 Given program IR (𝑒), a set of possible
schedules of 𝑒 (𝑆𝑒), a code-gen (𝑔), and a
real hardware cost function (𝑓)
 Find minimal 𝑓 𝑔 𝑒, 𝑠 in terms of 𝑠 when
𝑠 ∈ 𝑆𝑒
 Find a schedule 𝑠 of program IR 𝑒 that minimize hardware cost under a given
code-gen 𝑔 and a real hardware environment 𝑓
9

 Low experiment cost
 HPO cost hours or days
 Run a tensor program cost few seconds
 Domain-specific problem structure
 HPO treat problem as black box
 This work treat problem as white box
 Large quantity of similar operators
 Tensor operators are similar, transfer learning is
possible
10

 Formalize the problem of learning to
optimize tensor programs and summarize its
key characteristics.
 Propose a machine learning framework to
solve the problem.
 Accelerate the optimization by 2x to 10x
using transfer learning.
11

 autoTVM
 Overview
12

 Train a cost model 𝑓 𝑥 with a database 𝒟 =
𝑒𝑖, 𝑠𝑖, 𝑐𝑖 , 𝑐𝑖 = 𝑓 𝑥𝑖 , 𝑥𝑖 = 𝑔 𝑒𝑖, 𝑠𝑖
 Train a cost model with a database which contains the information of hardware
cost under a program IR w/ a schedule
 Encode AST (𝑥) with two approaches
 Gradient boosted trees (GBTs) w/ XGBoost
 TreeGRU
16

 Use rank loss to train a cost model 𝑓 𝑥 with
a database 𝒟 = 𝑒𝑖, 𝑠𝑖, 𝑐𝑖
 𝑖,𝑗 log 1 + 𝑒
−𝑠𝑖𝑔𝑛 𝑐𝑖−𝑐𝑗 𝑓 𝑥𝑖 −𝑓 𝑥𝑗
 Why not 𝑙2 loss
 Only care about the relative order of program
runtimes rather than their absolute values
17

 Given: a set of possible schedules of 𝑒 (𝑆𝑒)
 Find: 𝑠∗
∈ 𝑆𝑒 that minimize 𝑓 𝑔 𝑒, 𝑠∗
18

 Step1: pick the next promising batch
 Naïve: enumerate all 𝑠 ∈ 𝑆𝑒 to find top-b 𝑠 is
infeasible
 Run parallel SA with 𝑓 𝑔 𝑒, 𝑠 to find top-b
𝑠 candidates 𝑆
 Objective: change 𝑠 to find minimal 𝑓 𝑔 𝑒, 𝑠
 Apply 𝜖-greedy to sample top-b
19

 Step2: run measurement on hardware env.
 Run 𝑔 𝑒, 𝑠 , 𝑠 ∈ 𝑆 to get 𝑐𝑠 = 𝑓 𝑔 𝑒, 𝑠
 Save each 𝒟𝑠 = (𝑒, 𝑠, 𝑐𝑠) to 𝒟
20

 Step3: update cost model
 Update cost model c = 𝑓 𝑔 𝑒, 𝑠 with 𝒟 =
{ 𝑒, 𝑠, 𝑐 }
 Gradient boosted trees (GBTs) w/ XGBoost
 TreeGRU
21

 In real world, 𝒟 is from previous workloads,
which possible train 𝑓 from history 𝒟′
 Because 𝑓 use embedding vector of ASTs from
code-gen 𝑔 to predict cost 𝑐
 Goal: encode different AST to the same
embedding space
 Gradient boosted trees w/ XGBoost
 TreeGRU
22

 autoTVM
 Overview
23

 autoTVM tutorial
27

 autoTVM
 Overview
28

 Train a function composition 𝑓 ∘ 𝑔(𝑒, 𝑠) for predict
hardware cost during NAS
 𝑒 is known if an operation is selected
 𝑠 can be defined as a hyper-params in a supernet
 Characteristic of variation
 Spatial locality of bounded PE
 Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
 Characteristic of sparsity
 Three-level sub-problems
 Model, DRAM access, Accelerator
 Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
29

 RRAM Cell-to-cell variation (𝑅𝑜𝑛)
 Intrinsic ADC offset (process variation)
30
Source: RRAMedy: Protecting ReRAM-based Neural Network from Permanent and Soft Faults During Its Lifetime

 RRAM Cell-to-cell variation (𝑅𝑜𝑛)
 Intrinsic ADC offset (process variation)
 Spatial locality of bounded PE
 Tiling, Loop unrolling, different bit-line
 Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
 Input: schedule, Output: noise impact score
31

 Model accuracy considering weight sparsity
/
32
Ref: https://tvm.apache.org/docs/vta/index.html

 Model accuracy considering weight sparsity
/
 Three-level sub-problems
 Model level: model accuracy / weight storage
: access latency, data bandwidth
: sparse matrix operation
 Possible modeled with 𝑓 ∘ 𝑔(𝑒, 𝑠)
 Input: schedule, tvm IR, Output: latency or storage
33

autoTVM

Recommended

Recommended

More Related Content

Similar to autoTVM

Similar to autoTVM (20)

Recently uploaded

Recently uploaded (20)

autoTVM

Editor's Notes