2. Goal
1. Introduce Compilation Flow
2. Introduce IR Design
3. Pros and Cons
4. TODO Features
5. Make you want to do something on
Glow!?
https://www.books.com.tw/products/0010750924
5. High Level IR (HIR)
1. dataflow node-based graph representation
2. There are two types (placeholder and constant) of variable.
All nodes can access all variables in the Module.
Currently glow can not distinguish placeholder is input or
output, but the importer has naming rule convention when
creating output placeholder, so we use the name with
“save_” prefix to workaround this issue.
3. Glow is NHWC order. (Caffe2/ONNX are NCHW).
If the backend only supports NCHW, remember to insert
transpose node in pre-lowering.
4. If some model format only support float type but user want to
quantize model. After optimization, the graph inserted
quantize/dequantize before/after input/output placeholder.
But in some case, you would like to perform
quantize/dequantize in user program side.
6. Low Level IR (LIR)
1. Instruction-based representation, LIR allows multiple
output (ex. loss function)
2. The operand has qualifiers with @in, @out,
@inout. So the user instruction in the users list
maybe is output value.
3. All LIR can be built-in Op.
4. LIR designs the allocactivation/dealloc instruction to
measure live range of activation. LIR optimizer will
sink/hoist the alloc/dealloc place to reduce memory
pressure. But in some backend, we will insert
allocactivation because the weights will occupy
memory
5. Glow provides simple static time memory allocator
(first-fit) for backend usage.
7. Backend in Glow
1. tools/ClassGen/Backends/: Define backend-specific Nodes and Instructions for the Backend. (like
LLVM’s tablegen)
2. Node attribute:
● .addOverwrittenInput(“Output”)
● .setHasSideEffects(true)
3. Instr attribute:
● .autoIRGen() : Framework help backend to generates translation code (HIR->LIR) .
● .inplaceOperand({"Dest", "Batch"})
● .dataParallel()
8. Backend in Glow
2. lib/Backends/: implement derivied classes for Backend and CompiledFunction.
a. Backend abstract class
i. bool transformPreLowering/transforPostLowering
ii. bool shouldLower(const Node *N) const;
iii. bool shouldShareBuffers() const;
iv. compile/save
v. isOpSupported(Kinded::Kind opKind, ElemKind elementTy) const;
b. CompiledFunction
i. execute() = 0;
ii. setupRuns(), beforeRun(), afterRun(), tearDownRuns();
9. Pros and Cons
Pros:
1. Supprot training and inference compilation
2. Support quantization feature
3. Support many HIR and LIR optimziation and it also can work on custom nodes/instructions.
4. Support “dump DAG”
5. Support ASIC-friendly IR and helper function
6. more...
Cons:
1. Does not support python interface. But user can use ONNXIFI to achieve it.
2. Not-exist any ASIC backend for reference.
3. missed some builtin operator
4. more ...
10. Quantization feature
1. Quantization nodes in HIR.
a. QuantizationProfile
b. Quantize/Dequantize /RescaleQuantized
c. IntLookupTable
d. RowwiseQuantizedFullyConnected
2. Support related optimizations.
a. Quantize(Dequantize(X)) -> RescaleQuantized(X)
b. Dequantize(Quantize(X)) -> X
c. Quantize(Constant) -> Constant
d. PoolingNode(Rescale(X)) -> Rescale(PoolingNode(X)).
e. more...
11. Optimizations
1. Graph optimizer (HIR)
a. DCE, CSE.
b. Optimize specific node.
i. Concat(Slice(X, 0..10), Slice(X, 10..20)) -> X
ii. merge Transpose into MatMul
iii. Relu (MaxPool(X)) -> MaxPool(Relu(X))
iv. merge batch normalization operations. (Inference)
v. more …
2. IR optimizer (LIR)
a. Reduce memory usage
i. sinkAllocas/hoistDealloc/sinkTensorViews
ii. eliminate copy instruction
b. Eliminate redundant instructions
c. Peephole optimizations
d. more...
12. Support ASIC-friendly IR and helper function
1. Slice/InsertTensor/Tile/Gather/Scatter (HIR)
2. TensorView (LIR): a view of an existing tensor and does not allocate any new memory
3. Tensor class: represent a contiguous n-dim array. (copyRawFrom/copySlice/Transpose)
4. Handles: easy to access/operation on a Tensor
/// Create a tensor of type Float, of the shape {4 x 2}.
Tensor inputs(ElemKind::FloatTy, {4, 2});
/// Create a handle to the tensor.
auto I = inputs.getHandle<float>();
/// Store an element to the tensor at index {0, 0}.
I.at({0, 0}) = 13.1;
13. Cons (?)
1. There is only ShareBuffers flag to enable/disable optimization.
2. There is only one memory space in the one LIR function. If you backend has two memory spaces in
the one LIR function, some ShareBuffer optimization will generate unwanted result.
IRFunctionplaceholder
weight
14. Cons (?)
3. We does not see any advanced optimization comparing with TVM or in-house compiler.
ex. activation/weight partition when memory insufficient, reuse activation to avoid memory
movement, computation and data movement parallelism, more..
15. Make you want to do something on Glow!?
You can try to
1. Add a real ASIC backend
2. Add more advanced optimizations
3. Offloading subgraph to different backend
a. how to cowork with cpu
4. Improve JIT performance
a. How to support dynamic input shape?
b. How to support ROI pooling layer? (becuase the layer parameter is runtime information)
5. How to debug optimized model
6. Advanced scheduler
7. Advanced memory allocator
8. more..