1. Compiler Everywhere
ONNC - Deep Learning
The first and only one commercial
compiler for deep learning format
ONNX
http://onnx.ai/supported-tools
Lity - Blockchain
Professional smart contract compiler for
Consortium and full private blockchain
Knight - C/C++
High performance C/C++ compiler that
speed up software up to 35% ~ 280%
2. In-depth Compiler Technical Experience
Members come from Google, MediaTek, HTC, Andes and Yahoo
Started up since Nov. 2013
團隊人數 25人 貓數 2 貓
台北/新竹 三辦公室
(東京我們來了!)
4. By Skymizer Intelligent Compiler
Combining A.I. and compiler technologies
benchmark
compiler
training engine inference database
optimizer
#include <stdio.h>
int main()
source code
optimal
program
Automatically, No need to change either SW or HW
5. We Push Software Faster
The BEST is, you don’t need to change code
One-to-Five Year Improvement 30% ~ 200%
ODROID-U2 TK1
AutoBench 37% 46%
ConsumerBench 12% 12%
NetBench 27% 30%
TeleBench' 41% 51%
Geomean 34% 40%
37%
46%
12% 12%
27%
30%
41%
51%
34%
40%
Improvement
6. Compiler Everywhere
Lity - Blockchain
Professional smart contract compiler for
Consortium and full private blockchain
7. Naïve Smart Contract
Byte code has no DATA section and BSS section
Favor fancy GC rather then useful ARC
Use type-less language instead of strong-type
in serious financial area
Lack of diagnostic system
Impossible to extend features
Ignore finance industry’s real demand
Wrong ABI design and language selection
9. Check and Warn!
enable ERC20 check
should be view
check result
Smart contract with trouble
10. E
N
I
C++ Code
class Hello : public EniBase;
ENI_C_INTERFACE(hello, Hello)
Solidity Code
eni(“hello”, “world”)
ENI
A better way to extend Solidity
https://github.com/CyberMiles/libeni
examples:
https://github.com/CyberMiles/libeni/tree/master/examples
12. Compiler Everywhere
ONNC - Deep Learning
The first and only one commercial
compiler for deep learning format
ONNX
http://onnx.ai/supported-tools
Lity - Blockchain
Professional smart contract compiler for
Consortium and full private blockchain
Knight - C/C++
High performance C/C++ compiler that
speed up software up to 35% ~ 280%
13. Help DLA vendors to shrink time-to-market
Ensure executability of ONNX
Will be released in open source
before the end of July in 2018
15. ONNC: connect ONNX to every DLA ASICs
• The first framework to support DLA features
– Support coarse-grain DLA backend
– Support instruction scheduling and memory allocation for
heterogeneous architecture
• Goal: Best technology
– Reduce at least ½ memory consumption (with no spill and split)
– Reduce 90% execution time of DLA
16. Traditional compilation process
• Give a memory hierarchy, allocate register/memory to all
variables
• Traditional compilers have three basic phases
– Instruction scheduling
– Computation/memory partition
– Memory/register allocation
16
16
a = 1;
…
b = a;
…
a = 2;
c = b;
c = a;
a b c a b c
0x10
0x20
0x10 0x30
0x30
a2 = 1; …
b = a;
a1 = 3;
BB1
BB2 BB3
BB4
17. From runtime toward compiler
• Goal
– reuse local memory of DLA
• Algorithm
– for each region, load every value before entrance of the region and store everything after the
exit
– If we don’t have sufficient local memory, then shrink the region to a single layer
• Observation
– If we can separate the graph horizontally instead of vertical splitting, we can reuse more local
memory and eliminate most data movement
17
18. Traditional compiler vs DLA/ASIC compiler
• In CPU world, give an opcode, we can roughly get the instruction’s
physical features
– clock cycles time
– power consumption
• In DLA/ASIC world, clock cycle time may depends on
– inter-instruction overhead
– inter-operand overhead
– operand size
• For instruction scheduling, liveness changes in every code motion
• For computation/data partition, liveness is changing as well
• For memory/register allocation, liveness changes after every spills
18
19. GCC vs LLVM
Two different approaches of pass management
19
• fixed order of passes
• Use compiler flags to enable/disable
passes
• Keep all analysis results till termination
• Pros
– easy to understand and control
• Cons
– inflexible
• dynamic order of passes
• Every pass must describe its dependencies
to the other passes
• Release analysis results on time and
automatically
• Pros
– very flexible
• Cons
– difficult to control and predict the
order of passes
GCC LLVM
A B C D
-fa -fno-b -fc -fno-d
A
CB
D
release B immediately release C till D done
20. ONNC PassManager - flexible and easy
• Dynamic order of passes by dependencies
• Keep all analysis results till termination
– The number of analysis passes is relative small against conventional compilers
– Save the development time to understand life cycles with the other passes
• DLA needs re-run passes
– Liveness is changing every when a new data spill occurs
– Instruction needs re-scheduling every when an inter-instruction overhead is changed
retry till successLattice
D
BA C
Add D, PassManager will add A and
B automatically
A B C D
BFS topologic sort
21. Four Kinds of Passes in ONNC
• ModulePass
– The most general of all superclasses that you can use
– Use entire network as an unit
• TensorPass
– Use Tensor Graph as an unit
– Tensor Graph bases on ONNX IR
• RegionPass
– Use each signle-entry-signe-exit region in a tensor graph as an unit
– For example, groups in GoogLeNet
• ComputePass
– Use Compute Graph as an unit
21
Pass
Module
Pass
Tensor
Pass
Region
Pass
Compute
Pass
// methods in class Pass
bool run(Module& pModule);
virtual bool doInitialization(Module& pModule);
virtual bool doFinalization(Module& pModule);
// methods in class ModulePass
virtual bool runOnModule(Module& pModule) = 0;
// methods in class TensorPass
virtual bool runOnTensor(TensorGraph& pGraph) = 0;
22. AnalysisUsage describes the dependencies between Passes
• PassManager automatically creates all Passes that used by the other
Passes.
22
Start
D
BA C
/// methods in Pass D. Override
void D::getAnalysisUsage(AnalysisUsage& pUsage) const
{
pUsage.addRequiredID(A::ID);
PUsage.addRequiredID(B::ID);
}
/// in A.cpp
INITIALIZE_PASS(A, “pass_a”)
/// in B.cpp
INITIALIZE_PASS(B, “pass_b”)
23. ONNC IR: The heart of ONNC
Core design thought - from network domain to compute unit
• Four phases in the compilation process
– IRReader - read ONNX prototex and build ONNX IR
– TensorSel - select corresponding instruction for target devices
– MemAlloc - turn symbolic operands into memory address
• instruction scheduling
• memory partition
• memory allocation
– CodeEmit - emit binary code for target devices
23
ComputeNetwork
ONNX
ONNX
IR
(symbols
)
Compute
IR
(symbols
)
Exec
(isa)
IRReader TensorSel MemAlloc CodeEmit
Compute
IR
(address)
24. IR Lowering in compilation process
24
ComputeOperatorComputeOperator
Operator
Outputs <*>
Inputs <*>
value
Outputs <*>
value
Operator
inputs
onnx::Nodeonnx::Node
Operator Operator
Outputs <*>
Inputs <*>
value
inputs
Outputs <*>
value
ComputeOperand
ComputeOperand
25. DLA/ASIC compilers must use backward analysis
• Two directions of data analysis algorithms
– Forward analysis
• reachability analysis, availability analysis
– Backward analysis
• liveness analysis
• Traditional CPU compiler models the allocation program
as a reachability problem and uses forward analysis
• DLA/ASIC compilers must model the problem as liveness
analysis problem.
25
26. Backward Live Variable Analysis
• Equation
– LIVEout(B) - variables live on exit from block B:
LIVEout(nf) = LIVEout(B) = Ø
LIVEout(B) =
xSUCC(B)(UEVar(x)(LIVEout(x)-VarKill(x)))
BLIVEout(B)
LIVEout(X)
UEVar(X)
VarKill(X)
27. Liveness analysis of tensors
• Find out the live range of every tensor
• Leverage use-define chain of ONNX
• By the help of simple liveness analysis, we can reuse local memory
and eliminate ½ memory consumption with greedy allocation
27
a = 1;
…
b = a;
…
a = 2;
c = b;
c = a;
a b c
28. Quadruple - new way to select a NN compute unit
• Quadruple is a string representation that represents
– Target hardware architecture (micro-architecture, ISA, etc.)
– Target software environment (ABI, OS, etc.)
– Target tool (compiler, loader, calibration, etc.)
• for example,
– LLVM triple: arm-nono-linux-gnueabi
– ONNC quadruple: arm-none-linux-gnueabi-calibration-0.2
28
LLVM triple = HW x SW
ONNC quadruple = HW x SW x Tool
29. ONNC supports various target devices
• Use LLVM-like triple to select compiler target backend
– compiler
– loader
– calibration
• Every Target instance represents a target device
– contains cost model
– contains target-dependent passes
• Target instance registers target-dependent passes into PassManager
29
select target
with quadruple
register
target-dependent
passes
Driver Pass Manager
NVDLA BITMAIN X86_64 TensorSel MemAlloc CodeEmit
cost models lowering IR / optimizations
Target
31. 31
Compiler Layer
TargetMachine
DLA 1 DLA 2 CPU
LLVM
IRBuilderRuntime
PassMngr
Pass
ComputePass GraphPassTensorPass
Module
TensorIR
(ONNX)
ComputeIR
template design pattern (virtual
member functions)
create passes
lowering (from TensorOp to MCOp)
32. 32
High Level Concept of the
Architecture Structure
DevOp Layer (Umbrella)
Logistic Layer
JSON Diagnostic ADTSupport
Quick Regression CI
Building
System
UnittestRegression
Compiler Layer
IR Target Machine Pass Mngr
Tooling Layer
Compiler Driver Arch Explorer ONNX Reader
3rd party
LLVM ONNX
33. Help DLA vendors to shrink time-to-market
Ensure executability of ONNX
Will be released in open source
before the end of July in 2018