Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Onnc intro

620 views

Published on

Introduction to Open Neural Network Compiler (ONNC)

Published in: Engineering
  • Be the first to comment

Onnc intro

  1. 1. Compiler Everywhere ONNC - Deep Learning The first and only one commercial compiler for deep learning format ONNX http://onnx.ai/supported-tools Lity - Blockchain Professional smart contract compiler for Consortium and full private blockchain Knight - C/C++ High performance C/C++ compiler that speed up software up to 35% ~ 280%
  2. 2. In-depth Compiler Technical Experience Members come from Google, MediaTek, HTC, Andes and Yahoo Started up since Nov. 2013 團隊人數 25人 貓數 2 貓 台北/新竹 三辦公室 (東京我們來了!)
  3. 3. Compiler Everywhere Knight - C/C++ High performance C/C++ compiler that speed up software up to 35% ~ 280%
  4. 4. By Skymizer Intelligent Compiler Combining A.I. and compiler technologies benchmark compiler training engine inference database optimizer #include <stdio.h> int main() source code optimal program Automatically, No need to change either SW or HW
  5. 5. We Push Software Faster The BEST is, you don’t need to change code One-to-Five Year Improvement 30% ~ 200% ODROID-U2 TK1 AutoBench 37% 46% ConsumerBench 12% 12% NetBench 27% 30% TeleBench' 41% 51% Geomean 34% 40% 37% 46% 12% 12% 27% 30% 41% 51% 34% 40% Improvement
  6. 6. Compiler Everywhere Lity - Blockchain Professional smart contract compiler for Consortium and full private blockchain
  7. 7. Naïve Smart Contract Byte code has no DATA section and BSS section Favor fancy GC rather then useful ARC Use type-less language instead of strong-type in serious financial area Lack of diagnostic system Impossible to extend features Ignore finance industry’s real demand Wrong ABI design and language selection
  8. 8. 8 Lityrelease in open source http://github.com/cybermiles/lity
  9. 9. Check and Warn! enable ERC20 check should be view check result Smart contract with trouble
  10. 10. E N I C++ Code class Hello : public EniBase; ENI_C_INTERFACE(hello, Hello) Solidity Code eni(“hello”, “world”) ENI A better way to extend Solidity https://github.com/CyberMiles/libeni examples: https://github.com/CyberMiles/libeni/tree/master/examples
  11. 11. C/C++ Java Python JavaScript C# Objective C/C++ SWIFT Who can provide strong-typed, compiler- defined ABI, common API, BRMS and ARC?
  12. 12. Compiler Everywhere ONNC - Deep Learning The first and only one commercial compiler for deep learning format ONNX http://onnx.ai/supported-tools Lity - Blockchain Professional smart contract compiler for Consortium and full private blockchain Knight - C/C++ High performance C/C++ compiler that speed up software up to 35% ~ 280%
  13. 13. Help DLA vendors to shrink time-to-market Ensure executability of ONNX Will be released in open source before the end of July in 2018
  14. 14. CPUs DSP ASIC GPU flexibility performance FPGA x3 x3 x10 x100 Deep Learning is Heterogeneous Computing
  15. 15. ONNC: connect ONNX to every DLA ASICs • The first framework to support DLA features – Support coarse-grain DLA backend – Support instruction scheduling and memory allocation for heterogeneous architecture • Goal: Best technology – Reduce at least ½ memory consumption (with no spill and split) – Reduce 90% execution time of DLA
  16. 16. Traditional compilation process • Give a memory hierarchy, allocate register/memory to all variables • Traditional compilers have three basic phases – Instruction scheduling – Computation/memory partition – Memory/register allocation 16 16 a = 1; … b = a; … a = 2; c = b; c = a; a b c a b c 0x10 0x20 0x10 0x30 0x30 a2 = 1; … b = a; a1 = 3; BB1 BB2 BB3 BB4
  17. 17. From runtime toward compiler • Goal – reuse local memory of DLA • Algorithm – for each region, load every value before entrance of the region and store everything after the exit – If we don’t have sufficient local memory, then shrink the region to a single layer • Observation – If we can separate the graph horizontally instead of vertical splitting, we can reuse more local memory and eliminate most data movement 17
  18. 18. Traditional compiler vs DLA/ASIC compiler • In CPU world, give an opcode, we can roughly get the instruction’s physical features – clock cycles time – power consumption • In DLA/ASIC world, clock cycle time may depends on – inter-instruction overhead – inter-operand overhead – operand size • For instruction scheduling, liveness changes in every code motion • For computation/data partition, liveness is changing as well • For memory/register allocation, liveness changes after every spills 18
  19. 19. GCC vs LLVM Two different approaches of pass management 19 • fixed order of passes • Use compiler flags to enable/disable passes • Keep all analysis results till termination • Pros – easy to understand and control • Cons – inflexible • dynamic order of passes • Every pass must describe its dependencies to the other passes • Release analysis results on time and automatically • Pros – very flexible • Cons – difficult to control and predict the order of passes GCC LLVM A B C D -fa -fno-b -fc -fno-d A CB D release B immediately release C till D done
  20. 20. ONNC PassManager - flexible and easy • Dynamic order of passes by dependencies • Keep all analysis results till termination – The number of analysis passes is relative small against conventional compilers – Save the development time to understand life cycles with the other passes • DLA needs re-run passes – Liveness is changing every when a new data spill occurs – Instruction needs re-scheduling every when an inter-instruction overhead is changed retry till successLattice D BA C Add D, PassManager will add A and B automatically A B C D BFS topologic sort
  21. 21. Four Kinds of Passes in ONNC • ModulePass – The most general of all superclasses that you can use – Use entire network as an unit • TensorPass – Use Tensor Graph as an unit – Tensor Graph bases on ONNX IR • RegionPass – Use each signle-entry-signe-exit region in a tensor graph as an unit – For example, groups in GoogLeNet • ComputePass – Use Compute Graph as an unit 21 Pass Module Pass Tensor Pass Region Pass Compute Pass // methods in class Pass bool run(Module& pModule); virtual bool doInitialization(Module& pModule); virtual bool doFinalization(Module& pModule); // methods in class ModulePass virtual bool runOnModule(Module& pModule) = 0; // methods in class TensorPass virtual bool runOnTensor(TensorGraph& pGraph) = 0;
  22. 22. AnalysisUsage describes the dependencies between Passes • PassManager automatically creates all Passes that used by the other Passes. 22 Start D BA C /// methods in Pass D. Override void D::getAnalysisUsage(AnalysisUsage& pUsage) const { pUsage.addRequiredID(A::ID); PUsage.addRequiredID(B::ID); } /// in A.cpp INITIALIZE_PASS(A, “pass_a”) /// in B.cpp INITIALIZE_PASS(B, “pass_b”)
  23. 23. ONNC IR: The heart of ONNC  Core design thought - from network domain to compute unit • Four phases in the compilation process – IRReader - read ONNX prototex and build ONNX IR – TensorSel - select corresponding instruction for target devices – MemAlloc - turn symbolic operands into memory address • instruction scheduling • memory partition • memory allocation – CodeEmit - emit binary code for target devices 23 ComputeNetwork ONNX ONNX IR (symbols ) Compute IR (symbols ) Exec (isa) IRReader TensorSel MemAlloc CodeEmit Compute IR (address)
  24. 24. IR Lowering in compilation process 24 ComputeOperatorComputeOperator Operator Outputs <*> Inputs <*> value Outputs <*> value Operator inputs onnx::Nodeonnx::Node Operator Operator Outputs <*> Inputs <*> value inputs Outputs <*> value ComputeOperand ComputeOperand
  25. 25. DLA/ASIC compilers must use backward analysis • Two directions of data analysis algorithms – Forward analysis • reachability analysis, availability analysis – Backward analysis • liveness analysis • Traditional CPU compiler models the allocation program as a reachability problem and uses forward analysis • DLA/ASIC compilers must model the problem as liveness analysis problem. 25
  26. 26. Backward Live Variable Analysis • Equation – LIVEout(B) - variables live on exit from block B: LIVEout(nf) = LIVEout(B) = Ø LIVEout(B) = xSUCC(B)(UEVar(x)(LIVEout(x)-VarKill(x))) BLIVEout(B) LIVEout(X) UEVar(X) VarKill(X)
  27. 27. Liveness analysis of tensors • Find out the live range of every tensor • Leverage use-define chain of ONNX • By the help of simple liveness analysis, we can reuse local memory and eliminate ½ memory consumption with greedy allocation 27 a = 1; … b = a; … a = 2; c = b; c = a; a b c
  28. 28. Quadruple - new way to select a NN compute unit • Quadruple is a string representation that represents – Target hardware architecture (micro-architecture, ISA, etc.) – Target software environment (ABI, OS, etc.) – Target tool (compiler, loader, calibration, etc.) • for example, – LLVM triple: arm-nono-linux-gnueabi – ONNC quadruple: arm-none-linux-gnueabi-calibration-0.2 28 LLVM triple = HW x SW ONNC quadruple = HW x SW x Tool
  29. 29. ONNC supports various target devices • Use LLVM-like triple to select compiler target backend – compiler – loader – calibration • Every Target instance represents a target device – contains cost model – contains target-dependent passes • Target instance registers target-dependent passes into PassManager 29 select target with quadruple register target-dependent passes Driver Pass Manager NVDLA BITMAIN X86_64 TensorSel MemAlloc CodeEmit cost models lowering IR / optimizations Target
  30. 30. TargetBackend controls the phases of lowering 30 ComputeNetwork ONNX ONNX IR (symbols ) Compute IR (symbols ) Exec (isa) IRReader TensorSel MemAlloc CodeEmit Compute IR (address) // compiler bone PassManager pm; TargetBackend* backend = TargetRegistry::Lookup(“DLA-A”); backend->addTensorSel(pm); backend->addMemAlloc(pm); backend->addCodeEmit(pm); pm.run(); // Core/TargetBackend.h class TargetBackend { virtual void addTensorSel(PassManager& pPM) { return; } virtual void addMemAlloc (PassManager& pPM) { return; } virtual void addCodeEmit (PassManager& pPM) { return; } }; // ATargetBackend.cpp void ABackend::addCodeEmit(PassManager& pPM) { pPM.add(createRemoveUnusedNodePass()); pPM.add(createUpdateOutputInfoPass()); pPM.add(createTGMemAllocInfoPass(this)); pPM.add(createTargetLoweringPass(this)); pPM.add(createTGCodeEmitPass(this)); }
  31. 31. 31 Compiler Layer TargetMachine DLA 1 DLA 2 CPU LLVM IRBuilderRuntime PassMngr Pass ComputePass GraphPassTensorPass Module TensorIR (ONNX) ComputeIR template design pattern (virtual member functions) create passes lowering (from TensorOp to MCOp)
  32. 32. 32 High Level Concept of the Architecture Structure DevOp Layer (Umbrella) Logistic Layer JSON Diagnostic ADTSupport Quick Regression CI Building System UnittestRegression Compiler Layer IR Target Machine Pass Mngr Tooling Layer Compiler Driver Arch Explorer ONNX Reader 3rd party LLVM ONNX
  33. 33. Help DLA vendors to shrink time-to-market Ensure executability of ONNX Will be released in open source before the end of July in 2018

×