SlideShare a Scribd company logo
Facebook Glow Compiler のソースコー
ドをグダグダ語る会
@DeNA
作成:2018/08/26, 9/16,9/22,10/28
Slideshareにて公開 :2018/11/29
@Vengineer
ブログ (2007年~) : Vengineerの戯言
 http://blogs.yahoo.co.jp/verification_engineer
SlideShare :
 https://www.slideshare.net/ssuser479fa3
Twitter (2009年~) :
@Vengineer
ソースコード解析職人
宣伝です
宣伝です
お髭の人に
いじられるために
はるばるやってきました
しかしながら、
この企画を提案したのは、
あたしです!
お髭の人には、
気を付けろ
もうひとつ
宣伝です
PyTorch から XLA に変
換し、Cloud TPU にて、
Resnet-50を動かしたとい
うコードなのかな?
2018年12月1日(土)
さて、本題
Glowとは?
第1フェーズ
 ディープラーニング・フレームワーク
  ・Keras + TensorFlow  ダントツ
  ・PyTorch
  ・Chainer 日本では?
第2フェーズ
 グラフ・コンパイラ
  ・TensorFlow XLA
  ・NNVM (Relay) / TVM
  ・Glow
Glow: Graph Lowering Compiler
Techniques for Neural Networks
May 2, 2018
https://arxiv.org/abs/1805.00907
Facebook
Glow: A community-driven approach to
AI infrastructure
Sep 13, 2018
https://code.fb.com/ml-applications/glow-a-community-driven-approach-to-ai
-infrastructure/
Facebook
@Scale
2018 Keynote: Glow: A community-driven
approach to AI
SEPTEMBER 19, 2018
https://atscaleconference.com/videos/scale-2018-keynote-glow-a-community-driven
-approach-to-ai/
Facebook
さあ、
ソースコードを見
てみよう
$ sudo apt-get install graphviz cmake wget libpng-dev 
ninja-build clang llvm-5.0 
libprotobuf-dev protobuf-compiler
  cmake は、3.7.1 以上が必要
別途、ソースコードから3.12.1 をインストールしました
 llvmは、6.0 でも、7.0 でもいいみたいです。
準備
$ git clone https://github.com/pytorch/glow.git
$ git submodule update --init --recursive
$ cd glow
$ mkdir build_Debug
$ cd build_Debug
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug ..
$ ninja all
$ ninja test
 
ビルド
CMakeLists.txt の
option(GLOW_WITH_OPENCL "Build the OpenCL backend" ON)
を
option(GLOW_WITH_OPENCL "Build the OpenCL backend" OFF)
に変更にするか、コマンドラインにて、以下のようなパラメータを指定する
-DGLOW_WITH_OPENCL=OFF
 
OpenCL がデフォルトで ON
https://github.com/pytorch/glow
Glow : Graph Compiler & Execution Engine
High-Level Graph => Low-Level IR => Machine Code
 
TensorFlow XLA : JITコンパイラ (r1.5~)
XLAグラフに変換
最適化、その1
ターゲットハードウェアの実行オブジェクト
ターゲットハードウェアに依存しない最適化
HLO (High Level Optimizer)
XLAグラフ
最適化、その2
コード生成
ターゲットハードウェアに依存する最適化
LLO (Low Level Optimizer)
TensorFow Graph
実行オブジェクト
XLAグラフ
High-Level IR
・ドメインスペシフィックな最適化
Low-Level IR
・メモリ関連の最適化
命令のスケジューリング
静的なメモリ割り当て
メモリコピーの削除
・マシン依存コード生成
Glowは、どんなことをやっている?
ExecutionEngine EE(executionBackend);
TrainingConfig TC;
TC.learningRate = 0.001;
TC.momentum = 0.9;
TC.L2Decay = 0.001;
TC.batchSize = minibatchSize;
Function *T = glow::differentiate(F, TC); # <= 学習はこれが必要
EE.compile(CompilationMode::Train, T); # <= CompilationMode::Train
例題:mnist を見てみよう ( 学習 だってできる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
Tensor imageInputs;
Tensor labelInputs;
Variable *A =
mod.createVariable(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input",
VisibilityKind::Public, false);
Variable *selected =
mod.createVariable(ElemKind::Int64ITy, {minibatchSize, 1}, "selected",
VisibilityKind::Public, false);
unsigned numImages = loadMNIST(imageInputs, labelInputs);
EE.runBatch(numIterations, {A, selected}, {&imageInputs, &labelInputs});
例題:mnist を見てみよう ( 学習 だってできる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
auto *result = F->createSave("return", SM);
EE.compile(CompilationMode::Infer, F); #<= CompilationMode::Infer
Tensor sample(ElemKind::FloatTy, {minibatchSize, 28, 28, 1});
for (int iter = numIterations; iter < numIterations + 10; iter++) {
sample.copyConsecutiveSlices(&imageInputs, minibatchSize * iter);
EE.run({A}, {&sample});
Tensor &res = result->getVariable()->getPayload();
例題:mnist を見てみよう ( 推論 も当然できる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
llvm::cl::opt<BackendKind> executionBackend(
llvm::cl::desc("Backend to use:"),
llvm::cl::values(clEnumValN(BackendKind::Interpreter, "interpreter",
"Use interpreter (default option)"),
clEnumValN(BackendKind::CPU, "cpu", "Use CPU"),
clEnumValN(BackendKind::OpenCL, "opencl", "Use OpenCL")
),
llvm::cl::init(BackendKind::Interpreter),
llvm::cl::cat(mnistCat)
);
バックエンドは、「Interpreter(デフォルト)」「CPU」「OpenCL」
バックエンドは?
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
auto *CV0 = F->create Conv("conv", A, 16, 5, 1, 2, 1);
auto *RL0 = F->create RELU("relu", CV0);
auto *MP0 = F->create MaxPool("pool", RL0, 3, 3, 0);
auto *CV1 = F->create Conv("conv", MP0, 16, 5, 1, 2, 1);
auto *RL1 = F->create RELU("relu", CV1);
auto *MP1 = F->create MaxPool("pool", RL1, 3, 3, 0);
auto *FCL1 = F->create FullyConnected("fc", MP1, 10);
auto *SM = F->create SoftMax("sm", FCL1, selected);
auto *result = F->createSave("return", SM);
mnist のモデル構築
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
The Lifetime of
a Glow Instruction
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
モデルをロード
ONNX
Caffe2
PyTorch 1.0
PyTorch + Caffe2 + Glow
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
ExecutionEngine EE{BackendKind::Interpreter};
auto &mod = EE.getModule();
Function *F = mod.createFunction("main");
std::string NetFilename("tests/models/onnxModels/simpleConv.onnxtxt");
Variable *graphOutputVar;
Tensor data;
getNCHWData(&data, 1, 1, 3, 3);
ONNXModelLoader onnxLD(NetFilename, {"data"}, {&data}, *F);
graphOutputVar = onnxLD.getSingleOutput();
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
ONNXモデル をロード、コンパイル、推論
https://github.com/pytorch/glow/blob/master/tests/unittests/onnxImporterTest.cpp#L28
ExecutionEngine EE{BackendKind::Interpreter};
auto &mod = EE.getModule();
Function *F = mod.createFunction("main");
std::string NetDescFilename("tests/models/caffe2Models/predict_net.pbtxt");
std::string NetWeightFilename("tests/models/caffe2Models/init_net.pbtxt");
Variable *output;
Tensor data;
getNCHWData(&data, 1, 1, 3, 3);
caffe2ModelLoader caffe2LD(NetDescFilename, NetWeightFilename,
{"data"}, {&data}, *F);
output = caffe2LD.getSingleOutput();
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
Caffe2モデル をロード、コンパイル、推論
https://github.com/pytorch/glow/blob/master/tests/unittests/caffe2ImporterTest.cpp
ExecuteEngine
モデルのコンパイル
モデルの実行
モデルの保存
ExecuteEngine
compile バックエンドのgenerateIR : IRの生成
run
CompiledFunction の生成 (各バックエンド毎)
CompiledFunction の 実行 (execute)実行
コンパイル
save バックエンドのsave : IRの保存保存
void ExecutionEngine:: compile(CompilationMode mode, Function *F,
const Context &ctx) {
optimizeFunction(mode, F); // 最適化 後で
function_ = backend_-> compile(F, ctx); // コンパイル 後で
}
引数の mode は、最適化で使用する
ExecutionEngine::compile
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void glow::runBatch(ExecutionEngine &EE, size_t iterations,
size_t &sampleCounter, llvm::ArrayRef<Variable *> vars,
llvm::ArrayRef<Tensor *> input ) {
size_t batchSize = vars[0]->getType()->dims()[0];
for (size_t i = 0; i < iterations; i++) {
for (int i = 0, e = ph.size(); i < e; i++) {
auto *backingTensor = ctx.get(ph[i]);
auto dim = inputs[i]->dims();
size_t slc = sampleCounter % dim[0];
backingTensor->copyConsecutiveSlices(inputs[i], slc);
}
glow::updateVariablesFromBatch(vars, inputs, sampleCounter);
EE.run();
sampleCounter += batchSize;
}
}
glow::runBatch
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void ExecutionEngine:: run() {
assert(function_ && "No function has been compiled");
// Make sure that the context has backing tensors for all placeholders.
ctx.allocate(M_.getPlaceholders());
function_->setupRuns();
function_->beforeRun(ctx);
function_->execute();
function_->afterRun(ctx);
function_->tearDownRuns();
}
ExecutionEngine::run
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void ExecutionEngine:: save(CompilationMode mode, Function *F,
llvm::StringRef outputDir) {
llvm::StringRef networkName) {
optimizeFunction(mode, F); // 最適化 後で
backend_->save(F, outputDir, networkName);
}
ExecutionEngine::save
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
最適化
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
void ExecutionEngine:: compile(CompilationMode mode, Function *F,
const Context &ctx) {
optimizeFunction(mode, F); // 最適化
function_ = backend_-> compile(F, ctx); // コンパイル 後で
}
引数の mode は、最適化で使用する
ExecutionEngine::compile
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void ExecutionEngine:: optimizeFunction(CompilationMode mode,
Function *F) {
// Verify the function pre-optimization/lowering.
F->verify();
// Optimize the graph.
::glow::optimize(F, mode);
// Allow the backend to transform the graph prior to lowering.
if (backend_->transformPreLowering(F, mode)) {
// Optimize the graph again after the backend transformation.
// In particular, DCE is very likely to be useful.
::glow::optimize(F, mode);
}
ExecutionEngine::optimizeFunction
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
// Lower the graph into a sequence of low-level linear algebra operations.
::glow::lower(F, *backend_);
// Optimize the graph again.
::glow::optimize(F, mode);
// Allow the backend to transform the graph after lowering.
if (backend_->transformPostLowering(F, mode)) {
// Optimize the graph again after the backend transformation.
// In particular, DCE is very likely to be useful.
::glow::optimize(F, mode);
}
}
ExecutionEngine::optimizeFunction
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
1)、::glow::optimize(F, mode);
2)、if (backend_->transformPreLowering(F, mode))
::glow::optimize(F, mode);
3)、::glow::lower(F, *backend_);
4)、::glow::optimize(F, mode);
5)、if (backend_->transformPostLowering(F, mode))
::glow::optimize(F, mode);
generateIR の最適化部分のみ、抜き出すと
1)、::glow::optimize(F, mode);
2)、if (backend_->transformPreLowering(F, mode))
::glow::optimize(F, mode);
3)、::glow::lower(F, *backend_);
4)、::glow::optimize(F, mode);
5)、if (backend_->transformPostLowering(F, mode))
::glow::optimize(F, mode);
transformPreLowering / transformPostLowering
現時点の実装( Interpreter, CPU, OpenCL ) では、
transformPostLowering の実装は、CPU と OpenCL ではあるが、
transformPreLowering の実装はありません。
CPUBackendでは、
1)、convolution を CPU最適版 に置換
2)、MaxPooling と Splat を マージして、CPUMaxSplat に置換
OpenCLBackend では、
1)、Convolution を、OpenCL最適化版 に置換
2)、MaxPooling を、OpenCL最適化版 に置換
3)、AvgPooling を、OpenCL最適化版 に置換
transformPreLowering / PostLowering の実装
バックエンド
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
# ExecutionEngineは、インスタンス生成時に、バックエンドの種類を指定する
ExecutionEngine EE(executionBackend);
ExecutionEngine.hpp
  # デフォルトは、Interpreter
ExecutionEngine(BackendKind backendKind = BackendKind::Interpreter);
ExecutionEngine.cpp
# 指定した種類のバックエンドを生成する
ExecutionEngine::ExecutionEngine(BackendKind backendKind)
: backend_( createBackend(backendKind)) {}
ExecutionEngine::ExecutionEngine
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
Backend *glow::createBackend(BackendKind backendKind) {
switch (backendKind) {
case BackendKind::Interpreter: # Interpreter (Naiveな実装)
return createInterpreter();
case BackendKind::OpenCL: # OpenCL (Hostコード & OpenCLカーネル)
return createOCLBackend();
case BackendKind::CPU: # CPU (LLVM)
return createCPUBackend();
}
llvm_unreachable("unreachable");
}
glow::createBackend
https://github.com/pytorch/glow/blob/master/lib/Backends/Backends.cpp
Backend *createInterpreter() { return new Interpreter(); }
Backend *createCPUBackend() { return new CPUBackend(); }
Backend *createOCLBackend() { return new OCLBackend(); }
バックエンドの生成
https://github.com/pytorch/glow/blob/master/lib/Backends/
コンパイル:compile
バックエンドの compile
generateAndOptimizeIR
IR生成
&
IR最適化
compileIR
IRから
コード生成
virtual std::unique_ptr<CompiledFunction>
compile(std::unique_ptr<IRFunction> IR) const = 0;
InterpreterBackend
llvm::make_unique<InterpreterFunction>(std::move(IR))
CPUBackEnd
llvm::make_unique<CPUFunction>(std::move(JIT), heap)
OpenCBackend
llvm::make_unique<OpenCLFunction>(std::move(IR))
compile
https://github.com/pytorch/glow/blob/master/include/glow/Backends/Backend.h#L43
std::unique_ptr<CompiledFunction>
Interpreter::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
Interpreter::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
std::unique_ptr<CompiledFunction>
Interpreter::compileIR(std::unique_ptr<IRFunction> IR) const {
MemoryAllocator constantWeightsAllocator("ConstantWeights", 0);
MemoryAllocator placeholderWeightsAllocator("PlaceholderWeights", 0);
MemoryAllocator activationsAllocator("Activations", 0);
runtime::RuntimeBundle bundle =
generateRuntimeBundle(*IR, constantWeightsAllocator,
placeholderWeightsAllocator, activationsAllocator);
return llvm::make_unique< InterpreterFunction>(
std::move(IR), bundle) ;
}
Interpreter::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
std::unique_ptr<CompiledFunction>
CPUBackend::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
CPUBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp#L146
std::unique_ptr<CompiledFunction>
CPUBackend::compileIR(std::unique_ptr<IRFunction> IR) const {
AllocationsInfo allocationsInfo;
std::unique_ptr<LLVMIRGen> irgen = createIRGen(IR.get(), allocationsInfo);
irgen->initTargetMachine(target.empty() ? "" : target.getValue(),
llvm::CodeModel::Model::Large);
irgen->initCodeGen();
allocateJITMemory(IR.get(), irgen->getAllocationsInfo());
emitJitMain(*irgen);
irgen->performCodeGen();
CPUBackend::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
auto JIT = llvm::make_unique<llvm::orc::GlowJIT>(irgen->getTargetMachine());
JIT->addModule(irgen->borrowModule());
MemoryAllocator constantAllocator("ConstantWeights", 0);
MemoryAllocator placeholderAllocator("Placeholders", 0);
MemoryAllocator activationsAllocator("Activations", 0);
runtime::RuntimeBundle runtimeInfo = generateRuntimeBundle(
*IR, constantAllocator, placeholderAllocator, activationsAllocator);
return llvm::make_unique<CPUFunction>(std::move(JIT), runtimeInfo);
}
CPUBackend::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
std::unique_ptr<CompiledFunction>
OCLBackend::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
OpenCLBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
std::unique_ptr<CompiledFunction>
OCLBackend::compileIR(std::unique_ptr<IRFunction> IR) const {
MemoryAllocator allocator("GPU", 0xFFFFFFFF);
runtime::RuntimeBundle bundle =
generateRuntimeBundle(*IR, allocator, allocator, allocator);
return llvm::make_unique<OpenCLFunction>(std::move(IR), bundle) ;
}
OpenCLBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
IR生成
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
std::unique_ptr<IRFunction>
glow::generateAndOptimizeIR(Function *F,
bool shouldShareBuffers) {
auto IR = llvm::make_unique<IRFunction>(F);
# IR の生成
IR->generateIR();
# バックエンドを使って、最適化
::glow::optimize(*IR, shouldShareBuffers);
return IR;
}
IR生成とバックエンドを使ってIR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
void IRFunction::generateIR() {
assert(G_->verify() && "Invalid function");
// Schedule the nodes.
NodesPtrList ScheduledNodes;
scheduleGraph(ScheduledNodes);
IRGenVisitor irgen(this);
for (auto &N : ScheduledNodes) {
N->visit(nullptr, &irgen);
}
}
IR生成
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void IRFunction::scheduleGraph(NodesPtrList &Schedule) {
Schedule.clear();
for (auto &N : G_->getParent()->getVars()) {
Schedule.push_back(N);
}
for (auto &N : G_->getParent()->getPlaceholders()) {
Schedule.push_back(N);
}
グラフのスケジュール:前半
https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp
auto numVars = G_->getParent()->getConstants().size();
auto numPlaceholders = G_->getParent()->getPlaceholders().size();
(void)numVars;
(void)numPlaceholders;
std::unique_ptr<Scheduler> scheduler{
createScheduler(graphScheduler, *G_, Schedule)};
scheduler->schedule();
assert(scheduler->getSchedule().size() ==
G_->getNodes().size() + numPlaceholders + numVars &&
"All graph nodes have to be scheduled");
}
グラフのスケジュール:後半
https://github.com/pytorch/https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp#L172/blob/master/lib/IR/Graph
Scheduler.cpp
IR最適化
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
std::unique_ptr<IRFunction>
glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) {
auto IR = llvm::make_unique<IRFunction>(F);
# IR の生成
IR->generateIR();
# バックエンドを使って、最適化
::glow::optimize(*IR, shouldShareBuffers);
return IR;
}
IR生成とバックエンドを使ってIR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
void glow::optimize(IRFunction &M, CompilationMode mode, const Backend &B) {
M.verify();
if (!optimizeIR) return;
performPeepholeOptimizations(M);
eliminateDeadStores(M);
// Replace applicable InsertTensors and ExtractTensors with TensorViews.
optimizeInserts(M);
optimizeExtracts(M);
if (B.shouldShareBuffers ()) // Reuse buffers from previous operations.
shareBuffers(M);;
IR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1602
performPeepholeOptimizations(M);
hoistDealloc(M); // Shorten the lifetime of buffers.
sinkAllocas(M);
eliminateDeadStores(M); // Perform Dead Store Elimination.
deleteDeadAllocs(M);
makeWeightsConst(M); // Turn read-only weights into constant weights.
performDebugInstrumentation(M);
if (dumpOptMod) // Print the module to stdout if requested.
M.dump();
M.verify();
}
IR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1596
実行:execute
class CompiledFunction {
public:
virtual ~CompiledFunction() = default;
virtual void execute() = 0;
virtual void setupRuns() = 0;
virtual void beforeRun(const Context &ctx) = 0;
virtual void afterRun(const Context &ctx) = 0;
virtual void tearDownRuns() = 0;
};
CompiledFunction
https://github.com/pytorch/glow/blob/master/include/glow/Backends/CompiledFunction.h
class InterpreterFunction final : public CompiledFunction {
/// The IR to be executed.
std::unique_ptr<IRFunction> F_;
/// Maps values to Tensors, that are owned by this class.
std::unordered_map<const Value *, Tensor *> tensors_;
/// Maps values to Tensors, that are *not* owned by this class.
std::unordered_map<const Value *, Tensor *> externalTensors_;
public:
InterpreterFunction(std::unique_ptr<IRFunction> F, const Context &ctx);
~InterpreterFunction() override;
void execute() override;
InterpreterFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.h#L43
void InterpreterFunction::execute() {
#define DEF_VALUE(CLASS, NAME)
#define DEF_INSTR(CLASS, NAME) 
case Kinded::Kind::CLASS##Kind: { 
fwd##CLASS(llvm::cast<CLASS>(&I)); 
break; 
}
#define DEF_BACKEND_SPECIFIC_INSTR(CLASS, NAME)
for (const auto &I : F_->getInstrs()) {
switch (I.getKind()) { # <= 各オペレータの分岐!
#include "glow/AutoGenInstr.def"
default:
llvm_unreachable("Invalid instruction.");
}
}
}
InterpreterFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.cpp
class CPUFunction final : public CompiledFunction {
std::unique_ptr<llvm::orc::GlowJIT> JIT_;
void *heap_;
public:
CPUFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, void *heap);
~CPUFunction() override;
void execute() override;
};
CPUFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.h
void CPUFunction::execute() {
auto sym = JIT_->findSymbol( "jitmain");
using JitFuncType =
void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars,
uint8_t * activations);
auto address = sym.getAddress();
if (address) {
JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get());
funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress_,
baseActivationsAddress_);
} else {
GLOW_ASSERT(false && "Error getting address.");
}
}
CPUFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.cpp#L29
class OpenCLFunction final : public CompiledFunction {
cl_device_id deviceId_;
cl_context context_;
cl_command_queue commands_;
cl_mem deviceBuffer_{0};
std::vector<KernelLaunch> kernelLaunches_;
public:
explicit OpenCLFunction(std::unique_ptr<IRFunction> F);
~OpenCLFunction() override;
void execute() override;
OpenCLFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
void OpenCLFunction::execute() {
# めっちゃ長い
#
# 基本的には、
#
# 各レイヤーのループ
#
# 1). 各レイヤに対応したHost側のコード to OpenCLカーネルを生成
# 2). OpenCLカーネルのコンパイル
# 3). OpenCLの作法に則ったコードを実行 (enqueueKernel)
#
# clFinish(commands_); にて、すべてのOpenCLカーネルが終了するまで待つ
#
}
OpenCLFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
量子化 (FP32 => INT8)
https://github.com/pytorch/glow/blob/master/docs/Quantization.md
  ・FP32 => INT8
・プロファイルに基づく量子化
推論中の実行を観察して、
ニューラルネットワークの各ステージの可能な数値範囲を推定
・学習ベースの量子化は、将来サポート検討中
Glow での 量子化
https://github.com/pytorch/glow/blob/master/docs/Quantization.md
std::vector<NodeQuantizationInfo> QI{
{NodeQuantizationInfo::generateNodeOutputName(input->getName()),
{0.2f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(W->getName()), {0.3f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(B->getName()), {0.4f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(FC->getName()), {0.6f, 0}},
};
F = quantization::quantizeFunction(EE, QI, F);
// Make sure that graph can be compiled and run.
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
quantization::quantizeFunction の例
https://github.com/pytorch/glow/blob/master/tests/unittests/quantizationTest.cpp
Function *
quantizeFunction(const ExecutionEngine &EE,
llvm::ArrayRef<NodeQuantizationInfo> quantizationInfos,
Function *F, llvm::StringRef newFuncName = "");
quantization::quantizeFunction
https://github.com/pytorch/glow/blob/master/include/glow/Quantization/Quantization.h
https://github.com/pytorch/glow
Glow : Graph Compiler & Execution Engine
High-Level Graph => Low-Level IR => Machine Code
 
バックエンド
  Interpreter
CPU
OpenCL
 
あたしは、
ディープラーニング職人 ではありません
コンピュータエンジニア です
ありがとうございました
@Vengineer
ソースコード解析職人

More Related Content

What's hot

RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析Mr. Vengineer
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealTzung-Bi Shih
 
How to make a large C++-code base manageable
How to make a large C++-code base manageableHow to make a large C++-code base manageable
How to make a large C++-code base manageablecorehard_by
 
History & Practices for UniRx(EN)
History & Practices for UniRx(EN)History & Practices for UniRx(EN)
History & Practices for UniRx(EN)Yoshifumi Kawai
 
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Tzung-Bi Shih
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言Simen Li
 
Антон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствамиАнтон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствамиSergey Platonov
 
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack FirmwareSimen Li
 
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...Yoshifumi Kawai
 
Java Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersSergey Kuksenko
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUJohn Colvin
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DMithun Hunsur
 
Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >Sergey Platonov
 
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...Tsundere Chen
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
閒聊Python應用在game server的開發
閒聊Python應用在game server的開發閒聊Python應用在game server的開發
閒聊Python應用在game server的開發Eric Chen
 
Memory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native CollectionsMemory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native CollectionsYoshifumi Kawai
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingcppfrug
 

What's hot (20)

RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the Seal
 
How to make a large C++-code base manageable
How to make a large C++-code base manageableHow to make a large C++-code base manageable
How to make a large C++-code base manageable
 
History & Practices for UniRx(EN)
History & Practices for UniRx(EN)History & Practices for UniRx(EN)
History & Practices for UniRx(EN)
 
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
 
Антон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствамиАнтон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствами
 
C++17 now
C++17 nowC++17 now
C++17 now
 
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
 
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
 
Java Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware counters
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in D
 
Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >
 
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
閒聊Python應用在game server的開發
閒聊Python應用在game server的開發閒聊Python應用在game server的開發
閒聊Python應用在game server的開發
 
Memory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native CollectionsMemory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native Collections
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 

Similar to Facebook Glow Compiler のソースコードをグダグダ語る会

The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202Mahmoud Samir Fayed
 
Ekon 25 Python4Delphi_MX475
Ekon 25 Python4Delphi_MX475Ekon 25 Python4Delphi_MX475
Ekon 25 Python4Delphi_MX475Max Kleiner
 
TestUpload
TestUploadTestUpload
TestUploadZarksaDS
 
EKON 25 Python4Delphi_mX4
EKON 25 Python4Delphi_mX4EKON 25 Python4Delphi_mX4
EKON 25 Python4Delphi_mX4Max Kleiner
 
How to reverse engineer Android applications
How to reverse engineer Android applicationsHow to reverse engineer Android applications
How to reverse engineer Android applicationshubx
 
How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...Christoph Matthies
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6Wim Godden
 
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...Flink Forward
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019corehard_by
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Guillaume Laforge
 
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo OmuraPreferred Networks
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_worldfantasy zheng
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingMax Kleiner
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
How to Write Node.js Module
How to Write Node.js ModuleHow to Write Node.js Module
How to Write Node.js ModuleFred Chien
 

Similar to Facebook Glow Compiler のソースコードをグダグダ語る会 (20)

The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202
 
Ekon 25 Python4Delphi_MX475
Ekon 25 Python4Delphi_MX475Ekon 25 Python4Delphi_MX475
Ekon 25 Python4Delphi_MX475
 
TestUpload
TestUploadTestUpload
TestUpload
 
EKON 25 Python4Delphi_mX4
EKON 25 Python4Delphi_mX4EKON 25 Python4Delphi_mX4
EKON 25 Python4Delphi_mX4
 
How to reverse engineer Android applications
How to reverse engineer Android applicationsHow to reverse engineer Android applications
How to reverse engineer Android applications
 
How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008
 
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_world
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
Python at Facebook
Python at FacebookPython at Facebook
Python at Facebook
 
How to Write Node.js Module
How to Write Node.js ModuleHow to Write Node.js Module
How to Write Node.js Module
 

More from Mr. Vengineer

XilinxのxsimでSoftware Driven Verification.pdf
XilinxのxsimでSoftware  Driven Verification.pdfXilinxのxsimでSoftware  Driven Verification.pdf
XilinxのxsimでSoftware Driven Verification.pdfMr. Vengineer
 
VerilatorとSystemCでSoftware Driven Verification
VerilatorとSystemCでSoftware Driven VerificationVerilatorとSystemCでSoftware Driven Verification
VerilatorとSystemCでSoftware Driven VerificationMr. Vengineer
 
Cloud TPU Driver API ソースコード解析
Cloud TPU Driver API ソースコード解析Cloud TPU Driver API ソースコード解析
Cloud TPU Driver API ソースコード解析Mr. Vengineer
 
Cloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceCloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceMr. Vengineer
 
TensorFlow Lite Delegateとは?
TensorFlow Lite Delegateとは?TensorFlow Lite Delegateとは?
TensorFlow Lite Delegateとは?Mr. Vengineer
 
Pixel Visual Core device driver source code analysis
Pixel Visual Core device driver source code analysisPixel Visual Core device driver source code analysis
Pixel Visual Core device driver source code analysisMr. Vengineer
 
TensorFlow XLA 「XLAとは、から、最近の利用事例について」
TensorFlow XLA 「XLAとは、から、最近の利用事例について」TensorFlow XLA 「XLAとは、から、最近の利用事例について」
TensorFlow XLA 「XLAとは、から、最近の利用事例について」Mr. Vengineer
 
Ultra96(UltraZed)実践勉強会
Ultra96(UltraZed)実践勉強会Ultra96(UltraZed)実践勉強会
Ultra96(UltraZed)実践勉強会Mr. Vengineer
 
LeFlowを調べてみました
LeFlowを調べてみましたLeFlowを調べてみました
LeFlowを調べてみましたMr. Vengineer
 
Tensorflow dynamically loadable XLA plugin ソースコード解析
Tensorflow  dynamically loadable XLA plugin ソースコード解析Tensorflow  dynamically loadable XLA plugin ソースコード解析
Tensorflow dynamically loadable XLA plugin ソースコード解析Mr. Vengineer
 
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APITensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APIMr. Vengineer
 
「ディープラーニングでは、エコシステムが大切よ!」
 「ディープラーニングでは、エコシステムが大切よ!」 「ディープラーニングでは、エコシステムが大切よ!」
「ディープラーニングでは、エコシステムが大切よ!」Mr. Vengineer
 
TensorFlow XLA とハードウェア
TensorFlow XLA とハードウェアTensorFlow XLA とハードウェア
TensorFlow XLA とハードウェアMr. Vengineer
 
2017年のFPGA Community活動について
2017年のFPGA Community活動について2017年のFPGA Community活動について
2017年のFPGA Community活動についてMr. Vengineer
 
Zynq VIPを利用したテストベンチ
Zynq VIPを利用したテストベンチZynq VIPを利用したテストベンチ
Zynq VIPを利用したテストベンチMr. Vengineer
 
TensorFlow XLAの可能性
TensorFlow XLAの可能性 TensorFlow XLAの可能性
TensorFlow XLAの可能性 Mr. Vengineer
 
AWS EC2 F1とXilinx SDAccel
AWS EC2 F1とXilinx SDAccelAWS EC2 F1とXilinx SDAccel
AWS EC2 F1とXilinx SDAccelMr. Vengineer
 
Intel Nervana Graph とは?
Intel Nervana Graph とは?Intel Nervana Graph とは?
Intel Nervana Graph とは?Mr. Vengineer
 
DSPでディープラーニング
DSPでディープラーニングDSPでディープラーニング
DSPでディープラーニングMr. Vengineer
 

More from Mr. Vengineer (20)

XilinxのxsimでSoftware Driven Verification.pdf
XilinxのxsimでSoftware  Driven Verification.pdfXilinxのxsimでSoftware  Driven Verification.pdf
XilinxのxsimでSoftware Driven Verification.pdf
 
VerilatorとSystemCでSoftware Driven Verification
VerilatorとSystemCでSoftware Driven VerificationVerilatorとSystemCでSoftware Driven Verification
VerilatorとSystemCでSoftware Driven Verification
 
VerilatorとSystemC
VerilatorとSystemCVerilatorとSystemC
VerilatorとSystemC
 
Cloud TPU Driver API ソースコード解析
Cloud TPU Driver API ソースコード解析Cloud TPU Driver API ソースコード解析
Cloud TPU Driver API ソースコード解析
 
Cloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceCloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & Inference
 
TensorFlow Lite Delegateとは?
TensorFlow Lite Delegateとは?TensorFlow Lite Delegateとは?
TensorFlow Lite Delegateとは?
 
Pixel Visual Core device driver source code analysis
Pixel Visual Core device driver source code analysisPixel Visual Core device driver source code analysis
Pixel Visual Core device driver source code analysis
 
TensorFlow XLA 「XLAとは、から、最近の利用事例について」
TensorFlow XLA 「XLAとは、から、最近の利用事例について」TensorFlow XLA 「XLAとは、から、最近の利用事例について」
TensorFlow XLA 「XLAとは、から、最近の利用事例について」
 
Ultra96(UltraZed)実践勉強会
Ultra96(UltraZed)実践勉強会Ultra96(UltraZed)実践勉強会
Ultra96(UltraZed)実践勉強会
 
LeFlowを調べてみました
LeFlowを調べてみましたLeFlowを調べてみました
LeFlowを調べてみました
 
Tensorflow dynamically loadable XLA plugin ソースコード解析
Tensorflow  dynamically loadable XLA plugin ソースコード解析Tensorflow  dynamically loadable XLA plugin ソースコード解析
Tensorflow dynamically loadable XLA plugin ソースコード解析
 
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APITensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
 
「ディープラーニングでは、エコシステムが大切よ!」
 「ディープラーニングでは、エコシステムが大切よ!」 「ディープラーニングでは、エコシステムが大切よ!」
「ディープラーニングでは、エコシステムが大切よ!」
 
TensorFlow XLA とハードウェア
TensorFlow XLA とハードウェアTensorFlow XLA とハードウェア
TensorFlow XLA とハードウェア
 
2017年のFPGA Community活動について
2017年のFPGA Community活動について2017年のFPGA Community活動について
2017年のFPGA Community活動について
 
Zynq VIPを利用したテストベンチ
Zynq VIPを利用したテストベンチZynq VIPを利用したテストベンチ
Zynq VIPを利用したテストベンチ
 
TensorFlow XLAの可能性
TensorFlow XLAの可能性 TensorFlow XLAの可能性
TensorFlow XLAの可能性
 
AWS EC2 F1とXilinx SDAccel
AWS EC2 F1とXilinx SDAccelAWS EC2 F1とXilinx SDAccel
AWS EC2 F1とXilinx SDAccel
 
Intel Nervana Graph とは?
Intel Nervana Graph とは?Intel Nervana Graph とは?
Intel Nervana Graph とは?
 
DSPでディープラーニング
DSPでディープラーニングDSPでディープラーニング
DSPでディープラーニング
 

Recently uploaded

F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxArjunJain44
 
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...Amil baba
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理kywwoyk
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理kywwoyk
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理eemet
 
Memory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technologyMemory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technologyAhmed Abdelazeem
 
NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...
NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...
NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...Amil Baba Dawood bangali
 
1. WIX 2 PowerPoint for Work Experience.pptx
1. WIX 2 PowerPoint for Work Experience.pptx1. WIX 2 PowerPoint for Work Experience.pptx
1. WIX 2 PowerPoint for Work Experience.pptxlouise569794
 

Recently uploaded (8)

F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
 
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
Memory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technologyMemory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technology
 
NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...
NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...
NO1 Pandit Black magic/kala jadu,manpasand shadi in lahore,karachi rawalpindi...
 
1. WIX 2 PowerPoint for Work Experience.pptx
1. WIX 2 PowerPoint for Work Experience.pptx1. WIX 2 PowerPoint for Work Experience.pptx
1. WIX 2 PowerPoint for Work Experience.pptx
 

Facebook Glow Compiler のソースコードをグダグダ語る会

  • 1. Facebook Glow Compiler のソースコー ドをグダグダ語る会 @DeNA 作成:2018/08/26, 9/16,9/22,10/28 Slideshareにて公開 :2018/11/29 @Vengineer
  • 2. ブログ (2007年~) : Vengineerの戯言  http://blogs.yahoo.co.jp/verification_engineer SlideShare :  https://www.slideshare.net/ssuser479fa3 Twitter (2009年~) : @Vengineer ソースコード解析職人
  • 7. 宣伝です PyTorch から XLA に変 換し、Cloud TPU にて、 Resnet-50を動かしたとい うコードなのかな? 2018年12月1日(土)
  • 11. Glow: Graph Lowering Compiler Techniques for Neural Networks May 2, 2018 https://arxiv.org/abs/1805.00907 Facebook
  • 12. Glow: A community-driven approach to AI infrastructure Sep 13, 2018 https://code.fb.com/ml-applications/glow-a-community-driven-approach-to-ai -infrastructure/ Facebook
  • 13. @Scale 2018 Keynote: Glow: A community-driven approach to AI SEPTEMBER 19, 2018 https://atscaleconference.com/videos/scale-2018-keynote-glow-a-community-driven -approach-to-ai/ Facebook
  • 15. $ sudo apt-get install graphviz cmake wget libpng-dev ninja-build clang llvm-5.0 libprotobuf-dev protobuf-compiler   cmake は、3.7.1 以上が必要 別途、ソースコードから3.12.1 をインストールしました  llvmは、6.0 でも、7.0 でもいいみたいです。 準備
  • 16. $ git clone https://github.com/pytorch/glow.git $ git submodule update --init --recursive $ cd glow $ mkdir build_Debug $ cd build_Debug $ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug .. $ ninja all $ ninja test   ビルド
  • 17. CMakeLists.txt の option(GLOW_WITH_OPENCL "Build the OpenCL backend" ON) を option(GLOW_WITH_OPENCL "Build the OpenCL backend" OFF) に変更にするか、コマンドラインにて、以下のようなパラメータを指定する -DGLOW_WITH_OPENCL=OFF   OpenCL がデフォルトで ON
  • 18. https://github.com/pytorch/glow Glow : Graph Compiler & Execution Engine High-Level Graph => Low-Level IR => Machine Code  
  • 19. TensorFlow XLA : JITコンパイラ (r1.5~) XLAグラフに変換 最適化、その1 ターゲットハードウェアの実行オブジェクト ターゲットハードウェアに依存しない最適化 HLO (High Level Optimizer) XLAグラフ 最適化、その2 コード生成 ターゲットハードウェアに依存する最適化 LLO (Low Level Optimizer) TensorFow Graph 実行オブジェクト XLAグラフ
  • 21. ExecutionEngine EE(executionBackend); TrainingConfig TC; TC.learningRate = 0.001; TC.momentum = 0.9; TC.L2Decay = 0.001; TC.batchSize = minibatchSize; Function *T = glow::differentiate(F, TC); # <= 学習はこれが必要 EE.compile(CompilationMode::Train, T); # <= CompilationMode::Train 例題:mnist を見てみよう ( 学習 だってできる ) https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 22. Tensor imageInputs; Tensor labelInputs; Variable *A = mod.createVariable(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input", VisibilityKind::Public, false); Variable *selected = mod.createVariable(ElemKind::Int64ITy, {minibatchSize, 1}, "selected", VisibilityKind::Public, false); unsigned numImages = loadMNIST(imageInputs, labelInputs); EE.runBatch(numIterations, {A, selected}, {&imageInputs, &labelInputs}); 例題:mnist を見てみよう ( 学習 だってできる ) https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 23. auto *result = F->createSave("return", SM); EE.compile(CompilationMode::Infer, F); #<= CompilationMode::Infer Tensor sample(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}); for (int iter = numIterations; iter < numIterations + 10; iter++) { sample.copyConsecutiveSlices(&imageInputs, minibatchSize * iter); EE.run({A}, {&sample}); Tensor &res = result->getVariable()->getPayload(); 例題:mnist を見てみよう ( 推論 も当然できる ) https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 24. llvm::cl::opt<BackendKind> executionBackend( llvm::cl::desc("Backend to use:"), llvm::cl::values(clEnumValN(BackendKind::Interpreter, "interpreter", "Use interpreter (default option)"), clEnumValN(BackendKind::CPU, "cpu", "Use CPU"), clEnumValN(BackendKind::OpenCL, "opencl", "Use OpenCL") ), llvm::cl::init(BackendKind::Interpreter), llvm::cl::cat(mnistCat) ); バックエンドは、「Interpreter(デフォルト)」「CPU」「OpenCL」 バックエンドは? https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 25. auto *CV0 = F->create Conv("conv", A, 16, 5, 1, 2, 1); auto *RL0 = F->create RELU("relu", CV0); auto *MP0 = F->create MaxPool("pool", RL0, 3, 3, 0); auto *CV1 = F->create Conv("conv", MP0, 16, 5, 1, 2, 1); auto *RL1 = F->create RELU("relu", CV1); auto *MP1 = F->create MaxPool("pool", RL1, 3, 3, 0); auto *FCL1 = F->create FullyConnected("fc", MP1, 10); auto *SM = F->create SoftMax("sm", FCL1, selected); auto *result = F->createSave("return", SM); mnist のモデル構築 https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 26. The Lifetime of a Glow Instruction
  • 27.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 29. PyTorch 1.0 PyTorch + Caffe2 + Glow
  • 30.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 31. ExecutionEngine EE{BackendKind::Interpreter}; auto &mod = EE.getModule(); Function *F = mod.createFunction("main"); std::string NetFilename("tests/models/onnxModels/simpleConv.onnxtxt"); Variable *graphOutputVar; Tensor data; getNCHWData(&data, 1, 1, 3, 3); ONNXModelLoader onnxLD(NetFilename, {"data"}, {&data}, *F); graphOutputVar = onnxLD.getSingleOutput(); EE.compile(CompilationMode::Infer, F); EE.run({}, {}); ONNXモデル をロード、コンパイル、推論 https://github.com/pytorch/glow/blob/master/tests/unittests/onnxImporterTest.cpp#L28
  • 32. ExecutionEngine EE{BackendKind::Interpreter}; auto &mod = EE.getModule(); Function *F = mod.createFunction("main"); std::string NetDescFilename("tests/models/caffe2Models/predict_net.pbtxt"); std::string NetWeightFilename("tests/models/caffe2Models/init_net.pbtxt"); Variable *output; Tensor data; getNCHWData(&data, 1, 1, 3, 3); caffe2ModelLoader caffe2LD(NetDescFilename, NetWeightFilename, {"data"}, {&data}, *F); output = caffe2LD.getSingleOutput(); EE.compile(CompilationMode::Infer, F); EE.run({}, {}); Caffe2モデル をロード、コンパイル、推論 https://github.com/pytorch/glow/blob/master/tests/unittests/caffe2ImporterTest.cpp
  • 34. ExecuteEngine compile バックエンドのgenerateIR : IRの生成 run CompiledFunction の生成 (各バックエンド毎) CompiledFunction の 実行 (execute)実行 コンパイル save バックエンドのsave : IRの保存保存
  • 35. void ExecutionEngine:: compile(CompilationMode mode, Function *F, const Context &ctx) { optimizeFunction(mode, F); // 最適化 後で function_ = backend_-> compile(F, ctx); // コンパイル 後で } 引数の mode は、最適化で使用する ExecutionEngine::compile https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 36. void glow::runBatch(ExecutionEngine &EE, size_t iterations, size_t &sampleCounter, llvm::ArrayRef<Variable *> vars, llvm::ArrayRef<Tensor *> input ) { size_t batchSize = vars[0]->getType()->dims()[0]; for (size_t i = 0; i < iterations; i++) { for (int i = 0, e = ph.size(); i < e; i++) { auto *backingTensor = ctx.get(ph[i]); auto dim = inputs[i]->dims(); size_t slc = sampleCounter % dim[0]; backingTensor->copyConsecutiveSlices(inputs[i], slc); } glow::updateVariablesFromBatch(vars, inputs, sampleCounter); EE.run(); sampleCounter += batchSize; } } glow::runBatch https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 37. void ExecutionEngine:: run() { assert(function_ && "No function has been compiled"); // Make sure that the context has backing tensors for all placeholders. ctx.allocate(M_.getPlaceholders()); function_->setupRuns(); function_->beforeRun(ctx); function_->execute(); function_->afterRun(ctx); function_->tearDownRuns(); } ExecutionEngine::run https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 38. void ExecutionEngine:: save(CompilationMode mode, Function *F, llvm::StringRef outputDir) { llvm::StringRef networkName) { optimizeFunction(mode, F); // 最適化 後で backend_->save(F, outputDir, networkName); } ExecutionEngine::save https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 40.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 41. void ExecutionEngine:: compile(CompilationMode mode, Function *F, const Context &ctx) { optimizeFunction(mode, F); // 最適化 function_ = backend_-> compile(F, ctx); // コンパイル 後で } 引数の mode は、最適化で使用する ExecutionEngine::compile https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 42. void ExecutionEngine:: optimizeFunction(CompilationMode mode, Function *F) { // Verify the function pre-optimization/lowering. F->verify(); // Optimize the graph. ::glow::optimize(F, mode); // Allow the backend to transform the graph prior to lowering. if (backend_->transformPreLowering(F, mode)) { // Optimize the graph again after the backend transformation. // In particular, DCE is very likely to be useful. ::glow::optimize(F, mode); } ExecutionEngine::optimizeFunction https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 43. // Lower the graph into a sequence of low-level linear algebra operations. ::glow::lower(F, *backend_); // Optimize the graph again. ::glow::optimize(F, mode); // Allow the backend to transform the graph after lowering. if (backend_->transformPostLowering(F, mode)) { // Optimize the graph again after the backend transformation. // In particular, DCE is very likely to be useful. ::glow::optimize(F, mode); } } ExecutionEngine::optimizeFunction https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 44. 1)、::glow::optimize(F, mode); 2)、if (backend_->transformPreLowering(F, mode)) ::glow::optimize(F, mode); 3)、::glow::lower(F, *backend_); 4)、::glow::optimize(F, mode); 5)、if (backend_->transformPostLowering(F, mode)) ::glow::optimize(F, mode); generateIR の最適化部分のみ、抜き出すと
  • 45. 1)、::glow::optimize(F, mode); 2)、if (backend_->transformPreLowering(F, mode)) ::glow::optimize(F, mode); 3)、::glow::lower(F, *backend_); 4)、::glow::optimize(F, mode); 5)、if (backend_->transformPostLowering(F, mode)) ::glow::optimize(F, mode); transformPreLowering / transformPostLowering
  • 46. 現時点の実装( Interpreter, CPU, OpenCL ) では、 transformPostLowering の実装は、CPU と OpenCL ではあるが、 transformPreLowering の実装はありません。 CPUBackendでは、 1)、convolution を CPU最適版 に置換 2)、MaxPooling と Splat を マージして、CPUMaxSplat に置換 OpenCLBackend では、 1)、Convolution を、OpenCL最適化版 に置換 2)、MaxPooling を、OpenCL最適化版 に置換 3)、AvgPooling を、OpenCL最適化版 に置換 transformPreLowering / PostLowering の実装
  • 48.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 49. # ExecutionEngineは、インスタンス生成時に、バックエンドの種類を指定する ExecutionEngine EE(executionBackend); ExecutionEngine.hpp   # デフォルトは、Interpreter ExecutionEngine(BackendKind backendKind = BackendKind::Interpreter); ExecutionEngine.cpp # 指定した種類のバックエンドを生成する ExecutionEngine::ExecutionEngine(BackendKind backendKind) : backend_( createBackend(backendKind)) {} ExecutionEngine::ExecutionEngine https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 50. Backend *glow::createBackend(BackendKind backendKind) { switch (backendKind) { case BackendKind::Interpreter: # Interpreter (Naiveな実装) return createInterpreter(); case BackendKind::OpenCL: # OpenCL (Hostコード & OpenCLカーネル) return createOCLBackend(); case BackendKind::CPU: # CPU (LLVM) return createCPUBackend(); } llvm_unreachable("unreachable"); } glow::createBackend https://github.com/pytorch/glow/blob/master/lib/Backends/Backends.cpp
  • 51. Backend *createInterpreter() { return new Interpreter(); } Backend *createCPUBackend() { return new CPUBackend(); } Backend *createOCLBackend() { return new OCLBackend(); } バックエンドの生成 https://github.com/pytorch/glow/blob/master/lib/Backends/
  • 54. virtual std::unique_ptr<CompiledFunction> compile(std::unique_ptr<IRFunction> IR) const = 0; InterpreterBackend llvm::make_unique<InterpreterFunction>(std::move(IR)) CPUBackEnd llvm::make_unique<CPUFunction>(std::move(JIT), heap) OpenCBackend llvm::make_unique<OpenCLFunction>(std::move(IR)) compile https://github.com/pytorch/glow/blob/master/include/glow/Backends/Backend.h#L43
  • 55. std::unique_ptr<CompiledFunction> Interpreter::compile(Function *F) const { auto IR = generateAndOptimizeIR(F, shouldShareBuffers()); return compileIR(std::move(IR)); } Interpreter::compile https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
  • 56. std::unique_ptr<CompiledFunction> Interpreter::compileIR(std::unique_ptr<IRFunction> IR) const { MemoryAllocator constantWeightsAllocator("ConstantWeights", 0); MemoryAllocator placeholderWeightsAllocator("PlaceholderWeights", 0); MemoryAllocator activationsAllocator("Activations", 0); runtime::RuntimeBundle bundle = generateRuntimeBundle(*IR, constantWeightsAllocator, placeholderWeightsAllocator, activationsAllocator); return llvm::make_unique< InterpreterFunction>( std::move(IR), bundle) ; } Interpreter::compileIR https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
  • 57. std::unique_ptr<CompiledFunction> CPUBackend::compile(Function *F) const { auto IR = generateAndOptimizeIR(F, shouldShareBuffers()); return compileIR(std::move(IR)); } CPUBackend::compile https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp#L146
  • 58. std::unique_ptr<CompiledFunction> CPUBackend::compileIR(std::unique_ptr<IRFunction> IR) const { AllocationsInfo allocationsInfo; std::unique_ptr<LLVMIRGen> irgen = createIRGen(IR.get(), allocationsInfo); irgen->initTargetMachine(target.empty() ? "" : target.getValue(), llvm::CodeModel::Model::Large); irgen->initCodeGen(); allocateJITMemory(IR.get(), irgen->getAllocationsInfo()); emitJitMain(*irgen); irgen->performCodeGen(); CPUBackend::compileIR https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
  • 59. auto JIT = llvm::make_unique<llvm::orc::GlowJIT>(irgen->getTargetMachine()); JIT->addModule(irgen->borrowModule()); MemoryAllocator constantAllocator("ConstantWeights", 0); MemoryAllocator placeholderAllocator("Placeholders", 0); MemoryAllocator activationsAllocator("Activations", 0); runtime::RuntimeBundle runtimeInfo = generateRuntimeBundle( *IR, constantAllocator, placeholderAllocator, activationsAllocator); return llvm::make_unique<CPUFunction>(std::move(JIT), runtimeInfo); } CPUBackend::compileIR https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
  • 60. std::unique_ptr<CompiledFunction> OCLBackend::compile(Function *F) const { auto IR = generateAndOptimizeIR(F, shouldShareBuffers()); return compileIR(std::move(IR)); } OpenCLBackend::compile https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
  • 61. std::unique_ptr<CompiledFunction> OCLBackend::compileIR(std::unique_ptr<IRFunction> IR) const { MemoryAllocator allocator("GPU", 0xFFFFFFFF); runtime::RuntimeBundle bundle = generateRuntimeBundle(*IR, allocator, allocator, allocator); return llvm::make_unique<OpenCLFunction>(std::move(IR), bundle) ; } OpenCLBackend::compile https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
  • 63.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 64. std::unique_ptr<IRFunction> glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) { auto IR = llvm::make_unique<IRFunction>(F); # IR の生成 IR->generateIR(); # バックエンドを使って、最適化 ::glow::optimize(*IR, shouldShareBuffers); return IR; } IR生成とバックエンドを使ってIR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
  • 65. void IRFunction::generateIR() { assert(G_->verify() && "Invalid function"); // Schedule the nodes. NodesPtrList ScheduledNodes; scheduleGraph(ScheduledNodes); IRGenVisitor irgen(this); for (auto &N : ScheduledNodes) { N->visit(nullptr, &irgen); } } IR生成 https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 66. void IRFunction::scheduleGraph(NodesPtrList &Schedule) { Schedule.clear(); for (auto &N : G_->getParent()->getVars()) { Schedule.push_back(N); } for (auto &N : G_->getParent()->getPlaceholders()) { Schedule.push_back(N); } グラフのスケジュール:前半 https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp
  • 67. auto numVars = G_->getParent()->getConstants().size(); auto numPlaceholders = G_->getParent()->getPlaceholders().size(); (void)numVars; (void)numPlaceholders; std::unique_ptr<Scheduler> scheduler{ createScheduler(graphScheduler, *G_, Schedule)}; scheduler->schedule(); assert(scheduler->getSchedule().size() == G_->getNodes().size() + numPlaceholders + numVars && "All graph nodes have to be scheduled"); } グラフのスケジュール:後半 https://github.com/pytorch/https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp#L172/blob/master/lib/IR/Graph Scheduler.cpp
  • 69.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 70. std::unique_ptr<IRFunction> glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) { auto IR = llvm::make_unique<IRFunction>(F); # IR の生成 IR->generateIR(); # バックエンドを使って、最適化 ::glow::optimize(*IR, shouldShareBuffers); return IR; } IR生成とバックエンドを使ってIR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
  • 71. void glow::optimize(IRFunction &M, CompilationMode mode, const Backend &B) { M.verify(); if (!optimizeIR) return; performPeepholeOptimizations(M); eliminateDeadStores(M); // Replace applicable InsertTensors and ExtractTensors with TensorViews. optimizeInserts(M); optimizeExtracts(M); if (B.shouldShareBuffers ()) // Reuse buffers from previous operations. shareBuffers(M);; IR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1602
  • 72. performPeepholeOptimizations(M); hoistDealloc(M); // Shorten the lifetime of buffers. sinkAllocas(M); eliminateDeadStores(M); // Perform Dead Store Elimination. deleteDeadAllocs(M); makeWeightsConst(M); // Turn read-only weights into constant weights. performDebugInstrumentation(M); if (dumpOptMod) // Print the module to stdout if requested. M.dump(); M.verify(); } IR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1596
  • 74. class CompiledFunction { public: virtual ~CompiledFunction() = default; virtual void execute() = 0; virtual void setupRuns() = 0; virtual void beforeRun(const Context &ctx) = 0; virtual void afterRun(const Context &ctx) = 0; virtual void tearDownRuns() = 0; }; CompiledFunction https://github.com/pytorch/glow/blob/master/include/glow/Backends/CompiledFunction.h
  • 75. class InterpreterFunction final : public CompiledFunction { /// The IR to be executed. std::unique_ptr<IRFunction> F_; /// Maps values to Tensors, that are owned by this class. std::unordered_map<const Value *, Tensor *> tensors_; /// Maps values to Tensors, that are *not* owned by this class. std::unordered_map<const Value *, Tensor *> externalTensors_; public: InterpreterFunction(std::unique_ptr<IRFunction> F, const Context &ctx); ~InterpreterFunction() override; void execute() override; InterpreterFunction https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.h#L43
  • 76. void InterpreterFunction::execute() { #define DEF_VALUE(CLASS, NAME) #define DEF_INSTR(CLASS, NAME) case Kinded::Kind::CLASS##Kind: { fwd##CLASS(llvm::cast<CLASS>(&I)); break; } #define DEF_BACKEND_SPECIFIC_INSTR(CLASS, NAME) for (const auto &I : F_->getInstrs()) { switch (I.getKind()) { # <= 各オペレータの分岐! #include "glow/AutoGenInstr.def" default: llvm_unreachable("Invalid instruction."); } } } InterpreterFunction::execute https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.cpp
  • 77. class CPUFunction final : public CompiledFunction { std::unique_ptr<llvm::orc::GlowJIT> JIT_; void *heap_; public: CPUFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, void *heap); ~CPUFunction() override; void execute() override; }; CPUFunction https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.h
  • 78. void CPUFunction::execute() { auto sym = JIT_->findSymbol( "jitmain"); using JitFuncType = void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars, uint8_t * activations); auto address = sym.getAddress(); if (address) { JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get()); funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress_, baseActivationsAddress_); } else { GLOW_ASSERT(false && "Error getting address."); } } CPUFunction::execute https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.cpp#L29
  • 79. class OpenCLFunction final : public CompiledFunction { cl_device_id deviceId_; cl_context context_; cl_command_queue commands_; cl_mem deviceBuffer_{0}; std::vector<KernelLaunch> kernelLaunches_; public: explicit OpenCLFunction(std::unique_ptr<IRFunction> F); ~OpenCLFunction() override; void execute() override; OpenCLFunction https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
  • 80. void OpenCLFunction::execute() { # めっちゃ長い # # 基本的には、 # # 各レイヤーのループ # # 1). 各レイヤに対応したHost側のコード to OpenCLカーネルを生成 # 2). OpenCLカーネルのコンパイル # 3). OpenCLの作法に則ったコードを実行 (enqueueKernel) # # clFinish(commands_); にて、すべてのOpenCLカーネルが終了するまで待つ # } OpenCLFunction::execute https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
  • 81. 量子化 (FP32 => INT8) https://github.com/pytorch/glow/blob/master/docs/Quantization.md
  • 82.   ・FP32 => INT8 ・プロファイルに基づく量子化 推論中の実行を観察して、 ニューラルネットワークの各ステージの可能な数値範囲を推定 ・学習ベースの量子化は、将来サポート検討中 Glow での 量子化 https://github.com/pytorch/glow/blob/master/docs/Quantization.md
  • 83. std::vector<NodeQuantizationInfo> QI{ {NodeQuantizationInfo::generateNodeOutputName(input->getName()), {0.2f, 0}}, {NodeQuantizationInfo::generateNodeOutputName(W->getName()), {0.3f, 0}}, {NodeQuantizationInfo::generateNodeOutputName(B->getName()), {0.4f, 0}}, {NodeQuantizationInfo::generateNodeOutputName(FC->getName()), {0.6f, 0}}, }; F = quantization::quantizeFunction(EE, QI, F); // Make sure that graph can be compiled and run. EE.compile(CompilationMode::Infer, F); EE.run({}, {}); quantization::quantizeFunction の例 https://github.com/pytorch/glow/blob/master/tests/unittests/quantizationTest.cpp
  • 84. Function * quantizeFunction(const ExecutionEngine &EE, llvm::ArrayRef<NodeQuantizationInfo> quantizationInfos, Function *F, llvm::StringRef newFuncName = ""); quantization::quantizeFunction https://github.com/pytorch/glow/blob/master/include/glow/Quantization/Quantization.h
  • 85. https://github.com/pytorch/glow Glow : Graph Compiler & Execution Engine High-Level Graph => Low-Level IR => Machine Code   バックエンド   Interpreter CPU OpenCL