Tensorflow dynamically loadable XLA plugin ソースコード解析

TensorFlow
dynamically loadable XLA plugin
ソースコード解析独演会
　　　　　　　　　2018/05/24(木)＠LeapMind
@Vengineer

ブログ (2007年～) : Vengineerの戯言
　http://blogs.yahoo.co.jp/verification_engineer
SlideShare :
　https://www.slideshare.net/ssuser479fa3
Twitter (2009年～) :
＠Vengineer
最近は、ソースコード解析職人

今日の発表内容
1)、TensorFlow XLAとは
2)、dynamically loadable XLA plugin
3)、いろいろなデバイスに適用すると？
　　 Raspberry Pi 3 / Hikey960 / Ultra96

TensorFlow XLAとは
https://www.tensorflow.org/performance/xla/
XLA(Accelerated Linear Algebra)は、TensorFlow計算を最適化する線形代数
のドメイン固有のコンパイラです。結果として、サーバーおよびモバイルプラッ
トフォームでの速度、メモリ使用率、移植性が向上します。当初、ほとんどの
ユーザーはXLAの大きなメリットは見られませんが、JIT(Just-In-Time)コンパ
イルやAOT(Ahead-Of-Time)コンパイルを使用してXLAを使用することで実験
を開始できます。新しいハードウェアアクセラレータをターゲットとする開発者
は、XLAを試すことを特にお勧めします。
原文(英語)をそのまま、Google翻訳にお願いしました

TensorFlow XLAでは、
次の2つをサポートしている
1)、JIT (Just-In-Time) コンパイル
　　ただし、単一マシンのみで、GPUは1つ
2)、AOT (Ahead-Of-Time) コンパイル
　　CPUのみ : x86/x86-64/ARM/AARCH64
(PowerPCは、r1.5から無くなりました)

各種モデル
TensorFlow
最適化
コード生成
Host(PC)上で実行 Target上で実行
実行オブジェクト
現在は、LLVM (CPU)
r1.5でPowerPC無くなりました
freeze_graph
GraphDef
Variables
↓
Const
TensorFlow XLA : AOTコンパイラ

TensorFlow XLA : JITコンパイラ (r1.5～)
XLAグラフに変換
最適化、その1
ターゲットハードウェアの
ターゲットハードウェアに
依存しない最適化
HLO (High Level Optimizer)
XLAグラフ
最適化、その2
コード生成
ターゲットハードウェアに
依存する最適化
LLO (Low Level Optimizer)
TensorFow Graph
XLAグラフ
LLVM Compiler::compile
RunHloPass
RunBackend

XLA対応のデバイス
TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance
https://autodiff-workshop.github.io/slides/JeffDean.pdf

Cloud TPU : System Architecture
引用：https://cloud.google.com/tpu/docs/system-architecture
・TPU estimator
・TensorFlow Client
・TensorFlow Server
・XLA Compiler
・Cloud TPU

CQ出版社
インターフェース 2017年8月号、9月号に
TensorFlow XLAのAOT r1.0
についての記事を書きました
8月号：
衝撃的な性能UPの可能性を秘めた注目テクノロジ速報
AIをサクサク動かす
Google新機能TensorFlow「XLA」を探る
9月号：
最新テクノロジ・マニアの挑戦
AIサクサク用TensorFlow XLA AOTコンパイラ探訪
初めてのGoogleソースコード！
AI用コンパイラの可能性を探る引用：http://www.kumikomi.net/interface/contents/201708.php
http://www.kumikomi.net/interface/contents/201708.php

CQ出版社
インターフェース 2018年2月号に
TensorFlow XLAのJIT r1.4
についての記事を書きました
特集：「最強グーグルのAI＆IoT技術研究」
第2部 AI開発環境の研究
　第1章ディープ・ラーニングの未来大陸を制覇するのは誰だ？
TensorFlow XLAの可能性を探るグーグルAI最強説の研究
　第2章数式の専用デバイス割り当て機能絶賛進化中
グーグルTensorFlowがいろんなプロセッサに対応できるメカニズム
引用：http://www.kumikomi.net/interface/contents/201802.php

TensorFlow XLAは、中で何をやっているのか？
TensorFlow User Group ハード部 #2 2017/4/21
https://www.slideshare.net/ssuser479fa3/tensorflow-xla-75055947
TensorFlow XLA の可能性
Deep Learning Acceleration 勉強会 2017/09/03
TensorFlow XLA とハードウェア
Chainer MeetUp #6 2017/9/30
TensroFlow XLA : JIT編 (r1.3版)
https://www.slideshare.net/ssuser479fa3/tensroflow-xla-jit

def testXLA_JIT(self):
with tf.Session() as sess:
x = tf.placeholder(tf.float32, [2], name="x")
with tf.device("device:XLA_GPU:0"):
y = x * 2
　　result = sess.run(y, {x: [1.5, 0.5]})
XLA_GPUで実行するには！

1)、Feed/Fetchノードの追加
Mul
_Recv
Const
_Send
Feed(x)
Fetch(y)

2)、Placement
Mul
_Recv
Const
_Send
cpu : Feed(x)
cpu : Fetch(y)
XLA_GPU
XLA_GPU

3)、グラフの分割
_Recv
_Send
_Send _Recv _Send
XLA_GPU
Feed(x) Fetch(y)cpu
Mul
Const
_Recv

3)、グラフの分割
_XlaLaunch
_Recv
_Recv _Send
_Send _Recv _Send
XLA_GPU
Feed(x) Fetch(y)cpu

複数Opsを_XlaLaunch Opに変換
_XlaLaunch
XLA_GPU
MulConst
gpu

https://github.com/NervanaSystems/ngraph
Intel nGraph library
ONNX
neon
TensorFlow
MXNet
NNP = ARGON ?

TensorFlow r1.3
XLA + Intel nGraph
@Vengineer
2018/03/21, 03/25
03/29に新しいコードが公開され、
このコードはgithubから削除されたのでお蔵入り。

TensorFlow

TensorFlow
proposal
https://blogs.yahoo.co.jp/verification_engineer/71526428.html
2018/04/09

TensorFlow
dynamically loadable XLA plugin の内容
https://blogs.yahoo.co.jp/verification_engineer/71526444.html
2018/04/10

TensorFlow
https://github.com/NervanaSystems/ngraph-tensorflow
tensorflow/compiler/plugin/dynamic
TensorFlowのXLA側のコードの修正必要が無くなる

"""Configuration file for an XLA plugin.
- please don't check in changes to this file
- to prevent changes appearing in git status, use:
git update-index --assume-unchanged tensorflow/compiler/plugin/BUILD
To add additional devices to the XLA subsystem, add targets to the
dependency list in the 'plugin' target. For instance:
deps = ["//tensorflow/compiler/plugin/example:plugin_lib"],
"""
licenses(["notice"])
package(
default_visibility = ["//visibility:public"],
)
cc_library(
name = "plugin",
deps = [
"//tensorflow/compiler/plugin/dynamic:dynamic_plugin_lib "
],
)
BUILD

+-------------------+
　　| TensorFlow |
　　| |
　　| +---------------+ |
　　| | XLA | |
　　| | | |
　　| +----+----^-----+ |
　　 | | | |
　　 | +----v----+-----+ | +---------------------+
　　 | | dynamic +-------> libngraph_plugin.so |
　　| | plugin lib <-------+ |
　　 | +---------------+ | +---------------------+
　　+-------------------+

Scenario 3: Non-CPU-like hardware without an existing LLVM backend
If it is not possible to utilize LLVM, then the best option is to implement a new backend for XLA for
the desired hardware. This option requires the most effort. The classes that need to be implemented
are as follows:
StreamExecutor: For many devices not all methods of StreamExecutor are needed. See existing
StreamExecutor implementations for details.
xla::Compiler: This class encapsulates the compilation of an HLO computation into an
xla::Executable.
xla::Executable: This class is used to launch a compiled computation on the platform.
xla::TransferManager: This class enables backends to provide platform-specific mechanisms for
constructing XLA literal data from given device memory handles. In other words, it helps encapsulate
the transfer of data from the host to the device and back.
Developing a new backend for XLA
https://www.tensorflow.org/performance/xla/developing_new_backend

1)、Platformの登録 (Executor = StreamExecutor の登録含む)
2)、kernelの登録
3)、Compiler = xla::Compiler の登録 ( xla::Executable 含む)
4)、Computation Placerの登録
5)、Transfer Manager = xla::TransferManager の登録
6)、Deviceの登録
Tensorflow XLAでデバイスを追加するには！

1)、Platformの登録 (Executor = StreamExecutor の登録含む)
　Platform & Executor と Device の登録は、TenserFlow本体で必要
2)、Kernelの登録
3)、Compiler = xla::Compiler の登録 ( xla::Executable 含む)
4)、Computation Placerの登録
5)、Transfer Manager = xla::TransferManager の登録
　Backend、Compiler、Computation Placer、
　Transfer Manager の登録は、XLAで必要
Tensorflow XLAでデバイスを追加するには！

import tensorflow as tf
import numpy as np
x = tf.placeholder(tf.float32, shape=(2, 3))
y = tf.placeholder(tf.float32, shape=(3))
with tf.device("/job:localhost/replica:0/task:0/device: XLA_NGRAPH:0"):
a = x + y
with tf.Session() as sess:
res = sess.run(a, feed_dict={x: np.ones((2,3)), y: np.ones((3,))})
print("result:", res)
サンプルコード
Device名

・example
　　・BUILD
　　・compiler_adapter.h
　　・device_factory_adapter.cc
　　・device_factory_adapter.h
　　・disabled_test_manifest.txt
　　・executor_adapter.cc
　　・executor_adapter.h
　　・platform_adapter.cc
　　・platform_adapter.h
　　・plugin_adapter.cc
　　・plugin_adapter.h
　　・transfer_manager_adapter.h

tensorflow/compiler/plugin/dynamic/plugin_adapter.cc
volatile bool module_initialized = InitPluginModule();
bool InitPluginModule() {
// We are running as part of TensorFlow python environment
auto tf_root = xla::dynamic_plugin::GetTensorflowRoot();
auto plugin_directory = tf_root + "/plugins/";
std::string pattern = plugin_directory + "*.so";
std::vector<std::string> files;
auto result = tensorflow::Env::Default()->GetMatchingPaths(pattern, &files);
tensorflow::LoadDynamicPlugin (files[0]);
}
PlugInライブラリのロード
${tf_root}/plugins/*.so をロードする
ただし、
最初の1個目のライブラリをロード

tensorflow/compiler/plugin/dynamic/plugin_adapter.cc
static bool LoadDynamicPlugin (std::string lib_path) {
void* handle;
auto result =
tensorflow::Env::Default()->LoadLibrary(lib_path.c_str(), &handle);
// Get the Plugin object
xla::plugin::Info (*GetPluginData)();
result = tensorflow::Env::Default()-> GetSymbolFromLibrary (
handle, "GetPluginData", (void**)(&GetPluginData));
LoadDynamicPlugin 関数(その1)
ロードしたライブラリから
GetPluginData関数のポインタを獲得

// Get the plugin info
xla::plugin::Info plugin_info = GetPluginData();
// Get the function pointers to the plugin methods
auto Version = plugin_info.Version;
auto DeviceInfo = plugin_info.DeviceInfo;
auto RunBackend = plugin_info.RunBackend;
auto GetTransferManager = plugin_info.GetTransferManager;
auto Init = plugin_info.Init;
auto SupportedDataTypes = plugin_info.SupportedDataTypes;
auto device_info = DeviceInfo();

// Create the platform id - unique for each plugin
// TODO - create a unique value for platform id. Can't use
// PLATFORM_DEFINE_ID() inside a function
static int delta = 0;
int temp;
perftools::gputools:: Platform::Id kPluginPlatformId = &temp + delta;
delta++;
// Kernel registrations
auto supported_data_types = SupportedDataTypes();
REGISTER_XLA_LAUNCH_KERNEL (device_info.XLA_DEVICE_NAME,
tensorflow::XlaLocalLaunchOp,
supported_data_types);
REGISTER_XLA_DEVICE_KERNELS (device_info.XLA_DEVICE_NAME,
supported_data_types);
REGISTER_XLA_BACKEND (device_info.XLA_DEVICE_JIT_NAME,
supported_data_types, OpFilter);
Kernelの登録
Backendの登録

// Platform registration
std::unique_ptr<perftools::gputools::Platform> platform(
new xla::dynamic_plugin:: PlatformAdapter(
device_info.PLATFORM_NAME, kPluginPlatformId,
device_info.visible_device_count));
perftools::gputools::MultiPlatformManager::RegisterPlatform(
std::move(platform));
// Call the Plugin Init
auto status = plugin_info.Init(kPluginPlatformId);
// Register the Compiler facory
xla::Compiler::RegisterCompilerFactory (kPluginPlatformId, [=]() {
return xla::MakeUnique< xla::dynamic_plugin::CompilerAdapter >(
kPluginPlatformId, plugin_info);
});
Platformの登録
Compilerの登録

tensorflow/compiler/plugin/dynamic/platform_adapter.cc
PlatformAdapter::PlatformAdapter (std::string platform_name,
perftools::gputools::Platform::Id id,
int visible_device_count)
: name_(platform_name),
id_(id),
visible_device_count_(visible_device_count) {}
PlatformAdapterクラス

tensorflow/compiler/plugin/dynamic/compiler_adapter.h
class CompilerAdapter : public Compiler {
public:
explicit CompilerAdapter(perftools::gputools::Platform::Id platform_id,
xla::plugin::Info info)
: m_platform_id(platform_id), m_plugin_info(info) {}
~CompilerAdapter() override {}
StatusOr<std::unique_ptr<HloModule>> RunHloPasses(
std::unique_ptr<HloModule> module,
perftools::gputools::StreamExecutor* executor,
DeviceMemoryAllocator* device_allocator);
StatusOr<std::unique_ptr<Executable>> RunBackend(
std::unique_ptr<HloModule> hlo_module,
perftools::gputools::StreamExecutor* stream_exec,
DeviceMemoryAllocator* device_allocator)
CompilerAdapterクラス
Compilerクラスを継承
二つのメソッドを実装

StatusOr<std::unique_ptr<HloModule>> RunHloPasses(
std::unique_ptr<HloModule> module,
DeviceMemoryAllocator* device_allocator) override {
// Delegate the to the actual plugin
return m_plugin_info.RunHloPasses(std::move(module), executor,
device_allocator);
}
CompilerAdapter::RunHloPassesメソッド
Pluginの RunHloPasses メソッドを実行

StatusOr<std::unique_ptr<Executable>> RunBackend(
DeviceMemoryAllocator* device_allocator) override {
VLOG(1) << "Run backend " << hlo_module->name();
TF_RET_CHECK(stream_exec != nullptr);
auto executable =
m_plugin_info.RunBackend(std::move(hlo_module), stream_exec);
return std::move(executable);
}
CompilerAdapter::RunBackendメソッド
Pluginの RunBackend メソッドを実行

// Computation placer registration
xla::ComputationPlacer::RegisterComputationPlacer(
kPluginPlatformId,
&xla::dynamic_plugin::CompilerAdapter::CreateComputationPlacer);
XLA の Computation Placerの登録
tensorflow/compiler/plugin/dynamic/compiler_adapter.h
static std::unique_ptr<xla::ComputationPlacer> CreateComputationPlacer() {
return xla::MakeUnique<xla::ComputationPlacer>();
}

// Transfer manager registration
// Note: Ideally - we want to create the TransferManager with an implemenation
// but currently the creation is handled by the Registration method - which
// doesn't allow passing parameters to the constructor.
// This is inconsistent with the Compiler factory!
// Register with the factory
xla::dynamic_plugin::TransferManagerAdapter::Init(kPluginPlatformId)
xla::dynamic_plugin::TransferManagerAdapter* new_transfer_manager{nullptr};
const perftools::gputools::Platform* this_platform;
auto statusor = perftools::gputools::MultiPlatformManager::PlatformWithId(
kPluginPlatformId);
if (statusor.ok()) {
this_platform = statusor.ValueOrDie();
}

xla::StatusOr<xla::TransferManager*> s =
xla::TransferManager::GetForPlatform(this_platform);
if (s.ok()) {
new_transfer_manager =
(xla::dynamic_plugin::TransferManagerAdapter*)s.ValueOrDie();
}
auto plugin_transfer_manager = GetTransferManager();
new_transfer_manager->SetImplementation(plugin_transfer_manager)
Pluginの Transfer Manager を設定

// Register the Device - at the very last. That way - if we failed with other
// steps above, the device won't be available and users will get an error at
// the Python script stage
// Set priority to be below the default priority (50),
// so that Executor is not selected as a high priority device over other
// default devices. See constructor comments for Registrar in
// tensorflow/core/common_runtime/device_factory.h for a list of priority for
// devices.
DeviceFactory::Register (
device_info.XLA_DEVICE_NAME,
new DeviceFactoryAdapter (device_info.PLATFORM_NAME,
device_info.XLA_DEVICE_NAME,
device_info.XLA_DEVICE_JIT_NAME),
device_info.device_priority );
return true;
}
Deviceの登録

tensorflow/compiler/plugin/dynamic/device_factory_adapter.h
class DeviceFactoryAdapter : public tensorflow::DeviceFactory {
public:
DeviceFactoryAdapter(const char* platform_name, const char* dev_name,
const char* dev_jit_name)
: m_platform_name(platform_name),
m_device_name(dev_name),
m_device_jit_name(dev_jit_name) {
VLOG(1) << "DeviceFactoryAdapter: Platform name: " << m_platform_name
<< " DEV name: " << m_device_name;
}
DeviceFactoryAdapterクラス

core/common_runtime/device_factory.{h,c}
// The default priority values for built-in devices is:
// GPU: 210
// SYCL: 200
// GPUCompatibleCPU: 70
// ThreadPoolDevice: 60
// Default: 50
explicit Registrar(const string& device_type, int priority = 50) {
DeviceFactory::Register(device_type, new Factory(), priority);
}
デバイスの登録

1)、Platformの登録 (Executorの登録含む)
　Platform & Executor と Device の登録は、TenserFlow本体で必要
2)、Kernelの登録
3)、Compilerの登録 => Plugin
4)、Computation Placerの登録　 => XLA
5)、Transfer Managerの登録 => Plugin
　Kernel (Backend)、Compiler、Computation Placer、
　Transfer Manager の登録は、XLAで必要
TensorFlow dynamicall loadable XLA Pluginでは

サンプルPlugin
tensorflow/compiler/plugin/dynamic/example/
README.md
example_plugin.cc
transfer_manager.h
transfer_manager.cc
executable.h
executable.cc
plugin_test.py
trivial_test.py

テストコード
tensorflow/compiler/plugin/dynamic/example/plugin_test.py
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
w = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
with tf.device('/device: DYNAMIC_PLUGIN_EXAMPLE_DEVICE:0'):
y = tf.matmul(x, w) + b
sess = tf.Session(config=config)
tf.global_variables_initializer().run(session=sess)

Pluginコード：GetPluginData
tensorflow/compiler/plugin/dynamic/example/example_plugin.cc
//-----------------------------------------------------------------------------
// GLobal data for this Plugin
//-----------------------------------------------------------------------------
static xla::plugin::Info s_PluginInfo = {
Version, DeviceInfo, Init, GetTransferManager,
RunHloPasses, RunBackend, SupportedDataTypes };
//-----------------------------------------------------------------------------
// DSO Entry point
//-----------------------------------------------------------------------------
extern "C" xla::plugin::Info GetPluginData() { return s_PluginInfo; }

Pluginコード：SupportedDataTypes
static std::vector<tensorflow::DataType> kPluginSupportedDatatypes = {
{tensorflow::DT_INT32, tensorflow::DT_FLOAT, tensorflow::DT_BOOL,
tensorflow::DT_DOUBLE, tensorflow::DT_INT64}};
std::vector<tensorflow::DataType> SupportedDataTypes () {
return kPluginSupportedDatatypes;
}

Pluginコード：RunHloPasses
xla::StatusOr<std::unique_ptr<xla::HloModule>> RunHloPasses(
std::unique_ptr<xla::HloModule> module,
xla::DeviceMemoryAllocator* device_allocator) {
std::cout << "RunHloPasses called by Plugin adaptern";
// TODO
// Run the HLO optimization passes here
return std::move(module);
}
何もしていない

Pluginコード：RunBackend
std::unique_ptr<xla::Executable> RunBackend(
std::unique_ptr<xla::HloModule> hlo_module,
::perftools::gputools::StreamExecutor* stream_exec) {
print_embeded_computation (hlo_module->entry_computation());
// Create the Executable
std::unique_ptr<PluginExecutable> executable =
xla::MakeUnique<PluginExecutable>(std::move(hlo_module),
GetTransferManager ());
return executable;
}
グラフのダンプ
PluginExecutableの生成

Pluginコード：RunBackend
static void print_embeded_computation (
const xla::HloComputation* computation,
int nest_level) {
auto embedded_computations = computation->MakeEmbeddedComputationsList();
std::cout << "DYNAMIC_PLUGIN_EXAMPLE_COMPILER computation: "
<< computation->name() << "; nest_level: " << nest_level
<< "; num_embedded: " << embedded_computations.size() << std::endl;
std::cout << computation->ToString() << std::endl;
for (auto embedded_computation : embedded_computations) {
print_embeded_computation(embedded_computation, nest_level + 1);
}
}

Pluginコード：PluginExecutable
tensorflow/compiler/plugin/dynamic/example/executable.h
class PluginExecutable : public xla::Executable {
public:
PluginExecutable(std::unique_ptr<xla::HloModule> hlo_module,
xla::TransferManagerInterface* transfer_manager)
: xla::Executable(std::move(hlo_module), /*hlo_profile_printer=*/nullptr,
/*hlo_profile_index_map=*/nullptr),
m_transfer_manager(transfer_manager) {}
~PluginExecutable() {}

tensorflow/compiler/plugin/dynamic/example/transfer_manager.h
class TransferManager : public xla::TransferManagerInterface {
public:
TransferManager();
~TransferManager() override {}
...
Pluginコード：GetTransferManager
xla::TransferManagerInterface* GetTransferManager () {
static std::unique_ptr<xla::TransferManagerInterface> tx_manager =
std::unique_ptr<xla::TransferManagerInterface>(new TransferManager());
return tx_manager.get();
}

executable::ExecuteOnStream
tensorflow/compiler/xla/service/executable.h
// Enqueues the compilation result on the provided stream,
// passing the given arguments.
// This call is blocking and returns after the execution is done.
//
// If the hlo_execution_profile is provided as non-nullptr, profiling will be
// enabled.
//
// Returns a shaped buffer containing the result of the computation.
virtual StatusOr<std::unique_ptr<ShapedBuffer>> ExecuteOnStream(
const ServiceExecutableRunOptions* run_options,
tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
HloExecutionProfile* hlo_execution_profile) = 0;

Pluginコード：ExecuteOnStream
tensorflow/compiler/plugin/dynamic/example/executable.cc
xla::StatusOr<std::unique_ptr<xla::ShapedBuffer>>
PluginExecutable::ExecuteOnStream(
const xla::ServiceExecutableRunOptions* run_options,
tensorflow::gtl::ArraySlice<const xla::ShapedBuffer*> arguments,
xla::HloExecutionProfile* hlo_execution_profile) {
se::Stream* stream = run_options->stream();
se::StreamExecutor* executor = stream->parent();
const se::Platform* platform = executor->platform();
const xla::HloComputation* computation = module().entry_computation();

// Transform the ShapedBuffer arguments into literals which the
// evaluator consumes.
std::vector<std::unique_ptr<xla::Literal>> arg_literals;
for (tensorflow::int64 p = 0; p < computation->num_parameters(); ++p) {
TF_ASSIGN_OR_RETURN(
std::unique_ptr<xla::Literal> arg_literal,
m_transfer_manager->TransferLiteralFromDevice (executor, *arguments[p]));
arg_literals.push_back(std::move(arg_literal));
}
// Execute the graph using the HloEvaluator.
xla::HloEvaluator evaluator;
TF_ASSIGN_OR_RETURN(std::unique_ptr<xla::Literal> result_literal,
evaluator.Evaluate<std::unique_ptr<xla::Literal>>(
*computation, arg_literals));
TransferLiteralToDeviceでは？

// Make sure that the result shape is not empty
TF_RET_CHECK(!xla::ShapeUtil::IsNil(result_literal->shape()));
TF_ASSIGN_OR_RETURN(std::unique_ptr<xla::ShapedBuffer> result,
m_transfer_manager->AllocateShapedBuffer (
result_literal->shape(), run_options->allocator(),
executor->device_ordinal()));
TF_RETURN_IF_ERROR( m_transfer_manager->TransferLiteralToDevice (
executor, *result_literal, *result));
return std::move(result);
}
TransferLiteralFromDeviceでは？

evaluator.Evaluate
TransferLiteralToDevice
ホスト <= デバイス
TransferLiteralFromDevice
ホスト => デバイス
デバイスで実行
AllocateShapedBuffer
出力バッファ割り当て

Bridge TensorFlow*/XLA to run on
Intel® nGraph™ backends
https://github.com/NervanaSystems/ngraph-tensorflow-bridge
2018/03/29

StatusOr<std::unique_ptr<HloModule>> NGraphCompiler::RunHloPasses(
DeviceMemoryAllocator* device_allocator) {
HloPassPipeline pipeline("NGraph");
if (getenv("XLA_NGRAPH_SKIP_FUSION") == nullptr)
pipeline.AddPass<HloPassFix<NGraphFusion>>(&m_fusion_map);
TF_CHECK_OK(pipeline.Run(hlo_module.get()).status());
return std::move(hlo_module);
}
NGraphCompiler::RunHloPasses
　ngraph_compiler.cc

StatusOr<std::unique_ptr<Executable>> NGraphCompiler::RunBackend(
DeviceMemoryAllocator* device_allocator) {
hlo_module->mutable_entry_computation_layout()->SetToDefaultLayout();
xla::HloComputation* computation = hlo_module->entry_computation();
xla::HloInstruction* root_instruction = computation->root_instruction();
xla::Shape root_shape = root_instruction->shape();
NGraphControlEdgeDetector control_edge_detector;
TF_CHECK_OK(root_instruction->Accept(&control_edge_detector));
NGraphCompiler::RunBackend

HloModule
HloComputation
HloInstruction
xla::HloComputation* computation = hlo_module->entry_computation();
std::unique_ptr<HloModule> hlo_module
xla::HloInstruction* root_instruction = computation->root_instruction();

// この部分で、HLOからnGraphへ変換
NGraphBuilder builder (computation->parameter_instructions(), &m_fusion_map);
DfsHloVisitor* hlo_visitor{};
TF_ASSIGN_OR_RETURN(hlo_visitor, builder.Visitor());
TF_CHECK_OK(root_instruction->Accept(hlo_visitor));
builder.DebugPrintInstructionsList();
std::shared_ptr<compat::XLAFunction> ng_function;
TF_ASSIGN_OR_RETURN(ng_function, builder.NGraphFunction(root_instruction) );

// Backendを選択する
std::lock_guard<std::mutex> lock(m_module_mutex);
if (m_ngraph_runtime_manager == nullptr) {
std::string ngraph_backend_name(XLA_NGRAPH_DEFAULT_BACKEND);
if (const char* env_str = std::getenv(XLA_NGRAPH_BACKEND_ENV_VAR)) {
if (xla::ngraph_plugin::try_parse<std::string>(env_str,
ngraph_backend_name)) {
} else {
return InvalidArgument(
"nGraph backend specified but cannot be parsed");
}
}
m_ngraph_runtime_manager = // ランタイムマネージャー
ngraph::runtime::Manager::get(ngraph_backend_name);
}

// ここでは、オブジェクトにコンパイル
std::shared_ptr<ngraph::runtime::ExternalFunction > ng_runtime_function =
m_ngraph_runtime_manager-> compile(ng_function);
// Executable (NGraphExecutable) を生成する
std::unique_ptr<Executable> executable;
executable.reset(new NGraphExecutable(
std::move(hlo_module), m_ngraph_runtime_manager, ng_runtime_function ));
return std::move(executable);
}

NGraphExecutable::NGraphExecutable(
std::shared_ptr<ngraph::runtime::Manager> ng_manager,
std::shared_ptr<ngraph::runtime::ExternalFunction > ng_runtime_function )
: Executable(std::move(hlo_module), /*hlo_profile_printer=*/nullptr,
/*hlo_profile_index_map=*/nullptr),
m_ng_manager(ng_manager),
m_ng_runtime_function(ng_runtime_function) {}
NGraphExecutable::NGraphExecutable
　ngraph_executable.cc

StatusOr<std::unique_ptr<ShapedBuffer>> NGraphExecutable::ExecuteOnStream(
const ServiceExecutableRunOptions* run_options,
tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
HloExecutionProfile* hlo_execution_profile) {
途中略　(ホスト側からデバイス側へデータの移動)
// nGraphにて、tf.device で指定したデバイス用にコンパイル
auto call_frame = ng_backend->make_call_frame(m_ng_runtime_function);
call_frame->call(ng_result_tv_list, ng_input_tv_list);
途中略　(デバイス側からホスト側へデータの移動)
return std::move(result_buffer);
}
NGraphExecutable::ExecuteOnStream
　ngraph_executable.cc

NGraphExecutable::ExecuteOnStream
auto call_frame =
ng_backend->make_call_frame(
　 m_ng_runtime_function);
call_frame->call(
ng_result_tv_list,
ng_input_tv_list);
ホスト => デバイス
ホスト <= デバイス
デバイスで実行
出力バッファ割り当て

shared_ptr<runtime::CallFrame> runtime::interpreter::INT_Backend::make_call_frame(
const shared_ptr<ExternalFunction>& external_function)
{
return external_function->make_call_frame();
}
make_call_frame
ngraph-0.2.1/src/runtime/interpreter/int_backend.cpp

shared_ptr<runtime::CallFrame> runtime::interpreter::ExternalFunction::make_call_frame()
{
if (!m_is_compiled)
{
compile();
}
return make_shared<runtime::interpreter::INT_CallFrame>(shared_from_this(),
m_function);
}
make_call_frame
ngraph-0.2.1/src/runtime/interpreter/int_external_function.cpp

void runtime::interpreter::ExternalFunction::compile()
{
if (m_is_compiled)
{
return;
}
pass::Manager pass_manager;
// For now, just make everyone row-major.
pass_manager.register_pass<pass:: AssignLayout<DenseTensorViewLayout>>();
pass_manager.register_pass<pass:: Liveness>();
pass_manager.run_passes(m_function);
m_is_compiled = true;
}
make_call_frame
ngraph-0.2.1/src/runtime/interpreter/int_external_function.cpp

void ngraph::pass::Manager::run_passes(shared_ptr<Function> func)
{
// find all functions
vector<shared_ptr<Function>> fs;
traverse_functions(func, [&](shared_ptr<Function> f) { fs.push_back(f); });
set<shared_ptr<Function>> tfs(begin(fs), end(fs));
get_state().set_functions(tfs);
make_call_frame
ngraph-0.2.1/src/pass/manager.cpp

for (shared_ptr<PassBase> pass : m_pass_list)
{
pass->set_state(get_state());
auto call_graph_pass = dynamic_pointer_cast<CallGraphPass>(pass);
// 途中略
else if (call_graph_pass)
{
for (shared_ptr<Function> f : fs)
{
call_graph_pass-> run_on_call_graph(f->get_ordered_ops());
}
}
}
}
make_call_frame
ngraph-0.2.1/src/pass/manager.cpp

virtual bool run_on_call_graph(
const std::list<std::shared_ptr<Node>>& nodes) override
{
for (const std::shared_ptr<Node>& node : nodes)
{
for (size_t i = 0; i < node->get_output_size(); ++i)
{
auto tv = node->get_output_tensor_view(i);
if (nullptr == tv->get_tensor_view_layout())
{
auto layout = std::make_shared<LT>(*tv);
tv->set_tensor_view_layout(layout);
}
}
}
return false;
}
AssignLayout::run_on_call_graph
ngraph-0.2.1/src/runtime/pass/assign_layout.hpp

nGraph で実装されているハードウェアは、
・cpu (CPU) : コード生成 (C++)
・gpu (GPU) : コード生成 (CUDA)
・interpreter (INTERPRETER) : インタープリタ
の3つ。
nGraph Runtime
ngraph-0.2.1/src/runtime

void runtime::interpreter::INT_CallFrame::call(
const vector<shared_ptr<runtime::TensorView>>& results,
const vector<shared_ptr<runtime::TensorView>>& arguments)
{
vector<shared_ptr<runtime::TensorView>> inputs;
for (shared_ptr<runtime::TensorView> argument : arguments)
{
argument->collect_tensor_views(inputs, argument);
}
vector<shared_ptr<runtime::TensorView>> outputs;
for (shared_ptr<runtime::TensorView> result : results)
{
result->collect_tensor_views(outputs, result);
}
tensor_call(outputs, inputs);
}
INT_CallFrame::call
ngraph-0.2.1/src/runtime/interpreter/int_call_frame.cpp

void runtime::interpreter::INT_CallFrame::tensor_call(
const vector<shared_ptr<runtime::TensorView>>& output_tvs,
const vector<shared_ptr<runtime::TensorView>>& input_tvs)
{
vector<shared_ptr<runtime::HostTensorView>> args;
vector<shared_ptr<runtime::HostTensorView>> out;
for (auto tv : input_tvs)
{
args.push_back(static_pointer_cast<runtime::HostTensorView>(tv));
}
for (auto tv : output_tvs)
{
out.push_back(static_pointer_cast<runtime::HostTensorView>(tv));
}
tensor_call(out, args);
}
INT_CallFrame::tensor_call

void runtime::interpreter::INT_CallFrame::tensor_call(
const vector<shared_ptr<runtime::HostTensorView>>& output_tvs,
const vector<shared_ptr<runtime::HostTensorView>>& input_tvs)
{
call(m_function, output_tvs, input_tvs);
}
INT_CallFrame::tensor_call

void runtime::interpreter::INT_CallFrame::call(
std::shared_ptr<Function> function,
const vector<shared_ptr<runtime::HostTensorView>>& output_tvs,
const vector<shared_ptr<runtime::HostTensorView>>& input_tvs)
{
// 途中略
// Invoke computation
for (shared_ptr<Node> op : function->get_ordered_ops())
{
// 途中略
generate_calls(base_type, secondary_type, *op, inputs, outputs);
// 途中略
}
// 途中略
}
INT_CallFrame::call

void runtime::interpreter::INT_CallFrame::generate_calls(
const element::Type& base_type,
const element::Type& secondary_type,
ngraph::Node& op,
const std::vector<std::shared_ptr<HostTensorView>>& args,
const std::vector<std::shared_ptr<HostTensorView>>& out)
{
if (base_type == element::boolean)
{
generate_calls<char>(secondary_type, op, args, out);
}
else if (base_type == element::f32)
{
generate_calls<float>(secondary_type, op, args, out);
}
INT_CallFrame::call

template <typename BASE>
void generate_calls(const element::Type& type,
ngraph::Node& op,
{
if (type == element::boolean)
{
op_engine<BASE, char>(op, args, out);
}
else if (type == element::f32)
{
op_engine<BASE, float>(op, args, out);
}
INT_CallFrame::call
ngraph-0.2.1/src/runtime/interpreter/int_call_frame.hpp

template <typename T, typename S>
void op_engine(ngraph::Node& node,
{
std::string node_op = node.description();
if (node_op == "Abs")
{
reference::abs<T>(reinterpret_cast<T*>(args[0]->get_data_ptr()),
reinterpret_cast<T*>(out[0]->get_data_ptr()),
out[0]->get_element_count());
}
else if (node_op == "Acos")
{
reference::acos<T>(reinterpret_cast<T*>(args[0]->get_data_ptr()),
reinterpret_cast<T*>(out[0]->get_data_ptr()),
out[0]->get_element_count());
}
INT_CallFrame::call
ngraph-0.2.1/src/runtime/interpreter/int_call_frame.hpp

abs
acos
add
allreduce
asin
atan
av_pool
broadcast
ceiling
concat
constant convert
convolution
copy
reference : ops
ngraph-0.2.1/src/runtime/reference
cosh
cos
divide
dot
equal
exp floor
greater_eq
greater
less_eq
less log
max
maximum
max_pool
min
minimum multiply
negate not_equal
not
one_hot
pad
power
product
reduce
reduce_window
relu replace_slice
reshape
result
reverse
select_and_scatter
select
sign
sinh
sin
slice
softmax
sqrt
subtact
sum
tanh
tan

def main(_):
with tf.device('/device:NGRAPH:0'):
run_mnist(_)
// デフォルトでは”CPU"。
// 環境変数XLA_NGRAPH_BACKENDで指定できる
// CPU / GPU / INTERPRETER
def run_mnist(_):
# Import data
mnist = input_data.read_data_sets( FLAGS.data_dir,
one_hot=True )
...
Run MNIST Softmax with the activated bridge
引用：https://github.com/NervanaSystems/ngraph-tensorflow-bridge

Raspberry Pi3 に適用すると？

Raspberry Pi 3
A53x4
内部バス
GPGPU部
DRAM Host側
Device側
Dynamically loadable XLA Plugin
図 : 引用、https://www.raspberrypi.org/products/raspberry-pi-3-model-b/
VideoCore IV (Broadcom)

QMKL v1.0.0, 2018.04.10
https://github.com/Idein/qmkl
QMKL is a Math Kernel Library for VideoCore IV QPU.
QMKL is compatible with Intel MKL except for double precision etc.
We, Idein Inc., built object recognition demos (GoogLeNet etc.) on Raspberry Pi.
The demos run on QPU using both QMKL and our private libraries, which are
highly optimized for neural networks. Please check out our video on YouTube.

HiKey960 に適用すると？

HiKey960
https://www.96boards.org/product/hikey960/
・Hisilicon Kirin 960
・ARM Cortex-A53x4 + ARM Mali G71 MP8
・3GB or 4GB LPDDR4 SDRAM
・32GB UFS Flash Storage
・WiFi (2.4- / 5-GHz) and Bluetooh 4.1
・1 x USB 2.0 type C OTG
・2x USB 3.0, 1x USB 2.0 Type
・1 x HDMI 1.4 (Type A - full)
・12V@2A、
4.75mm outer / 1.7mm inner
　3GB版：239ドル、
4GB版：Switch Scienceで32270円(税込み)

HiKey960
A53x4
CCI-400
GPGPU
DRAM Host側
Device側
図 : 引用、https://www.96boards.org/product/hikey960/
Mali G71MP8 (ARM)
・OpenCL
・ARM Compute Library

+-------------------+
　　| |
　　| +---------------+ |
　　| | XLA | |
　　| | | |
　　| +----+----^-----+ | nGraph の Interpreter を参考に
　　 | | | |
　　 | +----v----+-----+ | +---------------------+
　　 | | dynamic +-------> Interpreter |
　　| | plugin lib <-------+ ARM Compute Library |
　　 | +---------------+ | | OpenCL |
　　| | | ARM Mali G71 |
　　+-------------------+ +---------------------+
ARM Compute Libraryを使ったInterpreter

+-------------------+
　　| |
　　| +---------------+ |
　　| | XLA | |
　　| | | |
　　| +----+----^-----+ | nGraph の Interpreter を参考に
　　 | | | | +---------------------+
　　 | +----v----+-----+ | | Interpreter |
　　 | | dynamic +-------> ARM NN SDK |
　　| | plugin lib <-------+ ARM Compute Library |
　　 | +---------------+ | | OpenCL |
　　| | | ARM Mali G71 |
　　+-------------------+ +---------------------+
ARM NN SDKを使ったInterpreter
ARM NN SDK : https://developer.arm.com/products/processors/machine-learning/arm-nn

Ultra96
https://www.96boards.org/product/ultra96/
・Xilinx Zynq UltraScale+ MPSoC ZU3EG A484
・Micron LPDDR4 2 GB (512M x 32)
・16 GB microSD card + adapter
・802.11b/g/n Wi-Fi and Bluetooth 4.2
・1x USB 3.0 Type Micro-B
・2x USB 3.0, 1x USB 2.0 Type
・Mini DisplayPort (MiniDP or mDP)
・8V~18V@3A
Plug inner 1.7mm / outer 4.8mm
　249ドル、Avnet Japanで29800円(税抜き)

Zynq UltraScale+ MPSoC
A53x4
CCI-400
FPGA部
DRAM Host側
Device側
図 : 引用、https://xlnx.i.lithium.com/t5/image/serverpage/image-id/24453iC519B19C6F6B40E4?v=1.0

Ultra96 ≒ UltraZed
Vivado HLS にて、FPGA部を開発する
　UltraZed 向け Debian GNU/Linux で
　Vivado-HLS を使って合成した回路を動かす by @ikwzm
https://qiita.com/ikwzm/items/5099d36b1bfd8009dce4
SDSoC にて、FPGA部を開発する
　Ultra96には、SDSoCのライセンスが付いている
　reVISION-Zybo-Z7-20をやってみた9 by @marsee101
http://marsee101.blog19.fc2.com/blog-entry-4131.html

引用、https://twitter.com/jwangARK/status/999362583374319616

ブログ (2007年～) : Vengineerの戯言
　http://blogs.yahoo.co.jp/verification_engineer
SlideShare :
　https://www.slideshare.net/ssuser479fa3
Twitter (2009年～) :
＠Vengineer
ありがとうございました

Tensorflow dynamically loadable XLA plugin ソースコード解析

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tensorflow dynamically loadable XLA plugin ソースコード解析

Similar to Tensorflow dynamically loadable XLA plugin ソースコード解析 (20)

More from Mr. Vengineer

More from Mr. Vengineer (20)

Tensorflow dynamically loadable XLA plugin ソースコード解析