機械学習 / Deep Learning 大全 (6) Library編

#azurejp
https://www.facebook.com/dahatake/
https://twitter.com/dahatake/
https://github.com/dahatake/
https://daiyuhatakeyama.wordpress.com/
https://www.slideshare.net/dahatake/

https://www.microsoft.com/en-us/cognitive-toolkit/

https://arxiv.org/pdf/1608.07249.pdf
DL F/W FCN-S AlexNet ResNet-50 LSTM-64
CNTK 0.017 0.031 0.168 0.017
Caffe 0.017 0.027 0.254 --
TensorFlow 0.020 0.317 0.227 0.065
Torch 0.016 0.043 0.144 0.324

https://github.com/Microsoft/CNTK

パラメータw、b
w、b
年齢
腫瘍
bias
z1
z2
疾患
あり
疾患
なし
𝑤11
𝑤21
𝑤12
𝑤22
𝑏1
𝑏2

入力・出力変数の定義
モデル評価
ネットワークの定義
損失関数、最適化方法の定義
モデル学習
モデル評価

import cntk as C
## 入力変数(年齢, 腫瘍の大きさ)の2種類あり
input_dim = 2
## 分類数(疾患の有無なので2値)
num_output_classes = 2
## 入力変数
feature = C.input_variable(input_dim, np.float32)
## 出力変数
label = C.input_variable(num_output_classes, np.float32)

def linear_layer(input_var, output_dim):
input_dim = input_var.shape[0]
## Define weight W
weight_param = C.parameter(shape=(input_dim,
output_dim))
## Define bias b
bias_param = C.parameter(shape=(output_dim))
## Wx + b. Pay attention to the order of variables!!
return bias_param + C.times(input_var, weight_param)
z = linear_layer(input, num_outputs)

## 損失関数
loss = C.cross_entropy_with_softmax(z, label)
## 分類エラー("分類として"当たっているか否か)
eval_error = C.classification_error(z, label)
## 最適化
learner = C.sgd(z.parameters, lr_schedule)
trainer = C.Trainer(z, (loss, eval_error), [learner])

モデル学習
for i in range(0, num_minibatches_to_train):
## Extract training data
features, labels =
generate_random_data_sample(minibatch_size, input_dim,
num_output_classes)
## Train
trainer.train_minibatch({feature : features, label :
labels})

モデル評価
out = C.softmax(z)
result = out.eval({feature : features})

with cntk.layers.default_options(activation=cntk.ops.relu, pad=False):
conv1 = cntk.layers.Convolution2D((5,5), 32, pad=True)(scaled_input)
pool1 = cntk.layers.MaxPooling((3,3), (2,2))(conv1)
conv2 = cntk.layers.Convolution2D((3,3), 48)(pool1)
pool2 = cntk.layers.MaxPooling((3,3), (2,2))(conv2)
conv3 = cntk.layers.Convolution2D((3,3), 64)(pool2)
f4 = cntk.layers.Dense(96)(conv3)
drop4 = cntk.layers.Dropout(0.5)(f4)
z = cntk.layers.Dense(num_output_classes, activation=None)(drop4)
例：手書き文字認識 (MNIST)

モデル学習
モデル評価

from cntk import distributed
...
learner = cntk.learner.momentum_sgd(...) # create local learner
distributed_after = epoch_size # number of samples to warm start with
distributed_learner = distributed.data_parallel_distributed_learner(
learner = learner,
num_quantization_bits = 32, # non-quantized gradient accumulation
distributed_after = 0) # no warm start
損失関数の定義

minibatch_source = MinibatchSource(...)
...
trainer = Trainer(z, ce, pe, distributed_learner)
...
session = training_session(trainer=trainer,
mb_source=minibatch_source, ...)
session.train()
...
distributed.Communicator.finalize() # must be called to finalize
MPI in case of successful distributed training
最適化方法の定義
https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-
machines#2-configuring-parallel-training-in-cntk-in-python

# GPUを2つ利用し、学習用スクリプトがtraining.py
> mpiexec –n 2 python training.py
分散学習の実行方法

import cntk
## CPU利用時
cntk.device.try_set_default_device(cntk.device.cpu())
## GPU利用時
cntk.device.try_set_default_device(cntk.device.gpu())
CPU/GPU利用設定

CNTKから2つのGPUが見えて
いる
GPU利用可能か確認

モデルパラメータ数は5700万
超！

GPU 学習時間精度(mAP)
1 GPU (NC6) 7分22秒 0.9479
2 GPU (NC12) 3分43秒 0.9479

全体像がどんど
ん
把握できなく
なってくる
何のアルゴリズ
ム
をどう使えば
いいんだっけ？
作成したモデル
を
システムに導入
する手間が重
い・・・

Azure Machine Learning (Azure ML)

既存のパッケージ
(Zip化してアップロード)
Rスクリプトの実行モジュール
R スクリプトを記述

既存のパッケージ (Zip化してアップロード)
Pythonスクリプト実行モジュール
Python スクリプトを記述

スクリーンショットは、RHmm モジュール
を読み込んで利用している例。依存関係の
ある MASS と nlme を含めている。

Application #1 Application #2
WinML RT API
WinML Win32 API
WinML Runtime
Model Inference Engine
DirectML API
CPUDirect3D
GPU
Input
Surface
Output
Surface

主要な機械学習フレームワークでサポート

FPGA: 空間計算
FPGA
データ
命令
命令
命令
データ
命令
命令
命令
CPU: 時間的計算
CPU
命令

Catapult v0
Catapult v1
スケール v1
Catapult v2
2011 2012 2013 2014 2015 2016 ...
Ignite
本番展開

Azure ML integration
デプロイまでを含んだモデルライフサイクル管理
Hardware Accelerated
Model Gallery
Brainwave
Compiler & Runtime
“Brainslice” Soft
Neural Processing Unit

Performance Flexibility Scale
Rapidly adapt to evolving ML
Inference-optimized numerical precision
Exploit sparsity, deep compression
Excellent inference at low batch sizes
Ultra-low latency | 10x < CPU/GPU
World’s largest cloud investment in FPGAs
Multiple Exa-Ops of aggregate AI capacity
Runs on Microsoft’s scale infrastructure
Low cost
$0.21/million images on Azure FPGA

F F F
L0
L1
F F F
L0
Pretrained DNN Model
in CNTK, etc.
Scalable DNN Hardware
Microservice
BrainWave
Soft DPU
Instr Decoder
& Control
Neural FU
64
Network switches
FPGAs

Model
Management
Service
Azure ML orchestratorPython and TensorFlow
Featurize images and train classifier
Classifier
(TF/LGBM)
Preprocessing
(TensorFlow, C++
API)
Control Plane
Service
Brain Wave Runtime
FPGA
CPU

DRAM
コントローラ
USB
コントローラ
イーサネットコント
ローラ
dsp
スライス
ram
ram
dsp
スライス
CPU
CPU

Web search
ranking
Traditional software (CPU) server plane
QPICPU
QSFP
40Gb/s ToR
FPGA
CPU
40Gb/s
QSFP QSFP
Hardware acceleration plane
相互接続されたFPGAが従来のソ
フトウェアレイヤーとは分離さ
れて動作
CPUから独立して管理・使用が
可能
Web search
ranking
Deep neural
networks
SDN offload
SQL
CPUs
FPGAs
Routers

A Cloud-Scale Acceleration Architecture [MICRO’16]

Pretrained DNN モデルをソフト DPU にコンパイルするため
の
フレームワーク中立の連合コンパイラとランタイム

の
狭精度 DNN 推論のための適応型 ISA
高速変化する AI アルゴリズムをサポートする柔軟性と拡張性

の
BrainWave Soft DPU マイクロアーキテクチャ
狭精度、低遅延バッチに最適

の
FPGA 上でモデルパラメータを完全に永続化するオンチップ
メモリは、
多数の FPGA にまたがってスケーリングすることにより、
大規模なモデルをサポート

の
FPGA 上でモデルパラメータを完全に永続化するオンチップ
メモリは、
多数の FPGA にまたがってスケーリングすることにより、
大規模なモデルをサポート
Intel の FPGA をスケールする HW マイクロサービスに展開
[マイクロ ' 16]

FPGA0 FPGA1
Add500
1000-dim ベクトル
1000-dim ベクトル
分割
500x500
マトリックス
MatMul500
500x500
マトリックス
MatMul500 MatMul500 MatMul500
500x500
マトリックス
Add500
Add500
Sigmoid500 Sigmoid500
分割
Add500
500 500
concat
500 500
500x500
マトリックス
ターゲット
コンパイラ
FPGA
ターゲット
コンパイラ
CPU-CNTK
フロント
ポータブル IR
ターゲット
コンパイラ
CPU-カフェ
トランスフォーム IRs
グラフスプリッタとオプティマイザ
展開パッケージ
Caffe
モデル
FPGA ハードウェアマイクロサービス
CNTK
モデル
Tensorflow
モデル

=
O(N2) data
O(N2) compute
入力アクティベーション
出力前のアクティベーション
N ウェイトカーネル
O(N3) data
O(N4K2) compute
=

=
O(N2) data
O(N2) compute
入力アクティベーション
出力前のアクティベーション
O(N3) data
O(N4K2) compute
=

FFPGA2xCPU
DRAM で初期化された
モデルパラメータ

FPGA2xCPU
DRAM で初期化された
モデルパラメータ

バッチサイズ
ハードウ
ェア
利用
(%)
FPGA

バッチサイズ
99回目
待ち時
間
最大
許可
遅延
バッチサイズ
ハードウ
ェア
利用
(%)
バッチ処理により HW の使用率が向上するが、待ち時間は増加

バッチサイズ
99回目
の待ち
時間
最大
許可
遅延
バッチサイズ
ハードウ
ェア
利用
(%)
バッチ処理により HW の使用率が向上するが、待ち時間が増加

2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM

2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
2
CPU
2
CPU

F`GA
マトリックス
ベクトル
ユニット

Neural Functional Unit
VRF
Instruction
Decoder
TA
TA
TA
TA
TA
Matrix-Vector Unit Convert to msft-fp
Convert to float16
Multifunction
Unit
xbar x
A
+ VRF
VRF
Multifunction
Unit
xbar x
+ VRF
VRF
Tensor Manager
Matrix Memory
Manager
Vector Memory
Manager
DRAM
x
A
+
Activation
Multiply
Add/Sub
Legend
Memory
Tensor data
Instructions
Commands
TA Tensor Arbiter
Input Message
Processor
Control
Processor
Output Message
Processor
A
Kernel
Matrix Vector
Multiply
VRFMatrix RF
+
Kernel
Matrix Vector
Multiply
VRFMatrix RF
Kernel
Matrix Vector
Multiply
VRFMatrix RF
NetworkIFC
...

99
行列行1
行列行2
行列行 N
Float16
入力Tensor
+
+
×
×
+
×
×
+
+
×
×
+
×
×
+
Float16 出力
Tensor

FPGA MVU カーネル
+
+
×
×
+
×
×
+
+
×
×
+
×
×
+

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
FPGA Performance vs. Data Type
Series1

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
Series1
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Tera-Operations/sec
Series1 Series3

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
Series1
12
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Tera-Operations/sec
Series1 Series3

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
Series1
12
31
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Tera-Operations/sec
Series1 Series3

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
Series1
12
31
65
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Tera-Operations/sec
Series1 Series3

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
Series1
12
31
65
90
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Tera-Operations/sec
Series1 Series3

1.4
2.0
2.7
4.5
0.0
1.0
2.0
3.0
4.0
5.0
1 2 3 4
Tera-Operations/sec
Series1
12
31
65
90
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Tera-Operations/sec
Series1 Series3
0.50
0.60
0.70
0.80
0.90
1.00
1 2 3
Accuracy
Impact of Narrow Precison on Accuracy
Series1 Series2 Series3

15T
40T
65T
90T
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
PeakThroughput(Tera-Operations/sec)
Single FPGA BrainWave Soft DPU Performance
Arria 10 1150 (20nm)
ms-fp9
316K ALMs (74%)
1442 DSPs (95%)
2,564 M20Ks (95%)
160 GOPS/W
Stratix 10 280 Early Silicon (14nm)
ms-fp9
858K ALMs (92%)
5,760 DSPs (100%)
8,151 M20Ks (70%)
320 GOPS/W  720 GOPS/W
(production)

EUS
SEA
WEU
WUS
Stamp: 20 racks
Azure box
24 CPU cores
4 FPGAs
BrainWave
Azure ML
Wire service
AML FPGA VM
Extension
Azure Host
MonAgent
DNN pipeline

http://aka.ms/aml-real-time-ai
brainwave-edge@microsoft.com
Models are easy to create and deploy into Azure cloud
Write once, deploy anywhere – to intelligent cloud or edge
Manage and update your models using Azure IoT Edge

機械学習 / Deep Learning 大全 (6) Library編

More Related Content

What's hot

Similar to 機械学習 / Deep Learning 大全 (6) Library編

More from Daiyu Hatakeyama

機械学習 / Deep Learning 大全 (6) Library編