Mesh tensorflow

Mesh Tensorflow
2019/3/4 kuroko1t

自己紹介
kuroko(@kuroko1t)
● 仕事
○ 前は．．次世代スパコンの CPUの論理回路設計
○ 最近は．．深層学習向けフレームワークの調査したり（ Tensorflow,horovod．．．)
● その他
○ GoでDeep Learningを試みたり（link）
○ horovodの記事書いてみたり (Qiita)
○ horovodにTensorflowのeager対応のPRだしたり

グラフから考える分散学習
データ並列モデル並列
グラフを複製（replicaを作成）
ex) horovodだと1プロセス１GPU
GPU0 GPU1 GPU0 GPU1
グラフを分割
(複数GPUで1モデルを共有)

Mesh Tensorflowって？
● データ並列，モデル並列両方をカバー
● python only
● DNNモデル
○ Mesh Tensorflowのメインターゲットは Transformer
■ 言語モデルのパラメータが5億近くなってきていて，モデル並列しないと計算できない
○ exampleにはmnist(cnn)のみ
● 独自の演算API
○ 本家Tensorflowとは異なるAPIを用いる
● single Process

Mesh Tensorflowの概念
概念図の用語
● Mesh Shape
○ coreとMeshの割当の定義
● Layout Rules
○ TensorとMeshの割当の定義
● mtf.Dimension
○ Tensorのshapeの名前付け
● Meshとは？
○ プロセッサのｎ次元配列として定義
され，それぞれのTensorがMeshに
割当られます．通常は Tensorは
coreに割り当てられる(tf.deviceを利
用)
● Layout Rules？
○ Tensorの各次元をMeshのどの次
元に配置するか定義します (Layout
Rulesで表現される)
(“batch”,3)
Mesh0
core0 core1 core2 core3
Mesh1
Tensor
T1
(“col”,2)
(“row”,2)
mtf.Dimension
Mesh shape
(batch, mesh1)→core1,core2, core3でバッチ並列
Layout Rules

サンプルソースを見てみる。。

Mesh Tensorflow実行例
graph = mtf.Graph()
mesh = mtf.Mesh(graph, "my_mesh")
row_x = mtf.Dimension("row_x", 6)
col_x = mtf.Dimension("col_x", 4)
row_y = mtf.Dimension("row_y", 4)
col_y = mtf.Dimension("col_y", 5)
x = mtf.import_tf_tensor(
mesh, tf.random.uniform([6,4]),
mtf.Shape([row_x, col_x]))
y = mtf.import_tf_tensor(
mesh, tf.random.uniform([4,5]),
mtf.Shape([row_y, col_y]))
result = mtf.matmul(x, y, output_shape=[row_x, col_y])
mesh_shape = "b1:2"
mesh_layout = "col_x:b1"
mesh_shape = mtf.convert_to_shape(mesh_shape)
mesh_devices = [""] * mesh_shape.size
mesh_impl = mtf.placement_mesh_impl.PlacementMeshImpl(
mesh_shape, mesh_layout, mesh_devices)
lowering = mtf.Lowering(graph, {mesh: mesh_impl})
restore_hook = mtf.MtfRestoreHook(lowering)
tf_result = lowering.export_to_tf_tensor(result)
with tf.train.MonitoredTrainingSession(hooks=[restore_hook]) as sess:
print(sess.run(tf_result).shape)
mtf用のGraph, Mesh の作成
tfのtensor→mtfのtensorへ変換API
mtfの演算APIの呼び出し
mtf→tfへop,tensorの変換(lowering)
Session run()を実行

Meshのshape, layout定義
mesh_shape = [("rows:2"), ("cols:2")]
layout_rules = [("batch", "rows"), ("hidden", "cols")]
mesh_impl = mtf.placement_mesh_impl.PlacementMeshImpl(
mesh_shape, layout_rules, mesh_devices)
● meshのcore割当数を定義．
○ coreの割当数を定義(ex: rowsという名前のMeshに2個coreを割り当て)
○ [("rows:2"), ("cols:2")]
● meshとTensor Shapeの対応
○ mtf.Dimensionとmeshの対応を定義する
○ [("batch", "rows"), ("hidden", "cols")]
● PlacementMeshImpl(CPU,GPU向け)
○ 上記layout ruleに応じてvariableをデバイスに割当(session起動時)
○ 集合演算の実装がされている
○ TPU向けにSimdMeshImplというのも用意されている
Tensor
“cols”
“rows”
mtf.Dimension

演算のデバイス割り振り
def parallel(devices, fn, *args, **kwargs):
...
for i, device in enumerate(devices):
with tf.device(device):
with tf.variable_scope("parallel_%d" % i):
my_args = [x[i] for x in args]
my_kwargs = {k: v[i] for k, v in six.iteritems(kwargs)}
ret.append(fn(*my_args, **my_kwargs))
内部的にwith tf.deviceで各デバイスに演算を割り当てている

演算API
# tensor importの例
x = mtf.import_tf_tensor(
mesh, tf.ones([50, 10]),
mtf.Shape([col_x, row_x]))
# 演算APIの例
f1 = mtf.relu(mtf.conv2d_with_blocks(
x, kernel1, strides=[1, 1, 1, 1], padding="SAME",
h_blocks_dim=row_blocks_dim, w_blocks_dim=col_blocks_dim))
● 演算APIにおける次元定義
○ 次元の定義はtf.Dimension()で定義した名前で行う。
○ この段階だとTensorの次元がどのように Mesh分割されるかは定義されない
● mtf.layer系 API
○ mtf.layer.dense, attention系など30個
○ Attention専用APIが多い(mtf.layer.masked_local_attention_1d()...)
● mtf 系 API
○ mtf.add, mtf.conv2d, mtf.maximum… 演算系APIが120個

演算APIの例：mtf.matmul()
● 例えば。A(2,4) x B(4,5) = C(2,5) で Aのcolに2core割当したとする．．
core1 =xcore01
1
4
4
5
core0
/core1
● A行列を行方向で分割(Aのrow,Bのcolは同じDimension)
1. core0,core1で並列演算
a. core0:A(1,4) x B(4x5) = C0(1,5)
b. core1 : A(1,4) x B(4,5) = C1(1,5)
2. C0,C1をまとめる
a. Cをexport_to_tf_tensor(mtf→tfへ変換)するとtf.concat実行してC0,C1をまとめる
core0
core1
A B C
C0
C1
1
1
5

演算APIの例：mtf.matmul()
core1 =xcore02
2
2
5
core0
2
core1 2
2
5
A B C(core0/core1) = C0(core0) + C1(core1)
● A行列を列方向で分割 (Aのrow,Bのcolは同じDimension)
1. core0,core1で並列演算
a. core0:A(2,2) x B(2,5) = C0
b. core1: A(2,2) x B(2,5) = C1
2. core0, core1での演算を合計する
a. C(2,5) = C0(2,5) + C1(2,5)
b. allreduce関数を利用して計算 (後ほど説明します)
core0
/core1

Mesh Tensorflowの集合通信

デバイス間の集合通信
PlacementMeshImpl内に定義
● allreduce
○ 各TensorのSumをとる
○ add, optimizer, layer.batch_norm, reduce系のopで利用されている
● allconcat
○ 各Tensorをconcatenateする（like a allgather）
● alltoall
○ MPIのalltoallと同等．

Allreduce(CPU,GPU)
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5
left center right center
tf.add
tf.add
tf.add
tf.add
SUM値 SUM値
tf.identitytf.identity
分配分配
for i in xrange(1, left_center + 1):
with tf.device(devices[i]):
left_sum = binary_reduction(left_sum, xs[i])
right_sum = xs[n-1]
演算デバイスを
制御
tf.add tf.add
1. ２グループに分割
2. 各グループ内でtf.add
3. 2グループ間でtf.add
4. tf.identityで分配
allreduceの方法

Allreduce(TPU)
● 内部のアルゴリズムは不明
○ CrossReplicaSumというOpkernelが実行される？？．．．
■ 実装がないため内部のアルゴリズム不明．．．
■ 結果としては各slice(各デバイス)のSUMを返す．
○ またなぜか実装がTensorflow内部にある．．
■ tensorflow/contrib/tpu/ops/gen_tpu_ops.py

mtf.Lowering, optimizer
# optimizerの登録
optimizer = mtf.optimize.SgdOptimizer()
update_ops = optimizer.apply_grads(var_grads, graph.trainable_variables)
lowering = mtf.Lowering(graph, {mesh: mesh_impl})
# optimizer operationのlowering
tf_update_ops = [lowering.lowered_operation(op) for op in update_ops]
tf_update_ops.append(tf.assign_add(global_step, 1))
train_op = tf.group(tf_update_ops)
● OptimizerはSGDとAdafactorのみサポート
● Lowering()の機能としてはmtf→tfの橋渡しで，tensorflowのAPIに変換したり
，実行デバイスの割当をおこなう
○ 演算ごとにどのようにloweringするか定義されている

性能 Transformerの性能
モデル並列の方が性能がスケーリングしている→Mesh Tensorflowを利用し
てモデル並列を簡単に実装できるし，性能出ていいよねって感じだと(特に説
明は見当たらないが．．)
出展：README.md

Mesh tensorflow

More Related Content

What's hot

Similar to Mesh tensorflow

Mesh tensorflow