TensorFlow Study Part I

工業技術研究院機密資料禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
TensorFlow Study (Part I)
劉得彥
Danny Liu
資訊與通訊研究所 ICL

Tensor
2
• A tensor is an n-dimensional structure
▪ n=0 : Single Value
▪ n=1 : List of Values
▪ n=2 : Matrix of values
▪ n=3 : Cube of values
▪ …
• The Tensor:
▪ Data is in TensorBuffer as an Eigen::Tensor
▪ Shape definition is in TensorShape
▪ Reference counting is in RefCounted
https://www.slideshare.net/EdurekaIN/introduction-to-tensorflow-deep-learning-using-tensorflow-
tensorflow-tutorial-edureka

Operation
3
• Here is customized Sigmoid Op example:
▪ Input Tensors: x = tf.constant([[1.0, 0.0], [0.0, -1.0]])
▪ Output Tensors: y = cpp_con_sigmoid.cpp_con_sigmoid(x)

Operation
4
• Register OP definition to kernel builder
Class CppConSigmoid

Expressing: Ops
5
http://public.kevinrobinsonblog.com/docs/A%20tour%20through%20the%20Te
nsorFlow%20codebase%20-%20v4.pdf

Expressing: Ops
6

Build a Computation Graph
7
• Tensors flows between operations
https://machinelearningblogs.com/2017/09/07/tensorflow-tutorial-part-1-introduction/
Tensor
Operator
Computation Graph

TensorFlow Framework
8
Python APIs
C++ APIs
swig
C APIs (tensorflow/c/c_api.h)
Layers
Estimator Keras
Canned Estimator
TensorFlow Core libraries (C++)
core, runtime, graph, grappler, ops, kernels …
Application using
C, Golang…
C++ Application
Python
Application
Python Application
Limited
functions : Do
inference
( Ongoing )

DirectSession
Build a Computation Graph
9
• TF Graph is the computation graph in session
Python APIs C++ APIs
tf.MatMul(a, b)
swig
C APIs
MatMul(scope, a, b)
GraphDef
TF Graph
Protobuf text
Convert to GraphDef with
Graph::ToGraphDef()
Serialization / Deserialization
for distribution training
1464 def _create_c_op(graph, node_def, inputs, control_inputs):
Protobuf text
Convert to Graph (node and edge) with
ConvertGraphDefToGraph()

Computation Graph in details
10
• tf.get_default_graph() can generate the ProtoBuf message of graph (GraphDef)
• Distributed Session use this to prune the graph and send to another device via grpc.
node {
name: "mse"
op: "Mean"
input: "Square"
input: "Const"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
attr {
key: "Tidx"
value {
type: DT_INT32
}
}
attr {
key: "keep_dims"
value {
b: false
}
}
}
node {
name: "Const"
op: "Const"
attr {
key: "dtype"
value {
type: DT_INT32
}
}
attr {
key: "value"
value {
tensor {
dtype: DT_INT32
tensor_shape {
dim {
size: 2
}
}
tensor_content:
"000000000000001
000000000"
}
}
}
}
node {
name: "Square"
op: "Square"
input: "sub"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
}
node {
name: "sub"
op: "Sub"
input: "predictions"
input: "y"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
}

Computation Graph in details
11
• We can use graph editor to manipulate our graph,
▪ for instance: swap out our targeted output tenor and in to a gradient op.
• from tensorflow.contrib import graph_editor as ge
▪ ge.add_control_inputs()
▪ ge.connect()
▪ ge.sgv()
▪ ge. remap_inputs()

Re-use models
12
Protobuf:
pb ( binary ) or
pb_txt
Protobuf binary:
pb
It contains weights
data
Define
model/network
in Python
Train
model/network
in Python
Save model:
*.ckpt
freeze_graph
Load
model/network in
C++
Inferencing
Transer learning
Model.data,
model.index,
and model.meta
Another way to
use model in
C++
original
way:

Compile C++ TensorFlow app/ops
13
• Here is an example to use CMake instead of Bazel BUILD…
is more convenient and the binary file is much smaller.

Gradient Calculation
14
• TensorFlow uses reverse-mode autodiff
▪ It computes all the partial derivatives of the outputs with regards to all the inputs in just n
outputs + 1 graph traversals.

Where are the Gradient Ops?
15
• The gradient op registration could be in Python or C++
Python
C++

Computation Graph Execution
16
• a simple example that illustrates graph construction and
execution using the C++ API
https://www.tensorflow.org/api_guides/cc/guide
How to know what happened?
1. export TF_CPP_MIN_VLOG_LEVEL=2
or
2. import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'

Log information
17
• The environment is 1 GPU and 32 CPU cores
▪ Decide Session Factory type ( dicrect session )
a. Inter op parallelism threads: 32
▪ Build Executor
a. Find and add visiable GPU devices
b. Create TF devices mapping to physical GPU device
c. Build 4 kinds of streams:
» CUDA stream
» Host_to_Device stream
» Device_to_Host stream
» Device_to_Device stream
▪ PoolAllocator for ProcessState CPU allocator
▪ BFCAllocator
a. Create Bins …
▪ Grappler ( computation graph optimization )
a. Do something …
▪ Op Kernel
a. Instantiating kernel for node
b. Processing / Computing node
» Allocate and deallocate tensors with allocators ( cpu or gpu: cuda_host_bfc )
▪ PollEvents
(sington)
StreamGroupFactory

Log: Op kernel processing
18
• 2018-02-23 11:19:04.563051: I tensorflow/core/common_runtime/executor.cc:1561] Process node: 96 step 2 output/output/MatMul = MatMul[T=DT_FLOAT, transpose_a=false,
transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fc1/fc1/Relu, output/kernel/read) is dead: 0
• 2018-02-23 11:19:04.563053: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 0 used_events_ 1
• 2018-02-23 11:19:04.563059: I tensorflow/core/platform/default/device_tracer.cc:307] PushAnnotation output/output/MatMul:MatMul
• 2018-02-23 11:19:04.563065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:445] GpuDevice::Compute output/output/MatMul op MatMul on GPU0 stream[0]
• 2018-02-23 11:19:04.563090: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: 2 kernel_name: "output/output/MatMul"
tensor { dtype: DT_FLOAT shape { dim { size: 1024 } dim { size: 10 } } allocation_description { requested_bytes: 40960 allocated_bytes: 68608 allocator_name: "GPU_0_bfc"
allocation_id: 85 has_single_reference: true ptr: 1108332340992 } } }
• 2018-02-23 11:19:04.563118: I tensorflow/stream_executor/stream.cc:3521] Called Stream::ThenBlasGemm(transa=NoTranspose, transb=NoTranspose, m=10, n=1024, k=128,
alpha=1, a=0x1020dc15300, lda=10, b=0x1020dc62e00, ldb=128, beta=0, c=0x1020dc16700, ldc=10) stream=0x6006d80
• 2018-02-23 11:19:04.563130: I tensorflow/stream_executor/cuda/cuda_blas.cc:1881] doing cuBLAS SGEMM: at=0 bt=0 m=10 n=1024 k=128 alpha=1.000000 a=0x1020dc15300
lda=10 b=0x1020dc62e00 ldb=128 beta=0.000000 c=0x1020dc16700 ldc=10
• 2018-02-23 11:19:04.563156: I tensorflow/core/platform/default/device_tracer.cc:483] ApiCallback 1:307 func: cuLaunchKernel
• 2018-02-23 11:19:04.563164: I tensorflow/core/platform/default/device_tracer.cc:497] LAUNCH stream 0x5fd4490 correllation 1217 kernel sgemm_32x32x32_NN
• 2018-02-23 11:19:04.563168: I tensorflow/core/platform/default/device_tracer.cc:471] 1217 : output/output/MatMul:MatMul
• 2018-02-23 11:19:04.563198: I tensorflow/core/platform/default/device_tracer.cc:483] ApiCallback 1:307 func: cuLaunchKernel
• 2018-02-23 11:19:04.563222: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 2 kernel_name: "output/output/MatMul"
tensor { dtype: DT_FLOAT shape { dim { size: 1024 } dim { size: 10 } } allocation_description { requested_bytes: 40960 allocated_bytes: 68608 allocator_name: "GPU_0_bfc"
allocation_id: 85 has_single_reference: true ptr: 1108332340992 } } }
• 2018-02-23 11:19:04.563236: I tensorflow/core/common_runtime/executor.cc:1673] Synchronous kernel done: 96 step 2 output/output/MatMul = MatMul[T=DT_FLOAT,
transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fc1/fc1/Relu, output/kernel/read) is dead: 0
• 2018-02-23 11:19:04.563244: I tensorflow/core/common_runtime/step_stats_collector.cc:264] Save dev /job:localhost/replica:0/task:0/device:GPU:0 nt 0x7fa0a7786830

Log: Tensor Allocation / Deallocation
19
• 2018-02-23 11:19:04.430051: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__
MemoryLogTensorOutput { step_id: 2 kernel_name: "pool3/dropout/cond/Merge" tensor { dtype: DT_FLOAT
shape { dim { size: 1024 } dim { size: 12544 } } allocation_description { requested_bytes: 51380224
allocated_bytes: 82837504 allocator_name: "GPU_0_bfc" allocation_id: 81 ptr: 1108467515392 } } }
• ... ...
• 2018-02-23 11:19:04.564922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:445]
GpuDevice::Compute train/gradients/train/Mean_grad/Prod/_16 op _Send on GPU0 stream[0].
• .. ...
• 2018-02-23 11:19:04.566499: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__
MemoryLogTensorDeallocation { allocation_id: 81 allocator_name: "GPU_0_bfc" }
This tensor is
going to be
deallocated…

Log: Event Manager
20
• 2018-02-23 11:19:04.565108: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 1 used_events_
• 22018-02-23 11:19:04.565123: I tensorflow/core/platform/default/device_tracer.cc:483] ApiCallback 1:279 func: cuMemcpyDtoHAsync_v
• 22018-02-23 11:19:04.565132: I tensorflow/core/platform/default/device_tracer.cc:471] 1286 : edge_69_train/gradients/train/Mean_grad/Prod
• 2018-02-23 11:19:04.565138: I tensorflow/stream_executor/cuda/cuda_driver.cc:1215] successfully enqueued async memcpy d2h of 4 bytes from
0x1020f310000 to 0x1020de00a00 on stream 0x7fa18db423b0
• 2018-02-23 11:19:04.565144: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:151] QueueInUse free_events_ 1 used_events_ 2
• 2018-02-23 11:19:04.565152: I tensorflow/stream_executor/stream.cc:302] Called Stream::ThenRecordEvent(event=0x7fa0a778b3f0)
stream=0x7fa18db1b7f0

IntraProcessRendezvous
21
• 2018-02-23 11:19:04.577400: I tensorflow/core/common_runtime/rendezvous_mgr.cc:42]
IntraProcessRendezvous Send 0x7fa18d9bab20
/job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/d
evice:GPU:0;edge_185_Conv2_SwapIn;0:0
• 2018-02-23 11:19:03.141265: I tensorflow/core/common_runtime/rendezvous_mgr.cc:119]
IntraProcessRendezvous Recv 0x7fa18d9bab20
/job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/d
evice:GPU:0;edge_185_Conv2_SwapIn;0:0
GPU:0

TensorFlow Graph Exection
22
Device Level:
memory
management
Session Level:
Global control
Executor Level:
Run graph
async
Op Level:
Compute forward
and gradient.
Executors get created
for each subgraph
Get node ( operation )
from “ready” queue
Call into Stream, which contains stream_executor

TensorFlow Study Part I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TensorFlow Study Part I

Similar to TensorFlow Study Part I (20)

Recently uploaded

Recently uploaded (20)

TensorFlow Study Part I