This slide is going to introduce the concept of TensorFlow based on the source code study, including tensor, operation, computation graph and execution.
Automate your Kamailio Test Calls - Kamailio World 2024
TensorFlow Study Part I
1. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
TensorFlow Study (Part I)
劉得彥
Danny Liu
資訊與通訊研究所 ICL
2. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Tensor
2
• A tensor is an n-dimensional structure
▪ n=0 : Single Value
▪ n=1 : List of Values
▪ n=2 : Matrix of values
▪ n=3 : Cube of values
▪ …
• The Tensor:
▪ Data is in TensorBuffer as an Eigen::Tensor
▪ Shape definition is in TensorShape
▪ Reference counting is in RefCounted
https://www.slideshare.net/EdurekaIN/introduction-to-tensorflow-deep-learning-using-tensorflow-
tensorflow-tutorial-edureka
3. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Operation
3
• Here is customized Sigmoid Op example:
▪ Input Tensors: x = tf.constant([[1.0, 0.0], [0.0, -1.0]])
▪ Output Tensors: y = cpp_con_sigmoid.cpp_con_sigmoid(x)
4. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Operation
4
• Register OP definition to kernel builder
Class CppConSigmoid
5. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Expressing: Ops
5
http://public.kevinrobinsonblog.com/docs/A%20tour%20through%20the%20Te
nsorFlow%20codebase%20-%20v4.pdf
7. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Build a Computation Graph
7
• Tensors flows between operations
https://machinelearningblogs.com/2017/09/07/tensorflow-tutorial-part-1-introduction/
Tensor
Operator
Computation Graph
8. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
TensorFlow Framework
8
Python APIs
C++ APIs
swig
C APIs (tensorflow/c/c_api.h)
Layers
Estimator Keras
Canned Estimator
TensorFlow Core libraries (C++)
core, runtime, graph, grappler, ops, kernels …
Application using
C, Golang…
C++ Application
Python
Application
Python Application
Limited
functions : Do
inference
( Ongoing )
9. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
DirectSession
Build a Computation Graph
9
• TF Graph is the computation graph in session
Python APIs C++ APIs
tf.MatMul(a, b)
swig
C APIs
MatMul(scope, a, b)
GraphDef
TF Graph
Protobuf text
Convert to GraphDef with
Graph::ToGraphDef()
Serialization / Deserialization
for distribution training
1464 def _create_c_op(graph, node_def, inputs, control_inputs):
Protobuf text
Convert to Graph (node and edge) with
ConvertGraphDefToGraph()
10. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Computation Graph in details
10
• tf.get_default_graph() can generate the ProtoBuf message of graph (GraphDef)
• Distributed Session use this to prune the graph and send to another device via grpc.
node {
name: "mse"
op: "Mean"
input: "Square"
input: "Const"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
attr {
key: "Tidx"
value {
type: DT_INT32
}
}
attr {
key: "keep_dims"
value {
b: false
}
}
}
node {
name: "Const"
op: "Const"
attr {
key: "dtype"
value {
type: DT_INT32
}
}
attr {
key: "value"
value {
tensor {
dtype: DT_INT32
tensor_shape {
dim {
size: 2
}
}
tensor_content:
"000000000000001
000000000"
}
}
}
}
node {
name: "Square"
op: "Square"
input: "sub"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
}
node {
name: "sub"
op: "Sub"
input: "predictions"
input: "y"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
}
11. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Computation Graph in details
11
• We can use graph editor to manipulate our graph,
▪ for instance: swap out our targeted output tenor and in to a gradient op.
• from tensorflow.contrib import graph_editor as ge
▪ ge.add_control_inputs()
▪ ge.connect()
▪ ge.sgv()
▪ ge. remap_inputs()
12. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Re-use models
12
Protobuf:
pb ( binary ) or
pb_txt
Protobuf binary:
pb
It contains weights
data
Define
model/network
in Python
Train
model/network
in Python
Save model:
*.ckpt
freeze_graph
Load
model/network in
C++
Inferencing
Transer learning
Model.data,
model.index,
and model.meta
Another way to
use model in
C++
original
way:
13. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Compile C++ TensorFlow app/ops
13
• Here is an example to use CMake instead of Bazel BUILD…
is more convenient and the binary file is much smaller.
14. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Gradient Calculation
14
• TensorFlow uses reverse-mode autodiff
▪ It computes all the partial derivatives of the outputs with regards to all the inputs in just n
outputs + 1 graph traversals.
15. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Where are the Gradient Ops?
15
• The gradient op registration could be in Python or C++
Python
C++
16. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Computation Graph Execution
16
• a simple example that illustrates graph construction and
execution using the C++ API
https://www.tensorflow.org/api_guides/cc/guide
How to know what happened?
1. export TF_CPP_MIN_VLOG_LEVEL=2
or
2. import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'
17. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Log information
17
• The environment is 1 GPU and 32 CPU cores
▪ Decide Session Factory type ( dicrect session )
a. Inter op parallelism threads: 32
▪ Build Executor
a. Find and add visiable GPU devices
b. Create TF devices mapping to physical GPU device
c. Build 4 kinds of streams:
» CUDA stream
» Host_to_Device stream
» Device_to_Host stream
» Device_to_Device stream
▪ PoolAllocator for ProcessState CPU allocator
▪ BFCAllocator
a. Create Bins …
▪ Grappler ( computation graph optimization )
a. Do something …
▪ Op Kernel
a. Instantiating kernel for node
b. Processing / Computing node
» Allocate and deallocate tensors with allocators ( cpu or gpu: cuda_host_bfc )
▪ PollEvents
(sington)
StreamGroupFactory
18. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Log: Op kernel processing
18
• 2018-02-23 11:19:04.563051: I tensorflow/core/common_runtime/executor.cc:1561] Process node: 96 step 2 output/output/MatMul = MatMul[T=DT_FLOAT, transpose_a=false,
transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fc1/fc1/Relu, output/kernel/read) is dead: 0
• 2018-02-23 11:19:04.563053: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 0 used_events_ 1
• 2018-02-23 11:19:04.563059: I tensorflow/core/platform/default/device_tracer.cc:307] PushAnnotation output/output/MatMul:MatMul
• 2018-02-23 11:19:04.563065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:445] GpuDevice::Compute output/output/MatMul op MatMul on GPU0 stream[0]
• 2018-02-23 11:19:04.563090: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: 2 kernel_name: "output/output/MatMul"
tensor { dtype: DT_FLOAT shape { dim { size: 1024 } dim { size: 10 } } allocation_description { requested_bytes: 40960 allocated_bytes: 68608 allocator_name: "GPU_0_bfc"
allocation_id: 85 has_single_reference: true ptr: 1108332340992 } } }
• 2018-02-23 11:19:04.563118: I tensorflow/stream_executor/stream.cc:3521] Called Stream::ThenBlasGemm(transa=NoTranspose, transb=NoTranspose, m=10, n=1024, k=128,
alpha=1, a=0x1020dc15300, lda=10, b=0x1020dc62e00, ldb=128, beta=0, c=0x1020dc16700, ldc=10) stream=0x6006d80
• 2018-02-23 11:19:04.563129: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 0 used_events_ 1
• 2018-02-23 11:19:04.563130: I tensorflow/stream_executor/cuda/cuda_blas.cc:1881] doing cuBLAS SGEMM: at=0 bt=0 m=10 n=1024 k=128 alpha=1.000000 a=0x1020dc15300
lda=10 b=0x1020dc62e00 ldb=128 beta=0.000000 c=0x1020dc16700 ldc=10
• 2018-02-23 11:19:04.563156: I tensorflow/core/platform/default/device_tracer.cc:483] ApiCallback 1:307 func: cuLaunchKernel
• 2018-02-23 11:19:04.563164: I tensorflow/core/platform/default/device_tracer.cc:497] LAUNCH stream 0x5fd4490 correllation 1217 kernel sgemm_32x32x32_NN
• 2018-02-23 11:19:04.563168: I tensorflow/core/platform/default/device_tracer.cc:471] 1217 : output/output/MatMul:MatMul
• 2018-02-23 11:19:04.563198: I tensorflow/core/platform/default/device_tracer.cc:483] ApiCallback 1:307 func: cuLaunchKernel
• 2018-02-23 11:19:04.563210: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 0 used_events_ 1
• 2018-02-23 11:19:04.563222: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 2 kernel_name: "output/output/MatMul"
tensor { dtype: DT_FLOAT shape { dim { size: 1024 } dim { size: 10 } } allocation_description { requested_bytes: 40960 allocated_bytes: 68608 allocator_name: "GPU_0_bfc"
allocation_id: 85 has_single_reference: true ptr: 1108332340992 } } }
• 2018-02-23 11:19:04.563236: I tensorflow/core/common_runtime/executor.cc:1673] Synchronous kernel done: 96 step 2 output/output/MatMul = MatMul[T=DT_FLOAT,
transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fc1/fc1/Relu, output/kernel/read) is dead: 0
• 2018-02-23 11:19:04.563244: I tensorflow/core/common_runtime/step_stats_collector.cc:264] Save dev /job:localhost/replica:0/task:0/device:GPU:0 nt 0x7fa0a7786830
19. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Log: Tensor Allocation / Deallocation
19
• 2018-02-23 11:19:04.430051: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__
MemoryLogTensorOutput { step_id: 2 kernel_name: "pool3/dropout/cond/Merge" tensor { dtype: DT_FLOAT
shape { dim { size: 1024 } dim { size: 12544 } } allocation_description { requested_bytes: 51380224
allocated_bytes: 82837504 allocator_name: "GPU_0_bfc" allocation_id: 81 ptr: 1108467515392 } } }
• ... ...
• 2018-02-23 11:19:04.564922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:445]
GpuDevice::Compute train/gradients/train/Mean_grad/Prod/_16 op _Send on GPU0 stream[0].
• .. ...
• 2018-02-23 11:19:04.566499: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__
MemoryLogTensorDeallocation { allocation_id: 81 allocator_name: "GPU_0_bfc" }
This tensor is
going to be
deallocated…
20. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Log: Event Manager
20
• 2018-02-23 11:19:04.565108: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 1 used_events_
• 22018-02-23 11:19:04.565123: I tensorflow/core/platform/default/device_tracer.cc:483] ApiCallback 1:279 func: cuMemcpyDtoHAsync_v
• 22018-02-23 11:19:04.565132: I tensorflow/core/platform/default/device_tracer.cc:471] 1286 : edge_69_train/gradients/train/Mean_grad/Prod
• 2018-02-23 11:19:04.565138: I tensorflow/stream_executor/cuda/cuda_driver.cc:1215] successfully enqueued async memcpy d2h of 4 bytes from
0x1020f310000 to 0x1020de00a00 on stream 0x7fa18db423b0
• 2018-02-23 11:19:04.565144: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:151] QueueInUse free_events_ 1 used_events_ 2
• 2018-02-23 11:19:04.565152: I tensorflow/stream_executor/stream.cc:302] Called Stream::ThenRecordEvent(event=0x7fa0a778b3f0)
stream=0x7fa18db1b7f0
• 2018-02-23 11:19:04.565159: I tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:190] PollEvents free_events_ 0 used_events_ 3
21. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
IntraProcessRendezvous
21
• 2018-02-23 11:19:04.577400: I tensorflow/core/common_runtime/rendezvous_mgr.cc:42]
IntraProcessRendezvous Send 0x7fa18d9bab20
/job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/d
evice:GPU:0;edge_185_Conv2_SwapIn;0:0
• 2018-02-23 11:19:03.141265: I tensorflow/core/common_runtime/rendezvous_mgr.cc:119]
IntraProcessRendezvous Recv 0x7fa18d9bab20
/job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/d
evice:GPU:0;edge_185_Conv2_SwapIn;0:0
GPU:0
22. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
TensorFlow Graph Exection
22
Device Level:
memory
management
Session Level:
Global control
Executor Level:
Run graph
async
Op Level:
Compute forward
and gradient.
Executors get created
for each subgraph
Get node ( operation )
from “ready” queue
Call into Stream, which contains stream_executor