SlideShare a Scribd company logo
1 of 97
primitiv:
Neural Network Toolkit
Yusuke Oda
2018/3/26 ASTREC, NICT
Agenda
• Basics of neural networks with computation graphs
• Design details and examples of primitiv
• An example usage
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 2
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 3
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 4
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
𝑥
𝑏
𝑊
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 5
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
𝑥
*
𝑏
𝑊
matmul
𝑊𝑥
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 6
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
𝑥
*
𝑏
𝑊
*
matmul
add
𝑊𝑥
𝑊𝑥 + 𝑏
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 7
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
𝑥
*
𝑏
𝑊
*
*
matmul
add
tanh
𝑊𝑥
𝑊𝑥 + 𝑏
tanh 𝑊𝑥 + 𝑏
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 8
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
𝑥
*
𝑏
𝑊
*
*
𝑦
matmul
add
tanh
add
𝑊𝑥
𝑊𝑥 + 𝑏
tanh 𝑊𝑥 + 𝑏
tanh 𝑊𝑠 + 𝑏 + 𝑥
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 9
Neural Networks with
Computation Graphs
• 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥
• → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥
• FFNNs can be represented as a DAG
= Computation Graph
𝑥
*
𝑏
𝑊
*
*
𝑦
matmul
add
tanh
add
𝑊𝑥
𝑊𝑥 + 𝑏
tanh 𝑊𝑥 + 𝑏
tanh 𝑊𝑠 + 𝑏 + 𝑥
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 10
Forward Calculation
(Retrieving Values)
• Once the computation graph constructed, the
actual computation can be performed along
the graph using the topological order.
3
15
2
5
17
0.9…
3.9…
matmul
add
tanh
add
𝑊𝑥
𝑊𝑥 + 𝑏
tanh 𝑊𝑥 + 𝑏
tanh 𝑊𝑠 + 𝑏 + 𝑥
1
2
3
4
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 11
Backward Calculation
(Retrieving Gradients)
• Backpropagation: using the chain rule of
derivatives along the same graph.
𝑥
*
𝑏
𝑊
*
*
𝑦
matmul
add
tanh
add𝑑𝐸
𝑑𝑥
+= 1 − 𝑦2
𝑑𝐸
𝑑𝑦
𝑑𝐸
𝑑𝑥1
+=
𝑑𝐸
𝑑𝑦
𝑑𝐸
𝑑𝑥2
+=
𝑑𝐸
𝑑𝑦
1
2
3
4
𝑑𝐸
𝑑𝑋1
+=
𝑑𝐸
𝑑𝑌
𝑋2
⊤
𝑑𝐸
𝑑𝑋2
+= 𝑋1
⊤
𝑑𝐸
𝑑𝑌
𝑑𝐸
𝑑𝑥1
+=
𝑑𝐸
𝑑𝑦
𝑑𝐸
𝑑𝑥2
+=
𝑑𝐸
𝑑𝑦
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 12
Backward Calculation
(Retrieving Gradients)
• Backpropagation: using the chain rule of
derivatives along the same graph.
𝑥
*
𝑏
𝑊
*
*
𝑦
matmul
add
tanh
add
𝑔 𝑥 += 1 − 𝑦2
𝑔 𝑦
𝑔 𝑥1
+= 𝑔 𝑦
𝑔 𝑥2
+= 𝑔 𝑦
1
2
3
4
𝑔 𝑋1
+= 𝑔 𝑌 𝑋2
⊤
𝑔 𝑋2
+= 𝑋1
⊤
𝑔 𝑌
𝑔 𝑥1
+= 𝑔 𝑦
𝑔 𝑥2
+= 𝑔 𝑦
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 13
Function and Variable of Graph
𝑋1, 𝑔 𝑋1
𝑌, 𝑔 𝑌
𝑋2, 𝑔 𝑋2
𝑓𝑓𝑤, 𝑓𝑏𝑤
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 14
Function and Variable of Graph
𝑋1, 𝑔 𝑋1
𝑌, 𝑔 𝑌
𝑋2, 𝑔 𝑋2
𝑓𝑓𝑤, 𝑓𝑏𝑤
Function: specifies the forward/backward calculation
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 15
Function and Variable of Graph
𝑋1, 𝑔 𝑋1
𝑌, 𝑔 𝑌
𝑋2, 𝑔 𝑋2
𝑓𝑓𝑤, 𝑓𝑏𝑤
Variable: represents actual values and gradients
Function: specifies the forward/backward calculation
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 16
Function and Variable of Graph
𝑋1, 𝑔 𝑋1
𝑌, 𝑔 𝑌
𝑋2, 𝑔 𝑋2
𝑓𝑓𝑤, 𝑓𝑏𝑤
Arguments: 0 or more
Results: 1 or more
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 17
Forward/backward operations of
the function
𝑋1, 𝑔 𝑋1
𝑌, 𝑔 𝑌
𝑋2, 𝑔 𝑋2
𝑓𝑓𝑤, 𝑓𝑏𝑤
𝒇 𝒇𝒘: 𝑋1 … 𝑋 𝑛 ⟼ 𝑌1 … 𝑌 𝑚
𝒇 𝒃𝒘: 𝑋1 … 𝑋 𝑛, 𝑌1 … 𝑌 𝑚, 𝑔 𝑌1
… 𝑔 𝑌 𝑚
⟼ 𝑔 𝑋1
… 𝑔 𝑋 𝑛
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 18
Combined Functions
• Any subgraphs with starts/ends by Function can be
as one Function.
𝑋, 𝑔 𝑋
𝑌, 𝑔 𝑌
𝑊
matmul
parameter
𝑏
parameter
add 𝑢 ReLU
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 19
Combined Functions
• Any subgraphs with starts/ends by Function can be
as one Function.
𝑋, 𝑔 𝑋
𝑌, 𝑔 𝑌
𝑊
matmul
“Linear” function in some
toolkits owns parameters itself,
and applies 2-3 functions.
parameter
𝑏
parameter
add 𝑢 ReLU
“Linear”
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 20
3 Strategies to Construct
Computation Graphs
• Difference: when/how to construct graph and
calculate the results.
• Static construction
• Caffe, Torch, TensorFlow, etc.
• Dynamic construction (define-by-run)
• Chainer, PyTorch, etc.
• Dynamic construction with lazy evaluation
• DyNet, PyTorch(partially), primitiv
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 21
Static Construction
• Constructs the computation graph before all
executions.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 22
Static Construction
• Constructs the computation graph before all
executions.
𝑥
*
𝑏
𝑊
* *
𝑦matmul
add tanh
add
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 23
Static Construction
• Constructs the computation graph before all
executions.
𝑥
*
𝑏
𝑊
* *
𝑦matmul
add tanh
add
𝑓
𝑥
𝑏
𝑊 𝑦
Fix
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 24
Static Construction
• Constructs the computation graph before all
executions.
• Then executes the “fixed” graph with actual data.
𝑥
*
𝑏
𝑊
* *
𝑦matmul
add tanh
add
𝑓
𝑥
𝑏
𝑊 𝑦
Fix
𝑓
…
…
… 𝑦
𝑓
…
…
… 𝑦
𝑓
…
…
… 𝑦
𝑓
3
2
5 3.9
Execute
with data6/1/2018 Copyright (c) 2018 by Yusuke Oda. 25
Dynamic Construction
(define-by-run)
• Graph construction and actual calculation are
performed simultaneously.
3
2
5Run 1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 26
Dynamic Construction
(define-by-run)
• Graph construction and actual calculation are
performed simultaneously.
3
15
2
5 matmulRun 1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 27
Dynamic Construction
(define-by-run)
• Graph construction and actual calculation are
performed simultaneously.
3
15
2
5
17
matmul
add
Run 1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 28
Dynamic Construction
(define-by-run)
• Graph construction and actual calculation are
performed simultaneously.
3
15
2
5
17 0.9…
matmul
add tanh
Run 1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 29
Dynamic Construction
(define-by-run)
• Graph construction and actual calculation are
performed simultaneously.
3
15
2
5
17 0.9…
3.9…matmul
add tanh
addRun 1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 30
Dynamic Construction
(define-by-run)
• Graph construction and actual calculation are
performed simultaneously.
3
15
2
5
17 0.9…
3.9…matmul
add tanh
addRun 1
9
18
-1
2
17 0.9…
9.9…matmul
add tanh
addRun 2
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 31
Dynamic Construction with Lazy
Evaluation
• Consists of 2 steps:
1. Constructing graphs using only types of values.
2. Performs actual computation (forward/backward)
along the graph.
3
?
2
5
? ?
?matmul
add tanh
add
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 32
Dynamic Construction with Lazy
Evaluation
• Consists of 2 steps:
1. Constructing graphs using only types of values.
2. Performs actual computation (forward/backward)
along the graph.
3
?
2
5
? ?
?matmul
add tanh
add
Query
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 33
Dynamic Construction with Lazy
Evaluation
• Consists of 2 steps:
1. Constructing graphs using only types of values.
2. Performs actual computation (forward/backward)
along the graph.
3
15
2
5
17 0.9…
3.9…matmul
add tanh
add
Query
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 34
Pros/cons of each strategies
• Static
• Capable of strong compile-time optimization
• Difficult to construct interactive graphs
• Dynamic (define-by-run)
• Capable of constructing interactive graphs
• Much overheads and difficulty of optimization
• Dynamic + Lazy
• Also capable of interactive graphs
• Applying just-in-time optimization
• 2-pass traversal over the graph
• calculate shapes always, then calculate values on demand.
• Still difficult to entire optimization
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 35
primitiv?
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 36
primitiv: Dynamic+Lazy NN Toolkit
• Originally forked from DyNet
• Restructured whole components
• Concepts
• Simple
• Compact
• Device/environment independent
• Implicit minibatching
• Multiple language support
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 37
primitiv: Simple
• Consists of essential functionalities.
• Pointless things are mostly omitted.
• Less learning cost
• But, the code does not become long.
• Encoder-decoder can be implemented about 300 lines in
C++ (see examples in the repository).
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 38
primitiv: Compact
• For minimal installation, you only need GCC/Clang
and CMake.
• If you need to use some specific hardware (e.g.
CUDA), all you need is adding the build switch.
$ git clone https://github.com/primitiv/primitiv
$ cd primitiv
$ cmake .
$ make
$ make install
$ echo "That's all."
$ cmake . –DPRIMITIV_USE_CUDA=ON
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 39
primitiv: Device/environment
Independent
• Device-specific code and network structure are
completely separated.
• Once the model was written, the code can be
executed using any (even unknown) hardware with
no modification.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 40
primitiv: Device/environment
Independent
• Device-specific code and network structure are
completely separated.
• Once the model was written, the code can be
executed using any (even unknown) hardware with
no modification.
#include <primitiv/primitiv.h>
using namespace primitiv;
namespace F = primitiv::functions;
Node predict(Node &x, Parameter &w, Parameter &b) {
Node ww = F::parameter<Node>(w);
Node bb = F::parameter<Node>(b);
return F::tanh(F::matmul(w, x) + b) + x;
}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 41
primitiv: Device/environment
Independent
• Device-specific code and network structure are
completely separated.
• Once the model was written, the code can be
executed using any (even unknown) hardware with
no modification.
#include <primitiv/primitiv.h>
using namespace primitiv;
namespace F = primitiv::functions;
Node predict(Node &x, Parameter &w, Parameter &b) {
Node ww = F::parameter<Node>(w);
Node bb = F::parameter<Node>(b);
return F::tanh(F::matmul(w, x) + b) + x;
}
Run on CPU
Run on CUDA
Run on OpenCL
Run on somewhere
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 42
primitiv: Implicit Minibatching
• Most networks can be utilized to both
single/minibatched data.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 43
primitiv: Implicit Minibatching
• Most networks can be utilized to both
single/minibatched data.
Node predict(
Node &x,
Parameter &w,
Parameter &b)
{
Node ww = F::parameter<Node>(w);
Node bb = F::parameter<Node>(b);
return F::tanh(F::matmul(w, x) + b) + x;
}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 44
primitiv: Implicit Minibatching
• Most networks can be utilized to both
single/minibatched data.
Node predict(
Node &x,
Parameter &w,
Parameter &b)
{
Node ww = F::parameter<Node>(w);
Node bb = F::parameter<Node>(b);
return F::tanh(F::matmul(w, x) + b) + x;
}
3
3.9
Single data
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 45
primitiv: Implicit Minibatching
• Most networks can be utilized to both
single/minibatched data.
Node predict(
Node &x,
Parameter &w,
Parameter &b)
{
Node ww = F::parameter<Node>(w);
Node bb = F::parameter<Node>(b);
return F::tanh(F::matmul(w, x) + b) + x;
}
3
3.9
3
3.9
4 5
4.9 5.9
3-minibatched data
Single data
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 46
primitiv: Multiple Language
Support
primitiv Core Library
(C++11)
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 47
primitiv: Multiple Language
Support
primitiv Core Library
(C++11)
C++ Apps
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 48
primitiv: Multiple Language
Support
primitiv Core Library
(C++11)
Cython
Middleware
primitiv-Python
Python Apps
C++ Apps
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 49
primitiv: Multiple Language
Support
primitiv Core Library
(C++11)
Cython
Middleware
primitiv C APIs
(C99)
primitiv-Python
Java
Binding
Rust
Binding
etc.
primitiv-
Java
primitiv-
Rust
etc.
Apps Apps Apps
Python Apps
C++ Apps
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 50
Core Components
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 51
Core Components of primitiv
• Shape
• Device and Tensor
• Graph and Node
• Parameter and Optimizer
• Other functionalities
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 52
Shape
• A Shape represents the volume and the minibatch
size of the data.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 53
Shape
• A Shape represents the volume and the minibatch
size of the data.
A scalar Shape({})
1 value
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 54
Shape
• A Shape represents the volume and the minibatch
size of the data.
A scalar Shape({})
1 value
A column vector Shape({3})
3 values
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 55
Shape
• A Shape represents the volume and the minibatch
size of the data.
A scalar Shape({})
1 value
A column vector Shape({3})
3 values
A matrix Shape({3, 4})
12 values
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 56
Shape
• A Shape represents the volume and the minibatch
size of the data.
A scalar Shape({})
1 value
A column vector Shape({3})
3 values
A matrix Shape({3, 4})
12 values
5 matrices Shape({3, 4}, 5)
60 values
×5
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 57
Shape Equivalence Rule
• "1" at the end dimensions are identical with none:
• "1" minibatch is identical with the single data:
Shape({3, 1}) == Shape({3})
Matrix Column vector
Shape({1}) == Shape({})
Column vector Scalar
Shape({2, 3, 4, 1, 1, 1}) == Shape({2, 3, 4})
Shape({2, 3, 4}, 1) == Shape({2, 3, 4})
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 58
Minibatch Broadcasting Rule
• Arguments of 𝑛 ≥ 2 -ary functions/ops with
minibatch size 1 are implicitly broadcasted.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 59
x = data with Shape({2, 2}, 123);
y = data with Shape({2, 2}, 123);
z = data with Shape({2, 2});
w = data with Shape({2, 2}, 42);
F::matmul(x, y); Shape({2, 2}, 123)
Operation will be performed for each minibatch separately.
F::matmul(z, w); Shape({2, 2}, 42)
`z` will be implicitly broadcasted.
F::sum({x, y, z}); Shape({2, 2}, 123)
`z` will be implicitly broadcasted.
F::sum({y, z, w}); Error!
Different sizes (123 vs 42) can not be calculated.
Device
• Device objects manages actual subroutines and the
memory management on a specific hardware.
• All hardware-related programs (e.g., CUDA) is
encapsulated in the Device.
CPU CUDA Other Hardwares
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 60
Device
• Device objects manages actual subroutines and the
memory management on a specific hardware.
• All hardware-related programs (e.g., CUDA) is
encapsulated in the Device.
CPU CUDA Other Hardwares
Unified "Device" Interface
CPU-specific
Routines
CUDA-specific
Routines
Other Routines
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 61
Device
• Device objects manages actual subroutines and the
memory management on a specific hardware.
• All hardware-related programs (e.g., CUDA) is
encapsulated in the Device.
CPU CUDA Other Hardwares
Unified "Device" Interface
Application Application
CPU-specific
Routines
CUDA-specific
Routines
Other Routines
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 62
Tensor
• Tensor is the most elementary interface of data.
• Each Tensor is related to a Device, has a Device-
specific memory, and a Shape to represent the
appearance of the data.
• Calculation is performed by eager evaluation.
Results are obtained immediately
Tensor
Reference to
the Device
Device-specific
Memory
Shape
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 63
Snippet:
Using Device and Tensor
#include <primitiv/primitiv.h>
using namespace primitiv;
// primitiv::functions has many functions for Tensor.
namespace F = primitiv::functions;
devices::Naive dev1; // Initializes CPU device
devices::CUDA dev2(0); // Initializes CUDA on GPU 0
devices::CUDA dev3(1); // Initializes CUDA on GPU 1
// `dev1` -- `dev3` have the same "Device" interface.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 64
Snippet:
Using Device and Tensor
// Making a new Tensor on `dev1`
Shape s({2, 2});
std::vector<float> data {1, 2, 3, 4}; // column-major
Tensor x1 = F::input<Tensor>(s, data, dev1);
// Making an 2-dimensional identity matrix on `dev2`
Tensor x2 = F::identity<Tensor>(2, dev2);
// Move x1 onto `dev2`
Tensor x11 = F::copy(x1, dev2);
// Math
Tensor x3 = x11 + x2; // x3 == {2, 2, 3, 5}
Tensor xe = x1 + x2; // Error: different device
Tensor x4 = F::exp(x1);
std::vector<float> ret = x4.to_vector(); // {2.7,7.4,20.,55.}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 65
Default Device
• The "Device" argument of each function can be
omitted using the default device.
devices::CUDA dev(0);
// Specifies `dev` as the default
Device::set_default(dev);
// Same as F::input<Tensor>(shape, data, dev);
Tensor x = F::input<Tensor>(shape, data);
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 66
Graph and Node
• Graph object represents a computation graph and
its states.
*
*
*
*
* *
*matmul
add tanh
add
parameter
parameter
input
Graph
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 67
Graph and Node
• Graph object represents a computation graph and
its states.
• Node object represents a variable node in the
Graph.
*
*
*
*
* *
*matmul
add tanh
add
parameter
parameter
input
GraphNode
Reference
to Graph
Variable
ID
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 68
Adding new Nodes into Graph
• Simply applying functions to add a new calculation
into the Graph.
• Node has the similar interface to Tensor.
• Math functions
• Arithmetic operations
x
Node x = F::input<Node>(shape, data);
input
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 69
Adding new Nodes into Graph
• Simply applying functions to add a new calculation
into the Graph.
• Node has the similar interface to Tensor.
• Math functions
• Arithmetic operations
x yexpx
Node x = F::input<Node>(shape, data);
Node y = F::exp(x);
input input
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 70
Lazy Evaluation through Nodes
• Unlike Tensor, Node is just a placeholder of values,
and does not invoke actual computation when it is
created.
• When the value was explicitly queried, all required
calculations are invoked.
? ?exp
std::vector<float> ret = y.to_vector()
input
y
Query
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 71
Lazy Evaluation through Nodes
• Unlike Tensor, Node is just a placeholder of values,
and does not invoke actual computation when it is
created.
• When the value was explicitly queried, all required
calculations are invoked.
Val ?exp
std::vector<float> ret = y.to_vector()
input
Invoke!
y
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 72
Lazy Evaluation through Nodes
• Unlike Tensor, Node is just a placeholder of values,
and does not invoke actual computation when it is
created.
• When the value was explicitly queried, all required
calculations are invoked.
Val Valexp
std::vector<float> ret = y.to_vector()
input
Invoke!
y
Return
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 73
Lazy Evaluation through Nodes
• Once the results are calculated, Nodes caches the
values and it will be reused by future queries.
• Unused values are never calculated.
Cac
hed
Cac
hed
?
Cac
hed
? ?
?matmul
add tanh
add
Query6/1/2018 Copyright (c) 2018 by Yusuke Oda. 74
Lazy Evaluation through Nodes
• Once the results are calculated, Nodes caches the
values and it will be reused by future queries.
• Unused values are never calculated.
Cac
hed
Cac
hed
Val
Cac
hed
Val Val
?matmul
add tanh
add
Invoked Invoked Invoked
Not
Invoked
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 75
Parameter
• Parameter objects represents a trainable parameter
in the network.
• Its values can be used in a variable of Graph, and its
gradients are updated by Graph.
• Initial values can be specified by hand, or using
Initializer object.
Parameter
Reference to
the Device
Values
Cumulative
Gradients
Other
Statistics
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 76
Optimizer
• Optimizer manages an update policy (SGD, Adam,
etc.) of Parameters.
• It consumes the gradient information that
Parameter holds to update the values.
• It also registers the statistics of each Parameter that
the update policy requires.
• I.e., all statistics about the Parameter is populated in the
Parameter object itself, and Optimizer does not have
such information.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 77
Snippet:
Initializing Parameter/Optimizer
// Device
devices::CUDA dev(0);
Device::set_default(dev);
// Parameter/Optimizer
Parameter p1(Shape({3}), {1, 2, 3});
Parameter p2(Shape({3}), initializers::Uniform(-1, 1));
// Using a uniform distribution to the initial values.
// Optimizer
optimizers::SGD opt(0.1); // Initializes SGD with LR=0.1.
opt.add(p1, p2); // Registers `p1` and `p2` to the optimizer.
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 78
Backpropagation
• Backpropagation can be performed through Nodes
by invoking bakcward() function.
• Tensors can not perform backpropagation because they
don't manage gradients and computation graphs.
• If the computation graph had some Parameters,
their gradients are updated by backward().
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 79
Snippet: Backpropagation
Graph g;
Graph::set_default(g); // Make g as the default.
Parameter p(Shape({3}), {1, 2, 3});
optimizers::SGD opt(0.1);
opt.add(p);
Node w = F::parameter(p);
Node x = F::input(Shape({3}), {2, 3, 5});
Node y = w * x; // Elementwise multiplication
y.to_vector(); // {2, 6, 15}
opt.reset_gradients(); // Make all gradients of parameters 0.
y.backward(); // Performs the backpropagation.
p.gradient().to_vector(); // {2, 3, 5}
opt.update(); // Performs the SGD rule:
// {1, 2, 3} – 0.1 * {2, 3, 5}
p.value().to_vector(); // {0.8, 1.7, 2.5}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 80
Example
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 81
The Data
• We use synthetic data in this example...
#include <random>
#include <tuple>
class DataSource {
std::mt19937 rng;
std::normal_distribution<float> data_dist, noise_dist;
public:
DataSource(float data_sd, float noise_sd)
: rng(std::random_device()())
, data_dist(0, data_sd)
, noise_dist(0, noise_sd) {}
std::tuple<float, float, float> operator()() {
const float x1 = data_dist(rng);
const float x2 = data_dist(rng);
return std::make_tuple(
x1 + noise_dist(rng),
x2 + noise_dist(rng),
x1 * x2 >= 0 ? 1 : -1);
}
};
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 82
The Data
• We use synthetic data in this example...
#include <random>
#include <tuple>
class DataSource {
std::mt19937 rng;
std::normal_distribution<float> data_dist, noise_dist;
public:
DataSource(float data_sd, float noise_sd)
: rng(std::random_device()())
, data_dist(0, data_sd)
, noise_dist(0, noise_sd) {}
std::tuple<float, float, float> operator()() {
const float x1 = data_dist(rng);
const float x2 = data_dist(rng);
return std::make_tuple(
x1 + noise_dist(rng),
x2 + noise_dist(rng),
x1 * x2 >= 0 ? 1 : -1);
}
};
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 83
The Data ... XOR
Where:
• data_sd == 1
• noise_sd == 0.1
Input: 𝒙 ∶= 𝑥1, 𝑥2 ∈ ℝ2
Output: 𝑦 ∈ ℝ 0,1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 84
The Network
• We use a simple MLP:
𝑦 = tanh 𝑊𝒉𝑦 𝒉 + 𝑏 𝑦
𝒉 = tanh 𝑊𝒙𝒉 𝒙 + 𝒃 𝒉
• Where:
𝒉 ∈ ℝ 𝑁
W𝒙𝒉 ∈ ℝ 𝑁×2
𝒃 𝒉 ∈ ℝ 𝑁
𝑊𝒉𝑦 ∈ ℝ1×𝑁
𝑏 𝑦 ∈ ℝ
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 85
Code 1: Initialization
• Including headers and declaring the main function
#include <iostream>
#include <vector>
#include <primitiv/primitiv.h>
using namespace std;
using namespace primitiv;
int main() {
devices::Naive dev; // uses CPU
Graph g;
Device::set_default(dev);
Graph::set_default(g);
// All code will be described here.
return 0;
}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 86
Code 2: Parameter and Optimizer
• We have 4 parameters: 𝑊𝒉𝑦, 𝑏 𝑦, 𝑊𝒙𝒉, 𝒃 𝒉.
(in main function)
constexpr unsigned N = 8; // #hidden units
Parameter pw_xh({N, 2}, initializers::XavierUniform());
Parameter pb_h({N}, initializers::Constant(0));
Parameter pw_hy({1, N}, initializers::XavierUniform());
Parameter pb_y({}, initializers::Constant(0));
constexpr float learning_rate = 0.1;
optimizers::SGD opt(learning_rate);
opt.add(pw_xh, pb_h, pw_hy, pb_y);
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 87
Code 3: Writing The Network
• Using lambda:
(in main function)
auto feedforward = [&](const Node &x) {
namespace F = primitiv::functions;
const Node w_xh = F::parameter<Node>(pw_xh); // Shape({N, 2})
const Node b_h = F::parameter<Node>(pb_h); // Shape({N})
const Node w_hy = F::parameter<Node>(pw_hy); // Shape({1, N})
const Node b_y = F::parameter<Node>(pb_y); // Shape({})
const Node h = F::tanh(F::matmul(w_xh, x) + b_h); // Shape({N}, B)
return F::tanh(F::matmul(w_hy, h) + b_y); // Shape({}, B)
};
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 88
Code 4: Loss Function
• Similar to the main network:
(in main function)
auto squared_loss = [](const Node &y, const Node &t) {
namespace F = primitiv::functions;
const Node diff = y - t; // Shape({}, B)
return F::batch::mean(diff * diff); // Shape({})
};
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 89
Code 5: Making The Minibatch
• This section is out of the toolkit, just up to the data.
(in main function)
constexpr float data_sd = 1.0;
constexpr float noise_sd = 0.1;
DataSource data_source(data_sd, noise_sd);
auto next_data = [&](unsigned minibatch_size) {
std::vector<float> data;
std::vector<float> labels;
for (unsigned i = 0; i < minibatch_size; ++i) {
float x1, x2, t;
std::tie(x1, x2, t) = data_source();
data.emplace_back(x1);
data.emplace_back(x2);
labels.emplace_back(t);
}
namespace F = primitiv::functions;
return std::make_tuple(
F::input<Node>(Shape({2}, minibatch_size), data), // input data `x`
F::input<Node>(Shape({}, minibatch_size), labels)); // label data `t`
};
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 90
Code 6: Training Loop
(in main function)
for (unsigned epoch = 0; epoch < 100; ++epoch) {
g.clear();
// Initializes the computation graph
Node x, t;
std::tie(x, t) = next_data(1000); // Obtains the next data
const Node y = feedforward(x); // Calculates the network
const Node loss = squared_loss(y, t); // Calculates the loss
std::cout << epoch << ": train loss=" << loss.to_float() << std::endl;
// Performs backpropagation and updates parameters
opt.reset_gradients();
loss.backward();
opt.update();
}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 91
Code 6: Training Loop
(in main function)
for (unsigned epoch = 0; epoch < 100; ++epoch) {
g.clear();
// Initializes the computation graph
Node x, t;
std::tie(x, t) = next_data(1000); // Obtains the next data
const Node y = feedforward(x); // Calculates the network
const Node loss = squared_loss(y, t); // Calculates the loss
std::cout << epoch << ": train loss=" << loss.to_float() << std::endl;
// Performs backpropagation and updates parameters
opt.reset_gradients();
loss.backward();
opt.update();
}
$ g++ -std=c++11 code.cc -lprimitiv
$ ./a.out
0: loss=1.17221
1: loss=1.07423
2: loss=1.06282
3: loss=1.04641
4: loss=1.00851
5: loss=1.01904
...
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 92
Code 7: Testing
(in main function)
for (unsigned epoch = 0; epoch < 100; ++epoch) {
(Training process written in the previous code block)
if (epoch % 10 == 9) {
namespace F = primitiv::functions;
const vector<float> test_x_data {1, 1, -1, 1, -1, -1, 1, -1};
const vector<float> test_x_data {1, -1, 1, -1};
const Node test_x = F::input<Node>(Shape({2}, 4), test_x_data);
const Node test_t = F::input<Node>(Shape({}, 4), test_t_data);
const Node test_y = feedforward(test_x);
const Node test_loss = squared_loss(test_y, test_t);
std::cout << "test results:";
for (float val : test_y.to_vector()) {
std::cout << ' ' << val;
}
std::cout << "ntest loss: " << test_loss.to_float() << std::endl;
}
}
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 93
Code 7: Testing
(in main function)
for (unsigned epoch = 0; epoch < 100; ++epoch) {
(Training process written in the previous code block)
if (epoch % 10 == 9) {
namespace F = primitiv::functions;
const vector<float> test_x_data {1, 1, -1, 1, -1, -1, 1, -1};
const vector<float> test_x_data {1, -1, 1, -1};
const Node test_x = F::input<Node>(Shape({2}, 4), test_x_data);
const Node test_t = F::input<Node>(Shape({}, 4), test_t_data);
const Node test_y = feedforward(test_x);
const Node test_loss = squared_loss(test_y, test_t);
std::cout << "test results:";
for (float val : test_y.to_vector()) {
std::cout << ' ' << val;
}
std::cout << "ntest loss: " << test_loss.to_float() << std::endl;
}
}
1, 1 ⟼ 1
−1, 1 ⟼ −1
−1, −1 ⟼ 1
1, −1 ⟼ −1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 94
Code 7: Testing
(in main function)
for (unsigned epoch = 0; epoch < 100; ++epoch) {
(Training process written in the previous code block)
if (epoch % 10 == 9) {
namespace F = primitiv::functions;
const vector<float> test_x_data {1, 1, -1, 1, -1, -1, 1, -1};
const vector<float> test_x_data {1, -1, 1, -1};
const Node test_x = F::input<Node>(Shape({2}, 4), test_x_data);
const Node test_t = F::input<Node>(Shape({}, 4), test_t_data);
const Node test_y = feedforward(test_x);
const Node test_loss = squared_loss(test_y, test_t);
std::cout << "test results:";
for (float val : test_y.to_vector()) {
std::cout << ' ' << val;
}
std::cout << "ntest loss: " << test_loss.to_float() << std::endl;
}
}
$ g++ -std=c++11 code.cc -lprimitiv
$ ./a.out
...
8: loss=0.933427
9: loss=0.927205
test results: 0.04619 -0.119208 0.0893511 -0.149148
test loss: 0.809695
10: loss=0.916669
11: loss=0.91744
...
18: loss=0.849496
19: loss=0.845048
test results: 0.156536 -0.229959 0.171106 -0.221599
test loss: 0.649342
20: loss=0.839679
21: loss=0.831217
...
1, 1 ⟼ 1
−1, 1 ⟼ −1
−1, −1 ⟼ 1
1, −1 ⟼ −1
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 95
Links
• Public repository (components, tests, examples)
• https://github.com/primitiv
• Slack (conversation)
• https://primitiv-forum.slack.com
• Documentation (tutorial, design, reference)
• http://primitiv.readthedocs.io/en/develop
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 96
Thanks!
6/1/2018 Copyright (c) 2018 by Yusuke Oda. 97

More Related Content

Similar to Neural Network Toolkit: primitiv Explained

Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLMasahiko Sawada
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series ForecastingBillTubbs
 
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...MITSUNARI Shigeo
 
UNIT_1-Introduction-to-Computer-Graphics.pdf
UNIT_1-Introduction-to-Computer-Graphics.pdfUNIT_1-Introduction-to-Computer-Graphics.pdf
UNIT_1-Introduction-to-Computer-Graphics.pdfSudarshanSharma43
 
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET Journal
 
Teaching with JupyterHub - lessons learned
Teaching with JupyterHub - lessons learnedTeaching with JupyterHub - lessons learned
Teaching with JupyterHub - lessons learnedMartin Christen
 
Ds03 part i algorithms by jyoti lakhani
Ds03 part i algorithms   by jyoti lakhaniDs03 part i algorithms   by jyoti lakhani
Ds03 part i algorithms by jyoti lakhanijyoti_lakhani
 
Engineering + Programming portfolio
Engineering + Programming portfolioEngineering + Programming portfolio
Engineering + Programming portfolioJosephDonnelly14
 
Hypothetical Partitioning for PostgreSQL
Hypothetical Partitioning for PostgreSQLHypothetical Partitioning for PostgreSQL
Hypothetical Partitioning for PostgreSQLYuzuko Hosoya
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
Agile: A guide to creating a project burndown chart
Agile: A guide to creating a project burndown chartAgile: A guide to creating a project burndown chart
Agile: A guide to creating a project burndown chartPM Majik
 
solve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdf
solve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdfsolve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdf
solve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdfarihantcomputersddn
 
GPU_based Searching
GPU_based SearchingGPU_based Searching
GPU_based Searchingjpawan33
 
Computer Graphics Practical
Computer Graphics PracticalComputer Graphics Practical
Computer Graphics PracticalNeha Sharma
 
Tower design using etabs- Nada Zarrak
Tower design using etabs- Nada Zarrak Tower design using etabs- Nada Zarrak
Tower design using etabs- Nada Zarrak Nada Zarrak
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 

Similar to Neural Network Toolkit: primitiv Explained (20)

Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQL
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
 
ppt 3 demo.pptx
ppt 3 demo.pptxppt 3 demo.pptx
ppt 3 demo.pptx
 
UNIT_1-Introduction-to-Computer-Graphics.pdf
UNIT_1-Introduction-to-Computer-Graphics.pdfUNIT_1-Introduction-to-Computer-Graphics.pdf
UNIT_1-Introduction-to-Computer-Graphics.pdf
 
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
 
Cadancesimulation
CadancesimulationCadancesimulation
Cadancesimulation
 
Teaching with JupyterHub - lessons learned
Teaching with JupyterHub - lessons learnedTeaching with JupyterHub - lessons learned
Teaching with JupyterHub - lessons learned
 
Ds03 part i algorithms by jyoti lakhani
Ds03 part i algorithms   by jyoti lakhaniDs03 part i algorithms   by jyoti lakhani
Ds03 part i algorithms by jyoti lakhani
 
3d printer
3d printer3d printer
3d printer
 
Pregel
PregelPregel
Pregel
 
Engineering + Programming portfolio
Engineering + Programming portfolioEngineering + Programming portfolio
Engineering + Programming portfolio
 
Hypothetical Partitioning for PostgreSQL
Hypothetical Partitioning for PostgreSQLHypothetical Partitioning for PostgreSQL
Hypothetical Partitioning for PostgreSQL
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Agile: A guide to creating a project burndown chart
Agile: A guide to creating a project burndown chartAgile: A guide to creating a project burndown chart
Agile: A guide to creating a project burndown chart
 
solve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdf
solve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdfsolve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdf
solve 6 , quiz 2link of book httpwww.irccyn.ec-nantes.fr~mart.pdf
 
GPU_based Searching
GPU_based SearchingGPU_based Searching
GPU_based Searching
 
Computer Graphics Practical
Computer Graphics PracticalComputer Graphics Practical
Computer Graphics Practical
 
Tower design using etabs- Nada Zarrak
Tower design using etabs- Nada Zarrak Tower design using etabs- Nada Zarrak
Tower design using etabs- Nada Zarrak
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 

More from Yusuke Oda

Neural Machine Translation via Binary Code Prediction
Neural Machine Translation via Binary Code PredictionNeural Machine Translation via Binary Code Prediction
Neural Machine Translation via Binary Code PredictionYusuke Oda
 
ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@Yusuke Oda
 
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳Yusuke Oda
 
Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Yusuke Oda
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Yusuke Oda
 
A Chainer MeetUp Talk
A Chainer MeetUp TalkA Chainer MeetUp Talk
A Chainer MeetUp TalkYusuke Oda
 
PCFG構文解析法
PCFG構文解析法PCFG構文解析法
PCFG構文解析法Yusuke Oda
 
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...Yusuke Oda
 
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...Yusuke Oda
 
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Yusuke Oda
 
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)Yusuke Oda
 
Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Yusuke Oda
 

More from Yusuke Oda (13)

Neural Machine Translation via Binary Code Prediction
Neural Machine Translation via Binary Code PredictionNeural Machine Translation via Binary Code Prediction
Neural Machine Translation via Binary Code Prediction
 
ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@
 
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
 
Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
A Chainer MeetUp Talk
A Chainer MeetUp TalkA Chainer MeetUp Talk
A Chainer MeetUp Talk
 
PCFG構文解析法
PCFG構文解析法PCFG構文解析法
PCFG構文解析法
 
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
 
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
 
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
 
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
 
Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3
 
Test
TestTest
Test
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 

Neural Network Toolkit: primitiv Explained

  • 1. primitiv: Neural Network Toolkit Yusuke Oda 2018/3/26 ASTREC, NICT
  • 2. Agenda • Basics of neural networks with computation graphs • Design details and examples of primitiv • An example usage 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 2
  • 3. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 3
  • 4. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 4
  • 5. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 𝑥 𝑏 𝑊 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 5
  • 6. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 𝑥 * 𝑏 𝑊 matmul 𝑊𝑥 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 6
  • 7. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 𝑥 * 𝑏 𝑊 * matmul add 𝑊𝑥 𝑊𝑥 + 𝑏 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 7
  • 8. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 𝑥 * 𝑏 𝑊 * * matmul add tanh 𝑊𝑥 𝑊𝑥 + 𝑏 tanh 𝑊𝑥 + 𝑏 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 8
  • 9. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 𝑥 * 𝑏 𝑊 * * 𝑦 matmul add tanh add 𝑊𝑥 𝑊𝑥 + 𝑏 tanh 𝑊𝑥 + 𝑏 tanh 𝑊𝑠 + 𝑏 + 𝑥 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 9
  • 10. Neural Networks with Computation Graphs • 𝑦 = tanh 𝑊𝑥 + 𝑏 + 𝑥 • → 𝑦 = add tanh add matmul(𝑊, 𝑥), 𝑏 , 𝑥 • FFNNs can be represented as a DAG = Computation Graph 𝑥 * 𝑏 𝑊 * * 𝑦 matmul add tanh add 𝑊𝑥 𝑊𝑥 + 𝑏 tanh 𝑊𝑥 + 𝑏 tanh 𝑊𝑠 + 𝑏 + 𝑥 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 10
  • 11. Forward Calculation (Retrieving Values) • Once the computation graph constructed, the actual computation can be performed along the graph using the topological order. 3 15 2 5 17 0.9… 3.9… matmul add tanh add 𝑊𝑥 𝑊𝑥 + 𝑏 tanh 𝑊𝑥 + 𝑏 tanh 𝑊𝑠 + 𝑏 + 𝑥 1 2 3 4 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 11
  • 12. Backward Calculation (Retrieving Gradients) • Backpropagation: using the chain rule of derivatives along the same graph. 𝑥 * 𝑏 𝑊 * * 𝑦 matmul add tanh add𝑑𝐸 𝑑𝑥 += 1 − 𝑦2 𝑑𝐸 𝑑𝑦 𝑑𝐸 𝑑𝑥1 += 𝑑𝐸 𝑑𝑦 𝑑𝐸 𝑑𝑥2 += 𝑑𝐸 𝑑𝑦 1 2 3 4 𝑑𝐸 𝑑𝑋1 += 𝑑𝐸 𝑑𝑌 𝑋2 ⊤ 𝑑𝐸 𝑑𝑋2 += 𝑋1 ⊤ 𝑑𝐸 𝑑𝑌 𝑑𝐸 𝑑𝑥1 += 𝑑𝐸 𝑑𝑦 𝑑𝐸 𝑑𝑥2 += 𝑑𝐸 𝑑𝑦 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 12
  • 13. Backward Calculation (Retrieving Gradients) • Backpropagation: using the chain rule of derivatives along the same graph. 𝑥 * 𝑏 𝑊 * * 𝑦 matmul add tanh add 𝑔 𝑥 += 1 − 𝑦2 𝑔 𝑦 𝑔 𝑥1 += 𝑔 𝑦 𝑔 𝑥2 += 𝑔 𝑦 1 2 3 4 𝑔 𝑋1 += 𝑔 𝑌 𝑋2 ⊤ 𝑔 𝑋2 += 𝑋1 ⊤ 𝑔 𝑌 𝑔 𝑥1 += 𝑔 𝑦 𝑔 𝑥2 += 𝑔 𝑦 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 13
  • 14. Function and Variable of Graph 𝑋1, 𝑔 𝑋1 𝑌, 𝑔 𝑌 𝑋2, 𝑔 𝑋2 𝑓𝑓𝑤, 𝑓𝑏𝑤 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 14
  • 15. Function and Variable of Graph 𝑋1, 𝑔 𝑋1 𝑌, 𝑔 𝑌 𝑋2, 𝑔 𝑋2 𝑓𝑓𝑤, 𝑓𝑏𝑤 Function: specifies the forward/backward calculation 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 15
  • 16. Function and Variable of Graph 𝑋1, 𝑔 𝑋1 𝑌, 𝑔 𝑌 𝑋2, 𝑔 𝑋2 𝑓𝑓𝑤, 𝑓𝑏𝑤 Variable: represents actual values and gradients Function: specifies the forward/backward calculation 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 16
  • 17. Function and Variable of Graph 𝑋1, 𝑔 𝑋1 𝑌, 𝑔 𝑌 𝑋2, 𝑔 𝑋2 𝑓𝑓𝑤, 𝑓𝑏𝑤 Arguments: 0 or more Results: 1 or more 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 17
  • 18. Forward/backward operations of the function 𝑋1, 𝑔 𝑋1 𝑌, 𝑔 𝑌 𝑋2, 𝑔 𝑋2 𝑓𝑓𝑤, 𝑓𝑏𝑤 𝒇 𝒇𝒘: 𝑋1 … 𝑋 𝑛 ⟼ 𝑌1 … 𝑌 𝑚 𝒇 𝒃𝒘: 𝑋1 … 𝑋 𝑛, 𝑌1 … 𝑌 𝑚, 𝑔 𝑌1 … 𝑔 𝑌 𝑚 ⟼ 𝑔 𝑋1 … 𝑔 𝑋 𝑛 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 18
  • 19. Combined Functions • Any subgraphs with starts/ends by Function can be as one Function. 𝑋, 𝑔 𝑋 𝑌, 𝑔 𝑌 𝑊 matmul parameter 𝑏 parameter add 𝑢 ReLU 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 19
  • 20. Combined Functions • Any subgraphs with starts/ends by Function can be as one Function. 𝑋, 𝑔 𝑋 𝑌, 𝑔 𝑌 𝑊 matmul “Linear” function in some toolkits owns parameters itself, and applies 2-3 functions. parameter 𝑏 parameter add 𝑢 ReLU “Linear” 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 20
  • 21. 3 Strategies to Construct Computation Graphs • Difference: when/how to construct graph and calculate the results. • Static construction • Caffe, Torch, TensorFlow, etc. • Dynamic construction (define-by-run) • Chainer, PyTorch, etc. • Dynamic construction with lazy evaluation • DyNet, PyTorch(partially), primitiv 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 21
  • 22. Static Construction • Constructs the computation graph before all executions. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 22
  • 23. Static Construction • Constructs the computation graph before all executions. 𝑥 * 𝑏 𝑊 * * 𝑦matmul add tanh add 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 23
  • 24. Static Construction • Constructs the computation graph before all executions. 𝑥 * 𝑏 𝑊 * * 𝑦matmul add tanh add 𝑓 𝑥 𝑏 𝑊 𝑦 Fix 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 24
  • 25. Static Construction • Constructs the computation graph before all executions. • Then executes the “fixed” graph with actual data. 𝑥 * 𝑏 𝑊 * * 𝑦matmul add tanh add 𝑓 𝑥 𝑏 𝑊 𝑦 Fix 𝑓 … … … 𝑦 𝑓 … … … 𝑦 𝑓 … … … 𝑦 𝑓 3 2 5 3.9 Execute with data6/1/2018 Copyright (c) 2018 by Yusuke Oda. 25
  • 26. Dynamic Construction (define-by-run) • Graph construction and actual calculation are performed simultaneously. 3 2 5Run 1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 26
  • 27. Dynamic Construction (define-by-run) • Graph construction and actual calculation are performed simultaneously. 3 15 2 5 matmulRun 1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 27
  • 28. Dynamic Construction (define-by-run) • Graph construction and actual calculation are performed simultaneously. 3 15 2 5 17 matmul add Run 1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 28
  • 29. Dynamic Construction (define-by-run) • Graph construction and actual calculation are performed simultaneously. 3 15 2 5 17 0.9… matmul add tanh Run 1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 29
  • 30. Dynamic Construction (define-by-run) • Graph construction and actual calculation are performed simultaneously. 3 15 2 5 17 0.9… 3.9…matmul add tanh addRun 1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 30
  • 31. Dynamic Construction (define-by-run) • Graph construction and actual calculation are performed simultaneously. 3 15 2 5 17 0.9… 3.9…matmul add tanh addRun 1 9 18 -1 2 17 0.9… 9.9…matmul add tanh addRun 2 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 31
  • 32. Dynamic Construction with Lazy Evaluation • Consists of 2 steps: 1. Constructing graphs using only types of values. 2. Performs actual computation (forward/backward) along the graph. 3 ? 2 5 ? ? ?matmul add tanh add 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 32
  • 33. Dynamic Construction with Lazy Evaluation • Consists of 2 steps: 1. Constructing graphs using only types of values. 2. Performs actual computation (forward/backward) along the graph. 3 ? 2 5 ? ? ?matmul add tanh add Query 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 33
  • 34. Dynamic Construction with Lazy Evaluation • Consists of 2 steps: 1. Constructing graphs using only types of values. 2. Performs actual computation (forward/backward) along the graph. 3 15 2 5 17 0.9… 3.9…matmul add tanh add Query 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 34
  • 35. Pros/cons of each strategies • Static • Capable of strong compile-time optimization • Difficult to construct interactive graphs • Dynamic (define-by-run) • Capable of constructing interactive graphs • Much overheads and difficulty of optimization • Dynamic + Lazy • Also capable of interactive graphs • Applying just-in-time optimization • 2-pass traversal over the graph • calculate shapes always, then calculate values on demand. • Still difficult to entire optimization 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 35
  • 36. primitiv? 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 36
  • 37. primitiv: Dynamic+Lazy NN Toolkit • Originally forked from DyNet • Restructured whole components • Concepts • Simple • Compact • Device/environment independent • Implicit minibatching • Multiple language support 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 37
  • 38. primitiv: Simple • Consists of essential functionalities. • Pointless things are mostly omitted. • Less learning cost • But, the code does not become long. • Encoder-decoder can be implemented about 300 lines in C++ (see examples in the repository). 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 38
  • 39. primitiv: Compact • For minimal installation, you only need GCC/Clang and CMake. • If you need to use some specific hardware (e.g. CUDA), all you need is adding the build switch. $ git clone https://github.com/primitiv/primitiv $ cd primitiv $ cmake . $ make $ make install $ echo "That's all." $ cmake . –DPRIMITIV_USE_CUDA=ON 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 39
  • 40. primitiv: Device/environment Independent • Device-specific code and network structure are completely separated. • Once the model was written, the code can be executed using any (even unknown) hardware with no modification. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 40
  • 41. primitiv: Device/environment Independent • Device-specific code and network structure are completely separated. • Once the model was written, the code can be executed using any (even unknown) hardware with no modification. #include <primitiv/primitiv.h> using namespace primitiv; namespace F = primitiv::functions; Node predict(Node &x, Parameter &w, Parameter &b) { Node ww = F::parameter<Node>(w); Node bb = F::parameter<Node>(b); return F::tanh(F::matmul(w, x) + b) + x; } 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 41
  • 42. primitiv: Device/environment Independent • Device-specific code and network structure are completely separated. • Once the model was written, the code can be executed using any (even unknown) hardware with no modification. #include <primitiv/primitiv.h> using namespace primitiv; namespace F = primitiv::functions; Node predict(Node &x, Parameter &w, Parameter &b) { Node ww = F::parameter<Node>(w); Node bb = F::parameter<Node>(b); return F::tanh(F::matmul(w, x) + b) + x; } Run on CPU Run on CUDA Run on OpenCL Run on somewhere 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 42
  • 43. primitiv: Implicit Minibatching • Most networks can be utilized to both single/minibatched data. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 43
  • 44. primitiv: Implicit Minibatching • Most networks can be utilized to both single/minibatched data. Node predict( Node &x, Parameter &w, Parameter &b) { Node ww = F::parameter<Node>(w); Node bb = F::parameter<Node>(b); return F::tanh(F::matmul(w, x) + b) + x; } 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 44
  • 45. primitiv: Implicit Minibatching • Most networks can be utilized to both single/minibatched data. Node predict( Node &x, Parameter &w, Parameter &b) { Node ww = F::parameter<Node>(w); Node bb = F::parameter<Node>(b); return F::tanh(F::matmul(w, x) + b) + x; } 3 3.9 Single data 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 45
  • 46. primitiv: Implicit Minibatching • Most networks can be utilized to both single/minibatched data. Node predict( Node &x, Parameter &w, Parameter &b) { Node ww = F::parameter<Node>(w); Node bb = F::parameter<Node>(b); return F::tanh(F::matmul(w, x) + b) + x; } 3 3.9 3 3.9 4 5 4.9 5.9 3-minibatched data Single data 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 46
  • 47. primitiv: Multiple Language Support primitiv Core Library (C++11) 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 47
  • 48. primitiv: Multiple Language Support primitiv Core Library (C++11) C++ Apps 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 48
  • 49. primitiv: Multiple Language Support primitiv Core Library (C++11) Cython Middleware primitiv-Python Python Apps C++ Apps 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 49
  • 50. primitiv: Multiple Language Support primitiv Core Library (C++11) Cython Middleware primitiv C APIs (C99) primitiv-Python Java Binding Rust Binding etc. primitiv- Java primitiv- Rust etc. Apps Apps Apps Python Apps C++ Apps 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 50
  • 51. Core Components 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 51
  • 52. Core Components of primitiv • Shape • Device and Tensor • Graph and Node • Parameter and Optimizer • Other functionalities 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 52
  • 53. Shape • A Shape represents the volume and the minibatch size of the data. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 53
  • 54. Shape • A Shape represents the volume and the minibatch size of the data. A scalar Shape({}) 1 value 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 54
  • 55. Shape • A Shape represents the volume and the minibatch size of the data. A scalar Shape({}) 1 value A column vector Shape({3}) 3 values 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 55
  • 56. Shape • A Shape represents the volume and the minibatch size of the data. A scalar Shape({}) 1 value A column vector Shape({3}) 3 values A matrix Shape({3, 4}) 12 values 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 56
  • 57. Shape • A Shape represents the volume and the minibatch size of the data. A scalar Shape({}) 1 value A column vector Shape({3}) 3 values A matrix Shape({3, 4}) 12 values 5 matrices Shape({3, 4}, 5) 60 values ×5 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 57
  • 58. Shape Equivalence Rule • "1" at the end dimensions are identical with none: • "1" minibatch is identical with the single data: Shape({3, 1}) == Shape({3}) Matrix Column vector Shape({1}) == Shape({}) Column vector Scalar Shape({2, 3, 4, 1, 1, 1}) == Shape({2, 3, 4}) Shape({2, 3, 4}, 1) == Shape({2, 3, 4}) 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 58
  • 59. Minibatch Broadcasting Rule • Arguments of 𝑛 ≥ 2 -ary functions/ops with minibatch size 1 are implicitly broadcasted. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 59 x = data with Shape({2, 2}, 123); y = data with Shape({2, 2}, 123); z = data with Shape({2, 2}); w = data with Shape({2, 2}, 42); F::matmul(x, y); Shape({2, 2}, 123) Operation will be performed for each minibatch separately. F::matmul(z, w); Shape({2, 2}, 42) `z` will be implicitly broadcasted. F::sum({x, y, z}); Shape({2, 2}, 123) `z` will be implicitly broadcasted. F::sum({y, z, w}); Error! Different sizes (123 vs 42) can not be calculated.
  • 60. Device • Device objects manages actual subroutines and the memory management on a specific hardware. • All hardware-related programs (e.g., CUDA) is encapsulated in the Device. CPU CUDA Other Hardwares 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 60
  • 61. Device • Device objects manages actual subroutines and the memory management on a specific hardware. • All hardware-related programs (e.g., CUDA) is encapsulated in the Device. CPU CUDA Other Hardwares Unified "Device" Interface CPU-specific Routines CUDA-specific Routines Other Routines 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 61
  • 62. Device • Device objects manages actual subroutines and the memory management on a specific hardware. • All hardware-related programs (e.g., CUDA) is encapsulated in the Device. CPU CUDA Other Hardwares Unified "Device" Interface Application Application CPU-specific Routines CUDA-specific Routines Other Routines 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 62
  • 63. Tensor • Tensor is the most elementary interface of data. • Each Tensor is related to a Device, has a Device- specific memory, and a Shape to represent the appearance of the data. • Calculation is performed by eager evaluation. Results are obtained immediately Tensor Reference to the Device Device-specific Memory Shape 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 63
  • 64. Snippet: Using Device and Tensor #include <primitiv/primitiv.h> using namespace primitiv; // primitiv::functions has many functions for Tensor. namespace F = primitiv::functions; devices::Naive dev1; // Initializes CPU device devices::CUDA dev2(0); // Initializes CUDA on GPU 0 devices::CUDA dev3(1); // Initializes CUDA on GPU 1 // `dev1` -- `dev3` have the same "Device" interface. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 64
  • 65. Snippet: Using Device and Tensor // Making a new Tensor on `dev1` Shape s({2, 2}); std::vector<float> data {1, 2, 3, 4}; // column-major Tensor x1 = F::input<Tensor>(s, data, dev1); // Making an 2-dimensional identity matrix on `dev2` Tensor x2 = F::identity<Tensor>(2, dev2); // Move x1 onto `dev2` Tensor x11 = F::copy(x1, dev2); // Math Tensor x3 = x11 + x2; // x3 == {2, 2, 3, 5} Tensor xe = x1 + x2; // Error: different device Tensor x4 = F::exp(x1); std::vector<float> ret = x4.to_vector(); // {2.7,7.4,20.,55.} 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 65
  • 66. Default Device • The "Device" argument of each function can be omitted using the default device. devices::CUDA dev(0); // Specifies `dev` as the default Device::set_default(dev); // Same as F::input<Tensor>(shape, data, dev); Tensor x = F::input<Tensor>(shape, data); 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 66
  • 67. Graph and Node • Graph object represents a computation graph and its states. * * * * * * *matmul add tanh add parameter parameter input Graph 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 67
  • 68. Graph and Node • Graph object represents a computation graph and its states. • Node object represents a variable node in the Graph. * * * * * * *matmul add tanh add parameter parameter input GraphNode Reference to Graph Variable ID 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 68
  • 69. Adding new Nodes into Graph • Simply applying functions to add a new calculation into the Graph. • Node has the similar interface to Tensor. • Math functions • Arithmetic operations x Node x = F::input<Node>(shape, data); input 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 69
  • 70. Adding new Nodes into Graph • Simply applying functions to add a new calculation into the Graph. • Node has the similar interface to Tensor. • Math functions • Arithmetic operations x yexpx Node x = F::input<Node>(shape, data); Node y = F::exp(x); input input 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 70
  • 71. Lazy Evaluation through Nodes • Unlike Tensor, Node is just a placeholder of values, and does not invoke actual computation when it is created. • When the value was explicitly queried, all required calculations are invoked. ? ?exp std::vector<float> ret = y.to_vector() input y Query 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 71
  • 72. Lazy Evaluation through Nodes • Unlike Tensor, Node is just a placeholder of values, and does not invoke actual computation when it is created. • When the value was explicitly queried, all required calculations are invoked. Val ?exp std::vector<float> ret = y.to_vector() input Invoke! y 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 72
  • 73. Lazy Evaluation through Nodes • Unlike Tensor, Node is just a placeholder of values, and does not invoke actual computation when it is created. • When the value was explicitly queried, all required calculations are invoked. Val Valexp std::vector<float> ret = y.to_vector() input Invoke! y Return 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 73
  • 74. Lazy Evaluation through Nodes • Once the results are calculated, Nodes caches the values and it will be reused by future queries. • Unused values are never calculated. Cac hed Cac hed ? Cac hed ? ? ?matmul add tanh add Query6/1/2018 Copyright (c) 2018 by Yusuke Oda. 74
  • 75. Lazy Evaluation through Nodes • Once the results are calculated, Nodes caches the values and it will be reused by future queries. • Unused values are never calculated. Cac hed Cac hed Val Cac hed Val Val ?matmul add tanh add Invoked Invoked Invoked Not Invoked 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 75
  • 76. Parameter • Parameter objects represents a trainable parameter in the network. • Its values can be used in a variable of Graph, and its gradients are updated by Graph. • Initial values can be specified by hand, or using Initializer object. Parameter Reference to the Device Values Cumulative Gradients Other Statistics 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 76
  • 77. Optimizer • Optimizer manages an update policy (SGD, Adam, etc.) of Parameters. • It consumes the gradient information that Parameter holds to update the values. • It also registers the statistics of each Parameter that the update policy requires. • I.e., all statistics about the Parameter is populated in the Parameter object itself, and Optimizer does not have such information. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 77
  • 78. Snippet: Initializing Parameter/Optimizer // Device devices::CUDA dev(0); Device::set_default(dev); // Parameter/Optimizer Parameter p1(Shape({3}), {1, 2, 3}); Parameter p2(Shape({3}), initializers::Uniform(-1, 1)); // Using a uniform distribution to the initial values. // Optimizer optimizers::SGD opt(0.1); // Initializes SGD with LR=0.1. opt.add(p1, p2); // Registers `p1` and `p2` to the optimizer. 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 78
  • 79. Backpropagation • Backpropagation can be performed through Nodes by invoking bakcward() function. • Tensors can not perform backpropagation because they don't manage gradients and computation graphs. • If the computation graph had some Parameters, their gradients are updated by backward(). 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 79
  • 80. Snippet: Backpropagation Graph g; Graph::set_default(g); // Make g as the default. Parameter p(Shape({3}), {1, 2, 3}); optimizers::SGD opt(0.1); opt.add(p); Node w = F::parameter(p); Node x = F::input(Shape({3}), {2, 3, 5}); Node y = w * x; // Elementwise multiplication y.to_vector(); // {2, 6, 15} opt.reset_gradients(); // Make all gradients of parameters 0. y.backward(); // Performs the backpropagation. p.gradient().to_vector(); // {2, 3, 5} opt.update(); // Performs the SGD rule: // {1, 2, 3} – 0.1 * {2, 3, 5} p.value().to_vector(); // {0.8, 1.7, 2.5} 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 80
  • 81. Example 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 81
  • 82. The Data • We use synthetic data in this example... #include <random> #include <tuple> class DataSource { std::mt19937 rng; std::normal_distribution<float> data_dist, noise_dist; public: DataSource(float data_sd, float noise_sd) : rng(std::random_device()()) , data_dist(0, data_sd) , noise_dist(0, noise_sd) {} std::tuple<float, float, float> operator()() { const float x1 = data_dist(rng); const float x2 = data_dist(rng); return std::make_tuple( x1 + noise_dist(rng), x2 + noise_dist(rng), x1 * x2 >= 0 ? 1 : -1); } }; 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 82
  • 83. The Data • We use synthetic data in this example... #include <random> #include <tuple> class DataSource { std::mt19937 rng; std::normal_distribution<float> data_dist, noise_dist; public: DataSource(float data_sd, float noise_sd) : rng(std::random_device()()) , data_dist(0, data_sd) , noise_dist(0, noise_sd) {} std::tuple<float, float, float> operator()() { const float x1 = data_dist(rng); const float x2 = data_dist(rng); return std::make_tuple( x1 + noise_dist(rng), x2 + noise_dist(rng), x1 * x2 >= 0 ? 1 : -1); } }; 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 83
  • 84. The Data ... XOR Where: • data_sd == 1 • noise_sd == 0.1 Input: 𝒙 ∶= 𝑥1, 𝑥2 ∈ ℝ2 Output: 𝑦 ∈ ℝ 0,1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 84
  • 85. The Network • We use a simple MLP: 𝑦 = tanh 𝑊𝒉𝑦 𝒉 + 𝑏 𝑦 𝒉 = tanh 𝑊𝒙𝒉 𝒙 + 𝒃 𝒉 • Where: 𝒉 ∈ ℝ 𝑁 W𝒙𝒉 ∈ ℝ 𝑁×2 𝒃 𝒉 ∈ ℝ 𝑁 𝑊𝒉𝑦 ∈ ℝ1×𝑁 𝑏 𝑦 ∈ ℝ 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 85
  • 86. Code 1: Initialization • Including headers and declaring the main function #include <iostream> #include <vector> #include <primitiv/primitiv.h> using namespace std; using namespace primitiv; int main() { devices::Naive dev; // uses CPU Graph g; Device::set_default(dev); Graph::set_default(g); // All code will be described here. return 0; } 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 86
  • 87. Code 2: Parameter and Optimizer • We have 4 parameters: 𝑊𝒉𝑦, 𝑏 𝑦, 𝑊𝒙𝒉, 𝒃 𝒉. (in main function) constexpr unsigned N = 8; // #hidden units Parameter pw_xh({N, 2}, initializers::XavierUniform()); Parameter pb_h({N}, initializers::Constant(0)); Parameter pw_hy({1, N}, initializers::XavierUniform()); Parameter pb_y({}, initializers::Constant(0)); constexpr float learning_rate = 0.1; optimizers::SGD opt(learning_rate); opt.add(pw_xh, pb_h, pw_hy, pb_y); 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 87
  • 88. Code 3: Writing The Network • Using lambda: (in main function) auto feedforward = [&](const Node &x) { namespace F = primitiv::functions; const Node w_xh = F::parameter<Node>(pw_xh); // Shape({N, 2}) const Node b_h = F::parameter<Node>(pb_h); // Shape({N}) const Node w_hy = F::parameter<Node>(pw_hy); // Shape({1, N}) const Node b_y = F::parameter<Node>(pb_y); // Shape({}) const Node h = F::tanh(F::matmul(w_xh, x) + b_h); // Shape({N}, B) return F::tanh(F::matmul(w_hy, h) + b_y); // Shape({}, B) }; 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 88
  • 89. Code 4: Loss Function • Similar to the main network: (in main function) auto squared_loss = [](const Node &y, const Node &t) { namespace F = primitiv::functions; const Node diff = y - t; // Shape({}, B) return F::batch::mean(diff * diff); // Shape({}) }; 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 89
  • 90. Code 5: Making The Minibatch • This section is out of the toolkit, just up to the data. (in main function) constexpr float data_sd = 1.0; constexpr float noise_sd = 0.1; DataSource data_source(data_sd, noise_sd); auto next_data = [&](unsigned minibatch_size) { std::vector<float> data; std::vector<float> labels; for (unsigned i = 0; i < minibatch_size; ++i) { float x1, x2, t; std::tie(x1, x2, t) = data_source(); data.emplace_back(x1); data.emplace_back(x2); labels.emplace_back(t); } namespace F = primitiv::functions; return std::make_tuple( F::input<Node>(Shape({2}, minibatch_size), data), // input data `x` F::input<Node>(Shape({}, minibatch_size), labels)); // label data `t` }; 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 90
  • 91. Code 6: Training Loop (in main function) for (unsigned epoch = 0; epoch < 100; ++epoch) { g.clear(); // Initializes the computation graph Node x, t; std::tie(x, t) = next_data(1000); // Obtains the next data const Node y = feedforward(x); // Calculates the network const Node loss = squared_loss(y, t); // Calculates the loss std::cout << epoch << ": train loss=" << loss.to_float() << std::endl; // Performs backpropagation and updates parameters opt.reset_gradients(); loss.backward(); opt.update(); } 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 91
  • 92. Code 6: Training Loop (in main function) for (unsigned epoch = 0; epoch < 100; ++epoch) { g.clear(); // Initializes the computation graph Node x, t; std::tie(x, t) = next_data(1000); // Obtains the next data const Node y = feedforward(x); // Calculates the network const Node loss = squared_loss(y, t); // Calculates the loss std::cout << epoch << ": train loss=" << loss.to_float() << std::endl; // Performs backpropagation and updates parameters opt.reset_gradients(); loss.backward(); opt.update(); } $ g++ -std=c++11 code.cc -lprimitiv $ ./a.out 0: loss=1.17221 1: loss=1.07423 2: loss=1.06282 3: loss=1.04641 4: loss=1.00851 5: loss=1.01904 ... 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 92
  • 93. Code 7: Testing (in main function) for (unsigned epoch = 0; epoch < 100; ++epoch) { (Training process written in the previous code block) if (epoch % 10 == 9) { namespace F = primitiv::functions; const vector<float> test_x_data {1, 1, -1, 1, -1, -1, 1, -1}; const vector<float> test_x_data {1, -1, 1, -1}; const Node test_x = F::input<Node>(Shape({2}, 4), test_x_data); const Node test_t = F::input<Node>(Shape({}, 4), test_t_data); const Node test_y = feedforward(test_x); const Node test_loss = squared_loss(test_y, test_t); std::cout << "test results:"; for (float val : test_y.to_vector()) { std::cout << ' ' << val; } std::cout << "ntest loss: " << test_loss.to_float() << std::endl; } } 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 93
  • 94. Code 7: Testing (in main function) for (unsigned epoch = 0; epoch < 100; ++epoch) { (Training process written in the previous code block) if (epoch % 10 == 9) { namespace F = primitiv::functions; const vector<float> test_x_data {1, 1, -1, 1, -1, -1, 1, -1}; const vector<float> test_x_data {1, -1, 1, -1}; const Node test_x = F::input<Node>(Shape({2}, 4), test_x_data); const Node test_t = F::input<Node>(Shape({}, 4), test_t_data); const Node test_y = feedforward(test_x); const Node test_loss = squared_loss(test_y, test_t); std::cout << "test results:"; for (float val : test_y.to_vector()) { std::cout << ' ' << val; } std::cout << "ntest loss: " << test_loss.to_float() << std::endl; } } 1, 1 ⟼ 1 −1, 1 ⟼ −1 −1, −1 ⟼ 1 1, −1 ⟼ −1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 94
  • 95. Code 7: Testing (in main function) for (unsigned epoch = 0; epoch < 100; ++epoch) { (Training process written in the previous code block) if (epoch % 10 == 9) { namespace F = primitiv::functions; const vector<float> test_x_data {1, 1, -1, 1, -1, -1, 1, -1}; const vector<float> test_x_data {1, -1, 1, -1}; const Node test_x = F::input<Node>(Shape({2}, 4), test_x_data); const Node test_t = F::input<Node>(Shape({}, 4), test_t_data); const Node test_y = feedforward(test_x); const Node test_loss = squared_loss(test_y, test_t); std::cout << "test results:"; for (float val : test_y.to_vector()) { std::cout << ' ' << val; } std::cout << "ntest loss: " << test_loss.to_float() << std::endl; } } $ g++ -std=c++11 code.cc -lprimitiv $ ./a.out ... 8: loss=0.933427 9: loss=0.927205 test results: 0.04619 -0.119208 0.0893511 -0.149148 test loss: 0.809695 10: loss=0.916669 11: loss=0.91744 ... 18: loss=0.849496 19: loss=0.845048 test results: 0.156536 -0.229959 0.171106 -0.221599 test loss: 0.649342 20: loss=0.839679 21: loss=0.831217 ... 1, 1 ⟼ 1 −1, 1 ⟼ −1 −1, −1 ⟼ 1 1, −1 ⟼ −1 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 95
  • 96. Links • Public repository (components, tests, examples) • https://github.com/primitiv • Slack (conversation) • https://primitiv-forum.slack.com • Documentation (tutorial, design, reference) • http://primitiv.readthedocs.io/en/develop 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 96
  • 97. Thanks! 6/1/2018 Copyright (c) 2018 by Yusuke Oda. 97