A	Powerful,	Flexible,	and	Intui5ve	
Deep	Learning	Framework
@	NVIDIA	GTC,	April	6th,	2016
Shohei	Hido	
Chief	Research	Officer	
Preferred	Networks,	Inc.
l  Chainer	is	a	Python-based	deep	learning	framework	
l  Chainer	v1.0	was	released	as	an	open	source	on	June	2015	
l  It	DOESN’T	rely	on	Theano,	unlike	other	Python	frameworks	
l  Chainer	uses	a	unique	scheme	named	Define-by-Run
l  Why	do	users	sOll	need	another	framework?	
l  How	different	and	effecOve	Chainer	is?	
Preferred Networks (PFN)
A startup that applies deep learning to industrial IoT
l  Founded: March 2014
l  Headquarter: Tokyo, Japan
l  U.S. Subsidiary: San Mateo, California
l  Company size: 35 engineers & researchers
l  Investors: Toyota, FANUC, NTT
Deep learning	 Industrial IoT	
Partnering with world-leading companies using Chainer
l  R&D	collaboraOon	on	industrial	problems	with	real-world	data	
̶  Specific	requirements,	modified	algorithms,	many	trials	and	errors,	etc	
̶  Different	from	making	general-purpose	recogniOon	system	
Toyota	 FANUC	
Two types of background behind DL frameworks
1.	Scalability-oriented	
l  Use-cases	in	mind	
̶  Image/speech	recogniOon	system	
̶  Fast	DL	as	a	service	in	cloud	
l  Problem	type	
̶  A	few	general	applicaOons	
̶  10+	million	training	samples	
̶  10+	nodes	cluster	w/	fast	network	
l  Possible	boZleneck	
̶  Tuning	of	well-known	algorithms	
̶  Distributed	computaOon	for	
model/data-parallel	training	
2.	Flexibility-oriented	
l  Use-cases	in	mind	
̶  Algorithm	research	
̶  R&D	projects	for	new	products	
l  Problem	type	
̶  Various	specific	applicaOons	
̶  10+	k	training	samples	
̶  1	node	with	mulOple	GPUs	
l  Possible	boZleneck	
̶  Trial-and-error	in	prototyping	
̶  Debugging,	profiling	&	refactoring	
̶  (wait	Ome	during	compilaOon)
Designed for efficient research & development
l  Flexible:	new	kinds	of	complex	models	for	various	applicaOons	
l  IntuiOve:	rapid	prototyping	and	efficient	trial-and-error	
l  Powerful:	comparable	performance	for	1	node	&	mulO-GPUs		
Scalability-oriented Flexibility-oriented
l  Deep	learning	framework	basics	
l  IntroducOon	to	Chainer	
l  CuPy:	NumPy-compaOble	GPU	library	
l  Performance	and	applicaOons
Neural network and computation
・・ h1
Forward computation
Backward computation
Input Hidden units Output

Anomaly score:


Chainer focuses on network representation/training
l  Design	choices	for	deep	learning	frameworks	
̶  How	to	build	neural	networks?	
̶  How	to	train	neural	networks?	
̶  Which	text	format/language	for	modeling?		
̶  Which	language	for	compuOng?		
̶  Run	with	GPU?	
̶  Run	on	mulOple	GPUs?	
̶  Run	on	mulOple	compute	nodes?	
Building and training neural networks:
Computational graph construction is the key
1.  Construct	a	computaOonal	graph	
̶  Based	on	network	definiOon	given	by	users	
̶  Chains	of	funcOons	and	operaOons	on	input	variables	
2.  Compute	loss	and	gradients	
̶  Forward	computaOon	to	calculate	loss	for	a	minibatch	
̶  BackpropagaOon	gives	gradients	to	all	of	parameters	
3.  OpOmize	model	
̶  Update	each	parameter	with	the	gradient	
̶  Repeat	unOl	convergence	
Step 1. is the most important and there are many approaches
Building blocks
l  These	funcOonaliOes	are	very	similar	between	frameworks	
l  But	the	structure,	abstracOon	level,	and	interface	are	different	
l  It	comes	to	the	design	of	domain-specific	language	for	NN	
Array data structure
Operations & functions
(computational graph)
Types of domain-specific language for neural networks
l  Text	DSL	
̶  Ex.	Caffe	(prototxt)	
̶  Ex.	CNTK	(NDL)
l  Symbolic	program	
̶  OperaOons	
on	symbols	
̶  Ex.	Theano	
̶  Ex.	TensorFlow	
l  ImperaOve	program	
̶  Direct	computaOons	
on	raw	data	arrays	
̶  Ex.	Torch.nn	
̶  Ex.	Chainer
#	Symbolic	definiOon	
A	=	Variable(‘A’)	
B	=	Variable(‘B’)	
C	=	B	*	A	
D	=	C	+	Constant(1)	
#	Compile	
f	=	compile(D)	
d	=	f(A=np.ones(10),	
									B=np.ones(10)	*	2)	
#	ImperaOve	declaraOon	
a	=	np.ones(10)	
b	=	np.ones(10)	*	2	
c	=	b	*	a	
d	=	c	+	1	
%%	DefiniOon	in	text	
f:	{		
			“A”:	“Variable”,	
			“B”:		“Variable”,	
			“C”:	[“B”,	“*”,	“A”],	
			“ret”:	[“C”,	“+”,	1]	
#	Compile	
f	=	compile(“f.txt”)	
d	=	f(A=np.ones(10),	
									B=np.ones(10)	*	2)	
Ex. MXNet
Comparison of DSL type
DSL	type	 Pros.	 Cons.	
Text	DSL	
•  Human-readable	definiOon	
•  Non-programmer	can	easily	
edit	the	network	
•  Users	must	study	the	format	
•  Format	might	have	to	be	
extended	for	new	algorithms	
Internal	DSL	
•  StaOc	analysis	at	compile		
•  OpOmizaOon	before	training	
•  Easy	to	parallelize	
•  Users	must	study	special	syntax		
•  May	need	more	efforts	to	
implement	new	algorithms	
•  Less	efforts	to	learn	syntax	
•  Easy	debugging	and	profiling	
•  Suitable	for	new	algorithms	
with	complex	logic	
•  Hard	to	opOmize	in	advance	
•  Less	efficient	in	memory	
allocaOon	and	parallelizaOon		
Chainer	is	at	the	extreme	end	of	imperaOve	program	for	high	flexibility
l  Deep	learning	framework	basics	
l  IntroducOon	to	Chainer	
l  CuPy:	NumPy-compaOble	GPU	library	
l  Performance	and	applicaOons
Chainer as an open-source project
l  hZps://	
l  50	contributors	
l  1,277	stars	&	255	fork	
l  3,708	commits	
l  AcOve	development	&	release	for	last	10	months	
̶  v1.0.0	(June	2015)	to		v1.7.2	(March	2016)	
Original developer
Seiya Tokui
Chainer software stack
l  Chainer	is	built	on	top	of	NumPy	and	CUDA	
l  CuPy	is	also	introduced	as	an	equivalent	of	NumPy	on	GPU
Graph build scheme (1/2) - Define-and-Run:
Most of frameworks use this scheme (Chainer does not)
l  Define:	build	a	computaOonal	graph	based	on	definiOon	
l  Run:	update	the	model	(parameters)	using	training	dataset
Loss	&	gradient	
Auto	differenOaOon	
Graph build scheme (2/2) - Define-by-Run:
Computational graph construction on the fly
l  No	graph	is	constructed	before	training	
l  Instead,	the	graph	is	built	at	each	forward	computaOon		
l  ComputaOonal	graph	can	be	modified	dynamically	
for	each	iteraOon/sample	or	depending	on	some	condiOons	
Dynamic			change	
Define-by-Run example: MLP for MNIST
l  Only	transformaOons	between	units	are	set	before	training	
l  ConnecOon	is	given	as	forward	computaOon
l1 = Linear(784, n_units)
l2 = Linear(n_units, 10))
Linear l2Linear l1
x yh1
W bias
W bias
def forward(x):
h1 = ReLU(l1(x))
return l2(h1)
An interpreted language for neural network
l  Idea	
̶  Forward	computaOon	actually	goes	through	computaOonal	graph	
̶  By	remembering	the	history,	the	actual	graph	can	be	obtained	
l  Advantage	
̶  Flexibility	for	new	algorithms	with	complex	components	
u  Ex.	recurrent,	recursive,	aZenOon,	memory,	adversarial,	etc	
̶  IntuiOve	coding	with	highly	imperaOve	network	definiOon	
u  Ex.	stochasOc	network	of	which	graph	changes	for	each	iteraOon	
l  Current	drawbacks	
̶  Graph	is	generated	every	Ome	also	for	fixed	networks	
̶  No	opOmizaOon	even	for	staOc	part	of	graphs	
u  JIT-like	analysis	and	subgraph	cache	might	be	useful	
Basic components (1/2): Variable and Function
l  Variable	
̶  Variable	wraps	arrays	(.data)	
̶  It	remembers	parent	funcOon	
̶  It	will	be	assigned	gradient	(.grad)	
̶  It	keeps	track	of	not	only	data	
but	also	computaOons	
l  FuncOon	
̶  TransformaOon	between	Variable	
̶  Stateless	
̶  e.g.	sigmoid,	tanh,	ReLU,	
								maxpooling,	dropout
x y
x yh1
Chain (MLP2)
Basic components (2/2): Link and Chain
l  Link	=	funcOon	with	state	
̶  Parameters	are	also	Variable		
and	gradients	will	be	assigned	
̶  e.g.	Linear	(fully-connected),		LSTM			
							ConvoluOon2d,	word-embedding	
l  Chain	=	network	
̶  Chain	has	a	set	of	child	Link	
̶  Forward	computaOon	is	defined	
in	.	__call__()	
̶  e.g.	MLP2,	AlexNet,	GoogLeNet,	
								RNNLM,	seq2seq,		
x y
W b
Linear l2Linear l1
W bias
W bias
Backpropagation through computational graph
l  Consider	an	objecOve	(Link.Linear):		L = f(x * w + b)
l  This	computes	the	value	of	L	in	forward	computaOon,	and	
simultaneously	builds	the	following	computaOonal	graph	
l  The	gradient	of	L	can	be	computed	with	respect	to	
	any	variables	by	backpropagaOon	
l  Then	the	opOmizer	updates	the	value	of	parameters	
f L
is	Variable	
is	FuncOon	
Code sample (1/4): Multi-layer perceptron
class MLP2(Chain):
def __init__(self):
super(MLP2, self).__init__(
l1=L.Linear(784, 100),
l2=L.Linear(100, 10),
def __call__(self, x):
h1 = F.relu(self.l1(x))
y = self.l2(h1)
return y
class Classifier(Chain):
def __init__(self, predictor):
super(Classifier, self).
def __call__(self, x, t):
y = self.predictor(x)
self.accuracy = F.accuracy(y, t)
self.loss = F.softmax_cross_entropy(y, t)
return self.loss, self.accuracy
# Model and optimizer setup
model = Classifier(MLP2())
optimizer = optimizers.SGD()
# training loop with minibatch
for i in range(0, datasize, batchsize):
x = Variable(x_tr[i:i+batchsize])
t = Variable(y_tr[i:i+batchsize])
loss, acc = model(x, t)
Chain (MLP2)
Linear l2Linear l1
W bias
W bias
Code sample (2/4): Convolutional neural network
class AlexNet(Chain):
def __init__(self):
super(AlexNet, self).__init__(
conv1=L.Convolution2D(3, 96, 11, stride=4),
conv2=L.Convolution2D(96, 256, 5, pad=2),
conv3=L.Convolution2D(256, 384, 3, pad=1),
conv4=L.Convolution2D(384, 384, 3, pad=1),
conv5=L.Convolution2D(384, 256, 3, pad=1),
fc6=L.Linear(9216, 4096),
fc7=L.Linear(4096, 4096),
fc8=L.Linear(4096, 1000),
def __call__(self, x, t):
h = F.max_pooling_2d(F.relu(
F.local_response_normalization(self.conv1(x))), 3, stride=2)
h = F.max_pooling_2d(F.relu(
F.local_response_normalization(self.conv2(h))), 3, stride=2)
h = F.relu(self.conv3(h))
h = F.relu(self.conv4(h))
h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2)
h = F.dropout(F.relu(self.fc6(h)), train=self.train)
h = F.dropout(F.relu(self.fc7(h)), train=self.train)
y = self.fc8(h)
return y
* ImageNet Classification with Deep Convolutional Neural Networks
Code sample (3/4): Recurrent neural network
class SimpleRNN(Chain):
def __init__(self, n_vocab, n_units):
super(SimpleRNN, self).__init__(
embed=L.EmbedID(n_vocab, n_units)
x2h=L.Linear(n_units, n_units),
h2h=L.Linear(n_units, n_units),
h2y=L.Linear(n_units, n_vocab),)
self.h = None
def __call__(self, x):
y, h_new = self.fwd_one_step(x, self.h)
self.h = h_new
return y
def fwd_one_step(self, x, h):
x = F.tanh(self.embed(x))
if h is None:
h = F.tanh(self.x2h(x))
h = F.tanh(self.x2h(x) + self.h2h(h))
y = F.softmax(self.h2y(h))
return y, h	
x_1 h y_1
x_2 h y_2
x_3 h y_3
x_4 h y_4
BPTT	length	=	3
Input	word OutputRecurrent	state
# Truncated BPTT (length=3)
for i in range(0, datasize, batchsize):
accum_loss += model(x, t)
if i % bptt_length == 0:
Code sample (4/4): Deep Networks with Stochastic Depth
A paper published on arXiv, March 30, 2016
l  A	variant	of	Residual	Net	that	skips	connecOons	stochasOcally	
̶  Outperformed	the	original	Residual	Net	(ImageNet	2015	winner,	MSR)	
̶  StochasOc	skip:	
Taken from
G. Huang et al.	
# Mock code in Chainer
class StochasticResNet(Chain):
def __init__(self, prob, size, …):
super(StochasticResNet, size, …).__init__(
## Define f[i] as same for Residual Net )
self.p = prob # Survival probabilities
def __call__(self, h):
for i in range(self.size):
b = numpy.random.binomial(1, self.p[i])
c = self.f[i](h) + h if b == 1 else h
h = F.relu(c)
return h
w/ survival probability: 	
l  Other	features	
̶  Install	with	pip	in	one	line:		
̶  MulO-GPU	support	by	explicitly	selecOng	the	ID	to	use		
̶  Pre-trained	Caffe	model	import	from	Model	Zoo	
̶  Model	serializaOon	&	save	&	load	:	HDF5	or	NumPy	npz	
l  Future	direcOon	(not	only	for	Chainer)	
̶  JIT-like	opOmizaOon	during	Define-by-Run	
̶  Memory	consumpOon	reducOon	(GPU	memory	is	sOll	small)	
̶  Handling	variable-length	inputs	without	minibatch	
̶  Maximizing	performance	on	mulO-node	&	mulO-GPU	environment	
$ pip install chainer
l  Deep	learning	framework	basics	
l  IntroducOon	to	Chainer	
l  CuPy:	NumPy-compaOble	GPU	library	
l  Performance	and	applicaOons
CuPy: (partially-)NumPy-compatible GPU library
l  MoOvaOon:	NumPy	+	CUDA	=	CuPy	
̶  NumPy	is	the	standard	library	in	Python	for	numerical	computaOon	
̶  CUDA	is	the	standard	APIs	for	using	GPU	for	high-performance		
̶  Unfortunately,	NumPy	does	NOT	work	with	CUDA	
l  CuPy	supports:	
̶  Fast	computaOon	using	NVIDIA’s	cuBLAS	and	cuDNN	
̶  Array	indexing,	slicing,	transpose,	and	reshape	
̶  Most	of	operaOons/funcOons	in	NumPy	
u  Chainer	v1.7.2	already	supports	more	than	170	funcOons	
̶  User-defined	funcOons	and	kernels	
̶  all	dtypes,	broadcasOng,	memory	pool,	etc	
How to use CuPy
l  Usage	of	CuPy:	just	replace	NumPy with	CuPy	
l  Conversion	between	numpy.ndarray	and	cupy.ndarray
l  Ex.	CPU/GPU-agnosOc	logsumexp	funcOon
def logsumexp(x, axis=None):
xp = cuda.get_array_module(x) #Get CuPy or NumPy
x_max = x.max(axis)
exp_sum = xp.exp(x - x_max).sum(axis)
return x_max + xp.log(exp_sum)	
import numpy, cupy
enable_cupy = True
xp = cupy if enable_cupy else numpy	
w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray
w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray 	
CuPy implementation:
Optimized for performance & NumPy-compatibility
l  Use	Cython	for	cupy.core	&	cupy.cuda	
l  Dynamic	code	generaOon	&	compile	
̶  CUDA	code	is	generated	for	specific	tensor	dimension	&	data	type	
̶  On-the-fly	compile	by	nvcc	and	binary	cache	(faster	awer	1st	use)	
CUDA	libraries	(cuBLAS,	cuRAND,	cuDNN)
ufunc,	elementwise,	reduc5on	
CUDA	Python	wrapper cupy.cuda	
Tensor	opera5ons	&	func5ons cupy	
CuPy performance on linear algebra:
5 to 25 times faster than NumPy
def test(xp):
a = xp.arange(1000000).reshape(1000, -1)
return a.T * 2
t1 =
for i in range(1000):
t2 =
print(t2 -t1)
t1 =
for i in range(1000):
t2 =
print(t2 -t1)
msec speed	
NumPy	 2,929 1.0
CuPy 585 5.0
CuPy	+	
Memory	Pool
123 23.8
Intel	Core	i7-4790	@3.60GHz,32GB,	GeForce	GTX	970	
Use CuPy for GPU-based computation
l  Support	three	paZerns	as	wrappers	
̶  ElementwiseKernel:	for	element-wise	computaOon		
̶  ReducOonKernel:	for	reduce	operaOon	along	axis	
̶  ufunc:	universal	funcOon	as	in	Numpy	
l  Ex.	definiOon	of	an	element-wise	funcOon		
l  Usage	(automaOc	broadcast	and	type	check	are	supported)	
squared_diff = cupy.ElementwiseKernel(
‘float32 x, float32 y’, # Input
‘float32 z’, # Output
‘z = (x - y) * (x - y)’, # Operation
‘squared_diff’) # Name
squared_diff(cupy.arange(10), 10)
l  Deep	learning	framework	basics	
l  IntroducOon	to	Chainer	
l  CuPy:	NumPy-compaOble	GPU	library	
l  Performance	and	applicaOons
Public benchmark results (CNN):
Chainer shows comparable performance
l  Forward	computaOon	is	almost	the	same	with	TensorFlow	
l  Training	with	backward	computaOon	is	slower,	but	it	can	be	
offset	by	no	compilaOon	Ome	while	debugging/tuning	
AlexNet	 GoogLeNet	 VGG-A	 OverFeat	
Caffe	(naCve)	
AlexNet	 GoogLeNet	 VGG-A	 OverFeat	
Caffe	(naCve)	
Forward computation (msec) Backward computation (msec)
Taken from, using cuDNN except Caffe	 36
Chainer can benefit from latest CUDA libraries:
Ex. Winograd algorithm in cuDNN v5
l  Conv3x3	is	common	in	CNNs	&	now	computed	with	Winograd	
l  State-of-the-art	CNN	models	(e.g.,	GoogLeNet,	VGG-A)	
can	be	accelerated	up	to	2.0x	at	test	Ome	(forward	only)	
AlexNet	 GoogLeNet	 VGG-A	 OverFeat	
cuDNN	v4	
cuDNN	v5	
AlexNet	 GoogLeNet	 VGG-A	 OverFeat	
cuDNN	v4	
cuDNN	v5	
Forward computation (msec) Backward computation (msec)
Independently measured by a modified version of soumith/convnet-benchmarks
cuDNN v5 can be used in Chainer v1.8.0	 37
Algorithm implementation in Chainer:
A Neural Algorithm of Artistic Style (Gatys et al., 2015)
l  hZps://	
image (cat)
+	 =	
Main code (45 lines)	 38
l  Many	collaboraOons	are	on-going	w/	Chainer-based		
computer	vision,	deep	reinforcement	learning,	etc…	
l  Ex.	1	Chainer-controlled	toy	cars	in	Toyota	booth	at	CES	2016	
l  Ex.	2	Highly	accurate	FANUC’s	bin-picking	robot	at	IREX	2015	
̶  8	hours	training	to	reach	expert-level,	commercializaOon	by	2016	end
Chainer in industry:
Used in demonstrations & being commercialized	
l  Chainer	is	a	Python-based	deep	learning	framework	
with	dynamic	network	construcOon	scheme	and	CuPy	
l  It	is	designed	for	efficient	research	and	prototyping	while	
keeping	comparable	performance	thanks	to	NVIDIA	GPU	
l  Official	web:	hZp://	
l  Github:	hZps://	
Your	contribuOons	will	be	appreciated	&	we	are	hiring!

Recently uploaded (20)

Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Chainer GTC 2016

  • 2. Overview l  Chainer is a Python-based deep learning framework l  Chainer v1.0 was released as an open source on June 2015 l  It DOESN’T rely on Theano, unlike other Python frameworks l  Chainer uses a unique scheme named Define-by-Run l  Why do users sOll need another framework? l  How different and effecOve Chainer is? 2
  • 3. Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l  Founded: March 2014 l  Headquarter: Tokyo, Japan l  U.S. Subsidiary: San Mateo, California l  Company size: 35 engineers & researchers l  Investors: Toyota, FANUC, NTT Deep learning Industrial IoT 3 Manufacturing Automotive Healthcare
  • 4. Partnering with world-leading companies using Chainer l  R&D collaboraOon on industrial problems with real-world data ̶  Specific requirements, modified algorithms, many trials and errors, etc ̶  Different from making general-purpose recogniOon system 4 Toyota FANUC Panasonic NTT Cisco NVIDIA
  • 5. Two types of background behind DL frameworks 1. Scalability-oriented l  Use-cases in mind ̶  Image/speech recogniOon system ̶  Fast DL as a service in cloud l  Problem type ̶  A few general applicaOons ̶  10+ million training samples ̶  10+ nodes cluster w/ fast network l  Possible boZleneck ̶  Tuning of well-known algorithms ̶  Distributed computaOon for model/data-parallel training 2. Flexibility-oriented l  Use-cases in mind ̶  Algorithm research ̶  R&D projects for new products l  Problem type ̶  Various specific applicaOons ̶  10+ k training samples ̶  1 node with mulOple GPUs l  Possible boZleneck ̶  Trial-and-error in prototyping ̶  Debugging, profiling & refactoring ̶  (wait Ome during compilaOon)
  • 6. Designed for efficient research & development l  Flexible: new kinds of complex models for various applicaOons l  IntuiOve: rapid prototyping and efficient trial-and-error l  Powerful: comparable performance for 1 node & mulO-GPUs 6 Scalability-oriented Flexibility-oriented
  • 7. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 7
  • 8. Neural network and computation x1 xN ・・ h1 hH ・・・・ kM k1 yM y1 Forward computation Backward computation (backpropagation) ・・ ・・ Input Hidden units Output Text Image Sensor Object:
 Tulip Anomaly score:
 0.35 Category:
 Sports ・・ ・・・・ 8
  • 9. Chainer focuses on network representation/training l  Design choices for deep learning frameworks ̶  How to build neural networks? ̶  How to train neural networks? ̶  Which text format/language for modeling? ̶  Which language for compuOng? ̶  Run with GPU? ̶  Run on mulOple GPUs? ̶  Run on mulOple compute nodes? 9
  • 10. Building and training neural networks: Computational graph construction is the key 1.  Construct a computaOonal graph ̶  Based on network definiOon given by users ̶  Chains of funcOons and operaOons on input variables 2.  Compute loss and gradients ̶  Forward computaOon to calculate loss for a minibatch ̶  BackpropagaOon gives gradients to all of parameters 3.  OpOmize model ̶  Update each parameter with the gradient ̶  Repeat unOl convergence Step 1. is the most important and there are many approaches 10
  • 11. Building blocks l  These funcOonaliOes are very similar between frameworks l  But the structure, abstracOon level, and interface are different l  It comes to the design of domain-specific language for NN Array data structure (vector/matrix/tensor) Operations & functions Network (computational graph) Optimizer (SGD/AdaGrad/Adam) 11
  • 12. Types of domain-specific language for neural networks l  Text DSL ̶  Ex. Caffe (prototxt) ̶  Ex. CNTK (NDL) l  Symbolic program ̶  OperaOons on symbols ̶  Ex. Theano ̶  Ex. TensorFlow l  ImperaOve program ̶  Direct computaOons on raw data arrays ̶  Ex. Torch.nn ̶  Ex. Chainer # Symbolic definiOon A = Variable(‘A’) B = Variable(‘B’) C = B * A D = C + Constant(1) # Compile f = compile(D) d = f(A=np.ones(10), B=np.ones(10) * 2) # ImperaOve declaraOon a = np.ones(10) b = np.ones(10) * 2 c = b * a d = c + 1 %% DefiniOon in text f: { “A”: “Variable”, “B”: “Variable”, “C”: [“B”, “*”, “A”], “ret”: [“C”, “+”, 1] } # Compile f = compile(“f.txt”) d = f(A=np.ones(10), B=np.ones(10) * 2) 12 Ex. MXNet
  • 13. Comparison of DSL type DSL type Pros. Cons. Text DSL •  Human-readable definiOon •  Non-programmer can easily edit the network •  Users must study the format •  Format might have to be extended for new algorithms Internal DSL Symbolic •  StaOc analysis at compile •  OpOmizaOon before training •  Easy to parallelize •  Users must study special syntax •  May need more efforts to implement new algorithms ImperaOve •  Less efforts to learn syntax •  Easy debugging and profiling •  Suitable for new algorithms with complex logic •  Hard to opOmize in advance •  Less efficient in memory allocaOon and parallelizaOon Chainer is at the extreme end of imperaOve program for high flexibility 13
  • 14. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 14
  • 15. Chainer as an open-source project l  hZps:// l  50 contributors l  1,277 stars & 255 fork l  3,708 commits l  AcOve development & release for last 10 months ̶  v1.0.0 (June 2015) to v1.7.2 (March 2016) 15 Original developer Seiya Tokui
  • 16. CuPy Chainer software stack CPU NVIDIA GPU CUDA cuDNN BLAS NumPy Chainer l  Chainer is built on top of NumPy and CUDA l  CuPy is also introduced as an equivalent of NumPy on GPU 16
  • 17. Run Define Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l  Define: build a computaOonal graph based on definiOon l  Run: update the model (parameters) using training dataset Network definiOon ComputaOonal graph Gradient funcOon Parameters ComputaOonal graph Gradient funcOon Parameters Training data Update Loss & gradient Auto differenOaOon 17
  • 18. Define-by-Run Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l  No graph is constructed before training l  Instead, the graph is built at each forward computaOon l  ComputaOonal graph can be modified dynamically for each iteraOon/sample or depending on some condiOons Model definiOon ComputaOonal graph Gradient funcOon Parameters Training data Update Dynamic change CondiOons 18
  • 19. Define-by-Run example: MLP for MNIST l  Only transformaOons between units are set before training l  ConnecOon is given as forward computaOon l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) Linear l2Linear l1 x yh1 W bias 0 5 9 W bias ReLU def forward(x): h1 = ReLU(l1(x)) return l2(h1) 19
  • 20. Define-by-Run: An interpreted language for neural network l  Idea ̶  Forward computaOon actually goes through computaOonal graph ̶  By remembering the history, the actual graph can be obtained l  Advantage ̶  Flexibility for new algorithms with complex components u  Ex. recurrent, recursive, aZenOon, memory, adversarial, etc ̶  IntuiOve coding with highly imperaOve network definiOon u  Ex. stochasOc network of which graph changes for each iteraOon l  Current drawbacks ̶  Graph is generated every Ome also for fixed networks ̶  No opOmizaOon even for staOc part of graphs u  JIT-like analysis and subgraph cache might be useful 20
  • 21. Basic components (1/2): Variable and Function l  Variable ̶  Variable wraps arrays (.data) ̶  It remembers parent funcOon (.creator) ̶  It will be assigned gradient (.grad) ̶  It keeps track of not only data but also computaOons l  FuncOon ̶  TransformaOon between Variable ̶  Stateless ̶  e.g. sigmoid, tanh, ReLU, maxpooling, dropout Function x y Variable x yh1 0 5 9 21
  • 22. Chain (MLP2) Basic components (2/2): Link and Chain l  Link = funcOon with state ̶  Parameters are also Variable and gradients will be assigned ̶  e.g. Linear (fully-connected), LSTM ConvoluOon2d, word-embedding l  Chain = network ̶  Chain has a set of child Link ̶  Forward computaOon is defined in . __call__() ̶  e.g. MLP2, AlexNet, GoogLeNet, RNNLM, seq2seq, Link (Linear) y=f(W*x+b) x y W b Linear l2Linear l1 yh1 W bias W bias ReLU 22
  • 23. Backpropagation through computational graph l  Consider an objecOve (Link.Linear): L = f(x * w + b) l  This computes the value of L in forward computaOon, and simultaneously builds the following computaOonal graph l  The gradient of L can be computed with respect to any variables by backpropagaOon l  Then the opOmizer updates the value of parameters *x W + b f L is Variable is FuncOon 23
  • 24. Code sample (1/4): Multi-layer perceptron class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy # Model and optimizer setup model = Classifier(MLP2()) optimizer = optimizers.SGD() optimizer.setup(model) # training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward() optimizer.update() Chain (MLP2) Linear l2Linear l1 yh1 W bias W bias ReLU 24
  • 25. Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y * ImageNet Classification with Deep Convolutional Neural Networks conv2d conv2d conv2d conv2d conv2d linear linear 25 linear
  • 26. Code sample (3/4): Recurrent neural network class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h x_1 h y_1 x_2 h y_2 x_3 h y_3 x_4 h y_4 BPTT length = 3 Input word OutputRecurrent state # Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward() optimizer.update() 26
  • 27. Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016 l  A variant of Residual Net that skips connecOons stochasOcally ̶  Outperformed the original Residual Net (ImageNet 2015 winner, MSR) ̶  StochasOc skip: Taken from G. Huang et al. # Mock code in Chainer class StochasticResNet(Chain): def __init__(self, prob, size, …): super(StochasticResNet, size, …).__init__( ## Define f[i] as same for Residual Net ) self.p = prob # Survival probabilities def __call__(self, h): for i in range(self.size): b = numpy.random.binomial(1, self.p[i]) c = self.f[i](h) + h if b == 1 else h h = F.relu(c) return h w/ survival probability: 27
  • 28. Miscellaneous l  Other features ̶  Install with pip in one line: ̶  MulO-GPU support by explicitly selecOng the ID to use ̶  Pre-trained Caffe model import from Model Zoo ̶  Model serializaOon & save & load : HDF5 or NumPy npz l  Future direcOon (not only for Chainer) ̶  JIT-like opOmizaOon during Define-by-Run ̶  Memory consumpOon reducOon (GPU memory is sOll small) ̶  Handling variable-length inputs without minibatch ̶  Maximizing performance on mulO-node & mulO-GPU environment $ pip install chainer 28
  • 29. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 29
  • 30. CuPy: (partially-)NumPy-compatible GPU library l  MoOvaOon: NumPy + CUDA = CuPy ̶  NumPy is the standard library in Python for numerical computaOon ̶  CUDA is the standard APIs for using GPU for high-performance ̶  Unfortunately, NumPy does NOT work with CUDA l  CuPy supports: ̶  Fast computaOon using NVIDIA’s cuBLAS and cuDNN ̶  Array indexing, slicing, transpose, and reshape ̶  Most of operaOons/funcOons in NumPy u  Chainer v1.7.2 already supports more than 170 funcOons ̶  User-defined funcOons and kernels ̶  all dtypes, broadcasOng, memory pool, etc 30
  • 31. How to use CuPy l  Usage of CuPy: just replace NumPy with CuPy l  Conversion between numpy.ndarray and cupy.ndarray l  Ex. CPU/GPU-agnosOc logsumexp funcOon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum) import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray 31
  • 32. CuPy implementation: Optimized for performance & NumPy-compatibility l  Use Cython for cupy.core & cupy.cuda l  Dynamic code generaOon & compile ̶  CUDA code is generated for specific tensor dimension & data type ̶  On-the-fly compile by nvcc and binary cache (faster awer 1st use) CUDA libraries (cuBLAS, cuRAND, cuDNN) ndarray ufunc, elementwise, reduc5on CUDA Python wrapper cupy.cuda cupy.core Tensor opera5ons & func5ons cupy 32
  • 33. CuPy performance on linear algebra: 5 to 25 times faster than NumPy def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = for i in range(1000): test(numpy) t2 = print(t2 -t1) test(cupy) t1 = for i in range(1000): test(cupy) t2 = print(t2 -t1) msec speed up NumPy 2,929 1.0 CuPy 585 5.0 CuPy + Memory Pool 123 23.8 Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970 33
  • 34. Use CuPy for GPU-based computation l  Support three paZerns as wrappers ̶  ElementwiseKernel: for element-wise computaOon ̶  ReducOonKernel: for reduce operaOon along axis ̶  ufunc: universal funcOon as in Numpy l  Ex. definiOon of an element-wise funcOon l  Usage (automaOc broadcast and type check are supported) squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input ‘float32 z’, # Output ‘z = (x - y) * (x - y)’, # Operation ‘squared_diff’) # Name squared_diff(cupy.arange(10), 10) 34
  • 35. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 35
  • 36. Public benchmark results (CNN): Chainer shows comparable performance l  Forward computaOon is almost the same with TensorFlow l  Training with backward computaOon is slower, but it can be offset by no compilaOon Ome while debugging/tuning 0 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) 0 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) Forward computation (msec) Backward computation (msec) Taken from, using cuDNN except Caffe 36
  • 37. Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5 l  Conv3x3 is common in CNNs & now computed with Winograd l  State-of-the-art CNN models (e.g., GoogLeNet, VGG-A) can be accelerated up to 2.0x at test Ome (forward only) 0 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 0 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 Forward computation (msec) Backward computation (msec) Independently measured by a modified version of soumith/convnet-benchmarks cuDNN v5 can be used in Chainer v1.8.0 37
  • 38. Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) l  hZps:// Content image (cat) Style image New artistic image + = Main code (45 lines) 38
  • 39. l  Many collaboraOons are on-going w/ Chainer-based computer vision, deep reinforcement learning, etc… l  Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016 l  Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015 ̶  8 hours training to reach expert-level, commercializaOon by 2016 end Chainer in industry: Used in demonstrations & being commercialized 39