SlideShare a Scribd company logo
1 of 55
Download to read offline
Yukio Okuda
(freelance)
okf.yukio@gmail.com
PyConJP2018/9 Y. Okuda
Me= A Programmer over 40 Years= Joyful
2
   
:FBS





$16$MPDL()[






3FBM$PSFT
0OF $PSF .VMUJ $PSF
)ZQFS5ISFBE
GIL@Python
One Core
N-Threads ≤ 1-Thread
CPU-Bound ➡Special
Vector Processor
Multi Core
N-Threads@M-Cores = min(N,M)
CPU-Bound ➡General
Low-price GPGPU
IBM/
SRB
SUN/
process
DECα/
VMS-thread
POSIX/
pthread
Linux/
pthread
✈Wiki-Xeon
PyConJP2018/9 Y. Okuda
Story(1/2)
3
How to accelerate CPU bound codes in Python
Fast Execution
Compile Languages:
CPython-extension
No GIL:
Cython, PyPy, Jython, IronPython, ..
Device depend:
OpenMP, OpenACC, PyCuda
Fast Development
● Compatibility
● Portability
On-The-Fly (OTF)
Packages
PDF in clouds Codes in Appendix: ✍ Links: ✈GIL
Introduction
PyConJP2018/9 Y. Okuda
Story(2/2)
4
■Showing speed, but 10=90% 20=95% 50=98% 100=99% of time down
F
F
F
F
F
F
4IPUT








5JNFTFD
1ZUIPO 
/VN1Z 
/VN1Z5G 
1ZUIPO/VNCB 
$1ZUIPOYU 
/VN1Z5G!(QV 
1ZUIPO/VNCB
5ISFBE 
5G(SBQI8IJMF!(QV 
1Z5PSDI$V1Z!(QV 
Note: Very few data transfer, No tune up at packages
1000 Loops
Monte Carlo ΠCalculation
PyConJP2018/9 Y. Okuda
H/W S/W
5
env-all Tf-cpu,gpu
Python 3.6 3.5
Conda 5.1.0 VirtualEnv
Mint Linux(Ubuntu 16.04)
CPU + GPU
Batch python on shell
SSH, NFS
CPU: i7-2630QM stepping=5
(Sandy Bridge’12 mobile)
Turbo=Off, EIST=Off
SSE SSE2 SSE4.1 SSE4.2 AVX
2.0 GHz 4Core 8HT
L1=256K, L2=1M, L3=6M
PCIe II 5GT/s
DDR3 16G 21.3G/s,swap off
QM77, NF9G(Jetway Co.)
GPU: GTX-1060
(Pascal GP-106)
1.5 GHz 1280コ ア
L2=1.5M(192bI/F)
PCIe II 5GT/s
DDR5 6G 8G/s
CUDA-9 CC-6.1
Test bench
PyConJP2018/9 Y. Okuda
Background
➊ Python Thread
➋ GIL
➌ CPython-Ext
➍ NumPy
PyConJP2018/9 Y. Okuda
Speeds of Process and Thread (1/2)
7
def add(n):
a = 0
for in range(n):
a += 1
for n in [ .. ]:
ts = time.monotonic()
for in range(1000):
f(n)
te = time.monotonic()
def series(n):
add(n)
add(n)
def process(n):
p1 = Process (target= add,
args=(n,))
p1.start()
p2 = Process (target= add,
...
p1.join(); p2.join()
def thread(n):
t1 = Thread (target= add,
args=(n,))
t1.start()
t2 = Thread (target= add,
...
t1.join(); t2.join()
Background
PyConJP2018/9 Y. Okuda
Speeds of Process and Thread (2/2)
8
■ Speed
●Thread 1× Series
▼
(25%Down@TruboOn)
●Process 1.8× Series
F
F
F
4IPUT





5JNFTFD
5ISFBE 
4FSJFT 
1SPDFTT 
EE4FS1SP5IS)
■ Launch time
●Thread Zero
●Process 6 msec /each
F
F
F
4IPUT




5JNFTFD


5ISFBE
4FSJFT
1SPDFTT
EE4FS1SP5IS-PX
Background
PyConJP2018/9 Y. Okuda
Is Thread Safe? (1/2)
9
def add(n):
global g
for in range(n):
g += 1
def sub(n):
global g
for in range(n):
g –= 1
g = None
def a s(n):
global g
g = 0
t1 = Thread( ..
add,.. n)
t2 = Thread( ..
sub,.. n)
.. .. ..
return g
for n in [ .. ]:
gs = []
for in range(1000):
gs.append(a s(n))
n0 = not zero count (gs)
Background
PyConJP2018/9 Y. Okuda
Is Thread Safe? (2/2)
10
■ T ime ≥ 8 Not Thread-Safe global and local
F
F
F
F
F
F
4IPUT







5JNFTFD






/PU;FSPJO(4
   
5JNF TFD



/PU;FSPJO(4

Background
PyConJP2018/9 Y. Okuda
Why not Safe? GIL
11
■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria
● GIL: Global Interpreter Lock
■ Threads chopped intslice, and lose codes ✈A. Jesse
● tslice = 5 msec ● Errors from 8 msec
☞ For acceleration, avoid GIL and Python object access
☞ For no error, Finish in tslice or apply GIL-Safe opera-
tions
Thread1
Thread2
GIL
tslice
tslice
tslice
tslice
tslice GIL
Background
PyConJP2018/9 Y. Okuda
How to Avoid GIL
12
■ CPython-Extension:
1. Copy Python objects to C-Data
2. Apply “ Py BEGIN ALLOW THREADS” C-Macro
3. Execute C-Native codes or OTF codes
4. Apply “ Py END ALLOW THREADS” C-Macro
5. Copy C-Data to Python objects
Thread1
Thread2
CPython
C-Native
GIL
AvoidGIL
Copy in
Py BEGIN ALLOW THREADS
Py END ALLOW THREADS
Copy out
C-Native Codes
Background
PyConJP2018/9 Y. Okuda
Monte Carlo ΠCalculation
13
 


H hits in the circle targeting
N random shots at a square
π = 4 ·H/N ✈WikiPi-2 ✈LLNL
Error/π = a · Nb ✈WikiPi-1
Python C
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
double pin ( n ) {
unsigned i n t s = time (NULL) ;
i n t h = 0;
for ( i n t i = 0; i  n ; ++ i ) {
double x = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double y = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double r2 = x∗x + y∗y ;
i f ( r2 = 1 . )
h += 1;
return 4 . ∗ ( double ) h / ( double ) n ;
}}
Background
PyConJP2018/9 Y. Okuda
Multi-Threaded ΠCalculation
14
■ Original: pin(n)
Get h hits in n shots ➡ 4 ·h/n
■ m Threading: pinm(n, m)
Launch
h1 in
n/m
h2 in
n/m
...
...
hm in
n/m Map
h = sum(h1, h2, .., hm) Reduce
4 ·h/n
Background
PyConJP2018/9 Y. Okuda
Π@CPython
15
■ Easy Operation (good tools and official documents)
■ Require to run setup at each release of cpython
import cif
pi = cif.pin(n)
dist/cif- • • • egg
python setup.py • • •
(Compile, Link, Deploy, Test)
cifmodule.c
#include Python.h
static PyObject *
pin( • • • ){
Py BEGIN ALLOW THREADS
for (int i = 0; i  n; ++i){ • • •
Py END ALLOW THREADS
setup.py
import setuptools
setup( • • • )
cif test.py
import unittest
import cif
Background
PyConJP2018/9 Y. Okuda
Effects of Threads and Cores
16
☞ min(N, M)× – Overhead
N: # of Threads, M: # of Real Cores ;
     
5ISFBET






3FMBUJWF4QFFE
! !
3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT
4QFFE
*EFBM
Background
PyConJP2018/9 Y. Okuda
Hard to Develop Threading
17
■ Design issues : out of scope of this talk
■ A issue in this trial: rand r, random r
● rand r: Low randomness, ideal speed up ➡Selected
● random r : Good randomness, speed down at threading
● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std
● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48
● “more standardization―for compilers, users, and libraries
..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press
☞ Check speeds of Official thread-safe functions
0e+00 5e+04 1e+05
# Shots
0.00
0.02
0.04
ΠError
0.0001 -0.005
Rand r
Random r
F
F
F
4IPUT




5JNFTFD
5XP 5ISFBE
/P 5ISFBE
Background
PyConJP2018/9 Y. Okuda
NumPy Speedup
18
■ Converting to NumPy 7.7✕
● Vectorize: Move “for loops” into functions
● Numpy Vector/Matrix functions are compiled C-codes
■ Not only numeric calculation
● count nonzero ● less equal, less, .. ● sort, lexsort, ..
● where, searchsorted ● I/O
Python NumPy
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
import numpy as np
def np pi ( n ) :
x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) ,
np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) ,
dtype=np . f l o a t 6 4 )
ones = np . ones ( n , dtype=np . f l o a t 6 4 )
l s s = np . l e s s e q u a l ( rs , ones )
h i t = np . count nonzero ( l s s )
pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / 
np . f l o a t 6 4 ( n )
return pi
Background
PyConJP2018/9 Y. Okuda
Summary
19
➊Avoid GIL to speed up
➋Apply GIL-Safe operations
for Thread-Safe
➌min(N, M) acceleration
F
F
F
F
F
F
4IPUT








5JNFTFD
1ZUIPO 
/VN1Z 
$1ZUIPOYU 
$1ZUIPOYU!5ISFBE 
Background
PyConJP2018/9 Y. Okuda
Numba (V0.38.0) ✈Official
●Background
●Accelerate on:
CPU, M-Core, CUDA
(SSE, AVX, AVX2, AVX-512)
●@numba.jit
Just in Time Compile
■ Few user’s guides ✈Conda2018Slide
■ An excellent review ✈Matthew Rocklin
■ Supported by Conda, Inc
■ The Gordon and Betty Moore Foundation
■ GPU version free from end of 2017
■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite
■ CUDA 2.0 or above
PyConJP2018/9 Y. Okuda
Outline
21
■Two different systems
CPU CUDA
@numba.jit(• •)
def pin(n)
• • •
• • •
return out
pi = pin(100)
@numba.cuda.jit(• •)
def pin(n, out)
• • •
• • •
(no return)
pin[25, 40](100, pi)
●Many Python codes: ✈Official ✈NumPy
▼
Language: All except class, try, except, with, yield
▼
Function call: inner,closure, recursive
▼
Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round()
sorted() type() zip()
▼
NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std()
take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view()
▼
Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi
● CUDA Kernel codes ● NumPy: Not Supported
Numba
PyConJP2018/9 Y. Okuda
On CPU
22
■ @numba.jit() Compile/Execute compatible Python codes to LLVM
●Apply Python π✍ ➡ 21✕
Cf. Manual convert to CPython ✍ ➡ 23✕
☞Comparable speed to manually converted C
●Apply NumPy π✍ ➡ 1✕
☞ NumPy functions are not accelerated
Cf. Python to NumPy ➡7.7✕
☞ Jit 21/7.7= 3✕ of NumPy functions
▼
Numba: Python ➡LLVM ➡Python
▼
NumPy: (Python ➡C ➡Python)✕Repeat
Numba
PyConJP2018/9 Y. Okuda
Accelerate NumPy Indexing
23
■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo
● “for loop” and a function vector operations
on List and NdArray by native and Jit
def for add(n, vs):
for i in range(n):
vs[i] += 1
def np add(n, vs):
a = np.add (vs, 1)
F
F
F
F
F

More Related Content

What's hot

Про асинхронность / Максим Щепелин / Web Developer Wargaming
Про асинхронность / Максим Щепелин / Web Developer WargamingПро асинхронность / Максим Щепелин / Web Developer Wargaming
Про асинхронность / Максим Щепелин / Web Developer WargamingPython Meetup
 
HHVM on AArch64 - BUD17-400K1
HHVM on AArch64 - BUD17-400K1HHVM on AArch64 - BUD17-400K1
HHVM on AArch64 - BUD17-400K1Linaro
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & howlucenerevolution
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Shinya Takamaeda-Y
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言Simen Li
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring CostDavid Evans
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPMiller Lee
 
Programming at Compile Time
Programming at Compile TimeProgramming at Compile Time
Programming at Compile TimeemBO_Conference
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingcppfrug
 
MSc Thesis Defense Presentation
MSc Thesis Defense PresentationMSc Thesis Defense Presentation
MSc Thesis Defense PresentationMostafa Elhoushi
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 

What's hot (20)

Про асинхронность / Максим Щепелин / Web Developer Wargaming
Про асинхронность / Максим Щепелин / Web Developer WargamingПро асинхронность / Максим Щепелин / Web Developer Wargaming
Про асинхронность / Максим Щепелин / Web Developer Wargaming
 
Ctf hello,world!
Ctf hello,world! Ctf hello,world!
Ctf hello,world!
 
HHVM on AArch64 - BUD17-400K1
HHVM on AArch64 - BUD17-400K1HHVM on AArch64 - BUD17-400K1
HHVM on AArch64 - BUD17-400K1
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & how
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
python gil
python gilpython gil
python gil
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring Cost
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMP
 
Programming at Compile Time
Programming at Compile TimeProgramming at Compile Time
Programming at Compile Time
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 
MSc Thesis Defense Presentation
MSc Thesis Defense PresentationMSc Thesis Defense Presentation
MSc Thesis Defense Presentation
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Dafunctor
DafunctorDafunctor
Dafunctor
 
Sysprog 12
Sysprog 12Sysprog 12
Sysprog 12
 
Sysprog 12
Sysprog 12Sysprog 12
Sysprog 12
 

Similar to Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

Introduction to nand2 tetris
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetrisYodalee
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruzrpmcruz
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonoveurobsdcon
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)Igalia
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.Mosky Liu
 
Embedded Rust and more
Embedded Rust and moreEmbedded Rust and more
Embedded Rust and moreKiwamu Okabe
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaJuan Fumero
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOMatteo Grella
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Yusuke Izawa
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...Linaro
 

Similar to Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc (20)

Introduction to nand2 tetris
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetris
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Auto Tuning
Auto TuningAuto Tuning
Auto Tuning
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
 
Php engine
Php enginePhp engine
Php engine
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
 
Embedded Rust and more
Embedded Rust and moreEmbedded Rust and more
Embedded Rust and more
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
 

Recently uploaded

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Recently uploaded (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 

Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

  • 2. Me= A Programmer over 40 Years= Joyful 2 :FBS $16$MPDL()[ 3FBM$PSFT 0OF $PSF .VMUJ $PSF )ZQFS5ISFBE GIL@Python One Core N-Threads ≤ 1-Thread CPU-Bound ➡Special Vector Processor Multi Core N-Threads@M-Cores = min(N,M) CPU-Bound ➡General Low-price GPGPU IBM/ SRB SUN/ process DECα/ VMS-thread POSIX/ pthread Linux/ pthread ✈Wiki-Xeon PyConJP2018/9 Y. Okuda
  • 3. Story(1/2) 3 How to accelerate CPU bound codes in Python Fast Execution Compile Languages: CPython-extension No GIL: Cython, PyPy, Jython, IronPython, .. Device depend: OpenMP, OpenACC, PyCuda Fast Development ● Compatibility ● Portability On-The-Fly (OTF) Packages PDF in clouds Codes in Appendix: ✍ Links: ✈GIL Introduction PyConJP2018/9 Y. Okuda
  • 4. Story(2/2) 4 ■Showing speed, but 10=90% 20=95% 50=98% 100=99% of time down F
  • 5. F
  • 6. F
  • 7. F
  • 8. F
  • 9. F
  • 10. 4IPUT 5JNFTFD 1ZUIPO /VN1Z /VN1Z5G 1ZUIPO/VNCB $1ZUIPOYU /VN1Z5G!(QV 1ZUIPO/VNCB
  • 11. 5ISFBE 5G(SBQI8IJMF!(QV 1Z5PSDI$V1Z!(QV Note: Very few data transfer, No tune up at packages 1000 Loops Monte Carlo ΠCalculation PyConJP2018/9 Y. Okuda
  • 12. H/W S/W 5 env-all Tf-cpu,gpu Python 3.6 3.5 Conda 5.1.0 VirtualEnv Mint Linux(Ubuntu 16.04) CPU + GPU Batch python on shell SSH, NFS CPU: i7-2630QM stepping=5 (Sandy Bridge’12 mobile) Turbo=Off, EIST=Off SSE SSE2 SSE4.1 SSE4.2 AVX 2.0 GHz 4Core 8HT L1=256K, L2=1M, L3=6M PCIe II 5GT/s DDR3 16G 21.3G/s,swap off QM77, NF9G(Jetway Co.) GPU: GTX-1060 (Pascal GP-106) 1.5 GHz 1280コ ア L2=1.5M(192bI/F) PCIe II 5GT/s DDR5 6G 8G/s CUDA-9 CC-6.1 Test bench PyConJP2018/9 Y. Okuda
  • 13. Background ➊ Python Thread ➋ GIL ➌ CPython-Ext ➍ NumPy PyConJP2018/9 Y. Okuda
  • 14. Speeds of Process and Thread (1/2) 7 def add(n): a = 0 for in range(n): a += 1 for n in [ .. ]: ts = time.monotonic() for in range(1000): f(n) te = time.monotonic() def series(n): add(n) add(n) def process(n): p1 = Process (target= add, args=(n,)) p1.start() p2 = Process (target= add, ... p1.join(); p2.join() def thread(n): t1 = Thread (target= add, args=(n,)) t1.start() t2 = Thread (target= add, ... t1.join(); t2.join() Background PyConJP2018/9 Y. Okuda
  • 15. Speeds of Process and Thread (2/2) 8 ■ Speed ●Thread 1× Series ▼ (25%Down@TruboOn) ●Process 1.8× Series F
  • 16. F
  • 17. F
  • 18. 4IPUT 5JNFTFD 5ISFBE 4FSJFT 1SPDFTT EE4FS1SP5IS) ■ Launch time ●Thread Zero ●Process 6 msec /each F
  • 19. F
  • 20. F
  • 22. Is Thread Safe? (1/2) 9 def add(n): global g for in range(n): g += 1 def sub(n): global g for in range(n): g –= 1 g = None def a s(n): global g g = 0 t1 = Thread( .. add,.. n) t2 = Thread( .. sub,.. n) .. .. .. return g for n in [ .. ]: gs = [] for in range(1000): gs.append(a s(n)) n0 = not zero count (gs) Background PyConJP2018/9 Y. Okuda
  • 23. Is Thread Safe? (2/2) 10 ■ T ime ≥ 8 Not Thread-Safe global and local F
  • 24. F
  • 25. F
  • 26. F
  • 27. F
  • 28. F
  • 29. 4IPUT 5JNFTFD /PU;FSPJO(4 5JNF TFD /PU;FSPJO(4 Background PyConJP2018/9 Y. Okuda
  • 30. Why not Safe? GIL 11 ■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria ● GIL: Global Interpreter Lock ■ Threads chopped intslice, and lose codes ✈A. Jesse ● tslice = 5 msec ● Errors from 8 msec ☞ For acceleration, avoid GIL and Python object access ☞ For no error, Finish in tslice or apply GIL-Safe opera- tions Thread1 Thread2 GIL tslice tslice tslice tslice tslice GIL Background PyConJP2018/9 Y. Okuda
  • 31. How to Avoid GIL 12 ■ CPython-Extension: 1. Copy Python objects to C-Data 2. Apply “ Py BEGIN ALLOW THREADS” C-Macro 3. Execute C-Native codes or OTF codes 4. Apply “ Py END ALLOW THREADS” C-Macro 5. Copy C-Data to Python objects Thread1 Thread2 CPython C-Native GIL AvoidGIL Copy in Py BEGIN ALLOW THREADS Py END ALLOW THREADS Copy out C-Native Codes Background PyConJP2018/9 Y. Okuda
  • 32. Monte Carlo ΠCalculation 13 H hits in the circle targeting N random shots at a square π = 4 ·H/N ✈WikiPi-2 ✈LLNL Error/π = a · Nb ✈WikiPi-1 Python C import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n double pin ( n ) { unsigned i n t s = time (NULL) ; i n t h = 0; for ( i n t i = 0; i n ; ++ i ) { double x = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double y = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double r2 = x∗x + y∗y ; i f ( r2 = 1 . ) h += 1; return 4 . ∗ ( double ) h / ( double ) n ; }} Background PyConJP2018/9 Y. Okuda
  • 33. Multi-Threaded ΠCalculation 14 ■ Original: pin(n) Get h hits in n shots ➡ 4 ·h/n ■ m Threading: pinm(n, m) Launch h1 in n/m h2 in n/m ... ... hm in n/m Map h = sum(h1, h2, .., hm) Reduce 4 ·h/n Background PyConJP2018/9 Y. Okuda
  • 34. Π@CPython 15 ■ Easy Operation (good tools and official documents) ■ Require to run setup at each release of cpython import cif pi = cif.pin(n) dist/cif- • • • egg python setup.py • • • (Compile, Link, Deploy, Test) cifmodule.c #include Python.h static PyObject * pin( • • • ){ Py BEGIN ALLOW THREADS for (int i = 0; i n; ++i){ • • • Py END ALLOW THREADS setup.py import setuptools setup( • • • ) cif test.py import unittest import cif Background PyConJP2018/9 Y. Okuda
  • 35. Effects of Threads and Cores 16 ☞ min(N, M)× – Overhead N: # of Threads, M: # of Real Cores ; 5ISFBET 3FMBUJWF4QFFE ! ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Background PyConJP2018/9 Y. Okuda
  • 36. Hard to Develop Threading 17 ■ Design issues : out of scope of this talk ■ A issue in this trial: rand r, random r ● rand r: Low randomness, ideal speed up ➡Selected ● random r : Good randomness, speed down at threading ● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std ● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48 ● “more standardization―for compilers, users, and libraries ..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press ☞ Check speeds of Official thread-safe functions 0e+00 5e+04 1e+05 # Shots 0.00 0.02 0.04 ΠError 0.0001 -0.005 Rand r Random r F
  • 37. F
  • 38. F
  • 40. NumPy Speedup 18 ■ Converting to NumPy 7.7✕ ● Vectorize: Move “for loops” into functions ● Numpy Vector/Matrix functions are compiled C-codes ■ Not only numeric calculation ● count nonzero ● less equal, less, .. ● sort, lexsort, .. ● where, searchsorted ● I/O Python NumPy import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n import numpy as np def np pi ( n ) : x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t = np . count nonzero ( l s s ) pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / np . f l o a t 6 4 ( n ) return pi Background PyConJP2018/9 Y. Okuda
  • 41. Summary 19 ➊Avoid GIL to speed up ➋Apply GIL-Safe operations for Thread-Safe ➌min(N, M) acceleration F
  • 42. F
  • 43. F
  • 44. F
  • 45. F
  • 46. F
  • 47. 4IPUT 5JNFTFD 1ZUIPO /VN1Z $1ZUIPOYU $1ZUIPOYU!5ISFBE Background PyConJP2018/9 Y. Okuda
  • 48. Numba (V0.38.0) ✈Official ●Background ●Accelerate on: CPU, M-Core, CUDA (SSE, AVX, AVX2, AVX-512) ●@numba.jit Just in Time Compile ■ Few user’s guides ✈Conda2018Slide ■ An excellent review ✈Matthew Rocklin ■ Supported by Conda, Inc ■ The Gordon and Betty Moore Foundation ■ GPU version free from end of 2017 ■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite ■ CUDA 2.0 or above PyConJP2018/9 Y. Okuda
  • 49. Outline 21 ■Two different systems CPU CUDA @numba.jit(• •) def pin(n) • • • • • • return out pi = pin(100) @numba.cuda.jit(• •) def pin(n, out) • • • • • • (no return) pin[25, 40](100, pi) ●Many Python codes: ✈Official ✈NumPy ▼ Language: All except class, try, except, with, yield ▼ Function call: inner,closure, recursive ▼ Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round() sorted() type() zip() ▼ NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std() take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view() ▼ Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi ● CUDA Kernel codes ● NumPy: Not Supported Numba PyConJP2018/9 Y. Okuda
  • 50. On CPU 22 ■ @numba.jit() Compile/Execute compatible Python codes to LLVM ●Apply Python π✍ ➡ 21✕ Cf. Manual convert to CPython ✍ ➡ 23✕ ☞Comparable speed to manually converted C ●Apply NumPy π✍ ➡ 1✕ ☞ NumPy functions are not accelerated Cf. Python to NumPy ➡7.7✕ ☞ Jit 21/7.7= 3✕ of NumPy functions ▼ Numba: Python ➡LLVM ➡Python ▼ NumPy: (Python ➡C ➡Python)✕Repeat Numba PyConJP2018/9 Y. Okuda
  • 51. Accelerate NumPy Indexing 23 ■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo ● “for loop” and a function vector operations on List and NdArray by native and Jit def for add(n, vs): for i in range(n): vs[i] += 1 def np add(n, vs): a = np.add (vs, 1) F
  • 52. F
  • 53. F
  • 54. F
  • 55. F
  • 56. F
  • 57. 4IPUT 5JNFTFD 'PS/ESSBZ 'PS-JTU /QEE-JTU +JU'PS-JTU /QEE/ESSBZ +JU'PS/ESSBZ ▼ NdArray indexing is 3.8✕ slower than List ✈stackoverflow ▼ Indexing is required setup calculations, branches in main loops ▼ np.add(NdArray) is 100✕ faster than np.add(List) Numba PyConJP2018/9 Y. Okuda
  • 58. On M-Core 24 ■ All Core working ➊ set @jit(parallel=True) ➋ change “range” to “numba.prange” ● Apply Python π➡ 89✕ ➡ 4.4✕ of @jit() ●No way to control # of cores ▼ Multi-User/Process needs core assignment ■ @jit(nogil=True) + ThreadPoolExecutor controls ✍ 5ISFBET 3FMBUJWF4QFFE ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Numba PyConJP2018/9 Y. Okuda
  • 59. On CUDA 25 ■ Non-compatible python codes, (details are out of scope) ● CUDA kernel codes in definitions ▼ Python like, not C in PyCuda ● insert “[#blocks, #threads]” in calls ▼ Ex. pin[25, 40](n) ● Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy ▼ Use 2nd run, 1st includes 1.8 sec compile/load time F
  • 60. F
  • 61. F
  • 62. F
  • 63. F
  • 64. F
  • 65. 4IPUT 5JNFTFD $6%TU $6%OE Overhead ➡ Numba PyConJP2018/9 Y. Okuda
  • 66. Summary 26 ➊Convert to Nogil functions ➋Accelerate “for/while” loops ➌Improve NumPy indexing F
  • 67. F
  • 68. F
  • 69. F
  • 70. F
  • 71. F
  • 72. 4IPUT 5JNFTFD 1ZUIPO $16 /PHJM!5ISFBE 1BSBMMFM $6% Numba PyConJP2018/9 Y. Okuda
  • 73. Machine Learning Packages: ●NumPy accelerators ●Kernel-less CUDA access ●Tensor objects ●Poor Documents➡My thought ? ➊ TensorFlow (V1.9) ✈Official ✈奥田 ●CPU, CUDA, (TPU, ROCm, Phi) Own-SIMD +(SSE, AVX, AVX2, AVX-512) ➋ PyTorch (V0.4.11) ✈Official ●CUDA ➌ CuPy (V4.1.0) –Chainer– ✈Official ●CUDA PyConJP2018/9 Y. Okuda
  • 74. Exec Modes 28 ■ TensorFlow-tf: (CPU, CUDA)✕(Eager, Graph)= 4 ● Eager: Python is a direct executor for ordinary actions ● Graph: Python is a macro generator for computing graphs ● Eager if 1st-code is tf.enable eager execution() else Graph ●Two pip packages: CPU, GPU(=GPU+CPU) Implicit: Package set default device Explicit: “with tf.device(’/cpu:0’):” block ■ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph) ● torch.func(.., device=D,..) D=device(’cuda’); D=device(’cpu’) ● Implicit: auto-decide from operands ➡ Fast ● Explicit-2: torch.func(..).cuda() ➡ Slow ■ CuPy-cp: CUDA = 1 (NN-Graph) ●Only CUDA, use NumPy for CPU ML Packages PyConJP2018/9 Y. Okuda
  • 75. CUDA 29 ■ TensorFlow Eager✍ ➊ np. ➡tf. ➋ Change some func names ➌ Add “tf.cast” some func ➍ Select env. for CUDA ■ PyTorch✍/CuPy✍ Graph ➊ np. ➡pt./ cp. ➋ Change some func names/ No ➌ Add “device” options/ No ➍ Set global device type/ No ■ TensorFlow Graph✍ ➊ Create “tf.placeholder” inputs ➋ Run a function with the inputs ■ TensorFlow CPU ● Execute the same codes on env. of CPU F
  • 76. F
  • 77. F
  • 78. 4IPUT 5JNFTFD 5GBHFS 5G(SQBI $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda
  • 79. CPU 30 ■ TensorFlow ✍ 2.4, 3.8✕ 8 cores run SIMD ? ■ PyTorch ✍ 0.7✕ for CUDA-less develop/debug F
  • 80. F
  • 81. F
  • 82. 4IPUT 5JNFTFD 1Z5PSDI /VN1Z 5G 5G(SBQI TensorFlow PyTorch ■ In progress of Eager , More functional and faster ? F
  • 83. F
  • 84. F
  • 85. 4IPUT 5JNFTFD $POEBOW 7JSUVBMOW $POEB.LM ● V1.5@Jan./2018: Contribution version ✈奥田 ● V1.7: Moving out of contribution ● V1.8: SSE, AVX link ● V1.9@Aug.: Conda links intel-MKL ✈Conda MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel ● V?: Contribution AutoGraph ✈GitHub ML Packages PyConJP2018/9 Y. Okuda
  • 86. TensorFlow Graph 31 ■Advanced computing graph ●While, Branch, Parallel, Reduce, Scatter, etc in CUDA ● Concurrent Main Memory accesses from CUDAs and CPUs ▼ Written by non-portable special control functions, not Python – Macro-Language ▼ Hard to understand the functions, but contrib.AutoGraph converts “for, if, ..” to Graph ● Slower than PyToch in the π calculation ●1000 While@CUDA✍ ●10 Parallel@CUDA✍ F
  • 87. F
  • 88. F
  • 89. F
  • 90. F
  • 91. F
  • 93. F
  • 94. F
  • 95. F
  • 96. F
  • 97. F
  • 99. Overhead (OH) 32 ■ Negligible OHs for heavy functions as fft, cv, solvers, etc ● TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings ) tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries ) ● CuPy: 1. linalg 2. math 3. fft ■ Prediction of getting array OHs at ordinary cases ●NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar CPU np.RNG(n) xs xs[0] x CPU CUDA cp.RNG(n) xs nd cp.asnumpy nd[0] x CPU CUDA cp.RNG(n) xs xs[0] Scalar x cp.asnumpy RNG: Random Number Generator F
  • 100. F
  • 101. F
  • 102. F
  • 103. F
  • 104. F
  • 106. F
  • 107. F
  • 108. F
  • 109. F
  • 110. F
  • 111. 4IPUT 5JNFTFD SSBZ 4DBMBS    { ▼ Transfer time from CUDA to CPU ▼ Jump caused by Cache ? ML Packages PyConJP2018/9 Y. Okuda
  • 112. All Overheads 33 ■ Accelerate function “r = f(a1, a2)” ●NumPy ● Accelerator CPU def f(p1, p2): a1 p1 a2 p2 • • • r return rf CPU Acc. a1 p1 copy in a2 p2 • • • r copy return rf copy out ▼ copy in F
  • 113. F
  • 114. F
  • 115. F
  • 116. F
  • 117. F
  • 119. F
  • 120. F
  • 121. F
  • 122. F
  • 123. F
  • 125. F
  • 126. F
  • 127. F
  • 128. F
  • 129. F
  • 131. F
  • 132. F
  • 133. F
  • 134. F
  • 135. F
  • 137. Tensor 34 ■ Bridge between CPU and Accelerator ? CPU Acc. a1 p1 a2 p2 • • • r return rf Tensor copy in copy out others Buffer/Cache ●copy in (Create Tensor Object) TensorFlow convert to tensor(nd) PyTorch tensor(nd) Cupy array(nd) nd: NdArray ●copy out (Convert to NdArray) TensorFlow t obj.numpy() PyTorch t obj.cpu().numpy() Cupy asnumpy(t obj) t obj: Tensor Object ●Others ▼ Neural Network functions ▼ MM-Direct: Scatter Read/Write ▼ “if”, “while” ●Buffer/Cache ✈PyTorch ✈DlPack ▼ Not store in CPU-Mem. Cf. NumPy functions ▼ • • • ML Packages PyConJP2018/9 Y. Okuda
  • 138. Summary 35 ➊ CuPy: NumPy compatible CUDA ☞ TensorFlow: CPU-SIMD/CUDA/.., Application modules ☞ PyTorch: debugging on CPU ☞ Consider Copy-In/Out overhead F
  • 139. F
  • 140. F
  • 141. F
  • 142. F
  • 143. F
  • 144. 4IPUT 5JNFTFD /VN1Z 5G!$QV 5G!(QV $V1Z!(QV 5G8IJMF!(QV 1Z5PSDI!(QV ML Packages PyConJP2018/9 Y. Okuda
  • 145. Dask (V0.18.0) ✈Official ●Background ●“Delayed” simple graph for threading ■ Answer of PyData to Col. W/O-MM-Limit: Hadoop➡Arrow ➡7 systems + Pandas ✈Official ✈Mckinney , HANA(SAP), RevolutionR(MS)✈Official ■ Conda + DARPA, NSF, Gordon Moore Found., HHMI ■ Expand NumPy, Pandas, Scikit-Learn ■ Parallel computing: ● Process: Futures ●Thread: Delayed PyConJP2018/9 Y. Okuda
  • 146. Graph for Multi-Threading 37 ■“delayed” defines nodes of parallel computing✍ # Thread ●mn.visualize() at m=3 cnt = int(n/ m) ps = [ ] for in range(m): p = dask.delayed( get pi)(cnt) ps.append(p) mn = dask.delayed(np.mean)(ps) pi = mn.compute() Execute ■Apply to all the get pi functions with m=3 ① ② ③ Dask PyConJP2018/9 Y. Okuda
  • 147. The Results at 3 Threads 38 ■NumPy shows little improvement /VN1Z %BTL ● Ufuncs nogil ✈HP affect acceleration ▼ Short intervals of “add, multiply, less equal” ■No-GIL functions show well improvement $1ZUIPO %BTL /PHJM1Z %BTL ●ThreadPoolExecutor showed: ▼ 3X at CPython ▼ 3X at NogilPy ■The others show no improvement, CuPy may have nogil func. 1ZUIPO %BTL 1ZUIPO!+JU %BTL 5G$QV %BTL $V1Z %BTL Dask PyConJP2018/9 Y. Okuda
  • 148. Delayed vs ThreadPoolExecutor 39 ■ NogilPy ThreadPool shows lower launch, higher speed ● Delayed ●ThreadPool F
  • 149. F
  • 150. F
  • 151. F
  • 152. F
  • 153. F
  • 154. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 F
  • 155. F
  • 156. F
  • 157. F
  • 158. F
  • 159. F
  • 160. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 5ISFBET 3FMBUJWF4MPQF ! ! 4MPQF *EFBM 5ISFBET 3FMBUJWF4QFFE ! ! 4QFFE *EFBM Dask PyConJP2018/9 Y. Okuda
  • 161. Summary 40 ➊ No guide about GIL-Safe ●Only inhibit “+=, –=” without reasoning ➋ Large Overheads for the πcalculation ■ A tool for Dask components ? ■ Too Early to Evaluate ➊ NumPy has Nogil functions ➋ CuPy may have Nogil functions ● PyTorch Freeze ● TensorFlow@CPU segmentation fault F
  • 162. F
  • 163. F
  • 164. F
  • 165. F
  • 166. F
  • 167. 4IPUT 5JNFTFD /VN1Z /VN1Z!5ISFBE /PHJM1Z!5ISFBE Dask PyConJP2018/9 Y. Okuda
  • 168. Threading and Nogil ● ThreadPoolExecutor ➊ Confirm Nogil-ness of CuPy ➋ GIL-Safe prediction ➌ Nogil forced NumPy PyConJP2018/9 Y. Okuda
  • 169. NumPy vs CuPy 42 ■ NumPy Partial-Nogil, CuPy Full-Nogil ? ● NumPy ●CuPy F
  • 170. F
  • 171. F
  • 172. F
  • 173. F
  • 174. F
  • 175. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 F
  • 176. F
  • 177. F
  • 178. F
  • 179. F
  • 180. F
  • 181. 4IPUT 5JNFTFD 5 5 5 5 5 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM 5ISFBET 3FMBUJWF4MPQF 4MPQF *EFBM Threading and Nogil PyConJP2018/9 Y. Okuda
  • 182. Confirm CuPy 43 ■ Error/π = a · (N)b ✈WikiPi-1 / 4IPUT — — — — — CTPMVUF3FMBUJWFSSPS %BUB $V1Z!5 /VN1ZSSPS 1 Loop ●CuPy at 8 threads ▼ Thread-safe RNG ▼ Paralell execution in CUDA ●NumPy at 8 threads ▼ GIL Error caused by, h = 0 for v in lss: if v == 1: h = h + 1 not += Threading and Nogil PyConJP2018/9 Y. Okuda
  • 183. GIL-Safe Prediction 44 ■ Almost impossible to predict GIL-Safe Local functions show Safe or Not non-deterministic # def rng count(n) ✍ x = np.random.rand(n) # def count(n) ones = np.ones(n) c = np.count nonzero(ones) return c # n == c ● Count: 14 errors No error@T2,3,4 on the test-bench No error on Intel-Atom✍ ● Rng Count No error ☞Apply Forced Nogil functions F
  • 184. F
  • 185. F
  • 186. F
  • 187. F
  • 188. F
  • 190. F
  • 191. F
  • 192. F
  • 193. F
  • 194. F
  • 195. / 5JNFTFD 3OH@$PVOU $PVOU 1 Loop Threading and Nogil PyConJP2018/9 Y. Okuda
  • 196. Numba JIT Options 45 ■ Set nopython=True for nogil guarantee ?●Local objects are stored in a heap storage of which accesses should be mutexes. ●The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses. Guaranteed @jit( nogil=True, nopython = True) Non-guaranteed @jit( nogil=True, nopython = False) Thread-1 Variables NameSpaces • • • Thread-2 LLVM Objects • • • Thread-3 Release GIL Variables NameSpaces Catch GIL GIL EntryObject Manager Obj-1 Python Heap Storage Obj-n ●All Accesses Threading and Nogil PyConJP2018/9 Y. Okuda
  • 197. Nogil NumPy by Namba 46 ■ Some NumPy functions require rewriting ● Guaranteed Nogil F
  • 198. F
  • 199. F
  • 200. F
  • 201. F
  • 202. F
  • 203. 4IPUT 5JNFTFD 3FXSJUFE 0SJHJOBM 5ISFBE 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM ● Rewriting slows down 0.02X h = count nonzero(lss) h = 0 for v in lss: if v == 1: h = h + 1 ● Numba speeds up 1.6X ● 6 Threads speeds up 3.2X 5x of Original Threading and Nogil PyConJP2018/9 Y. Okuda
  • 204. Summary 47 ➊Apply Nogil functions for Thread-Safe ■ Set nopython=True with nogil=True in numba.jit ➋Almost impossible to predict GIL-Safe ➌CuPy paralell execution in CUDA ? F
  • 205. F
  • 206. F
  • 207. F
  • 208. F
  • 209. F
  • 210. 4IPUT 5JNFTFD /VN1Z /PHJM/VN1Z Threading and Nogil PyConJP2018/9 Y. Okuda
  • 211. Conclusion 48 Execution Time Confirmation (ETC) on run time signatures showed: ➊ Ideal threading acceleration = min(N, M) ➋ A comparison of On-The-Fly packages: ● Numba ● TensorFlow ● PyTorch ● CuPy ● Dask ➌ Basic issues and Solutions: ● GIL ● Nogil ● GIL-Safe ● Threading ● Graph ● NumPy Indexing ● Copy Overhead Enjoy On-The-Fly Own Ways ✍ PyConJP2018/9 Y. Okuda
  • 213. Appendix MIT License Copyright ( c ) 2018 Yukio Okuda Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) , to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e , and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s : The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software . THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyConJP2018/9 Y. Okuda
  • 214. Numba 51 Nogil import numba import random from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor @numba . j i t ( n o g i l =True , nopython=True ) def n b a p i n o g i l ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n tp e = ThreadPoolExecutor ( max workers =12) #−− def n b a p i n o g i l t p n m ( n , m) : g lo ba l tp e cn t = i n t ( n /m) i f cn t 1 : cn t = 1 ans = [ ] for i in range (m) : ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) ) p i = 0 . for f in ans : p i += f . r e s u l t ( ) return p i /m print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) ) CUDA import numba import numpy as np from numba . cuda . random import x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 from numba . cuda . random import c r e a t e x o r o s h i r o 1 2 8 p s t a t e s @numba . cuda . j i t ( ) def nba cuda ( n , pi , rng ) : t h r e a d i d = numba . cuda . g r i d ( 1 ) h i t = 0 for in range ( n ) : x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) r = x∗x + y∗y i f r = 1 . : h i t += 1 p i [ t h r e a d i d ] = 4 . ∗ h i t / n def n b a cu d a rec ( n ) : t h r e a d s p e r b l o c k = 25 b lo ck s = 40 r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s ( t h r e a d s p e r b l o c k ∗ blocks , seed =1) p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks , dtype=np . f l o a t 6 4 ) nba cuda [ blocks , t h r e a d s p e r b l o c k ] ( n , pis , r n g s t a t e s ) return p i s . mean ( ) print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 215. ML:TfEager,PyTorch,Cupy 52 TensorFlow-CPU/CUDA-Eager import t e n s o r f l o w as t f t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( ) # t f . e n a b l e e a g e r e x e c u t i o n ( ) def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i . numpy ( ) print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 )) CuPy-CUDA import cupy as cp import numpy as np def cp p i g p u ( n ) : x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = cp . ones ( n , dtype=cp . f l o a t 6 4 ) l s s = cp . l e s s e q u a l ( rs , ones ) h i t = cp . co u n t n o n zero ( l s s ) PyTorch-CPU import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) def p t p i c p u ( n ) : x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 ) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] p i = 4 . ∗ h i t / n return p i print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 )) PyTorch-CUDA import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) DEVICE = t o r c h . d ev ice ( ’ cuda ’ ) def p t p i g p u a l l ( n ) : x = t o r c h . rand ( n , d ev ice=DEVICE) y = t o r c h . rand ( n , d ev ice=DEVICE) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , d ev ice=DEVICE) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] return 4 . ∗ h i t / n print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 216. ML:TfGraph 53 TensorFlow-Simple Graph import t e n s o r f l o w as t f def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f g r a p h = t f p i n ( t f n ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h , f e e d d i c t ={ t f n : n }) return p i i f name == ” m a i n ” : print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) TensorFlow-While Graph import t e n s o r f l o w as t f from t f g r a p h s i m p l e import t f p i n def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) : p i s = t f . add ( p i s , t f p i n ( n ) ) return p i s def t f g r a p h p i n w h i l e ( n , loop ) : i = t f . c o n s t a n t ( 0 ) p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 ) i , p i s = t f . wh ile lo o p ( lambda i , p i s : t f . l e s s ( i , loop ) , lambda i , p i s : ( t f . add ( i , 1 ) , t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) , [ i , p i s ] ) p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ ) t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h w h i l e , f e e d d i c t ={ t f n : n , t f l o o p : 1000}) return p i print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 217. ML:TfGraph Dask 54 TensorFlow-Parallel Graph import t e n s o r f l o w as t f M = 10 m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ ) n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 ) h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ ) for in range (M) : xs = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . add ( h it , t f . co u n t n o n zero ( lss , dtype= t f . i n t 6 4 ) ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) ans = p i s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( in n , in m ) : p i = s e s s i o n . run ( ans , f e e d d i c t ={n : in n , m: in m }) return p i print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 )) Dask-Numba import numpy as np import random import dask import numba @numba . j i t ( n o g i l =True ) def g e t p i ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n def d s k n b a p i n o g i l ( n , m, v= False ) : cn t = i n t ( n /m) ps = [ ] for in range (m) : p = dask . delayed ( g e t p i ) ( cn t ) ps . append ( p ) mn = dask . delayed ( np . mean ) ( ps ) i f v : mn. v i s u a l i z e ( o p t i m i z e g r a p h=True ) p i = 0 e l s e : p i = mn . compute ( ) return p i # v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and # Graphviz u t i l i t y # g en era te . / mydask . png # d s k n b a p i n o g i l (10∗∗5 , 3 , v=True ) print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) ) Appendix PyConJP2018/9 Y. Okuda
  • 218. Miscellaneous 55 GIL-Safe import numpy as np from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor tp e = ThreadPoolExecutor ( max workers =25) def rn g co u n t ( n ) : x = np . random . rand ( n ) . asty p e ( np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def count ( n ) : ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def tpe pi nm min ( n , m, f ) : g lo ba l tp e t s = [ ] for i in range (m) : t s . append ( tp e . submit ( f , n ) ) p i s = [ ] for t in t s : p i s . append ( t . r e s u l t ( ) ) return min ( p i s ) for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7): c = tpe pi nm min ( n , 9 , count ) print ( ” count : ” , n==c , n , c ) c = tpe pi nm min ( n , 9 , rn g co u n t ) print ( ” rn g co u n t : ” , n==c , n , c ) GIL-Safe-Note R e s u l t s of print depend on e x e c u t i n g machine Bench mark machine : count : False 7000000 34302 rn g co u n t : True 7000000 7000000 count : False 8000000 10750 rn g co u n t : True 8000000 8000000 count : False 9000000 525822 rn g co u n t : True 9000000 9000000 count : False 10000000 455166 rn g co u n t : True 10000000 10000000 I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread s t e p p i n g =3 a l l True ! ! Appendix PyConJP2018/9 Y. Okuda