Yukio Okuda
(freelance)
okf.yukio@gmail.com
PyConJP2018/9 Y. Okuda
Me= A Programmer over 40 Years= Joyful
2
   
:FBS





$16$MPDL()[






3FBM$PSFT
0OF $PSF .VMUJ $PSF
)ZQFS5ISFBE
GIL@Python
One Core
N-Threads ≤ 1-Thread
CPU-Bound ➡Special
Vector Processor
Multi Core
N-Threads@M-Cores = min(N,M)
CPU-Bound ➡General
Low-price GPGPU
IBM/
SRB
SUN/
process
DECα/
VMS-thread
POSIX/
pthread
Linux/
pthread
✈Wiki-Xeon
PyConJP2018/9 Y. Okuda
Story(1/2)
3
How to accelerate CPU bound codes in Python
Fast Execution
Compile Languages:
CPython-extension
No GIL:
Cython, PyPy, Jython, IronPython, ..
Device depend:
OpenMP, OpenACC, PyCuda
Fast Development
● Compatibility
● Portability
On-The-Fly (OTF)
Packages
PDF in clouds Codes in Appendix: ✍ Links: ✈GIL
Introduction
PyConJP2018/9 Y. Okuda
Story(2/2)
4
■Showing speed, but 10=90% 20=95% 50=98% 100=99% of time down
F
F
F
F
F
F
4IPUT








5JNFTFD
1ZUIPO 
/VN1Z 
/VN1Z5G 
1ZUIPO/VNCB 
$1ZUIPOYU 
/VN1Z5G!(QV 
1ZUIPO/VNCB
5ISFBE 
5G(SBQI8IJMF!(QV 
1Z5PSDI$V1Z!(QV 
Note: Very few data transfer, No tune up at packages
1000 Loops
Monte Carlo ΠCalculation
PyConJP2018/9 Y. Okuda
H/W S/W
5
env-all Tf-cpu,gpu
Python 3.6 3.5
Conda 5.1.0 VirtualEnv
Mint Linux(Ubuntu 16.04)
CPU + GPU
Batch python on shell
SSH, NFS
CPU: i7-2630QM stepping=5
(Sandy Bridge’12 mobile)
Turbo=Off, EIST=Off
SSE SSE2 SSE4.1 SSE4.2 AVX
2.0 GHz 4Core 8HT
L1=256K, L2=1M, L3=6M
PCIe II 5GT/s
DDR3 16G 21.3G/s,swap off
QM77, NF9G(Jetway Co.)
GPU: GTX-1060
(Pascal GP-106)
1.5 GHz 1280コ ア
L2=1.5M(192bI/F)
PCIe II 5GT/s
DDR5 6G 8G/s
CUDA-9 CC-6.1
Test bench
PyConJP2018/9 Y. Okuda
Background
➊ Python Thread
➋ GIL
➌ CPython-Ext
➍ NumPy
PyConJP2018/9 Y. Okuda
Speeds of Process and Thread (1/2)
7
def add(n):
a = 0
for in range(n):
a += 1
for n in [ .. ]:
ts = time.monotonic()
for in range(1000):
f(n)
te = time.monotonic()
def series(n):
add(n)
add(n)
def process(n):
p1 = Process (target= add,
args=(n,))
p1.start()
p2 = Process (target= add,
...
p1.join(); p2.join()
def thread(n):
t1 = Thread (target= add,
args=(n,))
t1.start()
t2 = Thread (target= add,
...
t1.join(); t2.join()
Background
PyConJP2018/9 Y. Okuda
Speeds of Process and Thread (2/2)
8
■ Speed
●Thread 1× Series
▼
(25%Down@TruboOn)
●Process 1.8× Series
F
F
F
4IPUT





5JNFTFD
5ISFBE 
4FSJFT 
1SPDFTT 
EE4FS1SP5IS)
■ Launch time
●Thread Zero
●Process 6 msec /each
F
F
F
4IPUT




5JNFTFD


5ISFBE
4FSJFT
1SPDFTT
EE4FS1SP5IS-PX
Background
PyConJP2018/9 Y. Okuda
Is Thread Safe? (1/2)
9
def add(n):
global g
for in range(n):
g += 1
def sub(n):
global g
for in range(n):
g –= 1
g = None
def a s(n):
global g
g = 0
t1 = Thread( ..
add,.. n)
t2 = Thread( ..
sub,.. n)
.. .. ..
return g
for n in [ .. ]:
gs = []
for in range(1000):
gs.append(a s(n))
n0 = not zero count (gs)
Background
PyConJP2018/9 Y. Okuda
Is Thread Safe? (2/2)
10
■ T ime ≥ 8 Not Thread-Safe global and local
F
F
F
F
F
F
4IPUT







5JNFTFD






/PU;FSPJO(4
   
5JNF TFD



/PU;FSPJO(4

Background
PyConJP2018/9 Y. Okuda
Why not Safe? GIL
11
■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria
● GIL: Global Interpreter Lock
■ Threads chopped intslice, and lose codes ✈A. Jesse
● tslice = 5 msec ● Errors from 8 msec
☞ For acceleration, avoid GIL and Python object access
☞ For no error, Finish in tslice or apply GIL-Safe opera-
tions
Thread1
Thread2
GIL
tslice
tslice
tslice
tslice
tslice GIL
Background
PyConJP2018/9 Y. Okuda
How to Avoid GIL
12
■ CPython-Extension:
1. Copy Python objects to C-Data
2. Apply “ Py BEGIN ALLOW THREADS” C-Macro
3. Execute C-Native codes or OTF codes
4. Apply “ Py END ALLOW THREADS” C-Macro
5. Copy C-Data to Python objects
Thread1
Thread2
CPython
C-Native
GIL
AvoidGIL
Copy in
Py BEGIN ALLOW THREADS
Py END ALLOW THREADS
Copy out
C-Native Codes
Background
PyConJP2018/9 Y. Okuda
Monte Carlo ΠCalculation
13
 


H hits in the circle targeting
N random shots at a square
π = 4 ·H/N ✈WikiPi-2 ✈LLNL
Error/π = a · Nb ✈WikiPi-1
Python C
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
double pin ( n ) {
unsigned i n t s = time (NULL) ;
i n t h = 0;
for ( i n t i = 0; i  n ; ++ i ) {
double x = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double y = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double r2 = x∗x + y∗y ;
i f ( r2 = 1 . )
h += 1;
return 4 . ∗ ( double ) h / ( double ) n ;
}}
Background
PyConJP2018/9 Y. Okuda
Multi-Threaded ΠCalculation
14
■ Original: pin(n)
Get h hits in n shots ➡ 4 ·h/n
■ m Threading: pinm(n, m)
Launch
h1 in
n/m
h2 in
n/m
...
...
hm in
n/m Map
h = sum(h1, h2, .., hm) Reduce
4 ·h/n
Background
PyConJP2018/9 Y. Okuda
Π@CPython
15
■ Easy Operation (good tools and official documents)
■ Require to run setup at each release of cpython
import cif
pi = cif.pin(n)
dist/cif- • • • egg
python setup.py • • •
(Compile, Link, Deploy, Test)
cifmodule.c
#include Python.h
static PyObject *
pin( • • • ){
Py BEGIN ALLOW THREADS
for (int i = 0; i  n; ++i){ • • •
Py END ALLOW THREADS
setup.py
import setuptools
setup( • • • )
cif test.py
import unittest
import cif
Background
PyConJP2018/9 Y. Okuda
Effects of Threads and Cores
16
☞ min(N, M)× – Overhead
N: # of Threads, M: # of Real Cores ;
     
5ISFBET






3FMBUJWF4QFFE
! !
3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT
4QFFE
*EFBM
Background
PyConJP2018/9 Y. Okuda
Hard to Develop Threading
17
■ Design issues : out of scope of this talk
■ A issue in this trial: rand r, random r
● rand r: Low randomness, ideal speed up ➡Selected
● random r : Good randomness, speed down at threading
● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std
● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48
● “more standardization―for compilers, users, and libraries
..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press
☞ Check speeds of Official thread-safe functions
0e+00 5e+04 1e+05
# Shots
0.00
0.02
0.04
ΠError
0.0001 -0.005
Rand r
Random r
F
F
F
4IPUT




5JNFTFD
5XP 5ISFBE
/P 5ISFBE
Background
PyConJP2018/9 Y. Okuda
NumPy Speedup
18
■ Converting to NumPy 7.7✕
● Vectorize: Move “for loops” into functions
● Numpy Vector/Matrix functions are compiled C-codes
■ Not only numeric calculation
● count nonzero ● less equal, less, .. ● sort, lexsort, ..
● where, searchsorted ● I/O
Python NumPy
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
import numpy as np
def np pi ( n ) :
x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) ,
np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) ,
dtype=np . f l o a t 6 4 )
ones = np . ones ( n , dtype=np . f l o a t 6 4 )
l s s = np . l e s s e q u a l ( rs , ones )
h i t = np . count nonzero ( l s s )
pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / 
np . f l o a t 6 4 ( n )
return pi
Background
PyConJP2018/9 Y. Okuda
Summary
19
➊Avoid GIL to speed up
➋Apply GIL-Safe operations
for Thread-Safe
➌min(N, M) acceleration
F
F
F
F
F
F
4IPUT








5JNFTFD
1ZUIPO 
/VN1Z 
$1ZUIPOYU 
$1ZUIPOYU!5ISFBE 
Background
PyConJP2018/9 Y. Okuda
Numba (V0.38.0) ✈Official
●Background
●Accelerate on:
CPU, M-Core, CUDA
(SSE, AVX, AVX2, AVX-512)
●@numba.jit
Just in Time Compile
■ Few user’s guides ✈Conda2018Slide
■ An excellent review ✈Matthew Rocklin
■ Supported by Conda, Inc
■ The Gordon and Betty Moore Foundation
■ GPU version free from end of 2017
■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite
■ CUDA 2.0 or above
PyConJP2018/9 Y. Okuda
Outline
21
■Two different systems
CPU CUDA
@numba.jit(• •)
def pin(n)
• • •
• • •
return out
pi = pin(100)
@numba.cuda.jit(• •)
def pin(n, out)
• • •
• • •
(no return)
pin[25, 40](100, pi)
●Many Python codes: ✈Official ✈NumPy
▼
Language: All except class, try, except, with, yield
▼
Function call: inner,closure, recursive
▼
Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round()
sorted() type() zip()
▼
NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std()
take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view()
▼
Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi
● CUDA Kernel codes ● NumPy: Not Supported
Numba
PyConJP2018/9 Y. Okuda
On CPU
22
■ @numba.jit() Compile/Execute compatible Python codes to LLVM
●Apply Python π✍ ➡ 21✕
Cf. Manual convert to CPython ✍ ➡ 23✕
☞Comparable speed to manually converted C
●Apply NumPy π✍ ➡ 1✕
☞ NumPy functions are not accelerated
Cf. Python to NumPy ➡7.7✕
☞ Jit 21/7.7= 3✕ of NumPy functions
▼
Numba: Python ➡LLVM ➡Python
▼
NumPy: (Python ➡C ➡Python)✕Repeat
Numba
PyConJP2018/9 Y. Okuda
Accelerate NumPy Indexing
23
■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo
● “for loop” and a function vector operations
on List and NdArray by native and Jit
def for add(n, vs):
for i in range(n):
vs[i] += 1
def np add(n, vs):
a = np.add (vs, 1)
F
F
F
F
F

Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

  • 1.
  • 2.
    Me= A Programmerover 40 Years= Joyful 2 :FBS $16$MPDL()[ 3FBM$PSFT 0OF $PSF .VMUJ $PSF )ZQFS5ISFBE GIL@Python One Core N-Threads ≤ 1-Thread CPU-Bound ➡Special Vector Processor Multi Core N-Threads@M-Cores = min(N,M) CPU-Bound ➡General Low-price GPGPU IBM/ SRB SUN/ process DECα/ VMS-thread POSIX/ pthread Linux/ pthread ✈Wiki-Xeon PyConJP2018/9 Y. Okuda
  • 3.
    Story(1/2) 3 How to accelerateCPU bound codes in Python Fast Execution Compile Languages: CPython-extension No GIL: Cython, PyPy, Jython, IronPython, .. Device depend: OpenMP, OpenACC, PyCuda Fast Development ● Compatibility ● Portability On-The-Fly (OTF) Packages PDF in clouds Codes in Appendix: ✍ Links: ✈GIL Introduction PyConJP2018/9 Y. Okuda
  • 4.
    Story(2/2) 4 ■Showing speed, but10=90% 20=95% 50=98% 100=99% of time down F
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    4IPUT 5JNFTFD 1ZUIPO /VN1Z /VN1Z5G 1ZUIPO/VNCB $1ZUIPOYU /VN1Z5G!(QV 1ZUIPO/VNCB
  • 11.
    5ISFBE 5G(SBQI8IJMF!(QV 1Z5PSDI$V1Z!(QV Note: Very few data transfer, No tune up at packages 1000 Loops Monte Carlo ΠCalculation PyConJP2018/9 Y. Okuda
  • 12.
    H/W S/W 5 env-all Tf-cpu,gpu Python3.6 3.5 Conda 5.1.0 VirtualEnv Mint Linux(Ubuntu 16.04) CPU + GPU Batch python on shell SSH, NFS CPU: i7-2630QM stepping=5 (Sandy Bridge’12 mobile) Turbo=Off, EIST=Off SSE SSE2 SSE4.1 SSE4.2 AVX 2.0 GHz 4Core 8HT L1=256K, L2=1M, L3=6M PCIe II 5GT/s DDR3 16G 21.3G/s,swap off QM77, NF9G(Jetway Co.) GPU: GTX-1060 (Pascal GP-106) 1.5 GHz 1280コ ア L2=1.5M(192bI/F) PCIe II 5GT/s DDR5 6G 8G/s CUDA-9 CC-6.1 Test bench PyConJP2018/9 Y. Okuda
  • 13.
    Background ➊ Python Thread ➋GIL ➌ CPython-Ext ➍ NumPy PyConJP2018/9 Y. Okuda
  • 14.
    Speeds of Processand Thread (1/2) 7 def add(n): a = 0 for in range(n): a += 1 for n in [ .. ]: ts = time.monotonic() for in range(1000): f(n) te = time.monotonic() def series(n): add(n) add(n) def process(n): p1 = Process (target= add, args=(n,)) p1.start() p2 = Process (target= add, ... p1.join(); p2.join() def thread(n): t1 = Thread (target= add, args=(n,)) t1.start() t2 = Thread (target= add, ... t1.join(); t2.join() Background PyConJP2018/9 Y. Okuda
  • 15.
    Speeds of Processand Thread (2/2) 8 ■ Speed ●Thread 1× Series ▼ (25%Down@TruboOn) ●Process 1.8× Series F
  • 16.
  • 17.
  • 18.
    4IPUT 5JNFTFD 5ISFBE 4FSJFT 1SPDFTT EE4FS1SP5IS) ■ Launch time ●Thread Zero ●Process 6 msec /each F
  • 19.
  • 20.
  • 21.
  • 22.
    Is Thread Safe?(1/2) 9 def add(n): global g for in range(n): g += 1 def sub(n): global g for in range(n): g –= 1 g = None def a s(n): global g g = 0 t1 = Thread( .. add,.. n) t2 = Thread( .. sub,.. n) .. .. .. return g for n in [ .. ]: gs = [] for in range(1000): gs.append(a s(n)) n0 = not zero count (gs) Background PyConJP2018/9 Y. Okuda
  • 23.
    Is Thread Safe?(2/2) 10 ■ T ime ≥ 8 Not Thread-Safe global and local F
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    4IPUT 5JNFTFD /PU;FSPJO(4 5JNF TFD /PU;FSPJO(4 Background PyConJP2018/9 Y. Okuda
  • 30.
    Why not Safe?GIL 11 ■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria ● GIL: Global Interpreter Lock ■ Threads chopped intslice, and lose codes ✈A. Jesse ● tslice = 5 msec ● Errors from 8 msec ☞ For acceleration, avoid GIL and Python object access ☞ For no error, Finish in tslice or apply GIL-Safe opera- tions Thread1 Thread2 GIL tslice tslice tslice tslice tslice GIL Background PyConJP2018/9 Y. Okuda
  • 31.
    How to AvoidGIL 12 ■ CPython-Extension: 1. Copy Python objects to C-Data 2. Apply “ Py BEGIN ALLOW THREADS” C-Macro 3. Execute C-Native codes or OTF codes 4. Apply “ Py END ALLOW THREADS” C-Macro 5. Copy C-Data to Python objects Thread1 Thread2 CPython C-Native GIL AvoidGIL Copy in Py BEGIN ALLOW THREADS Py END ALLOW THREADS Copy out C-Native Codes Background PyConJP2018/9 Y. Okuda
  • 32.
    Monte Carlo ΠCalculation 13 H hits in the circle targeting N random shots at a square π = 4 ·H/N ✈WikiPi-2 ✈LLNL Error/π = a · Nb ✈WikiPi-1 Python C import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n double pin ( n ) { unsigned i n t s = time (NULL) ; i n t h = 0; for ( i n t i = 0; i n ; ++ i ) { double x = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double y = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double r2 = x∗x + y∗y ; i f ( r2 = 1 . ) h += 1; return 4 . ∗ ( double ) h / ( double ) n ; }} Background PyConJP2018/9 Y. Okuda
  • 33.
    Multi-Threaded ΠCalculation 14 ■ Original:pin(n) Get h hits in n shots ➡ 4 ·h/n ■ m Threading: pinm(n, m) Launch h1 in n/m h2 in n/m ... ... hm in n/m Map h = sum(h1, h2, .., hm) Reduce 4 ·h/n Background PyConJP2018/9 Y. Okuda
  • 34.
    Π@CPython 15 ■ Easy Operation(good tools and official documents) ■ Require to run setup at each release of cpython import cif pi = cif.pin(n) dist/cif- • • • egg python setup.py • • • (Compile, Link, Deploy, Test) cifmodule.c #include Python.h static PyObject * pin( • • • ){ Py BEGIN ALLOW THREADS for (int i = 0; i n; ++i){ • • • Py END ALLOW THREADS setup.py import setuptools setup( • • • ) cif test.py import unittest import cif Background PyConJP2018/9 Y. Okuda
  • 35.
    Effects of Threadsand Cores 16 ☞ min(N, M)× – Overhead N: # of Threads, M: # of Real Cores ; 5ISFBET 3FMBUJWF4QFFE ! ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Background PyConJP2018/9 Y. Okuda
  • 36.
    Hard to DevelopThreading 17 ■ Design issues : out of scope of this talk ■ A issue in this trial: rand r, random r ● rand r: Low randomness, ideal speed up ➡Selected ● random r : Good randomness, speed down at threading ● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std ● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48 ● “more standardization―for compilers, users, and libraries ..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press ☞ Check speeds of Official thread-safe functions 0e+00 5e+04 1e+05 # Shots 0.00 0.02 0.04 ΠError 0.0001 -0.005 Rand r Random r F
  • 37.
  • 38.
  • 39.
  • 40.
    NumPy Speedup 18 ■ Convertingto NumPy 7.7✕ ● Vectorize: Move “for loops” into functions ● Numpy Vector/Matrix functions are compiled C-codes ■ Not only numeric calculation ● count nonzero ● less equal, less, .. ● sort, lexsort, .. ● where, searchsorted ● I/O Python NumPy import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n import numpy as np def np pi ( n ) : x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t = np . count nonzero ( l s s ) pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / np . f l o a t 6 4 ( n ) return pi Background PyConJP2018/9 Y. Okuda
  • 41.
    Summary 19 ➊Avoid GIL tospeed up ➋Apply GIL-Safe operations for Thread-Safe ➌min(N, M) acceleration F
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
    4IPUT 5JNFTFD 1ZUIPO /VN1Z $1ZUIPOYU $1ZUIPOYU!5ISFBE Background PyConJP2018/9 Y. Okuda
  • 48.
    Numba (V0.38.0) ✈Official ●Background ●Accelerateon: CPU, M-Core, CUDA (SSE, AVX, AVX2, AVX-512) ●@numba.jit Just in Time Compile ■ Few user’s guides ✈Conda2018Slide ■ An excellent review ✈Matthew Rocklin ■ Supported by Conda, Inc ■ The Gordon and Betty Moore Foundation ■ GPU version free from end of 2017 ■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite ■ CUDA 2.0 or above PyConJP2018/9 Y. Okuda
  • 49.
    Outline 21 ■Two different systems CPUCUDA @numba.jit(• •) def pin(n) • • • • • • return out pi = pin(100) @numba.cuda.jit(• •) def pin(n, out) • • • • • • (no return) pin[25, 40](100, pi) ●Many Python codes: ✈Official ✈NumPy ▼ Language: All except class, try, except, with, yield ▼ Function call: inner,closure, recursive ▼ Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round() sorted() type() zip() ▼ NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std() take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view() ▼ Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi ● CUDA Kernel codes ● NumPy: Not Supported Numba PyConJP2018/9 Y. Okuda
  • 50.
    On CPU 22 ■ @numba.jit()Compile/Execute compatible Python codes to LLVM ●Apply Python π✍ ➡ 21✕ Cf. Manual convert to CPython ✍ ➡ 23✕ ☞Comparable speed to manually converted C ●Apply NumPy π✍ ➡ 1✕ ☞ NumPy functions are not accelerated Cf. Python to NumPy ➡7.7✕ ☞ Jit 21/7.7= 3✕ of NumPy functions ▼ Numba: Python ➡LLVM ➡Python ▼ NumPy: (Python ➡C ➡Python)✕Repeat Numba PyConJP2018/9 Y. Okuda
  • 51.
    Accelerate NumPy Indexing 23 ■Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo ● “for loop” and a function vector operations on List and NdArray by native and Jit def for add(n, vs): for i in range(n): vs[i] += 1 def np add(n, vs): a = np.add (vs, 1) F
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
    4IPUT 5JNFTFD 'PS/ESSBZ 'PS-JTU /QEE-JTU +JU'PS-JTU /QEE/ESSBZ +JU'PS/ESSBZ ▼ NdArray indexing is 3.8✕ slower than List ✈stackoverflow ▼ Indexing is required setup calculations, branches in main loops ▼ np.add(NdArray) is 100✕ faster than np.add(List) Numba PyConJP2018/9 Y. Okuda
  • 58.
    On M-Core 24 ■ AllCore working ➊ set @jit(parallel=True) ➋ change “range” to “numba.prange” ● Apply Python π➡ 89✕ ➡ 4.4✕ of @jit() ●No way to control # of cores ▼ Multi-User/Process needs core assignment ■ @jit(nogil=True) + ThreadPoolExecutor controls ✍ 5ISFBET 3FMBUJWF4QFFE ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Numba PyConJP2018/9 Y. Okuda
  • 59.
    On CUDA 25 ■ Non-compatiblepython codes, (details are out of scope) ● CUDA kernel codes in definitions ▼ Python like, not C in PyCuda ● insert “[#blocks, #threads]” in calls ▼ Ex. pin[25, 40](n) ● Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy ▼ Use 2nd run, 1st includes 1.8 sec compile/load time F
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
    4IPUT 5JNFTFD $6%TU $6%OE Overhead➡ Numba PyConJP2018/9 Y. Okuda
  • 66.
    Summary 26 ➊Convert to Nogilfunctions ➋Accelerate “for/while” loops ➌Improve NumPy indexing F
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
    4IPUT 5JNFTFD 1ZUIPO $16 /PHJM!5ISFBE 1BSBMMFM $6% Numba PyConJP2018/9 Y. Okuda
  • 73.
    Machine Learning Packages: ●NumPyaccelerators ●Kernel-less CUDA access ●Tensor objects ●Poor Documents➡My thought ? ➊ TensorFlow (V1.9) ✈Official ✈奥田 ●CPU, CUDA, (TPU, ROCm, Phi) Own-SIMD +(SSE, AVX, AVX2, AVX-512) ➋ PyTorch (V0.4.11) ✈Official ●CUDA ➌ CuPy (V4.1.0) –Chainer– ✈Official ●CUDA PyConJP2018/9 Y. Okuda
  • 74.
    Exec Modes 28 ■ TensorFlow-tf:(CPU, CUDA)✕(Eager, Graph)= 4 ● Eager: Python is a direct executor for ordinary actions ● Graph: Python is a macro generator for computing graphs ● Eager if 1st-code is tf.enable eager execution() else Graph ●Two pip packages: CPU, GPU(=GPU+CPU) Implicit: Package set default device Explicit: “with tf.device(’/cpu:0’):” block ■ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph) ● torch.func(.., device=D,..) D=device(’cuda’); D=device(’cpu’) ● Implicit: auto-decide from operands ➡ Fast ● Explicit-2: torch.func(..).cuda() ➡ Slow ■ CuPy-cp: CUDA = 1 (NN-Graph) ●Only CUDA, use NumPy for CPU ML Packages PyConJP2018/9 Y. Okuda
  • 75.
    CUDA 29 ■ TensorFlow Eager✍ ➊np. ➡tf. ➋ Change some func names ➌ Add “tf.cast” some func ➍ Select env. for CUDA ■ PyTorch✍/CuPy✍ Graph ➊ np. ➡pt./ cp. ➋ Change some func names/ No ➌ Add “device” options/ No ➍ Set global device type/ No ■ TensorFlow Graph✍ ➊ Create “tf.placeholder” inputs ➋ Run a function with the inputs ■ TensorFlow CPU ● Execute the same codes on env. of CPU F
  • 76.
  • 77.
  • 78.
    4IPUT 5JNFTFD 5GBHFS 5G(SQBI $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda
  • 79.
    CPU 30 ■ TensorFlow ✍2.4, 3.8✕ 8 cores run SIMD ? ■ PyTorch ✍ 0.7✕ for CUDA-less develop/debug F
  • 80.
  • 81.
  • 82.
    4IPUT 5JNFTFD 1Z5PSDI /VN1Z 5G 5G(SBQI TensorFlow PyTorch ■ In progress of Eager , More functional and faster ? F
  • 83.
  • 84.
  • 85.
    4IPUT 5JNFTFD $POEBOW 7JSUVBMOW $POEB.LM ● V1.5@Jan./2018: Contribution version ✈奥田 ● V1.7: Moving out of contribution ● V1.8: SSE, AVX link ● V1.9@Aug.: Conda links intel-MKL ✈Conda MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel ● V?: Contribution AutoGraph ✈GitHub ML Packages PyConJP2018/9 Y. Okuda
  • 86.
    TensorFlow Graph 31 ■Advanced computinggraph ●While, Branch, Parallel, Reduce, Scatter, etc in CUDA ● Concurrent Main Memory accesses from CUDAs and CPUs ▼ Written by non-portable special control functions, not Python – Macro-Language ▼ Hard to understand the functions, but contrib.AutoGraph converts “for, if, ..” to Graph ● Slower than PyToch in the π calculation ●1000 While@CUDA✍ ●10 Parallel@CUDA✍ F
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
    Overhead (OH) 32 ■ NegligibleOHs for heavy functions as fft, cv, solvers, etc ● TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings ) tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries ) ● CuPy: 1. linalg 2. math 3. fft ■ Prediction of getting array OHs at ordinary cases ●NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar CPU np.RNG(n) xs xs[0] x CPU CUDA cp.RNG(n) xs nd cp.asnumpy nd[0] x CPU CUDA cp.RNG(n) xs xs[0] Scalar x cp.asnumpy RNG: Random Number Generator F
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
    4IPUT 5JNFTFD SSBZ 4DBMBS    { ▼ Transfer time from CUDA toCPU ▼ Jump caused by Cache ? ML Packages PyConJP2018/9 Y. Okuda
  • 112.
    All Overheads 33 ■ Acceleratefunction “r = f(a1, a2)” ●NumPy ● Accelerator CPU def f(p1, p2): a1 p1 a2 p2 • • • r return rf CPU Acc. a1 p1 copy in a2 p2 • • • r copy return rf copy out ▼ copy in F
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
    Tensor 34 ■ Bridge betweenCPU and Accelerator ? CPU Acc. a1 p1 a2 p2 • • • r return rf Tensor copy in copy out others Buffer/Cache ●copy in (Create Tensor Object) TensorFlow convert to tensor(nd) PyTorch tensor(nd) Cupy array(nd) nd: NdArray ●copy out (Convert to NdArray) TensorFlow t obj.numpy() PyTorch t obj.cpu().numpy() Cupy asnumpy(t obj) t obj: Tensor Object ●Others ▼ Neural Network functions ▼ MM-Direct: Scatter Read/Write ▼ “if”, “while” ●Buffer/Cache ✈PyTorch ✈DlPack ▼ Not store in CPU-Mem. Cf. NumPy functions ▼ • • • ML Packages PyConJP2018/9 Y. Okuda
  • 138.
    Summary 35 ➊ CuPy: NumPycompatible CUDA ☞ TensorFlow: CPU-SIMD/CUDA/.., Application modules ☞ PyTorch: debugging on CPU ☞ Consider Copy-In/Out overhead F
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
    4IPUT 5JNFTFD /VN1Z 5G!$QV 5G!(QV $V1Z!(QV 5G8IJMF!(QV 1Z5PSDI!(QV ML Packages PyConJP2018/9 Y. Okuda
  • 145.
    Dask (V0.18.0) ✈Official ●Background ●“Delayed”simple graph for threading ■ Answer of PyData to Col. W/O-MM-Limit: Hadoop➡Arrow ➡7 systems + Pandas ✈Official ✈Mckinney , HANA(SAP), RevolutionR(MS)✈Official ■ Conda + DARPA, NSF, Gordon Moore Found., HHMI ■ Expand NumPy, Pandas, Scikit-Learn ■ Parallel computing: ● Process: Futures ●Thread: Delayed PyConJP2018/9 Y. Okuda
  • 146.
    Graph for Multi-Threading 37 ■“delayed”defines nodes of parallel computing✍ # Thread ●mn.visualize() at m=3 cnt = int(n/ m) ps = [ ] for in range(m): p = dask.delayed( get pi)(cnt) ps.append(p) mn = dask.delayed(np.mean)(ps) pi = mn.compute() Execute ■Apply to all the get pi functions with m=3 ① ② ③ Dask PyConJP2018/9 Y. Okuda
  • 147.
    The Results at3 Threads 38 ■NumPy shows little improvement /VN1Z %BTL ● Ufuncs nogil ✈HP affect acceleration ▼ Short intervals of “add, multiply, less equal” ■No-GIL functions show well improvement $1ZUIPO %BTL /PHJM1Z %BTL ●ThreadPoolExecutor showed: ▼ 3X at CPython ▼ 3X at NogilPy ■The others show no improvement, CuPy may have nogil func. 1ZUIPO %BTL 1ZUIPO!+JU %BTL 5G$QV %BTL $V1Z %BTL Dask PyConJP2018/9 Y. Okuda
  • 148.
    Delayed vs ThreadPoolExecutor 39 ■NogilPy ThreadPool shows lower launch, higher speed ● Delayed ●ThreadPool F
  • 149.
  • 150.
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
  • 156.
  • 157.
  • 158.
  • 159.
  • 160.
    4IPUT 5JNFTFD 5 5 5 5 5 5 5 5ISFBET 3FMBUJWF4MPQF ! ! 4MPQF *EFBM 5ISFBET 3FMBUJWF4QFFE ! ! 4QFFE *EFBM Dask PyConJP2018/9 Y. Okuda
  • 161.
    Summary 40 ➊ No guideabout GIL-Safe ●Only inhibit “+=, –=” without reasoning ➋ Large Overheads for the πcalculation ■ A tool for Dask components ? ■ Too Early to Evaluate ➊ NumPy has Nogil functions ➋ CuPy may have Nogil functions ● PyTorch Freeze ● TensorFlow@CPU segmentation fault F
  • 162.
  • 163.
  • 164.
  • 165.
  • 166.
  • 167.
  • 168.
    Threading and Nogil ●ThreadPoolExecutor ➊ Confirm Nogil-ness of CuPy ➋ GIL-Safe prediction ➌ Nogil forced NumPy PyConJP2018/9 Y. Okuda
  • 169.
    NumPy vs CuPy 42 ■NumPy Partial-Nogil, CuPy Full-Nogil ? ● NumPy ●CuPy F
  • 170.
  • 171.
  • 172.
  • 173.
  • 174.
  • 175.
  • 176.
  • 177.
  • 178.
  • 179.
  • 180.
  • 181.
    4IPUT 5JNFTFD 5 5 5 5 5 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM 5ISFBET 3FMBUJWF4MPQF 4MPQF *EFBM Threading and Nogil PyConJP2018/9 Y. Okuda
  • 182.
    Confirm CuPy 43 ■ Error/π= a · (N)b ✈WikiPi-1 / 4IPUT — — — — — CTPMVUF3FMBUJWFSSPS %BUB $V1Z!5 /VN1ZSSPS 1 Loop ●CuPy at 8 threads ▼ Thread-safe RNG ▼ Paralell execution in CUDA ●NumPy at 8 threads ▼ GIL Error caused by, h = 0 for v in lss: if v == 1: h = h + 1 not += Threading and Nogil PyConJP2018/9 Y. Okuda
  • 183.
    GIL-Safe Prediction 44 ■ Almostimpossible to predict GIL-Safe Local functions show Safe or Not non-deterministic # def rng count(n) ✍ x = np.random.rand(n) # def count(n) ones = np.ones(n) c = np.count nonzero(ones) return c # n == c ● Count: 14 errors No error@T2,3,4 on the test-bench No error on Intel-Atom✍ ● Rng Count No error ☞Apply Forced Nogil functions F
  • 184.
  • 185.
  • 186.
  • 187.
  • 188.
  • 189.
  • 190.
  • 191.
  • 192.
  • 193.
  • 194.
  • 195.
    / 5JNFTFD 3OH@$PVOU $PVOU 1Loop Threading and Nogil PyConJP2018/9 Y. Okuda
  • 196.
    Numba JIT Options 45 ■Set nopython=True for nogil guarantee ?●Local objects are stored in a heap storage of which accesses should be mutexes. ●The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses. Guaranteed @jit( nogil=True, nopython = True) Non-guaranteed @jit( nogil=True, nopython = False) Thread-1 Variables NameSpaces • • • Thread-2 LLVM Objects • • • Thread-3 Release GIL Variables NameSpaces Catch GIL GIL EntryObject Manager Obj-1 Python Heap Storage Obj-n ●All Accesses Threading and Nogil PyConJP2018/9 Y. Okuda
  • 197.
    Nogil NumPy byNamba 46 ■ Some NumPy functions require rewriting ● Guaranteed Nogil F
  • 198.
  • 199.
  • 200.
  • 201.
  • 202.
  • 203.
    4IPUT 5JNFTFD 3FXSJUFE 0SJHJOBM 5ISFBE 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM ● Rewriting slows down 0.02X h = count nonzero(lss) h = 0 for v in lss: if v == 1: h = h + 1 ● Numba speeds up 1.6X ● 6 Threads speeds up 3.2X 5x of Original Threading and Nogil PyConJP2018/9 Y. Okuda
  • 204.
    Summary 47 ➊Apply Nogil functionsfor Thread-Safe ■ Set nopython=True with nogil=True in numba.jit ➋Almost impossible to predict GIL-Safe ➌CuPy paralell execution in CUDA ? F
  • 205.
  • 206.
  • 207.
  • 208.
  • 209.
  • 210.
    4IPUT 5JNFTFD /VN1Z /PHJM/VN1Z Threadingand Nogil PyConJP2018/9 Y. Okuda
  • 211.
    Conclusion 48 Execution Time Confirmation(ETC) on run time signatures showed: ➊ Ideal threading acceleration = min(N, M) ➋ A comparison of On-The-Fly packages: ● Numba ● TensorFlow ● PyTorch ● CuPy ● Dask ➌ Basic issues and Solutions: ● GIL ● Nogil ● GIL-Safe ● Threading ● Graph ● NumPy Indexing ● Copy Overhead Enjoy On-The-Fly Own Ways ✍ PyConJP2018/9 Y. Okuda
  • 212.
  • 213.
    Appendix MIT License Copyright (c ) 2018 Yukio Okuda Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) , to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e , and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s : The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software . THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyConJP2018/9 Y. Okuda
  • 214.
    Numba 51 Nogil import numba import random fromc o n c u r r e n t . f u t u r e s import ThreadPoolExecutor @numba . j i t ( n o g i l =True , nopython=True ) def n b a p i n o g i l ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n tp e = ThreadPoolExecutor ( max workers =12) #−− def n b a p i n o g i l t p n m ( n , m) : g lo ba l tp e cn t = i n t ( n /m) i f cn t 1 : cn t = 1 ans = [ ] for i in range (m) : ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) ) p i = 0 . for f in ans : p i += f . r e s u l t ( ) return p i /m print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) ) CUDA import numba import numpy as np from numba . cuda . random import x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 from numba . cuda . random import c r e a t e x o r o s h i r o 1 2 8 p s t a t e s @numba . cuda . j i t ( ) def nba cuda ( n , pi , rng ) : t h r e a d i d = numba . cuda . g r i d ( 1 ) h i t = 0 for in range ( n ) : x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) r = x∗x + y∗y i f r = 1 . : h i t += 1 p i [ t h r e a d i d ] = 4 . ∗ h i t / n def n b a cu d a rec ( n ) : t h r e a d s p e r b l o c k = 25 b lo ck s = 40 r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s ( t h r e a d s p e r b l o c k ∗ blocks , seed =1) p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks , dtype=np . f l o a t 6 4 ) nba cuda [ blocks , t h r e a d s p e r b l o c k ] ( n , pis , r n g s t a t e s ) return p i s . mean ( ) print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 215.
    ML:TfEager,PyTorch,Cupy 52 TensorFlow-CPU/CUDA-Eager import t en s o r f l o w as t f t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( ) # t f . e n a b l e e a g e r e x e c u t i o n ( ) def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i . numpy ( ) print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 )) CuPy-CUDA import cupy as cp import numpy as np def cp p i g p u ( n ) : x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = cp . ones ( n , dtype=cp . f l o a t 6 4 ) l s s = cp . l e s s e q u a l ( rs , ones ) h i t = cp . co u n t n o n zero ( l s s ) PyTorch-CPU import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) def p t p i c p u ( n ) : x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 ) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] p i = 4 . ∗ h i t / n return p i print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 )) PyTorch-CUDA import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) DEVICE = t o r c h . d ev ice ( ’ cuda ’ ) def p t p i g p u a l l ( n ) : x = t o r c h . rand ( n , d ev ice=DEVICE) y = t o r c h . rand ( n , d ev ice=DEVICE) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , d ev ice=DEVICE) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] return 4 . ∗ h i t / n print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 216.
    ML:TfGraph 53 TensorFlow-Simple Graph import te n s o r f l o w as t f def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f g r a p h = t f p i n ( t f n ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h , f e e d d i c t ={ t f n : n }) return p i i f name == ” m a i n ” : print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) TensorFlow-While Graph import t e n s o r f l o w as t f from t f g r a p h s i m p l e import t f p i n def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) : p i s = t f . add ( p i s , t f p i n ( n ) ) return p i s def t f g r a p h p i n w h i l e ( n , loop ) : i = t f . c o n s t a n t ( 0 ) p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 ) i , p i s = t f . wh ile lo o p ( lambda i , p i s : t f . l e s s ( i , loop ) , lambda i , p i s : ( t f . add ( i , 1 ) , t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) , [ i , p i s ] ) p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ ) t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h w h i l e , f e e d d i c t ={ t f n : n , t f l o o p : 1000}) return p i print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 217.
    ML:TfGraph Dask 54 TensorFlow-Parallel Graph importt e n s o r f l o w as t f M = 10 m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ ) n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 ) h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ ) for in range (M) : xs = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . add ( h it , t f . co u n t n o n zero ( lss , dtype= t f . i n t 6 4 ) ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) ans = p i s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( in n , in m ) : p i = s e s s i o n . run ( ans , f e e d d i c t ={n : in n , m: in m }) return p i print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 )) Dask-Numba import numpy as np import random import dask import numba @numba . j i t ( n o g i l =True ) def g e t p i ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n def d s k n b a p i n o g i l ( n , m, v= False ) : cn t = i n t ( n /m) ps = [ ] for in range (m) : p = dask . delayed ( g e t p i ) ( cn t ) ps . append ( p ) mn = dask . delayed ( np . mean ) ( ps ) i f v : mn. v i s u a l i z e ( o p t i m i z e g r a p h=True ) p i = 0 e l s e : p i = mn . compute ( ) return p i # v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and # Graphviz u t i l i t y # g en era te . / mydask . png # d s k n b a p i n o g i l (10∗∗5 , 3 , v=True ) print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) ) Appendix PyConJP2018/9 Y. Okuda
  • 218.
    Miscellaneous 55 GIL-Safe import numpy asnp from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor tp e = ThreadPoolExecutor ( max workers =25) def rn g co u n t ( n ) : x = np . random . rand ( n ) . asty p e ( np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def count ( n ) : ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def tpe pi nm min ( n , m, f ) : g lo ba l tp e t s = [ ] for i in range (m) : t s . append ( tp e . submit ( f , n ) ) p i s = [ ] for t in t s : p i s . append ( t . r e s u l t ( ) ) return min ( p i s ) for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7): c = tpe pi nm min ( n , 9 , count ) print ( ” count : ” , n==c , n , c ) c = tpe pi nm min ( n , 9 , rn g co u n t ) print ( ” rn g co u n t : ” , n==c , n , c ) GIL-Safe-Note R e s u l t s of print depend on e x e c u t i n g machine Bench mark machine : count : False 7000000 34302 rn g co u n t : True 7000000 7000000 count : False 8000000 10750 rn g co u n t : True 8000000 8000000 count : False 9000000 525822 rn g co u n t : True 9000000 9000000 count : False 10000000 455166 rn g co u n t : True 10000000 10000000 I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread s t e p p i n g =3 a l l True ! ! Appendix PyConJP2018/9 Y. Okuda