2. Me= A Programmer over 40 Years= Joyful
2
:FBS
$16$MPDL()[
3FBM$PSFT
0OF $PSF .VMUJ $PSF
)ZQFS5ISFBE
GIL@Python
One Core
N-Threads ≤ 1-Thread
CPU-Bound ➡Special
Vector Processor
Multi Core
N-Threads@M-Cores = min(N,M)
CPU-Bound ➡General
Low-price GPGPU
IBM/
SRB
SUN/
process
DECα/
VMS-thread
POSIX/
pthread
Linux/
pthread
✈Wiki-Xeon
PyConJP2018/9 Y. Okuda
3. Story(1/2)
3
How to accelerate CPU bound codes in Python
Fast Execution
Compile Languages:
CPython-extension
No GIL:
Cython, PyPy, Jython, IronPython, ..
Device depend:
OpenMP, OpenACC, PyCuda
Fast Development
● Compatibility
● Portability
On-The-Fly (OTF)
Packages
PDF in clouds Codes in Appendix: ✍ Links: ✈GIL
Introduction
PyConJP2018/9 Y. Okuda
14. Speeds of Process and Thread (1/2)
7
def add(n):
a = 0
for in range(n):
a += 1
for n in [ .. ]:
ts = time.monotonic()
for in range(1000):
f(n)
te = time.monotonic()
def series(n):
add(n)
add(n)
def process(n):
p1 = Process (target= add,
args=(n,))
p1.start()
p2 = Process (target= add,
...
p1.join(); p2.join()
def thread(n):
t1 = Thread (target= add,
args=(n,))
t1.start()
t2 = Thread (target= add,
...
t1.join(); t2.join()
Background
PyConJP2018/9 Y. Okuda
15. Speeds of Process and Thread (2/2)
8
■ Speed
●Thread 1× Series
▼
(25%Down@TruboOn)
●Process 1.8× Series
F
22. Is Thread Safe? (1/2)
9
def add(n):
global g
for in range(n):
g += 1
def sub(n):
global g
for in range(n):
g –= 1
g = None
def a s(n):
global g
g = 0
t1 = Thread( ..
add,.. n)
t2 = Thread( ..
sub,.. n)
.. .. ..
return g
for n in [ .. ]:
gs = []
for in range(1000):
gs.append(a s(n))
n0 = not zero count (gs)
Background
PyConJP2018/9 Y. Okuda
23. Is Thread Safe? (2/2)
10
■ T ime ≥ 8 Not Thread-Safe global and local
F
30. Why not Safe? GIL
11
■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria
● GIL: Global Interpreter Lock
■ Threads chopped intslice, and lose codes ✈A. Jesse
● tslice = 5 msec ● Errors from 8 msec
☞ For acceleration, avoid GIL and Python object access
☞ For no error, Finish in tslice or apply GIL-Safe opera-
tions
Thread1
Thread2
GIL
tslice
tslice
tslice
tslice
tslice GIL
Background
PyConJP2018/9 Y. Okuda
31. How to Avoid GIL
12
■ CPython-Extension:
1. Copy Python objects to C-Data
2. Apply “ Py BEGIN ALLOW THREADS” C-Macro
3. Execute C-Native codes or OTF codes
4. Apply “ Py END ALLOW THREADS” C-Macro
5. Copy C-Data to Python objects
Thread1
Thread2
CPython
C-Native
GIL
AvoidGIL
Copy in
Py BEGIN ALLOW THREADS
Py END ALLOW THREADS
Copy out
C-Native Codes
Background
PyConJP2018/9 Y. Okuda
32. Monte Carlo ΠCalculation
13
H hits in the circle targeting
N random shots at a square
π = 4 ·H/N ✈WikiPi-2 ✈LLNL
Error/π = a · Nb ✈WikiPi-1
Python C
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
double pin ( n ) {
unsigned i n t s = time (NULL) ;
i n t h = 0;
for ( i n t i = 0; i n ; ++ i ) {
double x = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double y = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double r2 = x∗x + y∗y ;
i f ( r2 = 1 . )
h += 1;
return 4 . ∗ ( double ) h / ( double ) n ;
}}
Background
PyConJP2018/9 Y. Okuda
33. Multi-Threaded ΠCalculation
14
■ Original: pin(n)
Get h hits in n shots ➡ 4 ·h/n
■ m Threading: pinm(n, m)
Launch
h1 in
n/m
h2 in
n/m
...
...
hm in
n/m Map
h = sum(h1, h2, .., hm) Reduce
4 ·h/n
Background
PyConJP2018/9 Y. Okuda
34. Π@CPython
15
■ Easy Operation (good tools and official documents)
■ Require to run setup at each release of cpython
import cif
pi = cif.pin(n)
dist/cif- • • • egg
python setup.py • • •
(Compile, Link, Deploy, Test)
cifmodule.c
#include Python.h
static PyObject *
pin( • • • ){
Py BEGIN ALLOW THREADS
for (int i = 0; i n; ++i){ • • •
Py END ALLOW THREADS
setup.py
import setuptools
setup( • • • )
cif test.py
import unittest
import cif
Background
PyConJP2018/9 Y. Okuda
35. Effects of Threads and Cores
16
☞ min(N, M)× – Overhead
N: # of Threads, M: # of Real Cores ;
5ISFBET
3FMBUJWF4QFFE
! !
3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT
4QFFE
*EFBM
Background
PyConJP2018/9 Y. Okuda
36. Hard to Develop Threading
17
■ Design issues : out of scope of this talk
■ A issue in this trial: rand r, random r
● rand r: Low randomness, ideal speed up ➡Selected
● random r : Good randomness, speed down at threading
● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std
● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48
● “more standardization―for compilers, users, and libraries
..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press
☞ Check speeds of Official thread-safe functions
0e+00 5e+04 1e+05
# Shots
0.00
0.02
0.04
ΠError
0.0001 -0.005
Rand r
Random r
F
40. NumPy Speedup
18
■ Converting to NumPy 7.7✕
● Vectorize: Move “for loops” into functions
● Numpy Vector/Matrix functions are compiled C-codes
■ Not only numeric calculation
● count nonzero ● less equal, less, .. ● sort, lexsort, ..
● where, searchsorted ● I/O
Python NumPy
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
import numpy as np
def np pi ( n ) :
x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) ,
np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) ,
dtype=np . f l o a t 6 4 )
ones = np . ones ( n , dtype=np . f l o a t 6 4 )
l s s = np . l e s s e q u a l ( rs , ones )
h i t = np . count nonzero ( l s s )
pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) /
np . f l o a t 6 4 ( n )
return pi
Background
PyConJP2018/9 Y. Okuda
41. Summary
19
➊Avoid GIL to speed up
➋Apply GIL-Safe operations
for Thread-Safe
➌min(N, M) acceleration
F
48. Numba (V0.38.0) ✈Official
●Background
●Accelerate on:
CPU, M-Core, CUDA
(SSE, AVX, AVX2, AVX-512)
●@numba.jit
Just in Time Compile
■ Few user’s guides ✈Conda2018Slide
■ An excellent review ✈Matthew Rocklin
■ Supported by Conda, Inc
■ The Gordon and Betty Moore Foundation
■ GPU version free from end of 2017
■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite
■ CUDA 2.0 or above
PyConJP2018/9 Y. Okuda
49. Outline
21
■Two different systems
CPU CUDA
@numba.jit(• •)
def pin(n)
• • •
• • •
return out
pi = pin(100)
@numba.cuda.jit(• •)
def pin(n, out)
• • •
• • •
(no return)
pin[25, 40](100, pi)
●Many Python codes: ✈Official ✈NumPy
▼
Language: All except class, try, except, with, yield
▼
Function call: inner,closure, recursive
▼
Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round()
sorted() type() zip()
▼
NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std()
take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view()
▼
Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi
● CUDA Kernel codes ● NumPy: Not Supported
Numba
PyConJP2018/9 Y. Okuda
50. On CPU
22
■ @numba.jit() Compile/Execute compatible Python codes to LLVM
●Apply Python π✍ ➡ 21✕
Cf. Manual convert to CPython ✍ ➡ 23✕
☞Comparable speed to manually converted C
●Apply NumPy π✍ ➡ 1✕
☞ NumPy functions are not accelerated
Cf. Python to NumPy ➡7.7✕
☞ Jit 21/7.7= 3✕ of NumPy functions
▼
Numba: Python ➡LLVM ➡Python
▼
NumPy: (Python ➡C ➡Python)✕Repeat
Numba
PyConJP2018/9 Y. Okuda
51. Accelerate NumPy Indexing
23
■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo
● “for loop” and a function vector operations
on List and NdArray by native and Jit
def for add(n, vs):
for i in range(n):
vs[i] += 1
def np add(n, vs):
a = np.add (vs, 1)
F
57. 4IPUT
5JNFTFD
'PS/ESSBZ
'PS-JTU
/QEE-JTU
+JU'PS-JTU
/QEE/ESSBZ
+JU'PS/ESSBZ
▼
NdArray indexing is 3.8✕
slower than List ✈stackoverflow
▼
Indexing is required
setup calculations,
branches in main loops
▼ np.add(NdArray) is 100✕
faster than np.add(List)
Numba
PyConJP2018/9 Y. Okuda
58. On M-Core
24
■ All Core working
➊ set @jit(parallel=True)
➋ change “range” to “numba.prange”
● Apply Python π➡ 89✕ ➡ 4.4✕ of @jit()
●No way to control # of cores
▼
Multi-User/Process needs core assignment
■ @jit(nogil=True) + ThreadPoolExecutor controls ✍
5ISFBET
3FMBUJWF4QFFE
!
3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT
4QFFE
*EFBM
Numba
PyConJP2018/9 Y. Okuda
59. On CUDA
25
■ Non-compatible python codes, (details are out of scope)
● CUDA kernel codes in definitions
▼
Python like, not C in PyCuda
● insert “[#blocks, #threads]” in calls
▼
Ex. pin[25, 40](n)
● Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy
▼
Use 2nd run, 1st includes 1.8 sec compile/load time
F
74. Exec Modes
28
■ TensorFlow-tf: (CPU, CUDA)✕(Eager, Graph)= 4
● Eager: Python is a direct executor for ordinary actions
● Graph: Python is a macro generator for computing graphs
● Eager if 1st-code is tf.enable eager execution() else Graph
●Two pip packages: CPU, GPU(=GPU+CPU)
Implicit: Package set default device
Explicit: “with tf.device(’/cpu:0’):” block
■ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph)
● torch.func(.., device=D,..)
D=device(’cuda’); D=device(’cpu’)
● Implicit: auto-decide from operands ➡ Fast
● Explicit-2: torch.func(..).cuda() ➡ Slow
■ CuPy-cp: CUDA = 1 (NN-Graph)
●Only CUDA, use NumPy for CPU
ML Packages
PyConJP2018/9 Y. Okuda
75. CUDA
29
■ TensorFlow Eager✍
➊ np. ➡tf.
➋ Change some func names
➌ Add “tf.cast” some func
➍ Select env. for CUDA
■ PyTorch✍/CuPy✍ Graph
➊ np. ➡pt./ cp.
➋ Change some func names/ No
➌ Add “device” options/ No
➍ Set global device type/ No
■ TensorFlow Graph✍
➊ Create “tf.placeholder” inputs
➋ Run a function with the inputs
■ TensorFlow CPU
● Execute the same codes on
env. of CPU
F
85. 4IPUT
5JNFTFD
$POEBOW
7JSUVBMOW
$POEB.LM
● V1.5@Jan./2018: Contribution version ✈奥田
● V1.7: Moving out of contribution
● V1.8: SSE, AVX link
● V1.9@Aug.: Conda links intel-MKL ✈Conda
MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel
● V?: Contribution AutoGraph ✈GitHub
ML Packages
PyConJP2018/9 Y. Okuda
86. TensorFlow Graph
31
■Advanced computing graph
●While, Branch, Parallel, Reduce, Scatter, etc in CUDA
● Concurrent Main Memory accesses from CUDAs and CPUs
▼
Written by non-portable special control functions,
not Python – Macro-Language
▼
Hard to understand the functions, but
contrib.AutoGraph converts “for, if, ..” to Graph
● Slower than PyToch in the π calculation
●1000 While@CUDA✍ ●10 Parallel@CUDA✍
F
99. Overhead (OH)
32
■ Negligible OHs for heavy functions as fft, cv, solvers, etc
● TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings )
tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries )
● CuPy: 1. linalg 2. math 3. fft
■ Prediction of getting array OHs at ordinary cases
●NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar
CPU
np.RNG(n)
xs
xs[0]
x
CPU CUDA
cp.RNG(n)
xs
nd cp.asnumpy
nd[0]
x
CPU CUDA
cp.RNG(n)
xs
xs[0]
Scalar
x cp.asnumpy
RNG: Random
Number
Generator
F
112. All Overheads
33
■ Accelerate function “r = f(a1, a2)”
●NumPy ● Accelerator
CPU
def f(p1, p2):
a1 p1
a2 p2
• • •
r return rf
CPU Acc.
a1 p1
copy in
a2 p2
• • •
r copy return rf
copy out
▼
copy in
F
137. Tensor
34
■ Bridge between CPU and Accelerator ?
CPU Acc.
a1 p1
a2 p2
• • •
r return rf
Tensor
copy in
copy out
others
Buffer/Cache
●copy in (Create Tensor Object)
TensorFlow convert to tensor(nd)
PyTorch tensor(nd)
Cupy array(nd)
nd: NdArray
●copy out (Convert to NdArray)
TensorFlow t obj.numpy()
PyTorch t obj.cpu().numpy()
Cupy asnumpy(t obj)
t obj: Tensor Object
●Others
▼
Neural Network functions
▼
MM-Direct: Scatter Read/Write
▼
“if”, “while”
●Buffer/Cache ✈PyTorch ✈DlPack
▼
Not store in CPU-Mem.
Cf. NumPy functions
▼
• • •
ML Packages
PyConJP2018/9 Y. Okuda
138. Summary
35
➊ CuPy: NumPy compatible CUDA
☞ TensorFlow: CPU-SIMD/CUDA/..,
Application modules
☞ PyTorch: debugging on CPU
☞ Consider Copy-In/Out overhead
F
145. Dask (V0.18.0) ✈Official
●Background
●“Delayed” simple graph for
threading
■ Answer of PyData to Col. W/O-MM-Limit:
Hadoop➡Arrow ➡7 systems + Pandas ✈Official ✈Mckinney ,
HANA(SAP), RevolutionR(MS)✈Official
■ Conda + DARPA, NSF, Gordon Moore Found., HHMI
■ Expand NumPy, Pandas, Scikit-Learn
■ Parallel computing:
● Process: Futures
●Thread: Delayed
PyConJP2018/9 Y. Okuda
146. Graph for Multi-Threading
37
■“delayed” defines nodes of parallel computing✍
# Thread ●mn.visualize() at m=3
cnt = int(n/ m)
ps = [ ]
for in range(m):
p = dask.delayed(
get pi)(cnt)
ps.append(p)
mn = dask.delayed(np.mean)(ps)
pi = mn.compute()
Execute
■Apply to all the get pi functions with m=3
①
②
③
Dask
PyConJP2018/9 Y. Okuda
147. The Results at 3 Threads
38
■NumPy shows little improvement
/VN1Z
%BTL
● Ufuncs nogil ✈HP affect acceleration
▼
Short intervals of “add, multiply, less equal”
■No-GIL functions show well improvement
$1ZUIPO
%BTL
/PHJM1Z
%BTL
●ThreadPoolExecutor showed:
▼
3X at CPython
▼
3X at NogilPy
■The others show no improvement, CuPy may have nogil func.
1ZUIPO
%BTL
1ZUIPO!+JU
%BTL
5G$QV
%BTL
$V1Z
%BTL
Dask
PyConJP2018/9 Y. Okuda
161. Summary
40
➊ No guide about GIL-Safe
●Only inhibit “+=, –=” without reasoning
➋ Large Overheads for the πcalculation
■ A tool for Dask components ?
■ Too Early to Evaluate
➊ NumPy has Nogil functions
➋ CuPy may have Nogil functions
● PyTorch Freeze
● TensorFlow@CPU segmentation fault
F
182. Confirm CuPy
43
■ Error/π = a · (N)b ✈WikiPi-1
/ 4IPUT
—
—
—
—
—
CTPMVUF3FMBUJWFSSPS
%BUB
$V1Z!5
/VN1ZSSPS
1 Loop ●CuPy at 8 threads
▼
Thread-safe RNG
▼
Paralell execution in
CUDA
●NumPy at 8 threads
▼
GIL Error caused by,
h = 0
for v in lss:
if v == 1:
h = h + 1
not +=
Threading and Nogil
PyConJP2018/9 Y. Okuda
183. GIL-Safe Prediction
44
■ Almost impossible to predict GIL-Safe
Local functions show Safe or Not non-deterministic
# def rng count(n) ✍
x = np.random.rand(n)
# def count(n)
ones = np.ones(n)
c = np.count nonzero(ones)
return c # n == c
● Count: 14 errors
No error@T2,3,4
on the test-bench
No error on Intel-Atom✍
● Rng Count No error
☞Apply Forced Nogil
functions
F
196. Numba JIT Options
45
■ Set nopython=True for nogil guarantee ?●Local objects are stored in a heap storage of which accesses should be mutexes.
●The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses.
Guaranteed
@jit( nogil=True,
nopython = True)
Non-guaranteed
@jit( nogil=True,
nopython = False)
Thread-1
Variables
NameSpaces
• • •
Thread-2
LLVM
Objects
• • •
Thread-3
Release GIL
Variables
NameSpaces
Catch GIL
GIL
EntryObject Manager
Obj-1 Python Heap Storage Obj-n
●All Accesses
Threading and Nogil
PyConJP2018/9 Y. Okuda
197. Nogil NumPy by Namba
46
■ Some NumPy functions require rewriting
● Guaranteed Nogil
F
203. 4IPUT
5JNFTFD
3FXSJUFE
0SJHJOBM
5ISFBE
5ISFBET
3FMBUJWF4QFFE
!
4QFFE
*EFBM
● Rewriting slows down 0.02X
h = count nonzero(lss)
h = 0
for v in lss:
if v == 1:
h = h + 1
● Numba speeds up 1.6X
● 6 Threads speeds up 3.2X
5x of Original
Threading and Nogil
PyConJP2018/9 Y. Okuda
204. Summary
47
➊Apply Nogil functions for Thread-Safe
■ Set nopython=True with nogil=True in numba.jit
➋Almost impossible to predict GIL-Safe
➌CuPy paralell execution in CUDA ?
F
213. Appendix
MIT License
Copyright ( c ) 2018 Yukio Okuda
Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g
a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) ,
to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n
th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e ,
and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e
Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s :
The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d
in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software .
THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
PyConJP2018/9 Y. Okuda
214. Numba
51
Nogil
import numba
import random
from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor
@numba . j i t ( n o g i l =True , nopython=True )
def n b a p i n o g i l ( n ) :
h i t = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r = x∗x + y∗y
i f r = 1 . :
h i t += 1
return 4 . ∗ h i t / n
tp e = ThreadPoolExecutor ( max workers =12)
#−−
def n b a p i n o g i l t p n m ( n , m) :
g lo ba l tp e
cn t = i n t ( n /m)
i f cn t 1 :
cn t = 1
ans = [ ]
for i in range (m) :
ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) )
p i = 0 .
for f in ans :
p i += f . r e s u l t ( )
return p i /m
print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) )
CUDA
import numba
import numpy as np
from numba . cuda . random import
x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4
from numba . cuda . random import
c r e a t e x o r o s h i r o 1 2 8 p s t a t e s
@numba . cuda . j i t ( )
def nba cuda ( n , pi , rng ) :
t h r e a d i d = numba . cuda . g r i d ( 1 )
h i t = 0
for in range ( n ) :
x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d )
y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d )
r = x∗x + y∗y
i f r = 1 . :
h i t += 1
p i [ t h r e a d i d ] = 4 . ∗ h i t / n
def n b a cu d a rec ( n ) :
t h r e a d s p e r b l o c k = 25
b lo ck s = 40
r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s (
t h r e a d s p e r b l o c k ∗ blocks , seed =1)
p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks ,
dtype=np . f l o a t 6 4 )
nba cuda [ blocks , t h r e a d s p e r b l o c k ] (
n , pis , r n g s t a t e s )
return p i s . mean ( )
print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 ))
Appendix
PyConJP2018/9 Y. Okuda
215. ML:TfEager,PyTorch,Cupy
52
TensorFlow-CPU/CUDA-Eager
import t e n s o r f l o w as t f
t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( )
# t f . e n a b l e e a g e r e x e c u t i o n ( )
def t f p i n ( n ) :
xs = t f . random uniform (
shape =[ n ] , minval =0 . , maxval =1 . ,
dtype= t f . f l o a t 6 4 )
ys = t f . random uniform (
shape =[ n ] , minval =0 . , maxval =1 . ,
dtype= t f . f l o a t 6 4 )
r s = t f . add ( t f . m u l t i p l y ( xs , xs ) ,
t f . m u l t i p l y ( ys , ys ) )
ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 )
l s s = t f . l e s s e q u a l ( rs , ones )
h i t = t f . co u n t n o n zero ( l s s )
p i = t f . d i v i d e (
t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) ,
t f . c a s t ( h it , t f . f l o a t 6 4 ) ) ,
t f . c a s t ( n , t f . f l o a t 6 4 ) )
return p i . numpy ( )
print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 ))
CuPy-CUDA
import cupy as cp
import numpy as np
def cp p i g p u ( n ) :
x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 )
y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 )
r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) ,
cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) ,
dtype=np . f l o a t 6 4 )
ones = cp . ones ( n , dtype=cp . f l o a t 6 4 )
l s s = cp . l e s s e q u a l ( rs , ones )
h i t = cp . co u n t n o n zero ( l s s )
PyTorch-CPU
import t o r c h
t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 )
def p t p i c p u ( n ) :
x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 )
y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 )
r s = t o r c h . add ( t o r c h . mul ( x , x ) ,
t o r c h . mul ( y , y ) )
ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 )
l s s = t o r c h . l e ( rs , ones )
h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ]
p i = 4 . ∗ h i t / n
return p i
print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 ))
PyTorch-CUDA
import t o r c h
t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 )
DEVICE = t o r c h . d ev ice ( ’ cuda ’ )
def p t p i g p u a l l ( n ) :
x = t o r c h . rand ( n , d ev ice=DEVICE)
y = t o r c h . rand ( n , d ev ice=DEVICE)
r s = t o r c h . add (
t o r c h . mul ( x , x ) ,
t o r c h . mul ( y , y ) )
ones = t o r c h . ones ( n , d ev ice=DEVICE)
l s s = t o r c h . l e ( rs , ones )
h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ]
return 4 . ∗ h i t / n
print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 ))
Appendix
PyConJP2018/9 Y. Okuda
216. ML:TfGraph
53
TensorFlow-Simple Graph
import t e n s o r f l o w as t f
def t f p i n ( n ) :
xs = t f . random uniform (
shape =[ n ] , minval =0 . , maxval =1 . ,
dtype= t f . f l o a t 6 4 )
ys = t f . random uniform (
shape =[ n ] , minval =0 . , maxval =1 . ,
dtype= t f . f l o a t 6 4 )
r s = t f . add ( t f . m u l t i p l y ( xs , xs ) ,
t f . m u l t i p l y ( ys , ys ) )
ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 )
l s s = t f . l e s s e q u a l ( rs , ones )
h i t = t f . co u n t n o n zero ( l s s )
p i = t f . d i v i d e (
t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) ,
t f . c a s t ( h it , t f . f l o a t 6 4 ) ) ,
t f . c a s t ( n , t f . f l o a t 6 4 ) )
return p i
t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ )
t f g r a p h = t f p i n ( t f n )
s e s s i o n = t f . Sessio n ( )
s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) )
def g e t p i ( n ) :
p i = s e s s i o n . run (
t f g r a p h , f e e d d i c t ={ t f n : n })
return p i
i f name == ” m a i n ” :
print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 ))
TensorFlow-While Graph
import t e n s o r f l o w as t f
from t f g r a p h s i m p l e import t f p i n
def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) :
p i s = t f . add ( p i s , t f p i n ( n ) )
return p i s
def t f g r a p h p i n w h i l e ( n , loop ) :
i = t f . c o n s t a n t ( 0 )
p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 )
i , p i s = t f . wh ile lo o p (
lambda i , p i s : t f . l e s s ( i , loop ) ,
lambda i , p i s : ( t f . add ( i , 1 ) ,
t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) ,
[ i , p i s ]
)
p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) )
return p i
t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ )
t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ )
t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p )
s e s s i o n = t f . Sessio n ( )
s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) )
def g e t p i ( n ) :
p i = s e s s i o n . run (
t f g r a p h w h i l e ,
f e e d d i c t ={ t f n : n , t f l o o p : 1000})
return p i
print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 ))
Appendix
PyConJP2018/9 Y. Okuda
217. ML:TfGraph Dask
54
TensorFlow-Parallel Graph
import t e n s o r f l o w as t f
M = 10
m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ )
n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ )
s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 )
h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ )
for in range (M) :
xs = t f . random uniform (
shape =[ s t e p ] , minval =0 . , maxval =1 . ,
dtype= t f . f l o a t 6 4 )
ys = t f . random uniform (
shape =[ s t e p ] , minval =0 . , maxval =1 . ,
dtype= t f . f l o a t 6 4 )
r s = t f . add ( t f . m u l t i p l y ( xs , xs ) ,
t f . m u l t i p l y ( ys , ys ) )
ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 )
l s s = t f . l e s s e q u a l ( rs , ones )
h i t = t f . add ( h it , t f . co u n t n o n zero (
lss , dtype= t f . i n t 6 4 ) )
p i = t f . d i v i d e ( t f . m u l t i p l y (
t f . c a s t ( 4 . , t f . f l o a t 6 4 ) ,
t f . c a s t ( h it , t f . f l o a t 6 4 ) ) ,
t f . c a s t ( n , t f . f l o a t 6 4 ) )
ans = p i
s e s s i o n = t f . Sessio n ( )
s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) )
def g e t p i ( in n , in m ) :
p i = s e s s i o n . run (
ans ,
f e e d d i c t ={n : in n , m: in m })
return p i
print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 ))
Dask-Numba
import numpy as np
import random
import dask
import numba
@numba . j i t ( n o g i l =True )
def g e t p i ( n ) :
h i t = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r = x∗x + y∗y
i f r = 1 . :
h i t += 1
return 4 . ∗ h i t / n
def d s k n b a p i n o g i l ( n , m, v= False ) :
cn t = i n t ( n /m)
ps = [ ]
for in range (m) :
p = dask . delayed ( g e t p i ) ( cn t )
ps . append ( p )
mn = dask . delayed ( np . mean ) ( ps )
i f v :
mn. v i s u a l i z e ( o p t i m i z e g r a p h=True )
p i = 0
e l s e :
p i = mn . compute ( )
return p i
# v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and
# Graphviz u t i l i t y
# g en era te . / mydask . png
# d s k n b a p i n o g i l (10∗∗5 , 3 , v=True )
print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) )
Appendix
PyConJP2018/9 Y. Okuda
218. Miscellaneous
55
GIL-Safe
import numpy as np
from c o n c u r r e n t . f u t u r e s
import ThreadPoolExecutor
tp e = ThreadPoolExecutor (
max workers =25)
def rn g co u n t ( n ) :
x = np . random . rand ( n ) .
asty p e ( np . f l o a t 6 4 )
ones = np . ones (
n , dtype=np . f l o a t 6 4 )
c = np . co u n t n o n zero ( ones )
return c
def count ( n ) :
ones = np . ones (
n , dtype=np . f l o a t 6 4 )
c = np . co u n t n o n zero ( ones )
return c
def tpe pi nm min ( n , m, f ) :
g lo ba l tp e
t s = [ ]
for i in range (m) :
t s . append ( tp e . submit ( f , n ) )
p i s = [ ]
for t in t s :
p i s . append ( t . r e s u l t ( ) )
return min ( p i s )
for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7):
c = tpe pi nm min ( n , 9 , count )
print ( ” count : ” , n==c , n , c )
c = tpe pi nm min ( n , 9 , rn g co u n t )
print ( ” rn g co u n t : ” , n==c , n , c )
GIL-Safe-Note
R e s u l t s of print depend on e x e c u t i n g machine
Bench mark machine :
count : False 7000000 34302
rn g co u n t : True 7000000 7000000
count : False 8000000 10750
rn g co u n t : True 8000000 8000000
count : False 9000000 525822
rn g co u n t : True 9000000 9000000
count : False 10000000 455166
rn g co u n t : True 10000000 10000000
I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread
s t e p p i n g =3
a l l True ! !
Appendix
PyConJP2018/9 Y. Okuda