Parallel Computing in Python
Current State and Recent Advances
Pierre Glaser, INRIA  
September 5, 2019
1 / 32
Parallel computing in machine learning - an
overview
Multiprocessing libraries
Towards better performance and extensiblity
2 / 32
Parallel computing in machine learning
An overview
3 / 32
Python? What for?
Python usage among developers 1
1
source: https://www.jetbrains.com/research/python-developers-survey-2018/
4 / 32
Python? What for?
Python usage among developers 1
1
source: https://www.jetbrains.com/research/python-developers-survey-2018/
4 / 32
A growing data science ecosystem
numpy for n-dimensional arrays
5 / 32
A growing data science ecosystem
numpy for n-dimensional arrays
pandas for data analytics
5 / 32
A growing data science ecosystem
numpy for n-dimensional arrays
pandas for data analytics
scikit-learn for machine learning
5 / 32
Parallel computing? Why?
Independent, similar computation happens a lot in machine
learning. We call it embarassingly parallel tasks. Famous examples:
▶ cross validation
▶ random forests
▶ hyperparameter selection using grid search
Available for many scikit-learn estimators, but not all
6 / 32
Parallel computing in scikit-learn made easy
Parallelization in scikit-learn :
▶ ubiquituous
▶ painlessly toggled using estimators’s n_jobs option:
7 / 32
Parallel computing in scikit-learn made easy
Parallelization in scikit-learn :
▶ ubiquituous
▶ painlessly toggled using estimators’s n_jobs option:
clf = RandomForestClassifier(n_estimators=100, n_jobs=4)
X, y = get_data()
clf.fit(X, y) # runs on 4 cores!
7 / 32
Multithreading or multiprocessing?
Parallelism usually exist under two different forms:
8 / 32
Multithreading or multiprocessing?
Parallelism usually exist under two different forms:
▶ executing multiple processes (python programs) in parallel
▶ executing multiple threads of a same process in parallel
8 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Thread-3Thread-2Thread-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
thread-based parallelism:
+: threads share memory
▶ no data copies
▶ no data transfer
-: By default, Python forces threads to
run sequentially
The GIL can be released when calling
native code ( numpy , scipy ...)
main thread
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Thread-3Thread-2Thread-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
thread-based parallelism:
+: threads share memory
▶ no data copies
▶ no data transfer
-: By default, Python forces threads to
run sequentially
The GIL can be released when calling
native code ( numpy , scipy ...)
main thread
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Thread-3Thread-2Thread-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
thread-based parallelism:
+: threads share memory
▶ no data copies
▶ no data transfer
-: By default, Python forces threads to
run sequentially
The GIL can be released when calling
native code ( numpy , scipy ...)
main thread
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Thread-3Thread-2Thread-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
thread-based parallelism:
+: threads share memory
▶ no data copies
▶ no data transfer
-: By default, Python forces threads to
run sequentially
The GIL can be released when calling
native code ( numpy , scipy ...)
main thread
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Thread-3Thread-2Thread-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
process-based parallelism:
+: processes are assured to run in
parallel
-: need to pass and copy data around
▶ larger memory footprint
▶ data transfer overhead
main process
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Process-3Process-2Process-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
process-based parallelism:
+: processes are assured to run in
parallel
-: need to pass and copy data around
▶ larger memory footprint
▶ data transfer overhead
main process
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Process-3Process-2Process-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
Fitting a random forest
I want to fit many small decision trees independently, on the same data.
process-based parallelism:
+: processes are assured to run in
parallel
-: need to pass and copy data around
▶ larger memory footprint
▶ data transfer overhead
main process
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
a b
1 2
3 4
5 6
7 8
Process-3Process-2Process-1
memmaped data
a b
1 2
3 4
5 6
7 8
: single python interpreter
9 / 32
In practice
Figure: Model fitting time - parallel vs. sequential 10 / 32
In practice
Figure: Model fitting time - parallel vs. sequential 10 / 32
In practice
Figure: Model fitting time - parallel vs. sequential 10 / 32
multiprocessing libraries
standard libraries and third-party modules
11 / 32
Running a multi-core application with multiprocessing
main process
worker-1
worker-2
greet("Alice") result_1
mp.Pool
You need:
- process creation primitives mp.Process
- a serialization protocol pickle
12 / 32
Running a multi-core application with multiprocessing
main process
worker-1
worker-2
greet("Alice")
greet("Bob")
result_1
result_2
mp.Pool
You need:
- process creation primitives mp.Process
- a serialization protocol pickle
- thread-safe data structures mp.Queue
12 / 32
Running a multi-core application with multiprocessing
main process
worker-1
worker-2
greet("Alice")
greet("Bob")
result_1
result_2
mp.Pool
You need:
- process creation primitives mp.Process
- a serialization protocol pickle
- thread-safe data structures mp.Queue
bonus: the mp.Pool class
12 / 32
Serialization
Serialization defines the process of transforming an in-memory object into a
sequence of bytes.
estimator
process-1 process-2
deserializationserialization
13 / 32
Serialization
Serialization defines the process of transforming an in-memory object into a
sequence of bytes.
estimator
0111101. ..01
process-1 process-2
deserializationserialization
13 / 32
Serialization
Serialization defines the process of transforming an in-memory object into a
sequence of bytes.
estimator
0111101. ..01
process-1 process-2
deserializationserialization
The bytes string contains the instructions sequence that has to be executed to
reconstruct the graph of objects in a fresh python environment
13 / 32
Serialization
Serialization defines the process of transforming an in-memory object into a
sequence of bytes.
estimator estimator
0111101. ..01
process-1 process-2
deserializationserialization
The bytes string contains the instructions sequence that has to be executed to
reconstruct the graph of objects in a fresh python environment
13 / 32
The pickle protocol
Python defines a serialization protocol called pickle , and provides an
implementation of it in the standard library.
import pickle
s = pickle.dumps([1, 2, 3]) # serialization (pickling) step
s
b'x80x03]qx00(Kx01Kx02Kx03e.'
depickled_list = pickle.loads(b'x80x03]qx00(Kx01Kx02Kx03e.')
depickled_list
[1, 2, 3]
14 / 32
- multiprocessing often needs to pass data around
- pickle’s purpose is to pass data around
⇒ pickle is tightly integrated into many multiprocessing objects
( mp.Process , mp.Queue , mp.Connection ...)
15 / 32
- multiprocessing often needs to pass data around
- pickle’s purpose is to pass data around
⇒ pickle is tightly integrated into many multiprocessing objects
( mp.Process , mp.Queue , mp.Connection ...)
Improving Python multiprocessing performance often comes
by improving pickle
15 / 32
multiprocessing portability
creating processes
serializing data
the mp.Pool class



all come with design choices
Those design choices conflicts with the modern workflow of many
data scientists
16 / 32
multiprocessing portability
interactive sessions ( + IPython, Jupyter )
import multiprocessing as mp
mp.set_start_method("spawn")
p = mp.Pool(2)
def greet_friend(name):
... print("hello {}!".format(name))
...
p.map(greet_friend, ("Alice", "Bob"))
AttributeError: Cant get attribute 'greet_friend' on <module ' __main __'>
17 / 32
multiprocessing portability
interactive sessions ( + IPython, Jupyter )
import multiprocessing as mp
mp.set_start_method("spawn")
p = mp.Pool(2)
def greet_friend(name):
... print("hello {}!".format(name))
...
p.map(greet_friend, ("Alice", "Bob"))
AttributeError: Cant get attribute 'greet_friend' on <module ' __main __'>
external libraries ( + numpy , OpenMP )
17 / 32
multiprocessing portability
interactive sessions ( + IPython, Jupyter )
import multiprocessing as mp
mp.set_start_method("spawn")
p = mp.Pool(2)
def greet_friend(name):
... print("hello {}!".format(name))
...
p.map(greet_friend, ("Alice", "Bob"))
AttributeError: Cant get attribute 'greet_friend' on <module ' __main __'>
external libraries ( + numpy , OpenMP )
tends to deadlock when child processes crash
17 / 32
A (standard library) alternative to mp.Pool
concurrent.futures.ProcessPoolExecutor (Python3.2+)
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=2)
def greet_friend(name):
... return "hello {}!".format(name)
...
results = executor.map(greet_friend, ("Alice", "Bob")) # non-blocking
for r in results: # blocking until the next task completes.
... print(r)
does not deadlock when child processes unexpectedly crash.
18 / 32
concurrent.futures for data science: loky
loky is a third party package, that provides a more robust
ProcessPoolExecutor
Support for Python3.4 + (And 2.7... until next year)
Consistent behavior on all  , and 
Works in interactive shells
19 / 32
concurrent.futures for data science: loky
loky.ProcessPoolExecutor (Python3.2+)
a drop-in replacement of the concurrent.futures.ProcessPoolExecutor
using concurent.futures
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=2)
def greet_friend(name):
... return "hello {}!".format(name)
...
results = executor.map(greet_friend, ("Alice", "Bob")) # non-blocking
for r in results: # blocking until the next task completes.
... print(r)
20 / 32
concurrent.futures for data science: loky
loky.ProcessPoolExecutor (Python3.2+)
a drop-in replacement of the concurrent.futures.ProcessPoolExecutor
using loky
from loky import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=2)
def greet_friend(name):
... return "hello {}!".format(name)
...
results = executor.map(greet_friend, ("Alice", "Bob")) # non-blocking
for r in results: # blocking until the next task completes.
... print(r)
20 / 32
loky : decoupling multiprocessing from pickle
- pickle fails at serializing certain objects in interactive
sessions
- it is the main reason concurrent.futures.ProcessPoolExecutor and
mp.Pool fail to work in this setting
To address this problem, loky uses cloudpickle which is a
third-party pickle extension module.
21 / 32
Integration into data science frameworks
scikit-learn
joblib
loky cloudpickle
pickle
22 / 32
What’s new in Python 3.8 (and 3.9 ):
Increasing Python’s serialization speed and extensibility
23 / 32
pickle extensions: why?
by design, the standard pickle implementation blocks the serialization of
some Python constructs
24 / 32
pickle extensions: why?
by design, the standard pickle implementation blocks the serialization of
some Python constructs
import pickle
import cloudpickle
pickle.dumps(lambda x: x + 1) # would cause arbitrary code execution
24 / 32
pickle extensions: why?
by design, the standard pickle implementation blocks the serialization of
some Python constructs
import pickle
import cloudpickle
pickle.dumps(lambda x: x + 1) # would cause arbitrary code execution
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7fd0b36631e0
attribute lookup <lambda> on __main __ failed
24 / 32
pickle extensions: why?
by design, the standard pickle implementation blocks the serialization of
some Python constructs
import pickle
import cloudpickle
pickle.dumps(lambda x: x + 1) # would cause arbitrary code execution
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7fd0b36631e0
attribute lookup <lambda> on __main __ failed
In practice, data scientists need remote code execution of interactively
defined functions ( jupyter + dask , Zeppelin + (py)spark ...)
Such frameworks require pickle extensions such as cloudpickle
cloudpickle.dumps(lambda x: x + 1)
b'x80x04x958x01x00x00x00x00x00x00x8cx17cloudpickle.cloudpickle ...'
24 / 32
pickle extensions are (were) slow
The pickle module is implemented both as a pure Python module, and
as a C -optimized module.
25 / 32
pickle extensions are (were) slow
The pickle module is implemented both as a pure Python module, and
as a C -optimized module.
pickle extensions however could only extend the slow pythonic pickle
25 / 32
pickle extensions are (were) slow
The pickle module is implemented both as a pure Python module, and
as a C -optimized module.
pickle extensions however could only extend the slow pythonic pickle
$python3.7 -m timeit 'import pickle; pickle.dumps(list(range(100000)))'
50 loops, best of 5: 4.39 msec per loop
$python3.7 -m timeit 'import cloudpickle; cloudpickle.dumps(list(range(100000)))'
2 loops, best of 5: 119 msec per loop
25 / 32
Faster pickle extensions
in Python 3.8 , pickle extensions can now extend the C-optimized
pickle module 2
2
joint work with ogrisel and Antoine Pitrou
26 / 32
Faster pickle extensions
in Python 3.8 , pickle extensions can now extend the C-optimized
pickle module 2
$python3.8 -m timeit 'import pickle;pickle.dumps(list(range(100000)))'
50 loops, best of 5: 4.69 msec per loop
$python3.8 -m timeit 'import cloudpickle;cloudpickle.dumps(list(range(100000)))'
100 loops, best of 5: 3.73 msec per loop
2
joint work with ogrisel and Antoine Pitrou
26 / 32
Faster pickle extensions
in Python 3.8 , pickle extensions can now extend the C-optimized
pickle module 2
$python3.8 -m timeit 'import pickle;pickle.dumps(list(range(100000)))'
50 loops, best of 5: 4.69 msec per loop
$python3.8 -m timeit 'import cloudpickle;cloudpickle.dumps(list(range(100000)))'
100 loops, best of 5: 3.73 msec per loop
30x faster than on python3.7!
2
joint work with ogrisel and Antoine Pitrou
26 / 32
No-copy serialization using pickle protocol 5
 pickle was originally designed for on-disk persistency of Python
objects.
27 / 32
No-copy serialization using pickle protocol 5
 pickle was originally designed for on-disk persistency of Python
objects.
 Now, it is used wildely to communicate objects between workers,
which is done in-memory. RAM usage becomes a critical concern.
27 / 32
No-copy serialization using pickle protocol 5
 pickle was originally designed for on-disk persistency of Python
objects.
 Now, it is used wildely to communicate objects between workers,
which is done in-memory. RAM usage becomes a critical concern.
pickle protocol 5 a addition: PickleBuffer
ensures no copy operations when dumping or loading objects with large numpy
arrays and Arrow tables. (pandas DataFrame , scikit-learn estima-
tors...)
a
work by Antoine Pitrou
27 / 32
Out-of-band serialization
pickle protocol 5 goes even one step further: It allows delegation
of PEP 3118 -compatible objects serialization to third-party code.
shape
strides
flags big data buffer numpy array
pickle stream
(in band)
third-party stream
(out-of-band)
28 / 32
Out-of-band serialization
pickle protocol 5 goes even one step further: It allows delegation
of PEP 3118 -compatible objects serialization to third-party code.
shape
strides
flags big data buffer numpy array
pickle stream
(in band)
third-party stream
(out-of-band)
28 / 32
Out-of-band serialization
pickle protocol 5 goes even one step further: It allows delegation
of PEP 3118 -compatible objects serialization to third-party code.
shape
strides
flags big data buffer numpy array
pickle stream
(in band)
third-party stream
(out-of-band)
28 / 32
Decoupling multiprocessing and pickle
- multiprocessing exposes many constructs that use data
serialization internally ( Process, Connection, Queue )
- These objects are for now bound to use the standard
pickle and thus suffer from pickle ’s limitation.
29 / 32
Decoupling multiprocessing and pickle
(WIP) (Hopefully) New in Python 3.9:
joint work with Pablo Galindo
30 / 32
The custom reducer API
import mutliprocessing
from multiprocessing.reduction import AbstractReducer, AbstractPickler
class CustomReducer(AbstractReducer):
def get_pickler_class(self):
return CloudPickler
def get_unpickler_class(self):
return Unpickler
multiprocessing.set_reducer(CustomReducer())
31 / 32
Conclusion
- many recent improvements to make Python built-in
multiprocessing utilities more efficient, robust and extensible
- other advances towards a faster Python in a parallel setting:
- PEP 554: mutliple subinterpreters in the stdlib (3.9)
- the shared_memory module
32 / 32

Parallel computing in Python: Current state and recent advances

  • 1.
    Parallel Computing inPython Current State and Recent Advances Pierre Glaser, INRIA   September 5, 2019 1 / 32
  • 2.
    Parallel computing inmachine learning - an overview Multiprocessing libraries Towards better performance and extensiblity 2 / 32
  • 3.
    Parallel computing inmachine learning An overview 3 / 32
  • 4.
    Python? What for? Pythonusage among developers 1 1 source: https://www.jetbrains.com/research/python-developers-survey-2018/ 4 / 32
  • 5.
    Python? What for? Pythonusage among developers 1 1 source: https://www.jetbrains.com/research/python-developers-survey-2018/ 4 / 32
  • 6.
    A growing datascience ecosystem numpy for n-dimensional arrays 5 / 32
  • 7.
    A growing datascience ecosystem numpy for n-dimensional arrays pandas for data analytics 5 / 32
  • 8.
    A growing datascience ecosystem numpy for n-dimensional arrays pandas for data analytics scikit-learn for machine learning 5 / 32
  • 9.
    Parallel computing? Why? Independent,similar computation happens a lot in machine learning. We call it embarassingly parallel tasks. Famous examples: ▶ cross validation ▶ random forests ▶ hyperparameter selection using grid search Available for many scikit-learn estimators, but not all 6 / 32
  • 10.
    Parallel computing inscikit-learn made easy Parallelization in scikit-learn : ▶ ubiquituous ▶ painlessly toggled using estimators’s n_jobs option: 7 / 32
  • 11.
    Parallel computing inscikit-learn made easy Parallelization in scikit-learn : ▶ ubiquituous ▶ painlessly toggled using estimators’s n_jobs option: clf = RandomForestClassifier(n_estimators=100, n_jobs=4) X, y = get_data() clf.fit(X, y) # runs on 4 cores! 7 / 32
  • 12.
    Multithreading or multiprocessing? Parallelismusually exist under two different forms: 8 / 32
  • 13.
    Multithreading or multiprocessing? Parallelismusually exist under two different forms: ▶ executing multiple processes (python programs) in parallel ▶ executing multiple threads of a same process in parallel 8 / 32
  • 14.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Thread-3Thread-2Thread-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 15.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. thread-based parallelism: +: threads share memory ▶ no data copies ▶ no data transfer -: By default, Python forces threads to run sequentially The GIL can be released when calling native code ( numpy , scipy ...) main thread a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Thread-3Thread-2Thread-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 16.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. thread-based parallelism: +: threads share memory ▶ no data copies ▶ no data transfer -: By default, Python forces threads to run sequentially The GIL can be released when calling native code ( numpy , scipy ...) main thread a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Thread-3Thread-2Thread-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 17.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. thread-based parallelism: +: threads share memory ▶ no data copies ▶ no data transfer -: By default, Python forces threads to run sequentially The GIL can be released when calling native code ( numpy , scipy ...) main thread a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Thread-3Thread-2Thread-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 18.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. thread-based parallelism: +: threads share memory ▶ no data copies ▶ no data transfer -: By default, Python forces threads to run sequentially The GIL can be released when calling native code ( numpy , scipy ...) main thread a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Thread-3Thread-2Thread-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 19.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. process-based parallelism: +: processes are assured to run in parallel -: need to pass and copy data around ▶ larger memory footprint ▶ data transfer overhead main process a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Process-3Process-2Process-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 20.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. process-based parallelism: +: processes are assured to run in parallel -: need to pass and copy data around ▶ larger memory footprint ▶ data transfer overhead main process a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Process-3Process-2Process-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 21.
    Fitting a randomforest I want to fit many small decision trees independently, on the same data. process-based parallelism: +: processes are assured to run in parallel -: need to pass and copy data around ▶ larger memory footprint ▶ data transfer overhead main process a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 a b 1 2 3 4 5 6 7 8 Process-3Process-2Process-1 memmaped data a b 1 2 3 4 5 6 7 8 : single python interpreter 9 / 32
  • 22.
    In practice Figure: Modelfitting time - parallel vs. sequential 10 / 32
  • 23.
    In practice Figure: Modelfitting time - parallel vs. sequential 10 / 32
  • 24.
    In practice Figure: Modelfitting time - parallel vs. sequential 10 / 32
  • 25.
    multiprocessing libraries standard librariesand third-party modules 11 / 32
  • 26.
    Running a multi-coreapplication with multiprocessing main process worker-1 worker-2 greet("Alice") result_1 mp.Pool You need: - process creation primitives mp.Process - a serialization protocol pickle 12 / 32
  • 27.
    Running a multi-coreapplication with multiprocessing main process worker-1 worker-2 greet("Alice") greet("Bob") result_1 result_2 mp.Pool You need: - process creation primitives mp.Process - a serialization protocol pickle - thread-safe data structures mp.Queue 12 / 32
  • 28.
    Running a multi-coreapplication with multiprocessing main process worker-1 worker-2 greet("Alice") greet("Bob") result_1 result_2 mp.Pool You need: - process creation primitives mp.Process - a serialization protocol pickle - thread-safe data structures mp.Queue bonus: the mp.Pool class 12 / 32
  • 29.
    Serialization Serialization defines theprocess of transforming an in-memory object into a sequence of bytes. estimator process-1 process-2 deserializationserialization 13 / 32
  • 30.
    Serialization Serialization defines theprocess of transforming an in-memory object into a sequence of bytes. estimator 0111101. ..01 process-1 process-2 deserializationserialization 13 / 32
  • 31.
    Serialization Serialization defines theprocess of transforming an in-memory object into a sequence of bytes. estimator 0111101. ..01 process-1 process-2 deserializationserialization The bytes string contains the instructions sequence that has to be executed to reconstruct the graph of objects in a fresh python environment 13 / 32
  • 32.
    Serialization Serialization defines theprocess of transforming an in-memory object into a sequence of bytes. estimator estimator 0111101. ..01 process-1 process-2 deserializationserialization The bytes string contains the instructions sequence that has to be executed to reconstruct the graph of objects in a fresh python environment 13 / 32
  • 33.
    The pickle protocol Pythondefines a serialization protocol called pickle , and provides an implementation of it in the standard library. import pickle s = pickle.dumps([1, 2, 3]) # serialization (pickling) step s b'x80x03]qx00(Kx01Kx02Kx03e.' depickled_list = pickle.loads(b'x80x03]qx00(Kx01Kx02Kx03e.') depickled_list [1, 2, 3] 14 / 32
  • 34.
    - multiprocessing oftenneeds to pass data around - pickle’s purpose is to pass data around ⇒ pickle is tightly integrated into many multiprocessing objects ( mp.Process , mp.Queue , mp.Connection ...) 15 / 32
  • 35.
    - multiprocessing oftenneeds to pass data around - pickle’s purpose is to pass data around ⇒ pickle is tightly integrated into many multiprocessing objects ( mp.Process , mp.Queue , mp.Connection ...) Improving Python multiprocessing performance often comes by improving pickle 15 / 32
  • 36.
    multiprocessing portability creating processes serializingdata the mp.Pool class    all come with design choices Those design choices conflicts with the modern workflow of many data scientists 16 / 32
  • 37.
    multiprocessing portability interactive sessions( + IPython, Jupyter ) import multiprocessing as mp mp.set_start_method("spawn") p = mp.Pool(2) def greet_friend(name): ... print("hello {}!".format(name)) ... p.map(greet_friend, ("Alice", "Bob")) AttributeError: Cant get attribute 'greet_friend' on <module ' __main __'> 17 / 32
  • 38.
    multiprocessing portability interactive sessions( + IPython, Jupyter ) import multiprocessing as mp mp.set_start_method("spawn") p = mp.Pool(2) def greet_friend(name): ... print("hello {}!".format(name)) ... p.map(greet_friend, ("Alice", "Bob")) AttributeError: Cant get attribute 'greet_friend' on <module ' __main __'> external libraries ( + numpy , OpenMP ) 17 / 32
  • 39.
    multiprocessing portability interactive sessions( + IPython, Jupyter ) import multiprocessing as mp mp.set_start_method("spawn") p = mp.Pool(2) def greet_friend(name): ... print("hello {}!".format(name)) ... p.map(greet_friend, ("Alice", "Bob")) AttributeError: Cant get attribute 'greet_friend' on <module ' __main __'> external libraries ( + numpy , OpenMP ) tends to deadlock when child processes crash 17 / 32
  • 40.
    A (standard library)alternative to mp.Pool concurrent.futures.ProcessPoolExecutor (Python3.2+) from concurrent.futures import ProcessPoolExecutor executor = ProcessPoolExecutor(max_workers=2) def greet_friend(name): ... return "hello {}!".format(name) ... results = executor.map(greet_friend, ("Alice", "Bob")) # non-blocking for r in results: # blocking until the next task completes. ... print(r) does not deadlock when child processes unexpectedly crash. 18 / 32
  • 41.
    concurrent.futures for datascience: loky loky is a third party package, that provides a more robust ProcessPoolExecutor Support for Python3.4 + (And 2.7... until next year) Consistent behavior on all  , and  Works in interactive shells 19 / 32
  • 42.
    concurrent.futures for datascience: loky loky.ProcessPoolExecutor (Python3.2+) a drop-in replacement of the concurrent.futures.ProcessPoolExecutor using concurent.futures from concurrent.futures import ProcessPoolExecutor executor = ProcessPoolExecutor(max_workers=2) def greet_friend(name): ... return "hello {}!".format(name) ... results = executor.map(greet_friend, ("Alice", "Bob")) # non-blocking for r in results: # blocking until the next task completes. ... print(r) 20 / 32
  • 43.
    concurrent.futures for datascience: loky loky.ProcessPoolExecutor (Python3.2+) a drop-in replacement of the concurrent.futures.ProcessPoolExecutor using loky from loky import ProcessPoolExecutor executor = ProcessPoolExecutor(max_workers=2) def greet_friend(name): ... return "hello {}!".format(name) ... results = executor.map(greet_friend, ("Alice", "Bob")) # non-blocking for r in results: # blocking until the next task completes. ... print(r) 20 / 32
  • 44.
    loky : decouplingmultiprocessing from pickle - pickle fails at serializing certain objects in interactive sessions - it is the main reason concurrent.futures.ProcessPoolExecutor and mp.Pool fail to work in this setting To address this problem, loky uses cloudpickle which is a third-party pickle extension module. 21 / 32
  • 45.
    Integration into datascience frameworks scikit-learn joblib loky cloudpickle pickle 22 / 32
  • 46.
    What’s new inPython 3.8 (and 3.9 ): Increasing Python’s serialization speed and extensibility 23 / 32
  • 47.
    pickle extensions: why? bydesign, the standard pickle implementation blocks the serialization of some Python constructs 24 / 32
  • 48.
    pickle extensions: why? bydesign, the standard pickle implementation blocks the serialization of some Python constructs import pickle import cloudpickle pickle.dumps(lambda x: x + 1) # would cause arbitrary code execution 24 / 32
  • 49.
    pickle extensions: why? bydesign, the standard pickle implementation blocks the serialization of some Python constructs import pickle import cloudpickle pickle.dumps(lambda x: x + 1) # would cause arbitrary code execution Traceback (most recent call last): File "<stdin>", line 1, in <module> _pickle.PicklingError: Can't pickle <function <lambda> at 0x7fd0b36631e0 attribute lookup <lambda> on __main __ failed 24 / 32
  • 50.
    pickle extensions: why? bydesign, the standard pickle implementation blocks the serialization of some Python constructs import pickle import cloudpickle pickle.dumps(lambda x: x + 1) # would cause arbitrary code execution Traceback (most recent call last): File "<stdin>", line 1, in <module> _pickle.PicklingError: Can't pickle <function <lambda> at 0x7fd0b36631e0 attribute lookup <lambda> on __main __ failed In practice, data scientists need remote code execution of interactively defined functions ( jupyter + dask , Zeppelin + (py)spark ...) Such frameworks require pickle extensions such as cloudpickle cloudpickle.dumps(lambda x: x + 1) b'x80x04x958x01x00x00x00x00x00x00x8cx17cloudpickle.cloudpickle ...' 24 / 32
  • 51.
    pickle extensions are(were) slow The pickle module is implemented both as a pure Python module, and as a C -optimized module. 25 / 32
  • 52.
    pickle extensions are(were) slow The pickle module is implemented both as a pure Python module, and as a C -optimized module. pickle extensions however could only extend the slow pythonic pickle 25 / 32
  • 53.
    pickle extensions are(were) slow The pickle module is implemented both as a pure Python module, and as a C -optimized module. pickle extensions however could only extend the slow pythonic pickle $python3.7 -m timeit 'import pickle; pickle.dumps(list(range(100000)))' 50 loops, best of 5: 4.39 msec per loop $python3.7 -m timeit 'import cloudpickle; cloudpickle.dumps(list(range(100000)))' 2 loops, best of 5: 119 msec per loop 25 / 32
  • 54.
    Faster pickle extensions inPython 3.8 , pickle extensions can now extend the C-optimized pickle module 2 2 joint work with ogrisel and Antoine Pitrou 26 / 32
  • 55.
    Faster pickle extensions inPython 3.8 , pickle extensions can now extend the C-optimized pickle module 2 $python3.8 -m timeit 'import pickle;pickle.dumps(list(range(100000)))' 50 loops, best of 5: 4.69 msec per loop $python3.8 -m timeit 'import cloudpickle;cloudpickle.dumps(list(range(100000)))' 100 loops, best of 5: 3.73 msec per loop 2 joint work with ogrisel and Antoine Pitrou 26 / 32
  • 56.
    Faster pickle extensions inPython 3.8 , pickle extensions can now extend the C-optimized pickle module 2 $python3.8 -m timeit 'import pickle;pickle.dumps(list(range(100000)))' 50 loops, best of 5: 4.69 msec per loop $python3.8 -m timeit 'import cloudpickle;cloudpickle.dumps(list(range(100000)))' 100 loops, best of 5: 3.73 msec per loop 30x faster than on python3.7! 2 joint work with ogrisel and Antoine Pitrou 26 / 32
  • 57.
    No-copy serialization usingpickle protocol 5  pickle was originally designed for on-disk persistency of Python objects. 27 / 32
  • 58.
    No-copy serialization usingpickle protocol 5  pickle was originally designed for on-disk persistency of Python objects.  Now, it is used wildely to communicate objects between workers, which is done in-memory. RAM usage becomes a critical concern. 27 / 32
  • 59.
    No-copy serialization usingpickle protocol 5  pickle was originally designed for on-disk persistency of Python objects.  Now, it is used wildely to communicate objects between workers, which is done in-memory. RAM usage becomes a critical concern. pickle protocol 5 a addition: PickleBuffer ensures no copy operations when dumping or loading objects with large numpy arrays and Arrow tables. (pandas DataFrame , scikit-learn estima- tors...) a work by Antoine Pitrou 27 / 32
  • 60.
    Out-of-band serialization pickle protocol5 goes even one step further: It allows delegation of PEP 3118 -compatible objects serialization to third-party code. shape strides flags big data buffer numpy array pickle stream (in band) third-party stream (out-of-band) 28 / 32
  • 61.
    Out-of-band serialization pickle protocol5 goes even one step further: It allows delegation of PEP 3118 -compatible objects serialization to third-party code. shape strides flags big data buffer numpy array pickle stream (in band) third-party stream (out-of-band) 28 / 32
  • 62.
    Out-of-band serialization pickle protocol5 goes even one step further: It allows delegation of PEP 3118 -compatible objects serialization to third-party code. shape strides flags big data buffer numpy array pickle stream (in band) third-party stream (out-of-band) 28 / 32
  • 63.
    Decoupling multiprocessing andpickle - multiprocessing exposes many constructs that use data serialization internally ( Process, Connection, Queue ) - These objects are for now bound to use the standard pickle and thus suffer from pickle ’s limitation. 29 / 32
  • 64.
    Decoupling multiprocessing andpickle (WIP) (Hopefully) New in Python 3.9: joint work with Pablo Galindo 30 / 32
  • 65.
    The custom reducerAPI import mutliprocessing from multiprocessing.reduction import AbstractReducer, AbstractPickler class CustomReducer(AbstractReducer): def get_pickler_class(self): return CloudPickler def get_unpickler_class(self): return Unpickler multiprocessing.set_reducer(CustomReducer()) 31 / 32
  • 66.
    Conclusion - many recentimprovements to make Python built-in multiprocessing utilities more efficient, robust and extensible - other advances towards a faster Python in a parallel setting: - PEP 554: mutliple subinterpreters in the stdlib (3.9) - the shared_memory module 32 / 32