SlideShare a Scribd company logo
1 of 25
Download to read offline
Understanding and optimizing
parallelism in NumPy-based programs
Ralf Gommers
21 April 2022
First make it work, then make it fast
>>> %timeit main()
50.1 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> # ... perform some optimizations
>>> %timeit main()
9.58 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> # break out your profiler (e.g., py-spy), optimize some more
>>> %timeit main()
2.83 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
`htop` output
Approaches for performant numerical
code (single-threaded)
Vectorization Use compiled code
Python compilers Python interpreters
Pythran
CPython
Plus Cinder, Pyston, and more -- very experimental,
and limited gains for numerical code
multiprocessing & multithreading
A key issue: oversubscription
Package A sees N CPU cores, and decides to use them all:
A key issue: oversubscription
Package B, which uses package A, or the end user decides to use
multiprocessing, 1 process per core:
The more CPU cores a machine has, the worse the effect is!
Parallel APIs & behavior: NumPy
NumPy is single-threaded, no code in NumPy is written for parallel execution.
However, most numpy.linalg functions (those using BLAS or LAPACK) execute in
parallel. They use all available cores on a machine.
NumPy does release the GIL wherever it can.
numpy.random has specific APIs to allow users to:
(a) Obtain independent streams for random number generation across
processes (local or distributed)
(b) Perform multithreaded random number generation
Parallel APIs & behavior: SciPy
SciPy is single-threaded by default (same as NumPy)
Calls to functionality using BLAS or LAPACK is again multithreaded:
● primarily in scipy.linalg and scipy.sparse.linalg,
● also higher-level functionality using linear algebra under the hood:
kernel density estimation, multivariate distributions etc. in scipy.stats,
vector quantization in scipy.cluster, interpolators in scipy.interpolate,
optimizers in scipy.optimize, and more
Some APIs have a workers=1 keyword, which allows the user to control the
number of processes or threads. Or pass in a custom Pool.
scipy.fft provides a context manager:
Parallel APIs & behavior: SciPy
An example using workers=:
Parallel APIs & behavior: scikit-learn
Scikit-learn is mostly single-threaded by default.
However, more and more functionality uses OpenMP for automatic
parallelization. This defaults to the number of virtual (not physical) CPU cores.
Many scikit-learn APIs offer a n_jobs= keyword to let user enable multiple
threads or processes via joblib.
Scikit-learn implements fairly complex control of NumPy/SciPy’s BLAS and
LAPACK libraries to prevent oversubscription in the presence of
multiprocessing on top of multi-threading. This is done via the threadpoolctl
package.
Controlling parallelism - packages
Dependencies (Conda)
Controlling parallelism - packages
Dependencies (PyPI)
Controlling parallelism - packages
Conda PyPI
Tuning the default behavior
Default behavior is inconsistent: too aggressive for linear algebra, and too
conservative for workers (SciPy) and n_jobs (scikit-learn)!
OpenBLAS, MKL and OpenMP don’t have a nice API, only environment variables:
For scikit-learn you can explicitly choose a backend (but defaults are usually fine):
Tuning the default behavior
NumPy, SciPy and scikit-learn all recommend using threadpoolctl in case you
want more granular control over threading behavior of BLAS, LAPACK and
OpenMP libraries (or cannot set environment variables):
A pitfall on multi-tenant machines
Multi-tenant machines: N “vCPU” (virtual CPU) cores for you, M in total.
CircleCI gives you 2 cores for a CI job, on a 64 core machine (and
os.cpu_count() reports 64). Set OPENBLAS_NUM_THREADS=2 to avoid problems!
GitHub Actions, Azure DevOps and other services are better behaved.
The impact can be severe:
Parallel random number generation
Parallel random number generation
First what not to do – simply drawing random numbers in different
subprocesses will give you the same numbers in each process:
Parallel random number generation
Use SeedSequence to obtain independent streams easily:
Parallel random number generation
Second option: use the .jumped() method of BitGenerator instances to obtain
independent streams easily:
Parallel random number generation
Where is NumPy going - technical
Interoperability
Array API standard support
Extensibility
Easier custom dtypes
Performance
SIMD acceleration on:
x86, arm64, PPC, …?
C++
Just dipping our toes in the
water here - so far it was just
Python and C
Platform support
PPC, AIX, s390x,
cross-compiling to embedded
ARM systems, ...
Type annotations
Main namespace annotations
just completed
Note what is not on this list: auto-parallelization
Resources to learn more
Scikit-learn:
https://scikit-learn.org/stable/computing/parallelism.html
https://joblib.readthedocs.io/en/latest/parallel.html
SciPy:
http://scipy.github.io/devdocs/dev/toolchain.html#openmp-support
http://scipy.github.io/devdocs/search.html?q=workers
NumPy:
https://numpy.org/doc/stable/reference/random/parallel.html
Relevant paper: Composable Multi-Threading and Multi-Processing for Numeric Libraries
Find me at: ralf.gommers@gmail.com, rgommers, ralfgommers
Thank you!

More Related Content

What's hot

Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advanceDaeMyung Kang
 
4章 Linuxカーネル - 割り込み・例外 2
4章 Linuxカーネル - 割り込み・例外 24章 Linuxカーネル - 割り込み・例外 2
4章 Linuxカーネル - 割り込み・例外 2mao999
 
머신러닝의 개념과 실습
머신러닝의 개념과 실습머신러닝의 개념과 실습
머신러닝의 개념과 실습Byoung-Hee Kim
 
Image style transfer & AI on App
Image style transfer & AI on AppImage style transfer & AI on App
Image style transfer & AI on AppChihyang Li
 
로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법Jeongsang Baek
 
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?Yongho Ha
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...
PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...
PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...Edureka!
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기
제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기
제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기BOAZ Bigdata
 
ARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation WorkshopARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation WorkshopSaumil Shah
 
Deview2020 유저가 좋은 작품(웹툰)을 만났을때
Deview2020 유저가 좋은 작품(웹툰)을 만났을때 Deview2020 유저가 좋은 작품(웹툰)을 만났을때
Deview2020 유저가 좋은 작품(웹툰)을 만났을때 choi kyumin
 
Taipei.py 2018 - Control device via ioctl from Python
Taipei.py 2018 - Control device via ioctl from Python Taipei.py 2018 - Control device via ioctl from Python
Taipei.py 2018 - Control device via ioctl from Python Hua Chu
 
파이썬으로 구현하는 최적화 알고리즘 pyconkr2019
파이썬으로 구현하는 최적화 알고리즘 pyconkr2019파이썬으로 구현하는 최적화 알고리즘 pyconkr2019
파이썬으로 구현하는 최적화 알고리즘 pyconkr2019Jiwon Cha
 
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인Jae Young Park
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912Yooseok Choi
 

What's hot (20)

Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
 
4章 Linuxカーネル - 割り込み・例外 2
4章 Linuxカーネル - 割り込み・例外 24章 Linuxカーネル - 割り込み・例外 2
4章 Linuxカーネル - 割り込み・例外 2
 
머신러닝의 개념과 실습
머신러닝의 개념과 실습머신러닝의 개념과 실습
머신러닝의 개념과 실습
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Image style transfer & AI on App
Image style transfer & AI on AppImage style transfer & AI on App
Image style transfer & AI on App
 
로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법
 
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...
PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...
PyTorch vs TensorFlow: The Force Is Strong With Which One? | Which One You Sh...
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기
제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기
제8회 BOAZ 빅데이터 컨퍼런스 - 01 낚시성 기사 분류기
 
ARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation WorkshopARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation Workshop
 
Deview2020 유저가 좋은 작품(웹툰)을 만났을때
Deview2020 유저가 좋은 작품(웹툰)을 만났을때 Deview2020 유저가 좋은 작품(웹툰)을 만났을때
Deview2020 유저가 좋은 작품(웹툰)을 만났을때
 
Taipei.py 2018 - Control device via ioctl from Python
Taipei.py 2018 - Control device via ioctl from Python Taipei.py 2018 - Control device via ioctl from Python
Taipei.py 2018 - Control device via ioctl from Python
 
파이썬으로 구현하는 최적화 알고리즘 pyconkr2019
파이썬으로 구현하는 최적화 알고리즘 pyconkr2019파이썬으로 구현하는 최적화 알고리즘 pyconkr2019
파이썬으로 구현하는 최적화 알고리즘 pyconkr2019
 
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
 
淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道 淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道
 

Similar to Parallelism in a NumPy-based program

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
 
Use Data-Oriented Design to write efficient code
Use Data-Oriented Design to write efficient codeUse Data-Oriented Design to write efficient code
Use Data-Oriented Design to write efficient codeAlessio Coltellacci
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataTravis Oliphant
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayJan Margeta
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)Marcel Caraciolo
 

Similar to Parallelism in a NumPy-based program (20)

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
Use Data-Oriented Design to write efficient code
Use Data-Oriented Design to write efficient codeUse Data-Oriented Design to write efficient code
Use Data-Oriented Design to write efficient code
 
Effective Benchmarks
Effective BenchmarksEffective Benchmarks
Effective Benchmarks
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Multicore
MulticoreMulticore
Multicore
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Smallsat 2021
Smallsat 2021Smallsat 2021
Smallsat 2021
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)
 

More from Ralf Gommers

Reliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdfReliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdfRalf Gommers
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with PythonRalf Gommers
 
Python array API standardization - current state and benefits
Python array API standardization - current state and benefitsPython array API standardization - current state and benefits
Python array API standardization - current state and benefitsRalf Gommers
 
Building SciPy kernels with Pythran
Building SciPy kernels with PythranBuilding SciPy kernels with Pythran
Building SciPy kernels with PythranRalf Gommers
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Strengthening NumPy's foundations - growing beyond code
Strengthening NumPy's foundations - growing beyond codeStrengthening NumPy's foundations - growing beyond code
Strengthening NumPy's foundations - growing beyond codeRalf Gommers
 
PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019Ralf Gommers
 
Inside NumPy: preparing for the next decade
Inside NumPy: preparing for the next decadeInside NumPy: preparing for the next decade
Inside NumPy: preparing for the next decadeRalf Gommers
 
The evolution of array computing in Python
The evolution of array computing in PythonThe evolution of array computing in Python
The evolution of array computing in PythonRalf Gommers
 
__array_function__ conceptual design & related concepts
__array_function__ conceptual design & related concepts__array_function__ conceptual design & related concepts
__array_function__ conceptual design & related conceptsRalf Gommers
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
 
NumFOCUS_Summit2018_Roadmaps_session
NumFOCUS_Summit2018_Roadmaps_sessionNumFOCUS_Summit2018_Roadmaps_session
NumFOCUS_Summit2018_Roadmaps_sessionRalf Gommers
 
SciPy 1.0 and Beyond - a Story of Community and Code
SciPy 1.0 and Beyond - a Story of Community and CodeSciPy 1.0 and Beyond - a Story of Community and Code
SciPy 1.0 and Beyond - a Story of Community and CodeRalf Gommers
 

More from Ralf Gommers (13)

Reliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdfReliable from-source builds (Qshare 28 Nov 2023).pdf
Reliable from-source builds (Qshare 28 Nov 2023).pdf
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 
Python array API standardization - current state and benefits
Python array API standardization - current state and benefitsPython array API standardization - current state and benefits
Python array API standardization - current state and benefits
 
Building SciPy kernels with Pythran
Building SciPy kernels with PythranBuilding SciPy kernels with Pythran
Building SciPy kernels with Pythran
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Strengthening NumPy's foundations - growing beyond code
Strengthening NumPy's foundations - growing beyond codeStrengthening NumPy's foundations - growing beyond code
Strengthening NumPy's foundations - growing beyond code
 
PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019
 
Inside NumPy: preparing for the next decade
Inside NumPy: preparing for the next decadeInside NumPy: preparing for the next decade
Inside NumPy: preparing for the next decade
 
The evolution of array computing in Python
The evolution of array computing in PythonThe evolution of array computing in Python
The evolution of array computing in Python
 
__array_function__ conceptual design & related concepts
__array_function__ conceptual design & related concepts__array_function__ conceptual design & related concepts
__array_function__ conceptual design & related concepts
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
 
NumFOCUS_Summit2018_Roadmaps_session
NumFOCUS_Summit2018_Roadmaps_sessionNumFOCUS_Summit2018_Roadmaps_session
NumFOCUS_Summit2018_Roadmaps_session
 
SciPy 1.0 and Beyond - a Story of Community and Code
SciPy 1.0 and Beyond - a Story of Community and CodeSciPy 1.0 and Beyond - a Story of Community and Code
SciPy 1.0 and Beyond - a Story of Community and Code
 

Recently uploaded

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Recently uploaded (20)

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Parallelism in a NumPy-based program

  • 1. Understanding and optimizing parallelism in NumPy-based programs Ralf Gommers 21 April 2022
  • 2. First make it work, then make it fast >>> %timeit main() 50.1 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> # ... perform some optimizations >>> %timeit main() 9.58 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> # break out your profiler (e.g., py-spy), optimize some more >>> %timeit main() 2.83 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • 4. Approaches for performant numerical code (single-threaded) Vectorization Use compiled code Python compilers Python interpreters Pythran CPython Plus Cinder, Pyston, and more -- very experimental, and limited gains for numerical code
  • 6. A key issue: oversubscription Package A sees N CPU cores, and decides to use them all:
  • 7. A key issue: oversubscription Package B, which uses package A, or the end user decides to use multiprocessing, 1 process per core: The more CPU cores a machine has, the worse the effect is!
  • 8. Parallel APIs & behavior: NumPy NumPy is single-threaded, no code in NumPy is written for parallel execution. However, most numpy.linalg functions (those using BLAS or LAPACK) execute in parallel. They use all available cores on a machine. NumPy does release the GIL wherever it can. numpy.random has specific APIs to allow users to: (a) Obtain independent streams for random number generation across processes (local or distributed) (b) Perform multithreaded random number generation
  • 9. Parallel APIs & behavior: SciPy SciPy is single-threaded by default (same as NumPy) Calls to functionality using BLAS or LAPACK is again multithreaded: ● primarily in scipy.linalg and scipy.sparse.linalg, ● also higher-level functionality using linear algebra under the hood: kernel density estimation, multivariate distributions etc. in scipy.stats, vector quantization in scipy.cluster, interpolators in scipy.interpolate, optimizers in scipy.optimize, and more Some APIs have a workers=1 keyword, which allows the user to control the number of processes or threads. Or pass in a custom Pool. scipy.fft provides a context manager:
  • 10. Parallel APIs & behavior: SciPy An example using workers=:
  • 11. Parallel APIs & behavior: scikit-learn Scikit-learn is mostly single-threaded by default. However, more and more functionality uses OpenMP for automatic parallelization. This defaults to the number of virtual (not physical) CPU cores. Many scikit-learn APIs offer a n_jobs= keyword to let user enable multiple threads or processes via joblib. Scikit-learn implements fairly complex control of NumPy/SciPy’s BLAS and LAPACK libraries to prevent oversubscription in the presence of multiprocessing on top of multi-threading. This is done via the threadpoolctl package.
  • 12. Controlling parallelism - packages Dependencies (Conda)
  • 13. Controlling parallelism - packages Dependencies (PyPI)
  • 14. Controlling parallelism - packages Conda PyPI
  • 15. Tuning the default behavior Default behavior is inconsistent: too aggressive for linear algebra, and too conservative for workers (SciPy) and n_jobs (scikit-learn)! OpenBLAS, MKL and OpenMP don’t have a nice API, only environment variables: For scikit-learn you can explicitly choose a backend (but defaults are usually fine):
  • 16. Tuning the default behavior NumPy, SciPy and scikit-learn all recommend using threadpoolctl in case you want more granular control over threading behavior of BLAS, LAPACK and OpenMP libraries (or cannot set environment variables):
  • 17. A pitfall on multi-tenant machines Multi-tenant machines: N “vCPU” (virtual CPU) cores for you, M in total. CircleCI gives you 2 cores for a CI job, on a 64 core machine (and os.cpu_count() reports 64). Set OPENBLAS_NUM_THREADS=2 to avoid problems! GitHub Actions, Azure DevOps and other services are better behaved. The impact can be severe:
  • 19. Parallel random number generation First what not to do – simply drawing random numbers in different subprocesses will give you the same numbers in each process:
  • 20. Parallel random number generation Use SeedSequence to obtain independent streams easily:
  • 21. Parallel random number generation Second option: use the .jumped() method of BitGenerator instances to obtain independent streams easily:
  • 23. Where is NumPy going - technical Interoperability Array API standard support Extensibility Easier custom dtypes Performance SIMD acceleration on: x86, arm64, PPC, …? C++ Just dipping our toes in the water here - so far it was just Python and C Platform support PPC, AIX, s390x, cross-compiling to embedded ARM systems, ... Type annotations Main namespace annotations just completed Note what is not on this list: auto-parallelization
  • 24. Resources to learn more Scikit-learn: https://scikit-learn.org/stable/computing/parallelism.html https://joblib.readthedocs.io/en/latest/parallel.html SciPy: http://scipy.github.io/devdocs/dev/toolchain.html#openmp-support http://scipy.github.io/devdocs/search.html?q=workers NumPy: https://numpy.org/doc/stable/reference/random/parallel.html Relevant paper: Composable Multi-Threading and Multi-Processing for Numeric Libraries
  • 25. Find me at: ralf.gommers@gmail.com, rgommers, ralfgommers Thank you!