On the necessity and inapplicability of python

Yung-Yu Chen (@yungyuc)
On the necessity and
inapplicability of Python
Help us develop numerical software

Whom I am
• I am a mechanical engineer by training, focusing on
applications of continuum mechanics. A computational
scientist / engineer rather than a computer scientist.

• In my day job, I write high-performance code for
semiconductor applications of computational geometry
and lithography.

• In my spare time, I am teaching a course ‘numerical
software development’ in the dept. of computer science
in NCTU.
2
You can contact me through twitter: https://twitter.com/yungyuc
or linkedin: https://www.linkedin.com/in/yungyuc/.

PyHUG
• Python Hsinchu User Group (established in late
2011)

• The ﬁrst group of staﬀ of PyCon Taiwan (2012)

• Weekly meetups at a pub for 3 years, not
stopped by COVID-19

• 7+ active user groups in Taiwan

• I have been in PyConJP in 2012, 2013 (APAC),
2015, 2019

• Last year I led a visit group to PyConJP (thank
you Terada san for the sharing the know-
how!)

• I hope we can do more
3

PyCon
Taiwan
5-6 Sep, 2020, Tainan, Taiwan

• It is planned to be an on-site conference
(unless something incredibly bad
happens again)

• Speakers may choose to speak online

• We still need to wear a face mask

• Appreciate the Taiwan citizens and
government, who work hard to
counter COVID-19

• https://g0v.hackmd.io/@kiang/
mask-info

• We hope to see you again in Taiwan!
4
https://tw.pycon.org/2020/

Numerical software
• Numerical software: Computer programs to solve scientiﬁc or
mathematic problems.

• Other names: Mathematical software, scientiﬁc software, technical
software.

• Python is a popular language for application experts to describe the
problems and solutions, because it is easy to use.

• Most of the computing systems (the numerical software) are designed in
a hybrid architecture.

• The computing kernel uses C++.

• Python is chosen for the user-level API.
5

Example: OPC
6
photoresist
silicon substrate
photomask
light source
Photolithography in semiconductor fabrication
wave length is only
hundreds of nm
image I want to
project on the PR
shape I need
on the mask
Optical proximity correction (OPC)
(smaller than the
wave length)
write code to
make it happen

Example: PDEs
7
Numerical simulations of
conservation laws:

∂u
∂t
+
3
∑
k=1
∂F(k)
(u)
∂xk
= 0
Use case: stress waves in  
anisotropic solids
Use case: compressible ﬂows

Example: What others do
• Machine learning

• Examples: TensorFlow, PyTorch

• Also:

• Computer aided design and engineering (CAD/CAE)

• Computer graphics and visualization

• Hybrid architecture provides both speed and ﬂexibility

• C++ makes it possible to do the huge amount of calculations, e.g.,
distributed computing of thousands of computers

• Python helps describe the complex problems of mathematics or sciences
8

Crunch real numbers
• Simple example: solve the Laplace equation

•

•

•

• Use a two-dimensional array as the spatial grid

• Point-Jacobi method: 3-level nested loop
∂2
u
∂x2
+
∂2
u
∂y2
= 0 (0 < x < 1; 0 < y < 1)
u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1)
u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1)
def solve_python_loop():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt]
+ u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
9
Non-trivial boundary condition

Power of Numpy C++
def solve_numpy_array():
un = u.copy()
converged = False
step = 0
step += 1
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
u[...] = un[...]
def solve_python_loop():
un = u.copy()
converged = False
step = 0
# Outer loop.
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
u[...] = un[...]
CPU times: user 62.1 ms, sys: 1.6 ms, total: 63.7 ms
Wall time: 63.1 ms: Pretty good!
CPU times: user 5.24 s, sys: 22.5 ms, total: 5.26 s
Wall time: 5280 ms: Poor speed
10
std::tuple<xt::xarray<double>, size_t, double>
solve_cpp(xt::xarray<double> u)
{
const size_t nx = u.shape(0);
xt::xarray<double> un = u;
bool converged = false;
size_t step = 0;
double norm;
while (!converged)
{
++step;
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
}
}
norm = xt::amax(xt::abs(un-u))();
if (norm < 1.e-5) { converged = true; }
u = un;
}
return std::make_tuple(u, step, norm);
}
CPU times: user 29.7 ms, sys: 506 µs, total: 30.2 ms
Wall time: 29.9 ms: Definitely good!
Pure Python 5280 ms
Numpy 63.1 ms
C++ 29.9 ms
83.7x
2.1x 176.6x
Pure Python Numpy
C++
The speed is the reason

1000 computers → 5.67

Save a lot of $

Recap: Why Python?
• Python is slow, but numpy may be reasonably fast.

• Coding in C++ is time-consuming.

• C++ is only needed in the computing kernel.

• Most code is supportive code, but it must not slow down the
computing kernel.

• Python makes it easier to organize structure the code.

This is why high-performance system usually uses a hybrid
architecture (C++ with Python or another scripting language).
11

Let’s go hybrid, but …
• A dilemma:

• Engineers (domain experts) know the problems but
don’t know C++ and software engineering.

• Computer scientists (programmers) know about C++
and software engineering but not the problems.

• Either side takes years of practices and study.

• Not a lot of people want to play both roles.
12

NSD: attempt to improve
• Numerical software development: a graduate-level
course

• Train computer scientists the hybrid architecture
for numerical software

• https://github.com/yungyuc/nsd

• Runnable Jupyter notebooks
13
• Part 1: Start with Python
• Lecture 1: Introduction

• Lecture 2: Fundamental engineering practices

• Lecture 3: Python and numpy

• Part 2: Computer architecture for performance
• Lecture 4: C++ and computer architecture
• Lecture 5: Matrix operations

• Lecture 6: Cache optimization

• Lecture 7: SIMD

• Part 3: Resource management
• Lecture 8: Memory management

• Lecture 9: Ownership and smart pointers

• Part 4: How to write C++ for Python
• Lecture 10: Modern C++

• Lecture 11: C++ and C for Python

• Lecture 12: Array code in C++

• Lecture 13: Array-oriented design

• Part 5: Conclude with Python
• Lecture 14: Advanced Python

• Term project presentation

Memory hierarchy
• We go to C++ to make it easier to access hardware

• Modern computer has faster CPU than memory

• High performance comes with hiding the memory-access latency
registers (0 cycle)
L1 cache (4 cycles)
L2 cache (10 cycles)
L3 cache (50 cycles)
Main memory (200 cycles)
Disk (storage) (100,000 cycles)
14

Data object
• Numerical software processes
huge amount of data. Copying
them is expensive.

• Use a pipeline to process the
same block of data

• Use an object to manage the
data: data object

• Data objects may not always be a
good idea in other ﬁelds.

• Here we do what it takes for
uncompromisable
performance.
Field initialization
Interior time-marching
Boundary condition
Parallel data sync
Finalization
Data
15
Data access at all phases

Zero-copy: do it where it fits
Python app C++ app
C++
container
Ndarray
manage
access
Python app C++ app
C++
container
Ndarray
manage
accessa11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language memory buffer shared across language
Top (Python) - down (C++) Bottom (C++) - up (Python)
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language
Ndarray
C++
container
16

More detail …
Notes about moving from Python to C++

• Python frame object

• Building Python extensions using pybind11
and cmake

• Inspecting assembly code

• x86 intrinsics

• PyObject, CPython API and pybind11 API

• Shared pointer, unique pointer, raw pointer,
and ownership

• Template generic programming

https://tw.pycon.org/2020/en-us/events/talk/
1164539411870777736/
17

How to learn
• Work on a real project.

• Keep in mind that Python is 100x slower than C/C++.

• Always proﬁle (time).

• Don’t treat Python as simply Python.

• View Python as an interpreter library written in C.

• Use tools to call C/C++: Cython, pybind11, etc.
18

What we want
19
See problems
Formulate the
problems
Get something
working
Automate PrototypeReusable
software
? ?
One-time programs may happen

On the necessity and inapplicability of python

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to On the necessity and inapplicability of python

Similar to On the necessity and inapplicability of python (20)

More from Yung-Yu Chen

More from Yung-Yu Chen (10)

Recently uploaded

Recently uploaded (20)

On the necessity and inapplicability of python