Yung-Yu Chen (@yungyuc)
On the necessity and
inapplicability of Python
Help us develop numerical software
Whom I am
• I am a mechanical engineer by training, focusing on
applications of continuum mechanics. A computational
scientist / engineer rather than a computer scientist.

• In my day job, I write high-performance code for
semiconductor applications of computational geometry
and lithography.

• In my spare time, I am teaching a course ‘numerical
software development’ in the dept. of computer science
in NCTU.
2
You can contact me through twitter: https://twitter.com/yungyuc
or linkedin: https://www.linkedin.com/in/yungyuc/.
PyHUG
• Python Hsinchu User Group (established in late
2011)

• The first group of staff of PyCon Taiwan (2012)

• Weekly meetups at a pub for 3 years, not
stopped by COVID-19

• 7+ active user groups in Taiwan 

• I have been in PyConJP in 2012, 2013 (APAC),
2015, 2019

• Last year I led a visit group to PyConJP (thank
you Terada san for the sharing the know-
how!)

• I hope we can do more
3
PyCon
Taiwan
5-6 Sep, 2020, Tainan, Taiwan

• It is planned to be an on-site conference
(unless something incredibly bad
happens again)

• Speakers may choose to speak online

• We still need to wear a face mask

• Appreciate the Taiwan citizens and
government, who work hard to
counter COVID-19

• https://g0v.hackmd.io/@kiang/
mask-info 

• We hope to see you again in Taiwan!
4
https://tw.pycon.org/2020/
Numerical software
• Numerical software: Computer programs to solve scientific or
mathematic problems.

• Other names: Mathematical software, scientific software, technical
software.

• Python is a popular language for application experts to describe the
problems and solutions, because it is easy to use.

• Most of the computing systems (the numerical software) are designed in
a hybrid architecture.

• The computing kernel uses C++.

• Python is chosen for the user-level API.
5
Example: OPC
6
photoresist
silicon substrate
photomask
light source
Photolithography in semiconductor fabrication
wave length is only
hundreds of nm
image I want to
project on the PR
shape I need
on the mask
Optical proximity correction (OPC)
(smaller than the
wave length)
write code to
make it happen
Example: PDEs
7
Numerical simulations of
conservation laws:

∂u
∂t
+
3
∑
k=1
∂F(k)
(u)
∂xk
= 0
Use case: stress waves in 

anisotropic solids
Use case: compressible flows
Example: What others do
• Machine learning

• Examples: TensorFlow, PyTorch

• Also:

• Computer aided design and engineering (CAD/CAE)

• Computer graphics and visualization

• Hybrid architecture provides both speed and flexibility

• C++ makes it possible to do the huge amount of calculations, e.g.,
distributed computing of thousands of computers

• Python helps describe the complex problems of mathematics or sciences
8
Crunch real numbers
• Simple example: solve the Laplace equation

• 

• 

• 

• Use a two-dimensional array as the spatial grid

• Point-Jacobi method: 3-level nested loop
∂2
u
∂x2
+
∂2
u
∂y2
= 0 (0 < x < 1; 0 < y < 1)
u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1)
u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1)
def solve_python_loop():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt]
+ u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
9
Non-trivial boundary condition
Power of Numpy C++
def solve_numpy_array():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
while not converged:
step += 1
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
def solve_python_loop():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
CPU times: user 62.1 ms, sys: 1.6 ms, total: 63.7 ms
Wall time: 63.1 ms: Pretty good!
CPU times: user 5.24 s, sys: 22.5 ms, total: 5.26 s
Wall time: 5280 ms: Poor speed
10
std::tuple<xt::xarray<double>, size_t, double>
solve_cpp(xt::xarray<double> u)
{
const size_t nx = u.shape(0);
xt::xarray<double> un = u;
bool converged = false;
size_t step = 0;
double norm;
while (!converged)
{
++step;
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
}
}
norm = xt::amax(xt::abs(un-u))();
if (norm < 1.e-5) { converged = true; }
u = un;
}
return std::make_tuple(u, step, norm);
}
CPU times: user 29.7 ms, sys: 506 µs, total: 30.2 ms
Wall time: 29.9 ms: Definitely good!
Pure Python 5280 ms
Numpy 63.1 ms
C++ 29.9 ms
83.7x
2.1x 176.6x
Pure Python Numpy
C++
The speed is the reason

1000 computers → 5.67

Save a lot of $
Recap: Why Python?
• Python is slow, but numpy may be reasonably fast.

• Coding in C++ is time-consuming.

• C++ is only needed in the computing kernel.

• Most code is supportive code, but it must not slow down the
computing kernel.

• Python makes it easier to organize structure the code.

This is why high-performance system usually uses a hybrid
architecture (C++ with Python or another scripting language).
11
Let’s go hybrid, but …
• A dilemma:

• Engineers (domain experts) know the problems but
don’t know C++ and software engineering.

• Computer scientists (programmers) know about C++
and software engineering but not the problems.

• Either side takes years of practices and study.

• Not a lot of people want to play both roles.
12
NSD: attempt to improve
• Numerical software development: a graduate-level
course

• Train computer scientists the hybrid architecture
for numerical software

• https://github.com/yungyuc/nsd

• Runnable Jupyter notebooks
13
• Part 1: Start with Python
• Lecture 1: Introduction

• Lecture 2: Fundamental engineering practices

• Lecture 3: Python and numpy

• Part 2: Computer architecture for performance
• Lecture 4: C++ and computer architecture
• Lecture 5: Matrix operations

• Lecture 6: Cache optimization

• Lecture 7: SIMD

• Part 3: Resource management
• Lecture 8: Memory management

• Lecture 9: Ownership and smart pointers

• Part 4: How to write C++ for Python
• Lecture 10: Modern C++

• Lecture 11: C++ and C for Python

• Lecture 12: Array code in C++

• Lecture 13: Array-oriented design

• Part 5: Conclude with Python
• Lecture 14: Advanced Python

• Term project presentation
Memory hierarchy
• We go to C++ to make it easier to access hardware

• Modern computer has faster CPU than memory

• High performance comes with hiding the memory-access latency
registers (0 cycle)
L1 cache (4 cycles)
L2 cache (10 cycles)
L3 cache (50 cycles)
Main memory (200 cycles)
Disk (storage) (100,000 cycles)
14
Data object
• Numerical software processes
huge amount of data. Copying
them is expensive.

• Use a pipeline to process the
same block of data

• Use an object to manage the
data: data object

• Data objects may not always be a
good idea in other fields.

• Here we do what it takes for
uncompromisable
performance.
Field initialization
Interior time-marching
Boundary condition
Parallel data sync
Finalization
Data
15
Data access at all phases
Zero-copy: do it where it fits
Python app C++ app
C++
container
Ndarray
manage
access
Python app C++ app
C++
container
Ndarray
manage
accessa11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language memory buffer shared across language
Top (Python) - down (C++) Bottom (C++) - up (Python)
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language
Ndarray
C++
container
16
More detail …
Notes about moving from Python to C++ 

• Python frame object

• Building Python extensions using pybind11
and cmake

• Inspecting assembly code

• x86 intrinsics

• PyObject, CPython API and pybind11 API

• Shared pointer, unique pointer, raw pointer,
and ownership

• Template generic programming

https://tw.pycon.org/2020/en-us/events/talk/
1164539411870777736/
17
How to learn
• Work on a real project.

• Keep in mind that Python is 100x slower than C/C++.

• Always profile (time).

• Don’t treat Python as simply Python.

• View Python as an interpreter library written in C.

• Use tools to call C/C++: Cython, pybind11, etc.
18
What we want
19
See problems
Formulate the
problems
Get something
working
Automate PrototypeReusable
software
? ?
One-time programs may happen
Thanks!
Questions?

On the Necessity and Inapplicability of Python

  • 1.
    Yung-Yu Chen (@yungyuc) Onthe necessity and inapplicability of Python Help us develop numerical software
  • 2.
    Whom I am •I am a mechanical engineer by training, focusing on applications of continuum mechanics. A computational scientist / engineer rather than a computer scientist. • In my day job, I write high-performance code for semiconductor applications of computational geometry and lithography. • In my spare time, I am teaching a course ‘numerical software development’ in the dept. of computer science in NCTU. 2 You can contact me through twitter: https://twitter.com/yungyuc or linkedin: https://www.linkedin.com/in/yungyuc/.
  • 3.
    PyHUG • Python HsinchuUser Group (established in late 2011) • The first group of staff of PyCon Taiwan (2012) • Weekly meetups at a pub for 3 years, not stopped by COVID-19 • 7+ active user groups in Taiwan • I have been in PyConJP in 2012, 2013 (APAC), 2015, 2019 • Last year I led a visit group to PyConJP (thank you Terada san for the sharing the know- how!) • I hope we can do more 3
  • 4.
    PyCon Taiwan 5-6 Sep, 2020,Tainan, Taiwan • It is planned to be an on-site conference (unless something incredibly bad happens again) • Speakers may choose to speak online • We still need to wear a face mask • Appreciate the Taiwan citizens and government, who work hard to counter COVID-19 • https://g0v.hackmd.io/@kiang/ mask-info • We hope to see you again in Taiwan! 4 https://tw.pycon.org/2020/
  • 5.
    Numerical software • Numericalsoftware: Computer programs to solve scientific or mathematic problems. • Other names: Mathematical software, scientific software, technical software. • Python is a popular language for application experts to describe the problems and solutions, because it is easy to use. • Most of the computing systems (the numerical software) are designed in a hybrid architecture. • The computing kernel uses C++. • Python is chosen for the user-level API. 5
  • 6.
    Example: OPC 6 photoresist silicon substrate photomask lightsource Photolithography in semiconductor fabrication wave length is only hundreds of nm image I want to project on the PR shape I need on the mask Optical proximity correction (OPC) (smaller than the wave length) write code to make it happen
  • 7.
    Example: PDEs 7 Numerical simulationsof conservation laws: ∂u ∂t + 3 ∑ k=1 ∂F(k) (u) ∂xk = 0 Use case: stress waves in 
 anisotropic solids Use case: compressible flows
  • 8.
    Example: What othersdo • Machine learning • Examples: TensorFlow, PyTorch • Also: • Computer aided design and engineering (CAD/CAE) • Computer graphics and visualization • Hybrid architecture provides both speed and flexibility • C++ makes it possible to do the huge amount of calculations, e.g., distributed computing of thousands of computers • Python helps describe the complex problems of mathematics or sciences 8
  • 9.
    Crunch real numbers •Simple example: solve the Laplace equation • • • • Use a two-dimensional array as the spatial grid • Point-Jacobi method: 3-level nested loop ∂2 u ∂x2 + ∂2 u ∂y2 = 0 (0 < x < 1; 0 < y < 1) u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1) u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1) def solve_python_loop(): u = uoriginal.copy() un = u.copy() converged = False step = 0 # Outer loop. while not converged: step += 1 # Inner loops. One for x and the other for y. for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm 9 Non-trivial boundary condition
  • 10.
    Power of NumpyC++ def solve_numpy_array(): u = uoriginal.copy() un = u.copy() converged = False step = 0 while not converged: step += 1 un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] + u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm def solve_python_loop(): u = uoriginal.copy() un = u.copy() converged = False step = 0 # Outer loop. while not converged: step += 1 # Inner loops. One for x and the other for y. for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm CPU times: user 62.1 ms, sys: 1.6 ms, total: 63.7 ms Wall time: 63.1 ms: Pretty good! CPU times: user 5.24 s, sys: 22.5 ms, total: 5.26 s Wall time: 5280 ms: Poor speed 10 std::tuple<xt::xarray<double>, size_t, double> solve_cpp(xt::xarray<double> u) { const size_t nx = u.shape(0); xt::xarray<double> un = u; bool converged = false; size_t step = 0; double norm; while (!converged) { ++step; for (size_t it=1; it<nx-1; ++it) { for (size_t jt=1; jt<nx-1; ++jt) { un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4; } } norm = xt::amax(xt::abs(un-u))(); if (norm < 1.e-5) { converged = true; } u = un; } return std::make_tuple(u, step, norm); } CPU times: user 29.7 ms, sys: 506 µs, total: 30.2 ms Wall time: 29.9 ms: Definitely good! Pure Python 5280 ms Numpy 63.1 ms C++ 29.9 ms 83.7x 2.1x 176.6x Pure Python Numpy C++ The speed is the reason 1000 computers → 5.67 Save a lot of $
  • 11.
    Recap: Why Python? •Python is slow, but numpy may be reasonably fast. • Coding in C++ is time-consuming. • C++ is only needed in the computing kernel. • Most code is supportive code, but it must not slow down the computing kernel. • Python makes it easier to organize structure the code. This is why high-performance system usually uses a hybrid architecture (C++ with Python or another scripting language). 11
  • 12.
    Let’s go hybrid,but … • A dilemma: • Engineers (domain experts) know the problems but don’t know C++ and software engineering. • Computer scientists (programmers) know about C++ and software engineering but not the problems. • Either side takes years of practices and study. • Not a lot of people want to play both roles. 12
  • 13.
    NSD: attempt toimprove • Numerical software development: a graduate-level course • Train computer scientists the hybrid architecture for numerical software • https://github.com/yungyuc/nsd • Runnable Jupyter notebooks 13 • Part 1: Start with Python • Lecture 1: Introduction • Lecture 2: Fundamental engineering practices • Lecture 3: Python and numpy • Part 2: Computer architecture for performance • Lecture 4: C++ and computer architecture • Lecture 5: Matrix operations • Lecture 6: Cache optimization • Lecture 7: SIMD • Part 3: Resource management • Lecture 8: Memory management • Lecture 9: Ownership and smart pointers • Part 4: How to write C++ for Python • Lecture 10: Modern C++ • Lecture 11: C++ and C for Python • Lecture 12: Array code in C++ • Lecture 13: Array-oriented design • Part 5: Conclude with Python • Lecture 14: Advanced Python • Term project presentation
  • 14.
    Memory hierarchy • Wego to C++ to make it easier to access hardware • Modern computer has faster CPU than memory • High performance comes with hiding the memory-access latency registers (0 cycle) L1 cache (4 cycles) L2 cache (10 cycles) L3 cache (50 cycles) Main memory (200 cycles) Disk (storage) (100,000 cycles) 14
  • 15.
    Data object • Numericalsoftware processes huge amount of data. Copying them is expensive. • Use a pipeline to process the same block of data • Use an object to manage the data: data object • Data objects may not always be a good idea in other fields. • Here we do what it takes for uncompromisable performance. Field initialization Interior time-marching Boundary condition Parallel data sync Finalization Data 15 Data access at all phases
  • 16.
    Zero-copy: do itwhere it fits Python app C++ app C++ container Ndarray manage access Python app C++ app C++ container Ndarray manage accessa11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory buffer shared across language memory buffer shared across language Top (Python) - down (C++) Bottom (C++) - up (Python) Python app C++ app a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory buffer shared across language Ndarray C++ container 16
  • 17.
    More detail … Notesabout moving from Python to C++ • Python frame object • Building Python extensions using pybind11 and cmake • Inspecting assembly code • x86 intrinsics • PyObject, CPython API and pybind11 API • Shared pointer, unique pointer, raw pointer, and ownership • Template generic programming https://tw.pycon.org/2020/en-us/events/talk/ 1164539411870777736/ 17
  • 18.
    How to learn •Work on a real project. • Keep in mind that Python is 100x slower than C/C++. • Always profile (time). • Don’t treat Python as simply Python. • View Python as an interpreter library written in C. • Use tools to call C/C++: Cython, pybind11, etc. 18
  • 19.
    What we want 19 Seeproblems Formulate the problems Get something working Automate PrototypeReusable software ? ? One-time programs may happen
  • 20.