Take advantage of C++ from Python

Take advantage of C++
from Python
Yung-Yu Chen
PyCon Kyushu
30th June 2018

Why C++
❖ Python is slow
❖ Everything is on heap
❖ Always dynamic types
❖ Hard to access assembly
❖ Convoluted algorithms with ndarray
❖ Access external code written in any language
❖ Detail control and abstraction

Hard problems take time
• Supersonic jet in cross ﬂow; density contour
• 264 cores with 53 hours for 1.3 B variables (66 M elements) by 12,000 time steps
• At OSC, 2011 (10 Gbps InﬁniBand)
HPC (high-performance computing) is hard. Physics is harder. Don’t mingle.

Best of both worlds
❖ C++: fast runtime, strong static type checking, industrial grade
❖ Slow to code
❖ Python: fast prototyping, batteries included, easy to use
❖ Slow to run
❖ Hybrid system is everywhere.
❖ TensorFlow, Blender, OpenCV, etc.
❖ C++ crunches numbers. Python controls the ﬂow.
❖ Applications work like libraries, libraries like applications.

pybind11
❖ https://github.com/pybind/pybind11: C++11
❖ Expose C++ entities to Python
❖ Use Python from C++
❖ list, tuple, dict, and str
❖ handle, object, and none

C++11(/14/17/20)
New language features: auto and decltype, defaulted and deleted
functions, ﬁnal and override, trailing return type, rvalue references,
move constructors/move assignment, scoped enums, constexpr and
literal types, list initialization, delegating and inherited constructors,
brace-or-equal initializers, nullptr, long long, char16_t and char32_t,
type aliases, variadic templates, generalized unions, generalized
PODs, Unicode string literals, user-deﬁned literals, attributes,
lambda expressions, noexcept, alignof and alignas, multithreaded
memory model, thread-local storage, GC interface, range for (based
on a Boost library), static assertions (based on a Boost library)
http://en.cppreference.com/w/cpp/language/history

Python’s friends
❖ Shared pointer: manage resource ownership between
C++ and Python
❖ Move semantics: speed
❖ Lambda expression: ease the wrapping code

Ownership
❖ All Python objects are dynamically allocated on the
heap. Python uses reference counting to know who
should deallocate the object when it is no longer used.
❖ A owner of the reference to an object is responsible for
deallocating the object. With multiple owners, the last
owner (at this time, the reference count is 1) calls the
destructor and deallocate. Other owners simply
decrement the count by 1.

Shared pointer
#include <memory>
#include <vector>
#include <algorithm>
#include <iostream>
class Series {
std::vector<int> m_data;
public:
int sum() const {
const int ret = std::accumulate(
m_data.begin(), m_data.end(), 0);
std::cout << "Series::sum() = " << ret << std::endl;
return ret;
}
static size_t count;
Series(size_t size, int lead) : m_data(size) {
for (size_t it=0; it<size; it++) { m_data[it] = lead+it; }
count++;
}
~Series() { count--; }
};
size_t Series::count = 0;
void use_raw_pointer() {
Series * series_ptr = new Series(10, 2);
series_ptr->sum(); // call member function
// OUT: Series::sum() = 65
// remember to delete the object or we leak memory
std::cout << "before explicit deletion, Series::count = "
<< Series::count << std::endl;
// OUT: before explicit deletion, Series::count = 1
delete series_ptr;
std::cout << "after the resource is manually freed, Series::count = "
// OUT: after the resource is manually freed, Series::count = 0
}
void use_shared_pointer() {
std::shared_ptr<Series> series_sptr(new Series(10, 3));
series_sptr->sum(); // call member function
// OUT: Series::sum() = 75
// note shared_ptr handles deletion for series_sptr
}
int main(int argc, char ** argv) {
// the common raw pointer
use_raw_pointer();
// now, shared_ptr
use_shared_pointer();
std::cout << "no memory leak: Series::count = "
// OUT: no memory leak: Series::count = 0
return 0;
}

Move semantics
❖ Number-crunching code needs large arrays as memory buffers.
They aren’t supposed to be copied frequently.
❖ 50,000 × 50,000 takes 20 GB.
❖ Shared pointers should manage large chucks of memory.
❖ New reference to an object: copy constructor of shared pointer
❖ Borrowed reference to an object: const reference to the shared
pointer
❖ Stolen reference to an object: move constructor of shared
pointer

Lambda
❖ Put the code at the place it should be shown
namespace py = pybind11;
cls = py::class_< wrapped_type, holder_type >(mod, pyname, clsdoc):
cls
.def(
py::init([](block_type & block, index_type icl, bool init_sentinel) {
return wrapped_type(block, icl, init_sentinel);
}),
py::arg("block"), py::arg("icl"), py::arg("init_sentinel")=true
)
.def("repr", &wrapped_type::repr, py::arg("indent")=0, py::arg("precision")=0)
.def("__repr__", [](wrapped_type & self){ return self.repr(); })
.def("init_sentinel", &wrapped_type::init_sentinel)
.def_readwrite("cnd", &wrapped_type::cnd)
.def_readwrite("vol", &wrapped_type::vol)
.def_property_readonly(
"nbce",
[](wrapped_type & self) { return self.bces.size(); }
)
.def(
"get_bce",
[](wrapped_type & self, index_type ibce) { return self.bces.at(ibce); }
)
;

Lambda, cont’d
❖ Code as free as Python, as fast as C
#include <unordered_map>
#include <functional>
#include <cstdio>
// Python: fmap = dict()
std::unordered_map<int, std::function<void(int)>> fmap;
// Python: fmap[1] = lambda v: print("v = %d" % v)
fmap.insert({
1, [](int v) -> void { std::printf("v = %dn", v); }
});
// Python: fmap[5] = lambda v: print("v*5 = %d" % (v*5))
fmap.insert({
5, [](int v) -> void { std::printf("v*5 = %dn", v*5); }
});
std::unordered_map<int, std::function<void(int)>>::iterator search;
// Python: fmap[1](100)
search = fmap.find(1);
search->second(100);
// OUT: v = 100
// Python: fmap[5](500)
search = fmap.find(5);
search->second(500);
// OUT: v*5 = 2500
return 0;
}

Manipulate Python
❖ Don’t mingle Python with C++
❖ Python has GIL
❖ Don’t include Python.h if you don’t intend to run
Python
❖ Once it enters your core, it’s hard to get it off
#include <Python.h>
class Core {
private:
int m_value;
PyObject * m_pyobject;
};

Do it in the wrapping layer
cls
.def(
py::init([](py::object pyblock) {
block_type * block = py::cast<block_type *>(pyblock.attr("_ustblk"));
std::shared_ptr<wrapped_type> svr = wrapped_type::construct(block->shared_from_this());
for (auto bc : py::list(pyblock.attr("bclist"))) {
std::string name = py::str(bc.attr("__class__").attr("__name__").attr("lstrip")("GasPlus"));
BoundaryData * data = py::cast<BoundaryData *>(bc.attr("_data"));
std::unique_ptr<gas::TrimBase<NDIM>> trim;
if ("Interface" == name) {
trim = make_unique<gas::TrimInterface<NDIM>>(*svr, *data);
} else if ("NoOp" == name) {
trim = make_unique<gas::TrimNoOp<NDIM>>(*svr, *data);
} else if ("NonRefl" == name) {
trim = make_unique<gas::TrimNonRefl<NDIM>>(*svr, *data);
} else if ("SlipWall" == name) {
trim = make_unique<gas::TrimSlipWall<NDIM>>(*svr, *data);
} else if ("Inlet" == name) {
trim = make_unique<gas::TrimInlet<NDIM>>(*svr, *data);
} else {
/* do nothing for now */ // throw std::runtime_error("BC type unknown");
}
svr->trims().push_back(std::move(trim));
}
if (report_interval) { svr->make_qty(); }
return svr;
}),
py::arg("block")
);

pybind11::list
❖ Read a list and cast contents:
❖ Populate:
#include <pybind11/pybind11.h> // must be first
#include <string>
#include <iostream>
PYBIND11_MODULE(_pylist, mod) {
mod.def(
"do",
[](py::list & l) {
// convert contents to std::string and send to cout
std::cout << "std::cout:" << std::endl;
for (py::handle o : l) {
std::string s = py::cast<std::string>(o);
std::cout << s << std::endl;
}
}
);
mod.def(
"do2",
[](py::list & l) {
// create a new list
std::cout << "py::print:" << std::endl;
py::list l2;
for (py::handle o : l) {
std::string s = py::cast<std::string>(o);
s = "elm:" + s;
py::str s2(s);
l2.append(s2); // populate contents
}
py::print(l2);
}
);
} /* end PYBIND11_PLUGIN(_pylist) */
>>> import _pylist
>>> # print the input list
>>> _pylist.do(["a", "b", "c"])
std::cout:
a
b
c
>>> _pylist.do2(["d", "e", "f"])
py::print:
['elm:d', 'elm:e', 'elm:f']

pybind11::tuple
❖ Tuple is immutable, thus
behaves like read-only. The
construction is through another
iterable object.
❖ Read the contents of a tuple:
#include <vector>
PYBIND11_MODULE(_pytuple, mod) {
mod.def(
"do",
[](py::args & args) {
// build a list using py::list::append
py::list l;
for (py::handle h : args) {
l.append(h);
}
// convert it to a tuple
py::tuple t(l);
// print it out
py::print(py::str("{} len={}").format(t, t.size()));
// print the element one by one
for (size_t it=0; it<t.size(); ++it) {
py::print(py::str("{}").format(t[it]));
}
}
);
} /* end PYBIND11_PLUGIN(_pytuple) */
>>> import _pytuple
>>> _pytuple.do("a", 7, 5.6)
('a', 7, 5.6) len=3
a
7
5.6

pybind11::dict
❖ Dictionary is one of the
most useful container in
Python.
❖ Populate a dictionary:
❖ Manipulate it:
#include <string>
#include <stdexcept>
#include <iostream>
PYBIND11_MODULE(_pydict, mod) {
mod.def(
"do",
[](py::args & args) {
if (args.size() % 2 != 0) {
throw std::runtime_error("argument number must be even");
}
// create a dict from the input tuple
py::dict d;
for (size_t it=0; it<args.size(); it+=2) {
d[args[it]] = args[it+1];
}
return d;
}
);
mod.def(
"do2",
[](py::dict d, py::args & args) {
for (py::handle h : args) {
if (d.contains(h)) {
std::cout << py::cast<std::string>(h)
<< " is in the input dictionary" << std::endl;
} else {
std::cout << py::cast<std::string>(h)
<< " is not found in the input dictionary" << std::endl;
}
}
std::cout << "remove everything in the input dictionary!" << std::endl;
d.clear();
return d;
}
);
} /* end PYBIND11_PLUGIN(_pydict) */
>>> import _pydict
>>> d = _pydict.do("a", 7, "b", "name", 10, 4.2)
>>> print(d)
{'a': 7, 'b': 'name', 10: 4.2}
>>> d2 = _pydict.do2(d, "b", "d")
b is in the input dictionary
d is not found in the input dictionary
remove everything in the input dictionary!
>>> print("The returned dictionary is empty:", d2)
The returned dictionary is empty: {}
>>> print("The first dictionary becomes empty too:", d)
The first dictionary becomes empty too: {}
>>> print("Are the two dictionaries the same?", d2 is d)
Are the two dictionaries the same? True

pybind11::str
❖ One more trick with
Python strings in
pybind11; user-deﬁned
literal: 
 
#include <iostream>
using namespace py::literals; // to bring in the `_s` literal
PYBIND11_MODULE(_pystr, mod) {
mod.def(
"do",
[]() {
py::str s("python string {}"_s.format("formatting"));
py::print(s);
}
);
} /* end PYBIND11_PLUGIN(_pystr) */
>>> import _pystr
>>> _pystr.do()
python string formatting

Generic Python objects
❖ Pybind11 deﬁnes two generic types for representing
Python objects:
❖ “handle”: base class of all pybind11 classes for Python
types
❖ “object” derives from handle and adds automatic
reference counting

pybind11::handle and object
manually descrases refcount after h.dec_ref(): 3
#include <iostream>
PYBIND11_MODULE(_pyho, mod) {
mod.def(
"do",
[](py::object const & o) {
std::cout << "refcount in the beginning: "
<< o.ptr()->ob_refcnt << std::endl;
py::handle h(o);
std::cout << "no increase of refcount with a new pybind11::handle: "
<< h.ptr()->ob_refcnt << std::endl;
{
py::object o2(o);
std::cout << "increased refcount with a new pybind11::object: "
<< o2.ptr()->ob_refcnt << std::endl;
}
std::cout << "decreased refcount after the new pybind11::object destructed: "
<< o.ptr()->ob_refcnt << std::endl;
h.inc_ref();
std::cout << "manually increases refcount after h.inc_ref(): "
h.dec_ref();
std::cout << "manually descrases refcount after h.dec_ref(): "
}
);
} /* end PYBIND11_PLUGIN(_pyho) */
>>> import _pyho
>>> _pyho.do(["name"])
refcount in the beginning: 3
no increase of refcount with a new pybind11::handle: 3
increased refcount with a new pybind11::object: 4
decreased refcount after the new pybind11::object destructed: 3
manually increases refcount after h.inc_ref(): 4

pybind11::none
❖ It’s worth noting that
pybind11 has “none”
type. In Python, None is
a singleton, and
accessible as Py_None in
the C API.
❖ Access None single from
C++:
#include <iostream>
PYBIND11_MODULE(_pynone, mod) {
mod.def(
"do",
[](py::object const & o) {
if (o.is(py::none())) {
std::cout << "it is None" << std::endl;
} else {
std::cout << "it is not None" << std::endl;
}
}
);
} /* end PYBIND11_PLUGIN(_pynone) */
>>> import _pynone
>>> _pynone.do(None)
it is None
>>> _pynone.do(False)
it is not None

Never loop in Python
❖ Sum 100,000,000 integers
❖ The C++ version:
❖ Numpy is better, but not enough
$ python -m timeit -s 'data = range(100000000)' 'sum(data)'
10 loops, best of 3: 2.36 sec per loop
$ time ./run
real 0m0.010s
user 0m0.002s
sys 0m0.004s
#include <cstdio>
long value = 0;
for (long it=0; it<100000000; ++it) { value += it; }
return 0;
}
$ python -m timeit -s 'import numpy as np ; data =
np.arange(100000000, dtype="int64")' 'data.sum()'
10 loops, best of 3: 74.9 msec per loop

Wisely use arrays
❖ Python calls are expensive. Data need to be transferred
from Python to C++ in batch. Use arrays.
❖ C++ code may use arrays as internal representation. For
example, matrices are arrays having a 2-D view.
❖ Arrays are used as both
❖ interface between Python and C++, and
❖ internal storage in the C++ engine

Arrays in Python
❖ What we really mean is numpy(.ndarray)
❖ 12 lines to create vertices for zig-zagging mesh
❖ They get things done, although sometimes look convoluted
# create nodes.
nodes = []
for iy, yloc in enumerate(np.arange(y0, y1+dy/4, dy/2)):
if iy % 2 == 0:
meshx = np.arange(x0, x1+dx/4, dx, dtype='float64')
else:
meshx = np.arange(x0+dx/2, x1-dx/4, dx, dtype='float64')
nodes.append(np.vstack([meshx, np.full_like(meshx, yloc)]).T)
nodes = np.vstack(nodes)
assert nodes.shape[0] == nnode
blk.ndcrd[:,:] = nodes
assert (blk.ndcrd == nodes).all()

Expose memory buffer
class Buffer: public std::enable_shared_from_this<Buffer> {
private:
size_t m_length = 0;
char * m_data = nullptr;
struct ctor_passkey {};
public:
Buffer(size_t length, const ctor_passkey &)
: m_length(length) { m_data = new char[length](); }
static std::shared_ptr<Buffer> construct(size_t length) {
return std::make_shared<Buffer>(length, ctor_passkey());
}
~Buffer() {
if (nullptr != m_data) {
delete[] m_data;
m_data = nullptr;
}
}
/** Backdoor */
template< typename T >
T * data() const { return reinterpret_cast<T*>(m_data); }
};
py::array from(array_flavor flavor) {
// ndarray shape and stride
npy_intp shape[m_table.ndim()];
std::copy(m_table.dims().begin(),
m_table.dims().end(),
shape);
npy_intp strides[m_table.ndim()];
strides[m_table.ndim()-1] = m_table.elsize();
for (ssize_t it = m_table.ndim()-2; it >= 0; --it) {
strides[it] = shape[it+1] * strides[it+1];
}
// create ndarray
void * data = m_table.data();
py::object tmp = py::reinterpret_steal<py::object>(
PyArray_NewFromDescr(
&PyArray_Type,
PyArray_DescrFromType(m_table.datatypeid()),
m_table.ndim(),
shape,
strides,
data,
NPY_ARRAY_WRITEABLE,
nullptr));
// link lifecycle to the underneath buffer
py::object buffer = py::cast(m_table.buffer());
py::array ret;
if (PyArray_SetBaseObject((PyArrayObject *)tmp.ptr(),
buffer.inc_ref().ptr()) == 0) {
ret = tmp;
}
return ret;
}
Internal buffer Expose the buffer as ndarray
❖ Numpy arrays provide the most common construct: a
contiguous memory buffer, and tons of code
❖ N-dimensional arrays (ndarray)
❖ There are variants, but less useful in C++: masked
array, sparse matrices, etc.

Define your meta data
❖ Free to deﬁne how the memory is used
class LookupTableCore {
private:
std::shared_ptr<Buffer> m_buffer;
std::vector<index_type> m_dims;
index_type m_nghost = 0;
index_type m_nbody = 0;
index_type m_ncolumn = 0;
index_type m_elsize = 1; ///< Element size in bytes.
DataTypeId m_datatypeid = MH_INT8;
public:
index_type ndim() const { return m_dims.size(); }
index_type nghost() const { return m_nghost; }
index_type nbody() const { return m_nbody; }
index_type nfull() const { return m_nghost + m_nbody; }
index_type ncolumn() const { return m_ncolumn; }
index_type nelem() const { return nfull() * ncolumn(); }
index_type elsize() const { return m_elsize; }
DataTypeId datatypeid() const { return m_datatypeid; }
size_t nbyte() const { return buffer()->nbyte(); }
};
0
bodyghost

Organize arrays
❖ LookupTable is a class
template providing static
information for the dynamic
array core
❖ Now we can put together a
class that keeps track of all
data for computation
template< size_t NDIM >
class UnstructuredBlock {
private:
// geometry arrays.
LookupTable<real_type, NDIM> m_ndcrd;
LookupTable<real_type, NDIM> m_fccnd;
LookupTable<real_type, NDIM> m_fcnml;
LookupTable<real_type, 0> m_fcara;
LookupTable<real_type, NDIM> m_clcnd;
LookupTable<real_type, 0> m_clvol;
// meta arrays.
LookupTable<shape_type, 0> m_fctpn;
LookupTable<shape_type, 0> m_cltpn;
LookupTable<index_type, 0> m_clgrp;
// connectivity arrays.
LookupTable<index_type, FCMND+1> m_fcnds;
LookupTable<index_type, FCNCL > m_fccls;
LookupTable<index_type, CLMND+1> m_clnds;
LookupTable<index_type, CLMFC+1> m_clfcs;
// boundary information.
LookupTable<index_type, 2> m_bndfcs;
std::vector<BoundaryData> m_bndvec;
};
(This case is for unstructured meshes of mixed elements in 2-/3-dimensional Euclidean space)

Fast and hideous
❖ In theory we can write
beautiful and fast code in
C++, and we should.
❖ In practice, as long as it’s
fast, it’s not too hard to
compromise on elegance.
❖ Testability is the bottom
line.
const index_type *
pclfcs = reinterpret_cast<const index_type *>(clfcs().row(0));
prcells = reinterpret_cast<index_type *>(rcells.row(0));
for (icl=0; icl<ncell(); icl++) {
for (ifl=1; ifl<=pclfcs[0]; ifl++) {
ifl1 = ifl-1;
ifc = pclfcs[ifl];
const index_type *
pfccls = reinterpret_cast<const index_type *>(fccls().row(0))
+ ifc*FCREL;
if (ifc == -1) { // NOT A FACE!? SHOULDN'T HAPPEN.
prcells[ifl1] = -1;
continue;
} else if (pfccls[0] == icl) {
if (pfccls[2] != -1) { // has neighboring block.
prcells[ifl1] = -1;
} else { // is interior.
prcells[ifl1] = pfccls[1];
};
} else if (pfccls[1] == icl) { // I am the neighboring cell.
prcells[ifl1] = pfccls[0];
};
// count rcell number.
if (prcells[ifl1] >= 0) {
rcellno[icl] += 1;
} else {
prcells[ifl1] = -1;
};
};
// advance pointers.
pclfcs += CLMFC+1;
prcells += CLMFC;
}; (This looks like C since it really was C.)

Final notes
❖ Avoid Python when you need speed; use it as a shell to
your high-performance library from day one
❖ Resource management is in the core of the hybrid
architecture; do it in C++
❖ Use array (look-up tables) to keep large data
❖ Don’t access PyObject from your core
❖ Always keep in mind the differences in typing systems

Take advantage of C++ from Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Take advantage of C++ from Python

Similar to Take advantage of C++ from Python (20)

More from Yung-Yu Chen

More from Yung-Yu Chen (8)

Recently uploaded

Recently uploaded (20)

Take advantage of C++ from Python