HOSTED BY
Making Python x100 Faster
with Rust
Ohad Ravid
Team Lead at Trigo
Ohad Ravid (he/him)
Team Lead at Trigo
■ Worked on backend, frontend, networking, firmware, ...
■ Using Python & Rust to build scalable and fast systems
■ Love tests and tea
Making Python Faster, with Rust
Or: How we solved a performance issue in one of our Python
libraries, using Rust
■ An incremental approach, not a rewrite
■ Start small, reduce risk
■ An elegant combination, which balances flexibility and
performance
Trigo
"Magic Checkout" -
Stand in front of a checkout station,
Your items will be on the screen
■ Real time location
■ In 3D
■ Physical servers, in the store
■ Using 100s of cameras
Trigo's 3D Engine Architecture
■ First, we convert 2D images to 2D Skeletons
● Skeletons == Pixels in the image
containing heads / shoulders / hands
■ For every camera, for every timestamp
Trigo's 3D Engine Architecture
■ Group 2D skeletons by timestamp
Trigo's 3D Engine Architecture
■ Build 3D skeletons from the 2D skeletons from all the cameras
■ Pure Python codebase with lots of numpy, quick to iterate on
■ Scalable design
Shout out to Excalidraw, a tool for sketching beautiful diagrams
Problem and Motivation
■ Fine for X concurrent (physical) users
■ Grinds to a halt for 5X concurrent (physical) users
Our Solution
■ Profile to find the biggest perf opportunities
● Avoid frequently changed parts of the codebase
■ Strive to maintain API compatibility
● Try to improve perf directly in Python / numpy
● If not, rewrite in Rust, but a single function / class at a time
A toy example
■ But how can we rewrite in Rust just a single function in a big
codebase?
● And how can we maintain the same API?
■ Let’s use a toy library to explore this!
A toy example
@dataclass
class Polygon:
x: np.array
y: np.array
@cached_property
def center(self) -> np.array: ...
def area(self) -> float: ...
# .. lots of functions working with lists of `Polygon`s ..
Profiling
■ We will use py-spy and not cProfile
■ We need a benchmark and a baseline
Profiling
■ We will use py-spy and not cProfile
■ We need a benchmark and a baseline
"Good benchmarking is hard. Having said that, do not stress too
much about having a perfect benchmarking setup, particularly when
you start optimizing a program."
~ Nicholas Nethercote, in "The Rust Performance Book"
Benchmark
# .. imports ..
NUM_ITER = 10
## Generate some data
polygons, points = poly_match.generate_example()
## Run a few iterations of the logic
t0 = time.perf_counter()
for _ in range(NUM_ITER):
poly_match.main(polygons, points)
t1 = time.perf_counter()
## Calculate how much time it took.
print(f"Took and avg of {((t1 - t0) / NUM_ITER) * 1000:.2f}ms per iteration")
Baseline
$ python measure.py
Took an avg of 147.46ms per iteration
■ So, let's find out what is so slow here!
Measure first
$ py-spy record --native -o profile.svg -- python measure.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
...
py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 391 Errors: 0
Measure first
■ This will generate a flamegraph:
Measure first
■ We’ll focus on find_close_polygons,
because everything else is <<
■ So, let's have a look at find_close_polygons
Measure first
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
close_polygons = []
for poly in polygon_subset:
if np.linalg.norm(poly.center - point) < max_dist:
close_polygons.append(poly)
return close_polygons
■ pyo3 is a Rust library (a crate) for interacting between Python and Rust.
● A bit like pybind11 in C++
● Used by popular Python packages and tools (cryptography, orjson, …)
■ Let's create our crate, and get to work:
Our Rust Crate
mkdir poly_match_rs && cd "$_"
pip install maturin
maturin init --bindings pyo3
■ Starting out, our crate is going to look like this:
Our Rust Crate
use pyo3::prelude::*;
#[pyfunction]
fn find_close_polygons() -> PyResult<()> {
Ok(())
}
#[pymodule]
fn poly_match_rs(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(find_close_polygons, m)?)?;
Ok(())
}
Measure twice
■ Running the profiler again generates a new flamegraph:
Measure twice
■ Pink is mostly overhead (allocating, getattr)
■ Blue is the actual logic (norm)
(not to scale)
57% of total runtime 9% of total runtime
Measure twice
■ Most of the time is spent in getattr and getting the underlying
array using as_array.
■ To improve this, we need to rewrite Polygon in Rust.
Measure twice
@dataclass
class Polygon:
x: np.array
y: np.array
@cached_property
def center(self) -> np.array: ...
def area(self) -> float: ...
# .. lots of functions working with lists of `Polygon`s ..
A remainder:
■ Our struct is going to look like this:
v2 - Rewrite Polygon in Rust
use ndarray::Array1;
#[pyclass(subclass)]
struct Polygon {
x: Array1<f64>,
y: Array1<f64>,
center: Array1<f64>,
}
■ Our struct is going to look like this, which is pretty similar to the original class!
v2 - Rewrite Polygon in Rust
use ndarray::Array1;
#[pyclass(subclass)]
struct Polygon {
x: Array1<f64>,
y: Array1<f64>,
center: Array1<f64>,
}
import numpy as np
@dataclass
class Polygon:
x: np.array
y: np.array
@cached_property
def center(self) -> np.array: ...
■ And can be subclassed from Python
v2 - Rewrite Polygon in Rust
use ndarray::Array1;
#[pyclass(subclass)]
struct Polygon {
x: Array1<f64>,
y: Array1<f64>,
center: Array1<f64>,
}
class Polygon(poly_match_rs.Polygon):
_area: float = None
def area(self) -> float:
...
■ And use the fact that we have a Rust-based struct to implement our function
v2 - Rewrite Polygon in Rust
- polygon_subset: Vec<Py<PyAny>>,
+ polygon_subset: Vec<Py<Polygon>>,
■ And use the fact that we have a Rust-based struct to implement our function
v2 - Rewrite Polygon in Rust
for poly in polygons {
let norm = {
let center = &poly.as_ref(py).borrow().center;
((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt()
};
if norm < max_dist {
close_polygons.push(poly)
}
}
■ And use the fact that we have a Rust-based struct to implement our function:
v2 - Rewrite Polygon in Rust
for poly in polygons {
let norm = {
let center = &poly.as_ref(py).borrow().center;
((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt()
};
if norm < max_dist {
close_polygons.push(poly)
}
}
v2 - Rewrite Polygon in Rust
$ (cd ./poly_match_rs/ && maturin develop --release)
$ python measure.py
v2 - Rewrite Polygon in Rust
(Baseline was ~150ms, line-to-line was ~17ms)
$ (cd ./poly_match_rs/ && maturin develop --release)
$ python measure.py
Took an avg of 1.71ms per iteration
Summary
■ We profiled our Python code using py-spy
■ A naive, line-to-line translation of the hottest function resulted
in ~10x improvement
■ Converting our Python class to a Rust struct resulted in another
10x improvement
You can find out more at ohadravid.github.io
(can we go even faster?)
Takeaways
■ Rust (with the help of pyo3) unlocks true native performance
for everyday Python code, with minimal compromises.
■ Python is a superb API for researchers, and crafting fast
building blocks with Rust is an extremely powerful
combination.
Ohad Ravid
ohad.rv@gmail.com
@ohadrv
https://ohadravid.github.i
o
Thank you!

Making Python 100x Faster with Less Than 100 Lines of Rust

  • 1.
    HOSTED BY Making Pythonx100 Faster with Rust Ohad Ravid Team Lead at Trigo
  • 2.
    Ohad Ravid (he/him) TeamLead at Trigo ■ Worked on backend, frontend, networking, firmware, ... ■ Using Python & Rust to build scalable and fast systems ■ Love tests and tea
  • 3.
    Making Python Faster,with Rust Or: How we solved a performance issue in one of our Python libraries, using Rust ■ An incremental approach, not a rewrite ■ Start small, reduce risk ■ An elegant combination, which balances flexibility and performance
  • 4.
    Trigo "Magic Checkout" - Standin front of a checkout station, Your items will be on the screen ■ Real time location ■ In 3D ■ Physical servers, in the store ■ Using 100s of cameras
  • 5.
    Trigo's 3D EngineArchitecture ■ First, we convert 2D images to 2D Skeletons ● Skeletons == Pixels in the image containing heads / shoulders / hands ■ For every camera, for every timestamp
  • 6.
    Trigo's 3D EngineArchitecture ■ Group 2D skeletons by timestamp
  • 7.
    Trigo's 3D EngineArchitecture ■ Build 3D skeletons from the 2D skeletons from all the cameras ■ Pure Python codebase with lots of numpy, quick to iterate on ■ Scalable design Shout out to Excalidraw, a tool for sketching beautiful diagrams
  • 8.
    Problem and Motivation ■Fine for X concurrent (physical) users ■ Grinds to a halt for 5X concurrent (physical) users
  • 9.
    Our Solution ■ Profileto find the biggest perf opportunities ● Avoid frequently changed parts of the codebase ■ Strive to maintain API compatibility ● Try to improve perf directly in Python / numpy ● If not, rewrite in Rust, but a single function / class at a time
  • 10.
    A toy example ■But how can we rewrite in Rust just a single function in a big codebase? ● And how can we maintain the same API? ■ Let’s use a toy library to explore this!
  • 11.
    A toy example @dataclass classPolygon: x: np.array y: np.array @cached_property def center(self) -> np.array: ... def area(self) -> float: ... # .. lots of functions working with lists of `Polygon`s ..
  • 12.
    Profiling ■ We willuse py-spy and not cProfile ■ We need a benchmark and a baseline
  • 13.
    Profiling ■ We willuse py-spy and not cProfile ■ We need a benchmark and a baseline "Good benchmarking is hard. Having said that, do not stress too much about having a perfect benchmarking setup, particularly when you start optimizing a program." ~ Nicholas Nethercote, in "The Rust Performance Book"
  • 14.
    Benchmark # .. imports.. NUM_ITER = 10 ## Generate some data polygons, points = poly_match.generate_example() ## Run a few iterations of the logic t0 = time.perf_counter() for _ in range(NUM_ITER): poly_match.main(polygons, points) t1 = time.perf_counter() ## Calculate how much time it took. print(f"Took and avg of {((t1 - t0) / NUM_ITER) * 1000:.2f}ms per iteration")
  • 15.
    Baseline $ python measure.py Tookan avg of 147.46ms per iteration
  • 16.
    ■ So, let'sfind out what is so slow here! Measure first $ py-spy record --native -o profile.svg -- python measure.py py-spy> Sampling process 100 times a second. Press Control-C to exit. ... py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 391 Errors: 0
  • 17.
    Measure first ■ Thiswill generate a flamegraph:
  • 18.
    Measure first ■ We’llfocus on find_close_polygons, because everything else is <<
  • 19.
    ■ So, let'shave a look at find_close_polygons Measure first def find_close_polygons( polygon_subset: List[Polygon], point: np.array, max_dist: float ) -> List[Polygon]: close_polygons = [] for poly in polygon_subset: if np.linalg.norm(poly.center - point) < max_dist: close_polygons.append(poly) return close_polygons
  • 20.
    ■ pyo3 isa Rust library (a crate) for interacting between Python and Rust. ● A bit like pybind11 in C++ ● Used by popular Python packages and tools (cryptography, orjson, …) ■ Let's create our crate, and get to work: Our Rust Crate mkdir poly_match_rs && cd "$_" pip install maturin maturin init --bindings pyo3
  • 21.
    ■ Starting out,our crate is going to look like this: Our Rust Crate use pyo3::prelude::*; #[pyfunction] fn find_close_polygons() -> PyResult<()> { Ok(()) } #[pymodule] fn poly_match_rs(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(find_close_polygons, m)?)?; Ok(()) }
  • 22.
    Measure twice ■ Runningthe profiler again generates a new flamegraph:
  • 23.
    Measure twice ■ Pinkis mostly overhead (allocating, getattr) ■ Blue is the actual logic (norm) (not to scale) 57% of total runtime 9% of total runtime
  • 24.
    Measure twice ■ Mostof the time is spent in getattr and getting the underlying array using as_array. ■ To improve this, we need to rewrite Polygon in Rust.
  • 25.
    Measure twice @dataclass class Polygon: x:np.array y: np.array @cached_property def center(self) -> np.array: ... def area(self) -> float: ... # .. lots of functions working with lists of `Polygon`s .. A remainder:
  • 26.
    ■ Our structis going to look like this: v2 - Rewrite Polygon in Rust use ndarray::Array1; #[pyclass(subclass)] struct Polygon { x: Array1<f64>, y: Array1<f64>, center: Array1<f64>, }
  • 27.
    ■ Our structis going to look like this, which is pretty similar to the original class! v2 - Rewrite Polygon in Rust use ndarray::Array1; #[pyclass(subclass)] struct Polygon { x: Array1<f64>, y: Array1<f64>, center: Array1<f64>, } import numpy as np @dataclass class Polygon: x: np.array y: np.array @cached_property def center(self) -> np.array: ...
  • 28.
    ■ And canbe subclassed from Python v2 - Rewrite Polygon in Rust use ndarray::Array1; #[pyclass(subclass)] struct Polygon { x: Array1<f64>, y: Array1<f64>, center: Array1<f64>, } class Polygon(poly_match_rs.Polygon): _area: float = None def area(self) -> float: ...
  • 29.
    ■ And usethe fact that we have a Rust-based struct to implement our function v2 - Rewrite Polygon in Rust - polygon_subset: Vec<Py<PyAny>>, + polygon_subset: Vec<Py<Polygon>>,
  • 30.
    ■ And usethe fact that we have a Rust-based struct to implement our function v2 - Rewrite Polygon in Rust for poly in polygons { let norm = { let center = &poly.as_ref(py).borrow().center; ((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt() }; if norm < max_dist { close_polygons.push(poly) } }
  • 31.
    ■ And usethe fact that we have a Rust-based struct to implement our function: v2 - Rewrite Polygon in Rust for poly in polygons { let norm = { let center = &poly.as_ref(py).borrow().center; ((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt() }; if norm < max_dist { close_polygons.push(poly) } }
  • 32.
    v2 - RewritePolygon in Rust $ (cd ./poly_match_rs/ && maturin develop --release) $ python measure.py
  • 33.
    v2 - RewritePolygon in Rust (Baseline was ~150ms, line-to-line was ~17ms) $ (cd ./poly_match_rs/ && maturin develop --release) $ python measure.py Took an avg of 1.71ms per iteration
  • 34.
    Summary ■ We profiledour Python code using py-spy ■ A naive, line-to-line translation of the hottest function resulted in ~10x improvement ■ Converting our Python class to a Rust struct resulted in another 10x improvement You can find out more at ohadravid.github.io (can we go even faster?)
  • 35.
    Takeaways ■ Rust (withthe help of pyo3) unlocks true native performance for everyday Python code, with minimal compromises. ■ Python is a superb API for researchers, and crafting fast building blocks with Rust is an extremely powerful combination.
  • 36.