Making Python 100x Faster with Less Than 100 Lines of Rust

ScyllaDB 285 views 36 slides Jun 27, 2024
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

Python isn’t know as a low latency language. Can we bridge the performance gap using a bit of Rust and some profiling?


Slide Content

Making Python x100 Faster with Rust Ohad Ravid Team Lead at Trigo

Ohad Ravid ( he/him ) Team Lead at Trigo Worked on backend, frontend, networking, firmware, ... Using Python & Rust to build scalable and fast systems Love tests and tea

Making Python Faster, with Rust Or: H ow we solved a performance issue in one of our Python libraries, using Rust An incremental approach, not a rewrite Start small, reduce risk An elegant combination, which balances flexibility and performance

Trigo "Magic Checkout" - Stand in front of a checkout station, Your items will be on the screen Real time location In 3D Physical servers, in the store Using 100s of cameras

Trigo's 3D Engine Architecture First, we convert 2D images to 2D Skeletons Skeletons == Pixels in the image containing heads / shoulders / hands For every camera, for every timestamp

Trigo's 3D Engine Architecture Group 2D skeletons by timestamp

Trigo's 3D Engine Architecture Build 3D skeletons from the 2D skeletons from all the cameras Pure Python codebase with lots of numpy , quick to iterate on Scalable design Shout out to Excalidraw , a tool for sketching beautiful diagrams

Problem and Motivation Fine for X concurrent (physical) users Grinds to a halt for 5X concurrent (physical) users

Our Solution Profile to find the biggest perf opportunities Avoid frequently changed parts of the codebase Strive to maintain API compatibility Try to improve perf directly in Python / numpy If not, rewrite in Rust, but a single function / class at a time

A toy example But how can we rewrite in Rust just a single function in a big codebase? And how can we maintain the same API? Let ’ s use a toy library to explore this!

A toy example @dataclass class Polygon : x: np.array y: np.array @cached_property def center ( self ) -> np.array : ... def area ( self ) -> float : ... # .. lots of functions working with lists of `Polygon` s ..

Profiling We will use py-spy and not cProfile We need a benchmark and a baseline

Profiling We will use py-spy and not cProfile We need a benchmark and a baseline "Good benchmarking is hard. Having said that, do not stress too much about having a perfect benchmarking setup, particularly when you start optimizing a program." ~ Nicholas Nethercote, in "The Rust Performance Book"

Benchmark # .. imports .. NUM_ITER = 10 ## Generate some data polygons, points = poly_match.generate_example() ## Run a few iterations of the logic t0 = time.perf_counter() for _ in range (NUM_ITER): poly_match.main(polygons, points) t1 = time.perf_counter() ## Calculate how much time it took. print ( f "Took and avg of { ( (t1 - t0) / NUM_ITER) * 1000 :.2f} ms per iteration" )

Baseline $ python measure.py Took an avg of 147.46ms per iteration

So, let's find out what is so slow here! Measure first $ py-spy record --native -o profile.svg -- python measure.py py-spy> Sampling process 100 times a second. Press Control-C to exit. ... py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 391 Errors: 0

Measure first This will generate a flamegraph :

Measure first We’ll focus on find_close_polygons , because everything else is <<

So, let's have a look at find_close_polygons Measure first def find_close_polygons ( polygon_subset : List[Polygon], point : np.array, max_dist : float ) -> List[Polygon]: close_polygons = [] for poly in polygon_subset: if np.linalg.norm(poly.center - point) < max_dist: close_polygons.append(poly) return close_polygons

pyo3 is a Rust library (a crate) for interacting between Python and Rust . A bit like pybind11 in C++ Used by popular Python packages and tools (cryptography, orjson, …) Let's create our crate, and get to work: Our Rust Crate mkdir poly_match_rs && cd " $_ " pip install maturin maturin init --bindings pyo3

Starting out, our crate is going to look like this: Our Rust Crate use pyo3 :: prelude ::*; #[pyfunction] fn find_close_polygons () -> PyResult <()> { Ok (()) } #[pymodule] fn poly_match_rs ( _py : Python , m : & PyModule ) -> PyResult <()> { m . add_function ( wrap_pyfunction! ( find_close_polygons , m )?)?; Ok (()) }

Measure twice Running the profiler again generates a new flamegraph:

Measure twice Pink is mostly overhead (allocating, getattr) Blue is the actual logic ( norm ) (not to scale) 57% of total runtime 9 % of total runtime

Measure twice Most of the time is spent in getattr and getting the underlying array using as_array . To improve this, we need to rewrite Polygon in Rust.

Measure twice @dataclass class Polygon : x: np.array y: np.array @cached_property def center ( self ) -> np.array: ... def area ( self ) -> float : ... # .. lots of functions working with lists of `Polygon` s .. A remainder:

Our struct is going to look like this : v2 - Rewrite Polygon in Rust use ndarray :: Array1 ; #[pyclass(subclass)] struct Polygon { x : Array1 < f64 >, y : Array1 < f64 >, center : Array1 < f64 >, }

Our struct is going to look like this, which is pretty similar to the original class! v2 - Rewrite Polygon in Rust use ndarray :: Array1 ; #[pyclass(subclass)] struct Polygon { x : Array1 < f64 >, y : Array1 < f64 >, center : Array1 < f64 >, } import numpy as np @dataclass class Polygon : x: np.array y: np.array @cached_property def center ( self ) -> np.array: ...

And can be subclassed from Python v2 - Rewrite Polygon in Rust use ndarray :: Array1 ; #[pyclass(subclass)] struct Polygon { x : Array1 < f64 >, y : Array1 < f64 >, center : Array1 < f64 >, } class Polygon ( poly_match_rs . Polygon ): _area: float = None def area ( self ) -> float : ...

And use the fact that we have a Rust-based struct to implement our function v2 - Rewrite Polygon in Rust - polygon_subset: Vec<Py<PyAny>>, + polygon_subset: Vec<Py<Polygon>>,

And use the fact that we have a Rust-based struct to implement our function v2 - Rewrite Polygon in Rust for poly in polygons { let norm = { let center = & poly . as_ref ( py ). borrow ().center; (( center [ ] - point [ ]). square () + ( center [ 1 ] - point [ 1 ]). square ()). sqrt () }; if norm < max_dist { close_polygons . push ( poly ) } }

And use the fact that we have a Rust-based struct to implement our function: v2 - Rewrite Polygon in Rust for poly in polygons { let norm = { let center = & poly . as_ref ( py ). borrow () .center; (( center [ ] - point [ ]). square () + ( center [ 1 ] - point [ 1 ]). square ()). sqrt () }; if norm < max_dist { close_polygons . push ( poly ) } }

v2 - Rewrite Polygon in Rust $ (cd ./poly_match_rs/ && maturin develop --release ) $ python measure.py

v2 - Rewrite Polygon in Rust (Baseline was ~150ms, line-to-line was ~17ms) $ (cd ./poly_match_rs/ && maturin develop --release ) $ python measure.py Took an avg of 1.71ms per iteration

Summary We profiled our Python code using py-spy A naive, line-to-line translation of the hottest function resulted in ~10x improvement Converting our Python class to a Rust struct resulted in another 10x improvement You can find out more at ohadravid.github.io (can we go even faster?)

Takeaways Rust (with the help of pyo3 ) unlocks true native performance for everyday Python code, with minimal compromises. Python is a superb API for researchers, and crafting fast building blocks with Rust is an extremely powerful combination.

Ohad Ravid [email protected] @ohadrv https://ohadravid.github.io Thank you!
Tags