Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
285 views
36 slides
Jun 27, 2024
Slide 1 of 36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
About This Presentation
Python isn’t know as a low latency language. Can we bridge the performance gap using a bit of Rust and some profiling?
Size: 6.51 MB
Language: en
Added: Jun 27, 2024
Slides: 36 pages
Slide Content
Making Python x100 Faster with Rust Ohad Ravid Team Lead at Trigo
Ohad Ravid ( he/him ) Team Lead at Trigo Worked on backend, frontend, networking, firmware, ... Using Python & Rust to build scalable and fast systems Love tests and tea
Making Python Faster, with Rust Or: H ow we solved a performance issue in one of our Python libraries, using Rust An incremental approach, not a rewrite Start small, reduce risk An elegant combination, which balances flexibility and performance
Trigo "Magic Checkout" - Stand in front of a checkout station, Your items will be on the screen Real time location In 3D Physical servers, in the store Using 100s of cameras
Trigo's 3D Engine Architecture First, we convert 2D images to 2D Skeletons Skeletons == Pixels in the image containing heads / shoulders / hands For every camera, for every timestamp
Trigo's 3D Engine Architecture Group 2D skeletons by timestamp
Trigo's 3D Engine Architecture Build 3D skeletons from the 2D skeletons from all the cameras Pure Python codebase with lots of numpy , quick to iterate on Scalable design Shout out to Excalidraw , a tool for sketching beautiful diagrams
Problem and Motivation Fine for X concurrent (physical) users Grinds to a halt for 5X concurrent (physical) users
Our Solution Profile to find the biggest perf opportunities Avoid frequently changed parts of the codebase Strive to maintain API compatibility Try to improve perf directly in Python / numpy If not, rewrite in Rust, but a single function / class at a time
A toy example But how can we rewrite in Rust just a single function in a big codebase? And how can we maintain the same API? Let ’ s use a toy library to explore this!
A toy example @dataclass class Polygon : x: np.array y: np.array @cached_property def center ( self ) -> np.array : ... def area ( self ) -> float : ... # .. lots of functions working with lists of `Polygon` s ..
Profiling We will use py-spy and not cProfile We need a benchmark and a baseline
Profiling We will use py-spy and not cProfile We need a benchmark and a baseline "Good benchmarking is hard. Having said that, do not stress too much about having a perfect benchmarking setup, particularly when you start optimizing a program." ~ Nicholas Nethercote, in "The Rust Performance Book"
Benchmark # .. imports .. NUM_ITER = 10 ## Generate some data polygons, points = poly_match.generate_example() ## Run a few iterations of the logic t0 = time.perf_counter() for _ in range (NUM_ITER): poly_match.main(polygons, points) t1 = time.perf_counter() ## Calculate how much time it took. print ( f "Took and avg of { ( (t1 - t0) / NUM_ITER) * 1000 :.2f} ms per iteration" )
Baseline $ python measure.py Took an avg of 147.46ms per iteration
So, let's find out what is so slow here! Measure first $ py-spy record --native -o profile.svg -- python measure.py py-spy> Sampling process 100 times a second. Press Control-C to exit. ... py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 391 Errors: 0
Measure first This will generate a flamegraph :
Measure first We’ll focus on find_close_polygons , because everything else is <<
So, let's have a look at find_close_polygons Measure first def find_close_polygons ( polygon_subset : List[Polygon], point : np.array, max_dist : float ) -> List[Polygon]: close_polygons = [] for poly in polygon_subset: if np.linalg.norm(poly.center - point) < max_dist: close_polygons.append(poly) return close_polygons
pyo3 is a Rust library (a crate) for interacting between Python and Rust . A bit like pybind11 in C++ Used by popular Python packages and tools (cryptography, orjson, …) Let's create our crate, and get to work: Our Rust Crate mkdir poly_match_rs && cd " $_ " pip install maturin maturin init --bindings pyo3
Starting out, our crate is going to look like this: Our Rust Crate use pyo3 :: prelude ::*; #[pyfunction] fn find_close_polygons () -> PyResult <()> { Ok (()) } #[pymodule] fn poly_match_rs ( _py : Python , m : & PyModule ) -> PyResult <()> { m . add_function ( wrap_pyfunction! ( find_close_polygons , m )?)?; Ok (()) }
Measure twice Running the profiler again generates a new flamegraph:
Measure twice Pink is mostly overhead (allocating, getattr) Blue is the actual logic ( norm ) (not to scale) 57% of total runtime 9 % of total runtime
Measure twice Most of the time is spent in getattr and getting the underlying array using as_array . To improve this, we need to rewrite Polygon in Rust.
Measure twice @dataclass class Polygon : x: np.array y: np.array @cached_property def center ( self ) -> np.array: ... def area ( self ) -> float : ... # .. lots of functions working with lists of `Polygon` s .. A remainder:
Our struct is going to look like this : v2 - Rewrite Polygon in Rust use ndarray :: Array1 ; #[pyclass(subclass)] struct Polygon { x : Array1 < f64 >, y : Array1 < f64 >, center : Array1 < f64 >, }
Our struct is going to look like this, which is pretty similar to the original class! v2 - Rewrite Polygon in Rust use ndarray :: Array1 ; #[pyclass(subclass)] struct Polygon { x : Array1 < f64 >, y : Array1 < f64 >, center : Array1 < f64 >, } import numpy as np @dataclass class Polygon : x: np.array y: np.array @cached_property def center ( self ) -> np.array: ...
And can be subclassed from Python v2 - Rewrite Polygon in Rust use ndarray :: Array1 ; #[pyclass(subclass)] struct Polygon { x : Array1 < f64 >, y : Array1 < f64 >, center : Array1 < f64 >, } class Polygon ( poly_match_rs . Polygon ): _area: float = None def area ( self ) -> float : ...
And use the fact that we have a Rust-based struct to implement our function v2 - Rewrite Polygon in Rust - polygon_subset: Vec<Py<PyAny>>, + polygon_subset: Vec<Py<Polygon>>,
And use the fact that we have a Rust-based struct to implement our function v2 - Rewrite Polygon in Rust for poly in polygons { let norm = { let center = & poly . as_ref ( py ). borrow ().center; (( center [ ] - point [ ]). square () + ( center [ 1 ] - point [ 1 ]). square ()). sqrt () }; if norm < max_dist { close_polygons . push ( poly ) } }
And use the fact that we have a Rust-based struct to implement our function: v2 - Rewrite Polygon in Rust for poly in polygons { let norm = { let center = & poly . as_ref ( py ). borrow () .center; (( center [ ] - point [ ]). square () + ( center [ 1 ] - point [ 1 ]). square ()). sqrt () }; if norm < max_dist { close_polygons . push ( poly ) } }
v2 - Rewrite Polygon in Rust (Baseline was ~150ms, line-to-line was ~17ms) $ (cd ./poly_match_rs/ && maturin develop --release ) $ python measure.py Took an avg of 1.71ms per iteration
Summary We profiled our Python code using py-spy A naive, line-to-line translation of the hottest function resulted in ~10x improvement Converting our Python class to a Rust struct resulted in another 10x improvement You can find out more at ohadravid.github.io (can we go even faster?)
Takeaways Rust (with the help of pyo3 ) unlocks true native performance for everyday Python code, with minimal compromises. Python is a superb API for researchers, and crafting fast building blocks with Rust is an extremely powerful combination.