5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X

ScyllaDB 124 views 57 slides Jul 01, 2024
Slide 1
Slide 1 of 57
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57

About This Presentation

Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.

This is a case study about using database ideas to build a linter that looks for breaking changes in Rust libra...


Slide Content

Five hours to 7. 7 seconds : How database tricks sped up Rust linting by over 2000x Predrag Gruevski Maintainer , Trustfall & cargo-semver-checks

Hi, I’m Predrag ( he/him) Author of cargo-semver-checks Rust linter Author of Trustfall query engine Firmly believe that everything can be a database

Semver in Rust Semantic versioning = “only major versions can break APIs” Always-on, cannot opt out Easy to violate: breaking changes aren’t always obvious Violations are painful: users’ builds broken, maintainers scrambling to fix

1 in 6 of the top 1000 crates have broken semver at least once*

1 in 6 of the top 1000 crates have broken semver at least once* * not the maintainers’ fault; none of us would have fared any better

cargo-semver-checks “clippy for semver” — lint for semver violations before publishing Use as: cargo semver-checks && cargo publish Going to become part of cargo itself: https://github.com/obi1kenobi/cargo-semver-checks/issues/61

Make it work, then make it fast MVP version was useful, but not the fastest Runtime scaled as O(n^2) with library size Most libraries are small, so no perf problem Medium-sized libraries are at higher risk, linting is worth a slight slowdown Lots of adoption — libp2p, cargo_metadata, tokio, … Adding ~1 min to the CI publish job is no problem, right?

Sweet spot of bad scaling

a ws-sdk-ec2 tries linting semver… $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary

aws-sdk-ec2 tries linting semver… $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary

aws-sdk-ec2 tries linting semver… $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary Completed [18178.109s] 40 checks; 40 passed, 0 unnecessary

aws-sdk-ec2 tries linting semver… $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary Completed [18178.109s] 40 checks; 40 passed, 0 unnecessary

Let’s make that 2354x faster $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary Completed [ 7.722s] 40 checks; 40 passed, 0 unnecessary

Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways

Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways

Example: pub fn removed from API https://github.com/obi1kenobi/semver-examples/compare/main...easy_01

$ cargo semver-checks Parsing easy_01 v0.1.0 (current) Parsing easy_01 v0.1.0 (baseline) Checking easy_01 v0.1.0 -> v0.1.0 (no change) Completed [ 0.011s] 45 checks; 44 passed, 1 failed, 0 unnecessary --- failure function_missing: pub fn removed or renamed --- Description: A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.22.1/src/lints/ function_missing.ron Failed in: function easy_01::add, previously in file semver-examples/easy_01/old/src/lib.rs:1 Final [ 0.012s] semver requires new major version: 1 major and 0 minor checks failed

A special kind of querying Query: Find public functions that existed in the previous version, b ut don’t exist in the new version.

A special kind of querying Query: Find public functions that existed in the previous version, but don’t exist in the new version. Each result of this query is a semver violation! Declarative — business logic separate from implementation We can optimize implementation without affecting business logic! Dozens of queries like this — one per way to break semver

Naive execution — O(n^2) Check N functions against N functions, then filter for matching path.

Naive execution — O(n^2) Check N functions against N functions, then filter for matching path. O(N) O(N)

P redicate pushdown — O(n) execution For each of N functions, look up matching one in O(1). O(N) O(1)

Predicate pushdown — O(n) execution For each of N functions, look up matching one in O(1). This is the goal! O(N) O(1)

Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways

Our data source Use rustdoc to get JSON description of APIs Wrinkle: different Rust versions have different JSON formats Reimplementing queries for each new version is prohibitive! Ideally: queries over abstract data model that doesn’t change

Semver requires “s eeing like a compiler” We’re solving for “could the new code possibly work” For example: x.foo() might call method on x itself or on any of its traits Moving foo() from x itself to one of its traits is not breaking Requires compiler-like logic: n ame resolution, trait resolution, etc.

SQL isn’t a good fit SQL views and PL/SQL not suitable for compiler-like logic Would need complex & costly ETL for each different JSON format ETL runs all work eagerly — computes stuff we might not use!

Using the Trustfall query engine Engine with “foreign data wrappers” as a first-class feature Query any combination of data sources Just need an adapter for the data source Compiler-like logic goes in the adapter Battle-tested ideas: 7+ years in production Engine built in Rust; adapters can be Rust / Python / JS / WASM

Using the Trustfall query engine

What does this look like in practice?

Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways

Example Trustfall queries “Look up all functions in the crate.”

{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ output } // (optional) more query clauses } } } }

{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ output } // (optional) more query clauses } } } } Must iterate through all functions, n o opportunity to apply index.

// Lightly simplified code fn resolve_crate_items <' a >( contexts : ContextIterator <' a , Vertex <' a >>, // iterator of crates whose items we resolve hints : & ResolveEdgeInfo , // hints about how data is getting used ) -> ContextOutcomeIterator <' a , Vertex <' a >, ... > { resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) // return all items }) }

// Lightly simplified code fn resolve_crate_items <' a >( contexts : ContextIterator <' a , Vertex <' a >>, // iterator of crates whose items we resolve hints : & ResolveEdgeInfo , // hints about how data is getting used ) -> ContextOutcomeIterator <' a , Vertex <' a >, ... > { resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) // return all items }) } Including non-functions! Trustfall handles further filtering, like selecting functions only.

Example Trustfall queries “Look up all functions in the crate.” “Look up the function at a specific import path.”

s pecific function

{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ filter ( op : "=" , value : [ "$path" ]) @ output } // (optional) more query clauses } } } }

{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ filter ( op : "=" , value : [ "$path" ]) @ output } // (optional) more query clauses } } } } Query variable, statically known value. Use index lookup!

// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // No luck on finding a fast path. Go through all items the slow way. resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) }) }

// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } } // No luck on finding a fast path. Go through all items the slow way. resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) }) }

// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } } // No luck on finding a fast path. Go through all items the slow way. resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) }) } O(1) O(n)

Example Trustfall queries “Look up all functions in the crate.” “Look up the function at a specific import path.” “Look up the function at the same import path as this other function.” ( i.e. our semver lint)

This is not a single fixed value, it’s different for each function we’re checking. Its value is only known dynamically .

// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically or dynamically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } // No luck on finding a fast path. Go through all items the slow way. // < prior slow path here > }

// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically or dynamically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } else if let Some ( dynamic ) = neighbor_info . dynamically_required_property ( "path" ) { return dynamic . resolve_with ( contexts , | crate_vertex , candidate_value | { resolve_items_by_importable_path ( crate_vertex , candidate_value ) }); } } // No luck on finding a fast path. Go through all items the slow way. // < prior slow path here > }

// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically or dynamically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } else if let Some ( dynamic ) = neighbor_info . dynamically_required_property ( "path" ) { return dynamic . resolve_with ( contexts , | crate_vertex , candidate_value | { resolve_items_by_importable_path ( crate_vertex , candidate_value ) }); } } // No luck on finding a fast path. Go through all items the slow way. // < prior slow path here > } O(1) O(n) O(1)

Predicate pushdown — O(n) execution For each of N functions, look up matching one in O(1). This is the goal! O(N) O(1)

Predrag Gruevski @PredragGruevski / @[email protected] https://predr.ag/blog Happy querying!
Tags