5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
ScyllaDB
124 views
57 slides
Jul 01, 2024
Slide 1 of 57
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
About This Presentation
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.
This is a case study about using database ideas to build a linter that looks for breaking changes in Rust libra...
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.
This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules.
Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!
Size: 3.31 MB
Language: en
Added: Jul 01, 2024
Slides: 57 pages
Slide Content
Five hours to 7. 7 seconds : How database tricks sped up Rust linting by over 2000x Predrag Gruevski Maintainer , Trustfall & cargo-semver-checks
Hi, I’m Predrag ( he/him) Author of cargo-semver-checks Rust linter Author of Trustfall query engine Firmly believe that everything can be a database
Semver in Rust Semantic versioning = “only major versions can break APIs” Always-on, cannot opt out Easy to violate: breaking changes aren’t always obvious Violations are painful: users’ builds broken, maintainers scrambling to fix
1 in 6 of the top 1000 crates have broken semver at least once*
1 in 6 of the top 1000 crates have broken semver at least once* * not the maintainers’ fault; none of us would have fared any better
cargo-semver-checks “clippy for semver” — lint for semver violations before publishing Use as: cargo semver-checks && cargo publish Going to become part of cargo itself: https://github.com/obi1kenobi/cargo-semver-checks/issues/61
Make it work, then make it fast MVP version was useful, but not the fastest Runtime scaled as O(n^2) with library size Most libraries are small, so no perf problem Medium-sized libraries are at higher risk, linting is worth a slight slowdown Lots of adoption — libp2p, cargo_metadata, tokio, … Adding ~1 min to the CI publish job is no problem, right?
Sweet spot of bad scaling
a ws-sdk-ec2 tries linting semver… $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary
Let’s make that 2354x faster $ cd aws-sdk-ec2 $ cargo semver-checks Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change) Starting 40 checks, 0 unnecessary Completed [ 7.722s] 40 checks; 40 passed, 0 unnecessary
Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways
Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways
Example: pub fn removed from API https://github.com/obi1kenobi/semver-examples/compare/main...easy_01
$ cargo semver-checks Parsing easy_01 v0.1.0 (current) Parsing easy_01 v0.1.0 (baseline) Checking easy_01 v0.1.0 -> v0.1.0 (no change) Completed [ 0.011s] 45 checks; 44 passed, 1 failed, 0 unnecessary --- failure function_missing: pub fn removed or renamed --- Description: A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.22.1/src/lints/ function_missing.ron Failed in: function easy_01::add, previously in file semver-examples/easy_01/old/src/lib.rs:1 Final [ 0.012s] semver requires new major version: 1 major and 0 minor checks failed
A special kind of querying Query: Find public functions that existed in the previous version, b ut don’t exist in the new version.
A special kind of querying Query: Find public functions that existed in the previous version, but don’t exist in the new version. Each result of this query is a semver violation! Declarative — business logic separate from implementation We can optimize implementation without affecting business logic! Dozens of queries like this — one per way to break semver
Naive execution — O(n^2) Check N functions against N functions, then filter for matching path.
Naive execution — O(n^2) Check N functions against N functions, then filter for matching path. O(N) O(N)
P redicate pushdown — O(n) execution For each of N functions, look up matching one in O(1). O(N) O(1)
Predicate pushdown — O(n) execution For each of N functions, look up matching one in O(1). This is the goal! O(N) O(1)
Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways
Our data source Use rustdoc to get JSON description of APIs Wrinkle: different Rust versions have different JSON formats Reimplementing queries for each new version is prohibitive! Ideally: queries over abstract data model that doesn’t change
Semver requires “s eeing like a compiler” We’re solving for “could the new code possibly work” For example: x.foo() might call method on x itself or on any of its traits Moving foo() from x itself to one of its traits is not breaking Requires compiler-like logic: n ame resolution, trait resolution, etc.
SQL isn’t a good fit SQL views and PL/SQL not suitable for compiler-like logic Would need complex & costly ETL for each different JSON format ETL runs all work eagerly — computes stuff we might not use!
Using the Trustfall query engine Engine with “foreign data wrappers” as a first-class feature Query any combination of data sources Just need an adapter for the data source Compiler-like logic goes in the adapter Battle-tested ideas: 7+ years in production Engine built in Rust; adapters can be Rust / Python / JS / WASM
Using the Trustfall query engine
What does this look like in practice?
Overview Semver linting is a special kind of querying With the right tools, anything can be a database Query optimization — as an API Takeaways
Example Trustfall queries “Look up all functions in the crate.”
{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ output } // (optional) more query clauses } } } }
{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ output } // (optional) more query clauses } } } } Must iterate through all functions, n o opportunity to apply index.
// Lightly simplified code fn resolve_crate_items <' a >( contexts : ContextIterator <' a , Vertex <' a >>, // iterator of crates whose items we resolve hints : & ResolveEdgeInfo , // hints about how data is getting used ) -> ContextOutcomeIterator <' a , Vertex <' a >, ... > { resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) // return all items }) }
// Lightly simplified code fn resolve_crate_items <' a >( contexts : ContextIterator <' a , Vertex <' a >>, // iterator of crates whose items we resolve hints : & ResolveEdgeInfo , // hints about how data is getting used ) -> ContextOutcomeIterator <' a , Vertex <' a >, ... > { resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) // return all items }) } Including non-functions! Trustfall handles further filtering, like selecting functions only.
Example Trustfall queries “Look up all functions in the crate.” “Look up the function at a specific import path.”
s pecific function
{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ filter ( op : "=" , value : [ "$path" ]) @ output } // (optional) more query clauses } } } }
{ Crate { item { // all items in the crate: functions, structs, enums, etc. ... on Function { // select only functions, discard all else // get the paths under which this function is importable // (there might be zero, one, or more than one!) importable_path { path @ filter ( op : "=" , value : [ "$path" ]) @ output } // (optional) more query clauses } } } } Query variable, statically known value. Use index lookup!
// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // No luck on finding a fast path. Go through all items the slow way. resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) }) }
// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } } // No luck on finding a fast path. Go through all items the slow way. resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) }) }
// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } } // No luck on finding a fast path. Go through all items the slow way. resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_slow_path ( crate_vertex ) }) } O(1) O(n)
Example Trustfall queries “Look up all functions in the crate.” “Look up the function at a specific import path.” “Look up the function at the same import path as this other function.” ( i.e. our semver lint)
This is not a single fixed value, it’s different for each function we’re checking. Its value is only known dynamically .
// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically or dynamically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } // No luck on finding a fast path. Go through all items the slow way. // < prior slow path here > }
// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically or dynamically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } else if let Some ( dynamic ) = neighbor_info . dynamically_required_property ( "path" ) { return dynamic . resolve_with ( contexts , | crate_vertex , candidate_value | { resolve_items_by_importable_path ( crate_vertex , candidate_value ) }); } } // No luck on finding a fast path. Go through all items the slow way. // < prior slow path here > }
// Lightly simplified code fn resolve_crate_items <' a >(...) -> _ { // Is the `importable_path` edge being resolved in a subsequent step? if let Some ( neighbor_info ) = hints . first_edge ( "importable_path" ) { // Is the `path` value statically or dynamically known? If so, apply the index! if let Some ( candidate_value ) = neighbor_info . statically_required_property ( "path" ) { return resolve_neighbors_with ( contexts , | crate_vertex | { resolve_items_by_importable_path ( crate_vertex , candidate_value . clone ()) }); } else if let Some ( dynamic ) = neighbor_info . dynamically_required_property ( "path" ) { return dynamic . resolve_with ( contexts , | crate_vertex , candidate_value | { resolve_items_by_importable_path ( crate_vertex , candidate_value ) }); } } // No luck on finding a fast path. Go through all items the slow way. // < prior slow path here > } O(1) O(n) O(1)
Predicate pushdown — O(n) execution For each of N functions, look up matching one in O(1). This is the goal! O(N) O(1)