maximechevalierboisv1
248 views
72 slides
May 27, 2024
Slide 1 of 72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
About This Presentation
With each of the past 3 Ruby releases, YJIT has delivered higher and higher performance. However, we are seeing diminishing returns, because as JIT-compiled code becomes faster, it makes up less and less of the total execution time, which is now becoming dominated by C function calls. As such, it ma...
With each of the past 3 Ruby releases, YJIT has delivered higher and higher performance. However, we are seeing diminishing returns, because as JIT-compiled code becomes faster, it makes up less and less of the total execution time, which is now becoming dominated by C function calls. As such, it may appear like there is a fundamental limit to Ruby’s performance.
In the first half of the 20th century, some early airplane designers thought that the speed of sound was a fundamental limit on the speed reachable by airplanes, thus coining the term “sound barrier”. This limit was eventually overcome, as it became understood that airflow behaves differently at supersonic speeds.
In order to break the Ruby performance barrier, it will be necessary to reduce the dependency on C extensions, and start writing more gems in pure Ruby code. In this talk, I want to look at this problem more in depth, and explore how YJIT can help enable writing pure-Ruby software that delivers high performance levels.
Size: 5.8 MB
Language: en
Added: May 27, 2024
Slides: 72 pages
Slide Content
Maxime Chevalier-Boisvert @ RubyKaigi 2024
Breaking the Ruby Performance Barrier
●My name is Maxime Chevalier-Boisvert
●Obtained a PhD in compiler design in 2016
○JIT techniques for dynamically-typed languages
●Joined Shopify in 2020
○(not the same company as Spotify)
●Tech lead of the YJIT project
Hello, World!
About YJIT
●Project started in 2020 ??????
○New JIT compiler built inside CRuby
○Goal: make Ruby seamlessly faster
●Built at Shopify, but fully open source ??????
○Significant contributions from GitHub folks
●Data-driven approach to optimization ??????
○Large, diverse set of benchmarks
○Benchmark often
○Gather detailed metrics
YJIT is a Team Effort
This project would not be possible without my amazing colleagues at Shopify and GitHub!
Alan Wu
@alanwusx
Aaron Patterson
@tenderlove
Maxime
@Love2Code
Noah Gibbs
@codefolio
John Hawthorn
@jhawthorn
Eileen Uchitelle
@eileencodes
Jean Boussier
@byroot
Kevin Newton
@kddnewton
Takashi Kokubun
@k0kubun
Jemma Issroff
@JemmaIssroff
Jimmy Miller
@jimmyhmiller
Adam Hess
@theHessParker
Kevin Menard
@nirvdrum
Randy Stauner
@rwstauner
Program
●Quick YJIT news updates
●History of supersonic flight
●Making Ruby code faster: the traditional way
●The need for a new approach
●Protoboeuf: a pure-Ruby protobuf
●Next steps for YJIT
YJIT News
Ruby 3.3 Includes the Third Release of YJIT
●YJIT first included as part of Ruby 3.1
●Marked performance improvements with 3.2, 3.3 ??????
●YJIT 3.3 goal: better tuned for production deployments
○3.1, 3.2, had complaints from some people that YJIT was not faster
○Had to tell people to adjust various command-line options
○3.3: use less memory, ship with better default options!
●Big end of year push to find and fix bugs
Another Big Deployment…
●YJIT already deployed on all Shopify stores, at Discourse
○Deployed at Shopify since December 2022
●Earlier this year, GitHub quietly deployed YJIT 3.3
○Deployment went smoothly
○Reported a ~15% improvement in response time
○Closer to 20% speedup for some endpoints
●If you’ve visited github.com today…
ismyhostfastyet.com - Shopify at #1, excellent p75 time
Supersonic Flight
1903: First successful powered flight by the Wright brothers, ~48 km/h
1914: S.E.4 flown in Great Britain, top speed 217 km/h
1940: P-51 Mustang, top speed ~690 km/h
1944: Messerschmitt Me 262, ~850-1000 km/h
1944: Gloster Meteor, ~750-950 km/h
Flying Faster
●WWII: competitive pressure to build faster aircraft
○Faster fighters are better at dog fights
○Faster bombers are harder for fighters to hit
●Propellers lose efficiency as they approach the speed of sound
○Jet engines remove that limitation
●Fighter pilots could accidentally break the speed of sound while diving
○But when doing so, strange things would happen
●Airflow behaves differently at supersonic speeds
○Aircraft designed for subsonic speeds tend to shake violently
○Control surfaces become ineffective, or even reversed!
○Easy for pilots to lose control or aircraft
●Drag rises sharply as we approach the speed of sound
○Need several times more thrust to punch through transonic region
○Difficult to break sound barrier merely because of required power
●Supersonic speeds put high levels of stress on aircraft
○Aerodynamic heating causes thermal expansion, weakens materials
The Sound Barrier
Rethinking Aircraft Design
●For supersonic flight, we are operating in a different regime
○Need to rethink aircraft design in consequence
●Need several times greater thrust to pass through transonic region
○The first plane to break the sound barrier in level flight used a rocket engine
●Airflow behaves differently at supersonic speeds
○Smaller, swept back wings help reduce aerodynamic drag
●Need to deal with problems such as heating due to air compression
○The SR-71 is mostly made of titanium
Bell X-1 rocket-propelled experimental aircraft (1947)
Chuck Yeager, test pilot, next to the Bell X-1 “Glamorous Glennis”
Modified B-29 Superfortress dropping the Bell X-1
General Chuck Yeager passed away in 2020. He lived to be 97.
Boeing 787 Dreamliner (first flight 2009)
Douglas DC-9 (first flight 1965)
Lockheed Martin SR-71 Blackbird (first flight 1964), top speed ~3,540 km/h
???
This is RubyKaigi, your talk is supposed to be about Ruby…
Ok, but first, let me speak a bit about Python…
Python: The Old Way of Thinking
●Assumptions about performance:
○Python is a slow, interpreted language
○C is a fast, compiled language
●You are never allowed to complain about Python being too slow
●If your program too slow, just rewrite to slow parts in C
○Push the performance-sensitive logic into C extensions
○This is how Python performance issues are addressed
●Python for expressiveness, C/C++/Go/Rust for speed
… Ruby has the same problem … ??????
A New Regime
YJIT: A New Approach
●YJIT is the engine pushing Ruby closer to the speed of C
●As Ruby gets faster, we are entering a new regime
●The cost of calling C functions, translating data back and forth, becomes
relatively more and more expensive (increases drag)
●YJIT can’t optimize C code (black box), so this code gets in our way
●The complexity of maintaining C extensions becomes less appealing
Ruby’s Future
●Approaching a limit when it comes to optimizing Ruby performance
○We can’t make C code any faster
○Calls to/from C are relatively slow
●How can we reach Ruby’s maximum performance potential?
●Can we write more Ruby gems in pure-Ruby code?
●Can Ruby do the things C does effectively?
Why do people write C extensions?
1.To interface with external I/O APIs or system calls
e.g. GTK, ALSA, OpenGL, Vulkan, etc.
2.To improve performance, alleviate performance bottlenecks
e.g. number crunching, scientific computing, expensive algorithm
3.To interface with a specific native library
e.g. libmysqlclient, redis, protobuf, libyaml
(many of the libraries we use deal with parsing/serialization)
Why do people write C extensions?
1.To interface with external I/O APIs or system calls
e.g. GTK, ALSA, OpenGL, Vulkan, etc.
2.To improve performance, alleviate performance bottlenecks
e.g. number crunching, scientific computing, expensive algorithm
3.To interface with a specific native library
e.g. libmysqlclient, redis, protobuf, libyaml
(many of the libraries we use deal with parsing/serialization)
Pure-Ruby Gems
redis-client
●The redis-client gem has two drivers:
○hiredis (native C library binding)
○A pure-Ruby implementation
●Aaron Patterson and Jean Boussier improved the Ruby driver
○Found that with YJIT enabled, the new Ruby driver is about on par with the native C extension.
In the same ballpark.
●The Ruby driver could still be improved
○The C Ruby API allows pre-allocating hashes with the right size
■This capability is currently missing in Ruby
○Pure-Ruby driver could perform even better
GraphQL / TinyGQL
●GraphQL is used at Shopify and broadly across our industry
●The default GraphQL parser relies on racc
○Inefficient system where a native C library repeatedly calls into a Ruby tokenizer
○Lots of repeated Ruby-to-C calls (slow)
●Aaron Patterson wrote a pure-Ruby GraphQL parser
○With YJIT enabled, his parser outperforms the native C extension
●Blog post at:
○https://railsatscale.com/2023-08-29-ruby-outperforms-c/
Protobuf & Protoboeuf
●Protobuf is a binary data serialization protocol
●Protobuf is used at Shopify
○Serialization/deserialization of various data
○It’s used heavily in OpenTelemetry profiles
○We’re adopting Twirp as our internal communication protocol
●Protobuf is an annoying dependency to manage
○Common source of problems for Ruby upgrades
○At the top of allocation profiles
○Common source of memory leaks and crashes
●Could we build a pure-Ruby implementation that performs just as well as
Google’s native implementation?
Protobuf
Protoboeuf
●Prototype pure-Ruby implementation of protobuf
○Aims to be compatible with the proto3 spec
○Still a prototype/experiment at this stage
●Many attractive upsides:
○Easy to debug. Easy to maintain. Easy to contribute.
○Highly portable (anywhere Ruby runs, it will work!)
○No binary/systems package dependency
○Not tied to a specific Ruby ABI version
○Generates self-contained code, no runtime dependency
< >
AI-generated image (any resemblance to Shia Labeouf is purely coincidental).
What does Protoboeuf do?
●TinyGQL, redis-client, protoboeuf deal with parsing and serialization
○Work with strings and streams of bytes
○Low-level bit manipulation
○Allocate objects in the Ruby heap
●How does protoboeuf work?
○Parses proto3 specification (.proto files)
○Generates a self-contained Ruby to encode/decode protobuf streams
○Leverages YJIT for performance
●How fast is protoboeuf?
< >
ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s
Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster
ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s
Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster
ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s
Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster ??????
ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s
Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster
ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s
Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster
ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s
Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster
Caveats
●Our pure-Ruby protobuf is still an experiment / proof of concept
●It’s much faster than Google’s at decoding, but slower at encoding
○We didn’t spend any time optimizing the encoding
●We’re thinking of using it internally, but it comes without support
●We built protoboeuf in part so we could see what it would take
○How fast could we make this?
○What kinds of optimizations could we do in YJIT?
○How hard would it be to make a faster pure-Ruby protobuf?
○The answer is it really wasn’t that difficult!
How did we do it?
●Attack the problem from two sides at once:
○Write Ruby code that we know YJIT will optimize well
○Add optimizations to YJIT to better optimize low-level Ruby code
●Is optimizing YJIT for protoboeuf cheating?
○No. We simply optimized core methods such as String#getbyte and String#setbyte
○Locate performance bottlenecks and optimize them
○This work benefits a lot of existing Ruby code
●Is optimizing Ruby code for YJIT cheating?
●Protoboeuf decoding is still faster than Google’s protobuf without YJIT!
+14.5% faster
YJIT 3.4 & Protoboeuf
●#9763 YJIT: add specialized codegen for fixnum XOR
●#9767 YJIT: add codegen for String#setbyte
●#10188 YJIT: String#getbyte codegen
●#10323 YJIT: Propagate Array, Hash, and String classes
●#10401 YJIT: Optimize putobject+opt_ltlt for integers
●#10487 YJIT: Optimize local variables when EP == BP
●… And more …
●None of these are purely specific to protoboeuf! ??????
●The code we generate for protoboeuf is quirky/convoluted
●We manually inlined functions, manually unrolled loops
●We avoided using local variables when possible
○YJIT 3.3 doesn’t do register allocation for locals
●We used self for field accesses
○We can generate more efficient code this way
●Wanted to reach maximum performance possible with YJIT
○This is OK because it’s only generated code
○This code won’t get slower with future Ruby releases
●We don’t recommend you do this
○Many tricks we used (e.g. avoiding locals) will become unnecessary
Optimizing Ruby Code for YJIT
Looking Forward
Areas for Improvement
●Still pain points around doing low-level operations in Ruby
●Ruby has lots of useful string methods…
○But often need to allocate small strings, which is a perf killer
○I will not allocate, allocations are the perf killer
● Similarly, you have to buffer reads into a string to do your parsing
○IO::Buffer is oriented towards binary protocols
○StringIO lacks useful methods and is slower than strings
●Encoding::BINARY needs to be reset regularly
○We’re working on resolving this one
●Creating hashes, strings, arrays etc with a given capacity
Takeaways
●Traditionally, push to write performance-critical code in C
●YJIT changes the equation, we are entering a new regime
●As Ruby code gets faster, the balance changes
○Having to maintain C code or native dependencies can be a burden
○Native gems have the unfortunate tendency to break and be hard to debug
○In some cases, we can make pure Ruby gems that are as fast as C
●To leverage YJIT, we need to cultivate Ruby Power
●We should also think about how to make Ruby
better for writing low-level code
We’re working on YJIT 3.4
●Already seeing nice performance improvements
●We’re currently focusing on:
○Improvements to generated code quality
○QoL improvements for analyzing performance in production
○Debugging and testing
●With Ruby 3.4, you should expect:
○Better performance across the board
○Comparable memory usage
○Even better tested and debugged
○Better support for optimizing binary I/O code
●If you’re feeling bold, you can try out Ruby master!
YJIT Speedup on Optcarrot Over Time (higher is better)
Thank you for listening! :)
To Learn More About YJIT
●Follow our work on the Rails At Scale blog:
○https://railsatscale.com/
●Check out the YJIT readme:
○https://docs.ruby-lang.org/en/master/yjit/yjit_md.html
●You can reach me at:
○[email protected]
○@Love2Code on twitter/X
●Come talk to me after this talk!