Breaking the Ruby Performance Barrier with YJIT

maximechevalierboisv1 248 views 72 slides May 27, 2024
Slide 1
Slide 1 of 72
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72

About This Presentation

With each of the past 3 Ruby releases, YJIT has delivered higher and higher performance. However, we are seeing diminishing returns, because as JIT-compiled code becomes faster, it makes up less and less of the total execution time, which is now becoming dominated by C function calls. As such, it ma...


Slide Content

Maxime Chevalier-Boisvert @ RubyKaigi 2024
Breaking the Ruby Performance Barrier

●My name is Maxime Chevalier-Boisvert
●Obtained a PhD in compiler design in 2016
○JIT techniques for dynamically-typed languages
●Joined Shopify in 2020
○(not the same company as Spotify)
●Tech lead of the YJIT project
Hello, World!

About YJIT
●Project started in 2020 ??????
○New JIT compiler built inside CRuby
○Goal: make Ruby seamlessly faster
●Built at Shopify, but fully open source ??????
○Significant contributions from GitHub folks
●Data-driven approach to optimization ??????
○Large, diverse set of benchmarks
○Benchmark often
○Gather detailed metrics

YJIT is a Team Effort
This project would not be possible without my amazing colleagues at Shopify and GitHub!
Alan Wu
@alanwusx
Aaron Patterson
@tenderlove
Maxime
@Love2Code
Noah Gibbs
@codefolio
John Hawthorn
@jhawthorn
Eileen Uchitelle
@eileencodes
Jean Boussier
@byroot
Kevin Newton
@kddnewton
Takashi Kokubun
@k0kubun
Jemma Issroff
@JemmaIssroff
Jimmy Miller
@jimmyhmiller
Adam Hess
@theHessParker
Kevin Menard
@nirvdrum
Randy Stauner
@rwstauner

Program
●Quick YJIT news updates
●History of supersonic flight
●Making Ruby code faster: the traditional way
●The need for a new approach
●Protoboeuf: a pure-Ruby protobuf
●Next steps for YJIT

YJIT News

Ruby 3.3 Includes the Third Release of YJIT
●YJIT first included as part of Ruby 3.1
●Marked performance improvements with 3.2, 3.3 ??????
●YJIT 3.3 goal: better tuned for production deployments
○3.1, 3.2, had complaints from some people that YJIT was not faster
○Had to tell people to adjust various command-line options
○3.3: use less memory, ship with better default options!
●Big end of year push to find and fix bugs

Another Big Deployment…
●YJIT already deployed on all Shopify stores, at Discourse
○Deployed at Shopify since December 2022
●Earlier this year, GitHub quietly deployed YJIT 3.3
○Deployment went smoothly
○Reported a ~15% improvement in response time
○Closer to 20% speedup for some endpoints
●If you’ve visited github.com today…

ismyhostfastyet.com - Shopify at #1, excellent p75 time

Supersonic Flight

1903: First successful powered flight by the Wright brothers, ~48 km/h

1914: S.E.4 flown in Great Britain, top speed 217 km/h

1940: P-51 Mustang, top speed ~690 km/h

1944: Messerschmitt Me 262, ~850-1000 km/h

1944: Gloster Meteor, ~750-950 km/h

Flying Faster
●WWII: competitive pressure to build faster aircraft
○Faster fighters are better at dog fights
○Faster bombers are harder for fighters to hit
●Propellers lose efficiency as they approach the speed of sound
○Jet engines remove that limitation
●Fighter pilots could accidentally break the speed of sound while diving
○But when doing so, strange things would happen

●Airflow behaves differently at supersonic speeds
○Aircraft designed for subsonic speeds tend to shake violently
○Control surfaces become ineffective, or even reversed!
○Easy for pilots to lose control or aircraft
●Drag rises sharply as we approach the speed of sound
○Need several times more thrust to punch through transonic region
○Difficult to break sound barrier merely because of required power
●Supersonic speeds put high levels of stress on aircraft
○Aerodynamic heating causes thermal expansion, weakens materials
The Sound Barrier

Rethinking Aircraft Design
●For supersonic flight, we are operating in a different regime
○Need to rethink aircraft design in consequence
●Need several times greater thrust to pass through transonic region
○The first plane to break the sound barrier in level flight used a rocket engine
●Airflow behaves differently at supersonic speeds
○Smaller, swept back wings help reduce aerodynamic drag
●Need to deal with problems such as heating due to air compression
○The SR-71 is mostly made of titanium

Bell X-1 rocket-propelled experimental aircraft (1947)

Chuck Yeager, test pilot, next to the Bell X-1 “Glamorous Glennis”

Modified B-29 Superfortress dropping the Bell X-1

General Chuck Yeager passed away in 2020. He lived to be 97.

Boeing 787 Dreamliner (first flight 2009)

Douglas DC-9 (first flight 1965)

Lockheed Martin SR-71 Blackbird (first flight 1964), top speed ~3,540 km/h

???
This is RubyKaigi, your talk is supposed to be about Ruby…

Ok, but first, let me speak a bit about Python…

Python: The Old Way of Thinking
●Assumptions about performance:
○Python is a slow, interpreted language
○C is a fast, compiled language
●You are never allowed to complain about Python being too slow
●If your program too slow, just rewrite to slow parts in C
○Push the performance-sensitive logic into C extensions
○This is how Python performance issues are addressed
●Python for expressiveness, C/C++/Go/Rust for speed

… Ruby has the same problem … ??????

A New Regime

YJIT: A New Approach
●YJIT is the engine pushing Ruby closer to the speed of C
●As Ruby gets faster, we are entering a new regime
●The cost of calling C functions, translating data back and forth, becomes
relatively more and more expensive (increases drag)
●YJIT can’t optimize C code (black box), so this code gets in our way
●The complexity of maintaining C extensions becomes less appealing

Ruby’s Future
●Approaching a limit when it comes to optimizing Ruby performance
○We can’t make C code any faster
○Calls to/from C are relatively slow
●How can we reach Ruby’s maximum performance potential?
●Can we write more Ruby gems in pure-Ruby code?
●Can Ruby do the things C does effectively?

Why do people write C extensions?
1.To interface with external I/O APIs or system calls
e.g. GTK, ALSA, OpenGL, Vulkan, etc.
2.To improve performance, alleviate performance bottlenecks
e.g. number crunching, scientific computing, expensive algorithm
3.To interface with a specific native library
e.g. libmysqlclient, redis, protobuf, libyaml
(many of the libraries we use deal with parsing/serialization)

Why do people write C extensions?
1.To interface with external I/O APIs or system calls
e.g. GTK, ALSA, OpenGL, Vulkan, etc.
2.To improve performance, alleviate performance bottlenecks
e.g. number crunching, scientific computing, expensive algorithm
3.To interface with a specific native library
e.g. libmysqlclient, redis, protobuf, libyaml
(many of the libraries we use deal with parsing/serialization)

Pure-Ruby Gems

redis-client
●The redis-client gem has two drivers:
○hiredis (native C library binding)
○A pure-Ruby implementation
●Aaron Patterson and Jean Boussier improved the Ruby driver
○Found that with YJIT enabled, the new Ruby driver is about on par with the native C extension.
In the same ballpark.
●The Ruby driver could still be improved
○The C Ruby API allows pre-allocating hashes with the right size
■This capability is currently missing in Ruby
○Pure-Ruby driver could perform even better

GraphQL / TinyGQL
●GraphQL is used at Shopify and broadly across our industry
●The default GraphQL parser relies on racc
○Inefficient system where a native C library repeatedly calls into a Ruby tokenizer
○Lots of repeated Ruby-to-C calls (slow)
●Aaron Patterson wrote a pure-Ruby GraphQL parser
○With YJIT enabled, his parser outperforms the native C extension
●Blog post at:
○https://railsatscale.com/2023-08-29-ruby-outperforms-c/

Protobuf & Protoboeuf

●Protobuf is a binary data serialization protocol
●Protobuf is used at Shopify
○Serialization/deserialization of various data
○It’s used heavily in OpenTelemetry profiles
○We’re adopting Twirp as our internal communication protocol
●Protobuf is an annoying dependency to manage
○Common source of problems for Ruby upgrades
○At the top of allocation profiles
○Common source of memory leaks and crashes
●Could we build a pure-Ruby implementation that performs just as well as
Google’s native implementation?
Protobuf

Protoboeuf
●Prototype pure-Ruby implementation of protobuf
○Aims to be compatible with the proto3 spec
○Still a prototype/experiment at this stage
●Many attractive upsides:
○Easy to debug. Easy to maintain. Easy to contribute.
○Highly portable (anywhere Ruby runs, it will work!)
○No binary/systems package dependency
○Not tied to a specific Ruby ABI version
○Generates self-contained code, no runtime dependency
< >

AI-generated image (any resemblance to Shia Labeouf is purely coincidental).

What does Protoboeuf do?
●TinyGQL, redis-client, protoboeuf deal with parsing and serialization
○Work with strings and streams of bytes
○Low-level bit manipulation
○Allocate objects in the Ruby heap
●How does protoboeuf work?
○Parses proto3 specification (.proto files)
○Generates a self-contained Ruby to encode/decode protobuf streams
○Leverages YJIT for performance
●How fast is protoboeuf?
< >

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode upstream 5.000 i/100ms
decode protoboeuf 1.000 i/100ms
Calculating -------------------------------------
decode upstream 57.577 (± 3.5%) i/s - 290.000 in 5.043514s
decode protoboeuf 7.754 (± 0.0%) i/s - 39.000 in 5.029616s

Comparison:
decode upstream: 57.6 i/s
decode protoboeuf: 7.8 i/s - 7.43x slower

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode upstream 5.000 i/100ms
decode protoboeuf 1.000 i/100ms
Calculating -------------------------------------
decode upstream 57.577 (± 3.5%) i/s - 290.000 in 5.043514s
decode protoboeuf 7.754 (± 0.0%) i/s - 39.000 in 5.029616s

Comparison:
decode upstream: 57.6 i/s
decode protoboeuf: 7.8 i/s - 7.43x slower

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode upstream 5.000 i/100ms
decode protoboeuf 1.000 i/100ms
Calculating -------------------------------------
decode upstream 57.577 (± 3.5%) i/s - 290.000 in 5.043514s
decode protoboeuf 7.754 (± 0.0%) i/s - 39.000 in 5.029616s

Comparison:
decode upstream: 57.6 i/s
decode protoboeuf: 7.8 i/s - 7.43x slower??????

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s

Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s

Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s

Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster ??????

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s

Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s

Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s

Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster

Caveats
●Our pure-Ruby protobuf is still an experiment / proof of concept
●It’s much faster than Google’s at decoding, but slower at encoding
○We didn’t spend any time optimizing the encoding
●We’re thinking of using it internally, but it comes without support
●We built protoboeuf in part so we could see what it would take
○How fast could we make this?
○What kinds of optimizations could we do in YJIT?
○How hard would it be to make a faster pure-Ruby protobuf?
○The answer is it really wasn’t that difficult!

How did we do it?
●Attack the problem from two sides at once:
○Write Ruby code that we know YJIT will optimize well
○Add optimizations to YJIT to better optimize low-level Ruby code
●Is optimizing YJIT for protoboeuf cheating?
○No. We simply optimized core methods such as String#getbyte and String#setbyte
○Locate performance bottlenecks and optimize them
○This work benefits a lot of existing Ruby code
●Is optimizing Ruby code for YJIT cheating?
●Protoboeuf decoding is still faster than Google’s protobuf without YJIT!

+14.5% faster

YJIT 3.4 & Protoboeuf
●#9763 YJIT: add specialized codegen for fixnum XOR
●#9767 YJIT: add codegen for String#setbyte
●#10188 YJIT: String#getbyte codegen
●#10323 YJIT: Propagate Array, Hash, and String classes
●#10401 YJIT: Optimize putobject+opt_ltlt for integers
●#10487 YJIT: Optimize local variables when EP == BP
●… And more …
●None of these are purely specific to protoboeuf! ??????

if tag == 0x8
## PULL_UINT64
@x =
if (byte0 = buff.getbyte(index)) < 0x80
index += 1
byte0
elsif (byte1 = buff.getbyte(index + 1)) < 0x80
index += 2
(byte1 << 7) | (byte0 & 0x7F)
elsif (byte2 = buff.getbyte(index + 2)) < 0x80
index += 3
(byte2 << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
elsif (byte3 = buff.getbyte(index + 3)) < 0x80
index += 4
(byte3 << 21) | ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) |
(byte0 & 0x7F)
elsif (byte4 = buff.getbyte(index + 4)) < 0x80
index += 5
(byte4 << 28) | ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
((byte1 & 0x7F) << 7) | (byte0 & 0x7F)

●The code we generate for protoboeuf is quirky/convoluted
●We manually inlined functions, manually unrolled loops
●We avoided using local variables when possible
○YJIT 3.3 doesn’t do register allocation for locals
●We used self for field accesses
○We can generate more efficient code this way
●Wanted to reach maximum performance possible with YJIT
○This is OK because it’s only generated code
○This code won’t get slower with future Ruby releases
●We don’t recommend you do this
○Many tricks we used (e.g. avoiding locals) will become unnecessary
Optimizing Ruby Code for YJIT

Looking Forward

Areas for Improvement
●Still pain points around doing low-level operations in Ruby
●Ruby has lots of useful string methods…
○But often need to allocate small strings, which is a perf killer
○I will not allocate, allocations are the perf killer
● Similarly, you have to buffer reads into a string to do your parsing
○IO::Buffer is oriented towards binary protocols
○StringIO lacks useful methods and is slower than strings
●Encoding::BINARY needs to be reset regularly
○We’re working on resolving this one
●Creating hashes, strings, arrays etc with a given capacity

Takeaways
●Traditionally, push to write performance-critical code in C
●YJIT changes the equation, we are entering a new regime
●As Ruby code gets faster, the balance changes
○Having to maintain C code or native dependencies can be a burden
○Native gems have the unfortunate tendency to break and be hard to debug
○In some cases, we can make pure Ruby gems that are as fast as C
●To leverage YJIT, we need to cultivate Ruby Power
●We should also think about how to make Ruby
better for writing low-level code

We’re working on YJIT 3.4
●Already seeing nice performance improvements
●We’re currently focusing on:
○Improvements to generated code quality
○QoL improvements for analyzing performance in production
○Debugging and testing
●With Ruby 3.4, you should expect:
○Better performance across the board
○Comparable memory usage
○Even better tested and debugged
○Better support for optimizing binary I/O code
●If you’re feeling bold, you can try out Ruby master!

YJIT Speedup on Optcarrot Over Time (higher is better)

Thank you for listening! :)

To Learn More About YJIT
●Follow our work on the Rails At Scale blog:
○https://railsatscale.com/
●Check out the YJIT readme:
○https://docs.ruby-lang.org/en/master/yjit/yjit_md.html
●You can reach me at:
[email protected]
○@Love2Code on twitter/X
●Come talk to me after this talk!
Tags