Breaking the Ruby Performance Barrier with YJIT

Maxime Chevalier-Boisvert @ RubyKaigi 2024
Breaking the Ruby Performance Barrier

●My name is Maxime Chevalier-Boisvert
●Obtained a PhD in compiler design in 2016
○JIT techniques for dynamically-typed languages
●Joined Shopify in 2020
○(not the same company as Spotify)
●Tech lead of the YJIT project
Hello, World!

About YJIT
●Project started in 2020 ??????
○New JIT compiler built inside CRuby
○Goal: make Ruby seamlessly faster
●Built at Shopify, but fully open source ??????
○Significant contributions from GitHub folks
●Data-driven approach to optimization ??????
○Large, diverse set of benchmarks
○Benchmark often
○Gather detailed metrics

YJIT is a Team Effort
This project would not be possible without my amazing colleagues at Shopify and GitHub!
Alan Wu
@alanwusx
Aaron Patterson
@tenderlove
Maxime
@Love2Code
Noah Gibbs
@codefolio
John Hawthorn
@jhawthorn
Eileen Uchitelle
@eileencodes
Jean Boussier
@byroot
Kevin Newton
@kddnewton
Takashi Kokubun
@k0kubun
Jemma Issroff
@JemmaIssroff
Jimmy Miller
@jimmyhmiller
Adam Hess
@theHessParker
Kevin Menard
@nirvdrum
Randy Stauner
@rwstauner

Program
●Quick YJIT news updates
●History of supersonic flight
●Making Ruby code faster: the traditional way
●The need for a new approach
●Protoboeuf: a pure-Ruby protobuf
●Next steps for YJIT

YJIT News

Ruby 3.3 Includes the Third Release of YJIT
●YJIT first included as part of Ruby 3.1
●Marked performance improvements with 3.2, 3.3 ??????
●YJIT 3.3 goal: better tuned for production deployments
○3.1, 3.2, had complaints from some people that YJIT was not faster
○Had to tell people to adjust various command-line options
○3.3: use less memory, ship with better default options!
●Big end of year push to find and fix bugs

Another Big Deployment…
●YJIT already deployed on all Shopify stores, at Discourse
○Deployed at Shopify since December 2022
●Earlier this year, GitHub quietly deployed YJIT 3.3
○Deployment went smoothly
○Reported a ~15% improvement in response time
○Closer to 20% speedup for some endpoints
●If you’ve visited github.com today…

ismyhostfastyet.com - Shopify at #1, excellent p75 time

Supersonic Flight

1903: First successful powered flight by the Wright brothers, ~48 km/h

1914: S.E.4 flown in Great Britain, top speed 217 km/h

1940: P-51 Mustang, top speed ~690 km/h

1944: Messerschmitt Me 262, ~850-1000 km/h

1944: Gloster Meteor, ~750-950 km/h

Flying Faster
●WWII: competitive pressure to build faster aircraft
○Faster fighters are better at dog fights
○Faster bombers are harder for fighters to hit
●Propellers lose efficiency as they approach the speed of sound
○Jet engines remove that limitation
●Fighter pilots could accidentally break the speed of sound while diving
○But when doing so, strange things would happen

●Airflow behaves differently at supersonic speeds
○Aircraft designed for subsonic speeds tend to shake violently
○Control surfaces become ineffective, or even reversed!
○Easy for pilots to lose control or aircraft
●Drag rises sharply as we approach the speed of sound
○Need several times more thrust to punch through transonic region
○Difficult to break sound barrier merely because of required power
●Supersonic speeds put high levels of stress on aircraft
○Aerodynamic heating causes thermal expansion, weakens materials
The Sound Barrier

Rethinking Aircraft Design
●For supersonic flight, we are operating in a different regime
○Need to rethink aircraft design in consequence
●Need several times greater thrust to pass through transonic region
○The first plane to break the sound barrier in level flight used a rocket engine
●Airflow behaves differently at supersonic speeds
○Smaller, swept back wings help reduce aerodynamic drag
●Need to deal with problems such as heating due to air compression
○The SR-71 is mostly made of titanium

Bell X-1 rocket-propelled experimental aircraft (1947)

Chuck Yeager, test pilot, next to the Bell X-1 “Glamorous Glennis”

Modified B-29 Superfortress dropping the Bell X-1

General Chuck Yeager passed away in 2020. He lived to be 97.

Boeing 787 Dreamliner (first flight 2009)

Douglas DC-9 (first flight 1965)

Lockheed Martin SR-71 Blackbird (first flight 1964), top speed ~3,540 km/h

???
This is RubyKaigi, your talk is supposed to be about Ruby…

Ok, but first, let me speak a bit about Python…

Python: The Old Way of Thinking
●Assumptions about performance:
○Python is a slow, interpreted language
○C is a fast, compiled language
●You are never allowed to complain about Python being too slow
●If your program too slow, just rewrite to slow parts in C
○Push the performance-sensitive logic into C extensions
○This is how Python performance issues are addressed
●Python for expressiveness, C/C++/Go/Rust for speed

… Ruby has the same problem … ??????

A New Regime

YJIT: A New Approach
●YJIT is the engine pushing Ruby closer to the speed of C
●As Ruby gets faster, we are entering a new regime
●The cost of calling C functions, translating data back and forth, becomes
relatively more and more expensive (increases drag)
●YJIT can’t optimize C code (black box), so this code gets in our way
●The complexity of maintaining C extensions becomes less appealing

Ruby’s Future
●Approaching a limit when it comes to optimizing Ruby performance
○We can’t make C code any faster
○Calls to/from C are relatively slow
●How can we reach Ruby’s maximum performance potential?
●Can we write more Ruby gems in pure-Ruby code?
●Can Ruby do the things C does effectively?

Why do people write C extensions?
1.To interface with external I/O APIs or system calls
e.g. GTK, ALSA, OpenGL, Vulkan, etc.
2.To improve performance, alleviate performance bottlenecks
e.g. number crunching, scientific computing, expensive algorithm
3.To interface with a specific native library
e.g. libmysqlclient, redis, protobuf, libyaml
(many of the libraries we use deal with parsing/serialization)

Pure-Ruby Gems

redis-client
●The redis-client gem has two drivers:
○hiredis (native C library binding)
○A pure-Ruby implementation
●Aaron Patterson and Jean Boussier improved the Ruby driver
○Found that with YJIT enabled, the new Ruby driver is about on par with the native C extension.
In the same ballpark.
●The Ruby driver could still be improved
○The C Ruby API allows pre-allocating hashes with the right size
■This capability is currently missing in Ruby
○Pure-Ruby driver could perform even better

GraphQL / TinyGQL
●GraphQL is used at Shopify and broadly across our industry
●The default GraphQL parser relies on racc
○Inefficient system where a native C library repeatedly calls into a Ruby tokenizer
○Lots of repeated Ruby-to-C calls (slow)
●Aaron Patterson wrote a pure-Ruby GraphQL parser
○With YJIT enabled, his parser outperforms the native C extension
●Blog post at:
○https://railsatscale.com/2023-08-29-ruby-outperforms-c/

Protobuf & Protoboeuf

●Protobuf is a binary data serialization protocol
●Protobuf is used at Shopify
○Serialization/deserialization of various data
○It’s used heavily in OpenTelemetry profiles
○We’re adopting Twirp as our internal communication protocol
●Protobuf is an annoying dependency to manage
○Common source of problems for Ruby upgrades
○At the top of allocation profiles
○Common source of memory leaks and crashes
●Could we build a pure-Ruby implementation that performs just as well as
Google’s native implementation?
Protobuf

Protoboeuf
●Prototype pure-Ruby implementation of protobuf
○Aims to be compatible with the proto3 spec
○Still a prototype/experiment at this stage
●Many attractive upsides:
○Easy to debug. Easy to maintain. Easy to contribute.
○Highly portable (anywhere Ruby runs, it will work!)
○No binary/systems package dependency
○Not tied to a specific Ruby ABI version
○Generates self-contained code, no runtime dependency
< >

AI-generated image (any resemblance to Shia Labeouf is purely coincidental).

What does Protoboeuf do?
●TinyGQL, redis-client, protoboeuf deal with parsing and serialization
○Work with strings and streams of bytes
○Low-level bit manipulation
○Allocate objects in the Ruby heap
●How does protoboeuf work?
○Parses proto3 specification (.proto files)
○Generates a self-contained Ruby to encode/decode protobuf streams
○Leverages YJIT for performance
●How fast is protoboeuf?
< >

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode upstream 5.000 i/100ms
decode protoboeuf 1.000 i/100ms
Calculating -------------------------------------
decode upstream 57.577 (± 3.5%) i/s - 290.000 in 5.043514s
decode protoboeuf 7.754 (± 0.0%) i/s - 39.000 in 5.029616s

Comparison:
decode upstream: 57.6 i/s
decode protoboeuf: 7.8 i/s - 7.43x slower

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode upstream 5.000 i/100ms
decode protoboeuf 1.000 i/100ms
Calculating -------------------------------------
decode upstream 57.577 (± 3.5%) i/s - 290.000 in 5.043514s
decode protoboeuf 7.754 (± 0.0%) i/s - 39.000 in 5.029616s

Comparison:
decode upstream: 57.6 i/s
decode protoboeuf: 7.8 i/s - 7.43x slower??????

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s

Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
1.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.685 (± 0.0%) i/s - 14.000 in 5.218228s
decode and read protoboeuf
6.337 (± 0.0%) i/s - 32.000 in 5.050240s

Comparison:
decode and read upstream: 2.7 i/s
decode and read protoboeuf: 6.3 i/s - 2.36x faster ??????

ruby 3.4.0dev (2024-05-03T18:37:19Z master 7a49edcf1f) +YJIT [x86_64-linux]
Warming up --------------------------------------
decode and read upstream
1.000 i/100ms
decode and read protoboeuf
2.000 i/100ms
Calculating -------------------------------------
decode and read upstream
2.799 (± 0.0%) i/s - 14.000 in 5.008288s
decode and read protoboeuf
26.453 (± 0.0%) i/s - 134.000 in 5.066900s

Comparison:
decode and read upstream: 2.8 i/s
decode and read protoboeuf: 26.5 i/s - 9.45x faster

Caveats
●Our pure-Ruby protobuf is still an experiment / proof of concept
●It’s much faster than Google’s at decoding, but slower at encoding
○We didn’t spend any time optimizing the encoding
●We’re thinking of using it internally, but it comes without support
●We built protoboeuf in part so we could see what it would take
○How fast could we make this?
○What kinds of optimizations could we do in YJIT?
○How hard would it be to make a faster pure-Ruby protobuf?
○The answer is it really wasn’t that difficult!

How did we do it?
●Attack the problem from two sides at once:
○Write Ruby code that we know YJIT will optimize well
○Add optimizations to YJIT to better optimize low-level Ruby code
●Is optimizing YJIT for protoboeuf cheating?
○No. We simply optimized core methods such as String#getbyte and String#setbyte
○Locate performance bottlenecks and optimize them
○This work benefits a lot of existing Ruby code
●Is optimizing Ruby code for YJIT cheating?
●Protoboeuf decoding is still faster than Google’s protobuf without YJIT!

+14.5% faster

YJIT 3.4 & Protoboeuf
●#9763 YJIT: add specialized codegen for fixnum XOR
●#9767 YJIT: add codegen for String#setbyte
●#10188 YJIT: String#getbyte codegen
●#10323 YJIT: Propagate Array, Hash, and String classes
●#10401 YJIT: Optimize putobject+opt_ltlt for integers
●#10487 YJIT: Optimize local variables when EP == BP
●… And more …
●None of these are purely specific to protoboeuf! ??????

if tag == 0x8
## PULL_UINT64
@x =
if (byte0 = buff.getbyte(index)) < 0x80
index += 1
byte0
elsif (byte1 = buff.getbyte(index + 1)) < 0x80
index += 2
(byte1 << 7) | (byte0 & 0x7F)
elsif (byte2 = buff.getbyte(index + 2)) < 0x80
index += 3
(byte2 << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
elsif (byte3 = buff.getbyte(index + 3)) < 0x80
index += 4
(byte3 << 21) | ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) |
(byte0 & 0x7F)
elsif (byte4 = buff.getbyte(index + 4)) < 0x80
index += 5
(byte4 << 28) | ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
((byte1 & 0x7F) << 7) | (byte0 & 0x7F)

●The code we generate for protoboeuf is quirky/convoluted
●We manually inlined functions, manually unrolled loops
●We avoided using local variables when possible
○YJIT 3.3 doesn’t do register allocation for locals
●We used self for field accesses
○We can generate more efficient code this way
●Wanted to reach maximum performance possible with YJIT
○This is OK because it’s only generated code
○This code won’t get slower with future Ruby releases
●We don’t recommend you do this
○Many tricks we used (e.g. avoiding locals) will become unnecessary
Optimizing Ruby Code for YJIT

Looking Forward

Areas for Improvement
●Still pain points around doing low-level operations in Ruby
●Ruby has lots of useful string methods…
○But often need to allocate small strings, which is a perf killer
○I will not allocate, allocations are the perf killer
● Similarly, you have to buffer reads into a string to do your parsing
○IO::Buffer is oriented towards binary protocols
○StringIO lacks useful methods and is slower than strings
●Encoding::BINARY needs to be reset regularly
○We’re working on resolving this one
●Creating hashes, strings, arrays etc with a given capacity

Takeaways
●Traditionally, push to write performance-critical code in C
●YJIT changes the equation, we are entering a new regime
●As Ruby code gets faster, the balance changes
○Having to maintain C code or native dependencies can be a burden
○Native gems have the unfortunate tendency to break and be hard to debug
○In some cases, we can make pure Ruby gems that are as fast as C
●To leverage YJIT, we need to cultivate Ruby Power
●We should also think about how to make Ruby
better for writing low-level code

We’re working on YJIT 3.4
●Already seeing nice performance improvements
●We’re currently focusing on:
○Improvements to generated code quality
○QoL improvements for analyzing performance in production
○Debugging and testing
●With Ruby 3.4, you should expect:
○Better performance across the board
○Comparable memory usage
○Even better tested and debugged
○Better support for optimizing binary I/O code
●If you’re feeling bold, you can try out Ruby master!

YJIT Speedup on Optcarrot Over Time (higher is better)

Thank you for listening! :)

To Learn More About YJIT
●Follow our work on the Rails At Scale blog:
○https://railsatscale.com/
●Check out the YJIT readme:
○https://docs.ruby-lang.org/en/master/yjit/yjit_md.html
●You can reach me at:
○[email protected]
○@Love2Code on twitter/X
●Come talk to me after this talk!

Breaking the Ruby Performance Barrier with YJIT

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Breaking the Ruby Performance Barrier with YJIT

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 12

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 35

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx