JVM in the Age of AI: Babylon, Valhalla, TornadoVM and friends

ArturSkowroski 255 views 43 slides Oct 10, 2024
Slide 1
Slide 1 of 152
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152

About This Presentation

Are you tired of all the hype around yet another LLM-as-a-Service? That's why I want to talk about something more interesting than just another tool for prompt engineers.

While the entire industry is discussing it, my goal is to explain what needs to happen within the virtual machine for the J...


Slide Content

Artur Skowroński
JVM in the Age of AI
Babylon, Valhalla, TornadoVM and friends

Artur Skowronski
Head of Java / Kotlin Development

What we won't be talking about today...

What we won't be talking about today...
•Using LLM through API (Langchain4j,
Semantic Kernel, Spring AI etc.)
•Data Engineering (data collection, data
cleaning, etc.)
•Model Development (MLOps, observability
etc.)

What we will be talking about today!
•The usage and inference from existing
models
•Mainly about GPU programming !

What we will be talking about today!
•The usage and inference from existing
models
•Mainly about GPU programming !
•…but if I had told you about it,
then probably no one would have shown up "

Helicopter View

152 Slides / 45 Minutes

A bit of theory

Model Inference
Training Data
Trained Model (Fixed
Parameters)
New Data
Inference
Prediction

Model Inference
Inference
Prediction
Layer1
Layer2
Layer3
Next Token of Text
Classification of Image
Inference

Model Inference
Trained Model
File
New Data
(Query)
Inference
Prediction

Model Inference
Trained Model
File
New Data
(Query)
Inference
Prediction

Different Model Formats
.h5 (weights & architecture)
.pkl (Pickle)
.pt
.pb

All these frameworks are Python based
.h5 (weights & architecture)
.pkl (Pickle)
.pt
.pb

Use Python Inference tooling Libraries with JVMUse Python Inference tooling Libraries with JVM

GraalVM

GraalVM

GraalVM

GraalVM
Kubernetes

Truffle & GraalPy

GraalVM in Truffle Mode
GraalVM
Byte Code Interpreter
Compiled Code #
Graal Compiler
Language Interpreter

GraalVM in Truffle Mode
GraalPy
GraalJS
Espresso
Language Interpreter
GraalRuby

GraalVM in Truffle Mode

Truffle & GraalPy

Go a Bit More Complex: GraalPy

Go a Bit More Complex: GraalPy

Go a Bit More Complex: GraalPy

Shortcuts don’t help here
Infer
model in
Java
reusing python
components

Yesterday, 15:10

Inference of model from Java

Model Inference
Inference
Response to User with
Predictions
Query from API
App Logic
Trained Model
File

TensorFlow
TensorFlow is an open-source
machine learning framework
for building and deploying ML
models, especially neural
networks.

Deeplearning4j (DL4J)
DL4J is a popular ML library in
Java that allows for easy
loading of models trained in
other environments (e.g.,
Keras, TensorFlow).

Tribuo
Tribuo is a relatively new machine
learning library for Java, designed
by Oracle Labs.
Its main goal is to simplify the
creation, training, and deployment
of machine learning models in
Java applications.

There are
compatible
implementations
of the common
standards

API
JPA
(Java Persistence API)
JDBC
(Java Database Connectivity)
JAX-RS
(Java API for RESTful Web
Services)
Servlet API
JSR (Java Specification Request)

Interesting JSR
JSR 381 – Visual Recognition (VisRec) API
JSR 376 – Java Platform Module System (Project Jigsaw)
JSR 383 – Java SE 18.3 (Java 10)
JSR 315 – Java Servlet 3.0
JSR 380 – Bean Validation 2.0
JSR 244 – Java EE 5
JSR 303 – Bean Validation 1.0
JSR 907 – JDBC 4.0
JSR 133 – Java Memory Model and Thread Specification
JSR 292 – Dynamically Typed Languages Support (invokedynamic)

DeepJavaLibrary

Visual Recognition (VisRec) JSR #381

Visual Recognition (VisRec) JSR #381

Different Model Formats
.h5 (weights & architecture)
.pkl (Pickle)
.pt
.pb

What about performance of the process?

Model Inference
Inference
Prediction
Layer1
Layer2
Layer3
Next Token of Text
Classification of Image
Computational Complexity
Each layer of requires performing a vast
number of mathematical operations,
such as matrix multiplications. The more
layers a model has, the greater the
number of such operations.

Model Inference
Inference
Prediction
Layer1
Layer2
Layer3
Next Token of Text
Classification of ImageComputational Complexity
Each layer of requires performing a vast
number of mathematical operations,
such as matrix multiplications. The more
layers a model has, the greater the
number of such operations.
Model Size
Models with many layers typically have
millions, or even billions, of parameters.
Storing and processing such a large
number of weights requires significant
memory resources.

Optimization: Half Floats and Project Valhalla

Half Floats
In standard floating-point numbers (commonly
known as floats), we have 32 bits, with some
allocated for the sign of the number, some for the
exponent, and some for the mantissa, which
determines precision.
1 bit for the sign (positive or negative)
8 bits for the exponent (range)
23 bits for the mantissa (precision)

Half Floats
In half-floats, we only have 16 bits, meaning they
can store smaller and less precise numbers.
Narrower range with less numbers in given range
1 bit for the sign (positive or negative)
5 bits for the exponent (range)
10 bits for the mantissa (precision)

21 September 2024

Not a silver bullet - access to memory is more important

Native Calls

Start Simple - JNI and Calling C Libraries

Start Simple - JNI and Calling C Libraries

Quick Intro to JNI
JNI stands for Java Native Interface, a framework that allows
Java code running in the Java Virtual Machine (JVM) to
interact with native applications and libraries written in other
programming languages like C or C++.
JNI is (was?) typically used when Java applications need to
perform tasks that cannot be efficiently accomplished with
pure Java, such as using system-specific libraries or interacting
with low-level hardware.

Java 1.1, 1997

Quick Intro to JNI
Declare the native method in Java
In your Java code, declare a method as native
without providing an implementation…
…and compile the code, which generates a .class file.

Quick Intro to JNI
Using the javah tool (available until JDK 8), generate a C/C++
header file. This file contains the native method signatures in
C/C++…
…which you need to manually write the native implementation
in C/C++

Quick Intro to JNI
Compile the C/C++ code into a native library (e.g., .dll on
Windows, .so on Linux, .dylib on macOS).
Java will load this library at runtime using System.loadLibrary().

Quick Intro to JNI

Project Panama

Jextract - Java 17 (2021)
jextract - introduced in Project Panama - automates the
process of generating Java bindings from C/C++ header files.
It eliminates the need to manually write JNI code.
Changes to the native library are handled simply by re-running
jextract.
The Java code generated by jextract maps directly to the
native functions and data structures, allowing seamless
interaction between Java and native code.

Project Panama
1. Foreign Function & Memory API (FFM API)
2. Linker API
3. Foreign Data Layout API
4. Interconnect with Foreign Languages
5. Improved Performance for Native Calls
6. Vector API

Model Inference run on GPU
Inference
Response to User with
Predictions
Query from API
App Logic
Trained Model
File
GPU

Nvidia Stock

Can it be efficiently run on CPU without frameworks?
Inference
Response to User with
Predictions
Query from API
App Logic
Trained Model
File
CPU?

Model Inference run on GPU
Inference
Response to User with
Predictions
Query from API
App Logic
Trained Model
File
GPU

Panama Vector API and Vector Databases
llama.cpp is a 700-line, C++-
based inference
implementation optimized
for CPU inference.

It bypasses the traditional
GPU-centric frameworks like
PyTorch or TensorFlow.

Panama Vector API and Vector Databases

Model Inference
Inference
Prediction
Layer1
Layer2
Layer3
Next Token of Text
Classification of Image
Inference

Can it be efficiently run on CPU without frameworks?
Inference
Response to User with
Predictions
Query from API
App Logic
Trained Model
File
CPU?

https://github.com/kherud/java-llama.cpp - Java Bindings

Llama3.java

Today, 13:50

JLama

Today, 15:00

Panama Vector API and Vector Databases

Project Valhalla & Panama Together: SIMD

Panama Vector API and Vector Databases
Instruction Result 1Data1
Data2
Data3
SISD
Data1
Instruction Data2
Instruction Data3
Result 1
Result 1
Instruction
Result 1
Result 2
Result 3
Data1
Data2
Data3
SIMD
Data1
Data2
Data3

Autovectorization
arrayA[i] + arrayB[I];
result[]
arrayA[]
arrayB[]
SIMD
arrayA[0] + arrayB[0] arrayA[1] + arrayB[1]
arrayA[2] + arrayB[2]
JIT

Panama Vector API and Vector Databases

Panama Vector API and Vector Databases

Panama Vector API and Vector Databases

Model Inference
Inference
Prediction
Layer1
Layer2
Layer3
Next Token of Text
Classification of Image
Computational Complexity
Each layer of requires performing a vast
number of mathematical operations,
such as matrix multiplications. The more
layers a model has, the greater the
number of such operations.Model Size
Models with many layers typically have
millions, or even billions, of parameters.
Storing and processing such a large
number of weights requires signifi

Special Hardware!

OVERLY SIMPLIFIED EXAMPLE ALERT!

Model Inference run on GPU
CPU
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction 5
Instruction 6
Instruction 7
Instruction 8
Instruction 9
Instruction 10

Model Inference run on GPU
CPU
Instruction 5
Instruction 6
Instruction 7
Instruction 8
Instruction 9
Instruction 10
Instruction 1
Instruction 2
Instruction 3
Instruction 4

Model Inference run on GPU
CPU
Instruction 7
Instruction 8
Instruction 9
Instruction 10
Result 1
Result 3
Instruction 4
Result 2 Instruction 5
Instruction 6

Model Inference run on GPU
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10

Model Inference run on GPU
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7

Model Inference run on GPU
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10

Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
NDGrid (n-dimensional grid)
refers to a multi-dimensional
grid of threads used for parallel
computation on a GPU.
The GPU runs threads in grids
with multiple dimensions,
allowing kernel processing in
different dimensions (e.g., 1D, 2D,
3D) simultaneously.
What is Kernel and NDGrid?

What is Kernel and NDGrid?
Kernel
Kernel is a piece of code that is
executed by hundreds or
thousands of threads
simultaneously.
Each thread runs the same code
but operates on different data,
enabling the processing of large
amounts of data in a highly
parallel manner.

C and JNI
Kernels are embedded as strings in the code, and programmers had to
manually manage data movement and task execution.
Calculating the square of each element in an array⬤
The first steps in GPU programming
in Java began with projects that
exposed CUDA and OpenCL APIs
through JNI.

Projects such as JCuda, and
JOpenCL allowes the use of native
GPU libraries from within Java, but
require the programmer to manage
GPU tasks in a rather low-level
manner.

Kernels expressed in Java
Aparapi and Rootbeer are projects
that allowed the expression of
kernels directly in Java code,
eliminating the need to manually
write CUDA or OpenCL code.

They don’t provide a direct API for
defining ndgrid in the way CUDA or
OpenCL does

Kernels expressed in Java
1.Lack of support for proper data
types
2.Lack of support for diverse
hardware
3.In-efficient transpiled code…
4.…or manual hacks

Project Sumatra
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10
Parallel processing
?

Project Sumatra

Project Sumatra
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10
Streams.parallel()

Project Sumatra
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10
Streams.parallel()

Project Sumatra
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10
Streams.parallel()

Project Sumatra
GPU
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Kernel 6
Kernel 7
Kernel 8
Kernel 9
Kernel 10
Streams.parallel()
JDK Sumatra’s approach to GPU
acceleration has been largely
superseded by GraalVM’s
capabilities, which provide broader
optimization features (e.g., native
image and polyglot execution) and
better performance across various
platforms
Data transfer between JVM and
GPU can reduce performance
benefits, especially for small tasks.

Next Level: TornadoVM

Next Level: TornadoVM
API Task Task GraphAnnotations

Next Level: TornadoVM
API Task Task GraphAnnotations

Next Level: TornadoVM
API Task Annotations Task Graph

Next Level: TornadoVM
API Task Annotations
Graph Optimizer
Task Graph

Next Level: TornadoVM
API Task Annotations Task Graph
Graph Optimizer

Next Level: TornadoVM
API Task Annotations
Graph Optimizer
Execution Engine
Task Graph

Next Level: TornadoVM
API Task Annotations
Graph Optimizer
Execution Engine
CPU GPU FPGA
Task Graph

Tornado JIT Compiler
Next Level: TornadoVM
API Task Annotations
Graph Optimizer
Execution Engine
CPU GPU FPGA
Task Graph

GraalVM

Tornado JIT Compiler
Next Level: TornadoVM
API Task Annotations Task Graph
Graph Optimizer
Execution Engine
CPU GPU FPGA
CUDA OpenCL PTX

Next Level: TornadoVM
Task Graphs - Direct abstraction
for NDGrid
A lot of different backends (not
only CUDA)
Managed Memory
JIT Support through GraalVM

Future: HAT & Project Babylon

Heterogenous Accelerator Toolkit (HAT)

Heterogenous Accelerator Toolkit (HAT)
NDRange API
FFM data wrapping patterns
Support for Code Reflection from
Project Babylon
Support for FFM API
Application
JDK
Vendor
Native
runtime
CPU GPU
HAT
Code Reflection Panama FFM API

Code Reflection
Code reflection is an extension of
Java reflection that allows access to
symbolic representations of Java
code, such as method bodies and
lambda expressions, at runtime.
This makes it possible to
programmatically manipulate Java
code.
Code Reflection

Code Models
bytecode
Type Flattening
Abstract Syntax Tree
Full Syntactic Details

Code Models
bytecode
Type Flattening
Abstract Syntax Tree
Full Syntactic Details
Code Models
Information about types and Structures

JDK
GPU
Code Models

Code Reflection
Code Models
Information about types and Structures
Vendor Native Runtime Code

Code Models

Code Models
Babylon can use code reflection to
dynamically generate GPU code (e.g.,
OpenCL, CUDA) by analyzing the code
model and converting fragments of
Java code into corresponding GPU
instructions.
Panama FFM API and off-heap
MemorySegments eliminate the
need for manual memory
management between the JVM and
GPU, speeding up data exchange and
increasing performance.
Application
JDK with Babylon
Vendor
Native
runtime
CPU GPU
HAT
Code Reflection Panama FFM API

Heterogeneous Accelerator Toolkit (HAT) Update #JVMLS

Conclusions: What the future holds

Where are we?
Panama & Valhalla SIMD:

Where are we?
Panama & Valhalla SIMD:
Valhalla Half-Float:

Where are we?
Panama & Valhalla SIMD:
Valhalla Half-Float:
GraalPy:

Where are we?
Panama & Valhalla SIMD:
Valhalla Half-Float:
GraalPy:
JCuda:

Where are we?
Panama & Valhalla SIMD:
Valhalla Half-Float:
GraalPy:
JCuda:
Sumatra:

Where are we?
Panama & Valhalla SIMD:
Valhalla Half-Float:
GraalPy:
JCuda:
Sumatra:
TornadoVM:

Where are we?
Panama & Valhalla SIMD:
Valhalla Half-Float:
GraalPy:
JCuda:
Sumatra:
TornadoVM:
HAT & Babylon:

Thank you $
@ArturSkowronski