CUDA, designed as an extension to C++, preserves its familiar abstractions. However, unlike CPU programming --- where compilers and runtime systems abstract away most hardware concerns --- writing CUDA code requires developers to manually map computations onto the GPU’s parallel execution and memo...
CUDA, designed as an extension to C++, preserves its familiar abstractions. However, unlike CPU programming --- where compilers and runtime systems abstract away most hardware concerns --- writing CUDA code requires developers to manually map computations onto the GPU’s parallel execution and memory hierarchy, while respecting the fundamental constraints of the hardware. In this talk, I’ll discuss where this model breaks down and why an alternative low-level language is needed for productive, compositional GPU programming.
Size: 2.5 MB
Language: en
Added: Oct 14, 2025
Slides: 37 pages
Slide Content
A ScyllaDB Community
GPUS, and How to Program
Them
Manya Bansal
Ph.D. Student
A ScyllaDB Community
GPUS, and how to program
them
Manya Bansal
Ph.D. Student
not
^
The Era of Compute
Era of GPUs
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40 V100 A100 H100
Explosion in
the amount
of compute
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35
Symposium. Retrieved from Hot Chips website:
H100
GPU’s Hardware
CPU Execution Model
CPU Execution Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
CPU Execution Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Compiler
Operating System
Tower of Abstractions
__device__ void add_1_scalar(const float* a,
float* b, int N){
for (int i 0; i < N; i++)
b[i] = a[i] + 1;
}
__global__ void add_1_scalar_kernel(
const float *d_a,
float *d_b, int N){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N){
int elements_to_process = min(4, N - tid);
add_1_scalar(&d_a[tid], &d_b[tid],
elements_to_process);
}}
… ……
__device__ void add_1_thread(const float* a,
float* b, int N){
if (threadIdx.x < N)
b[threadIdx.x] = a[threadIdx.x] + 1;}
__global__ void add_1_thread_kernel(
const float* d_a,
float* d_b, int total_N){
int block_offset = blockIdx.x * blockDim.x;
int elements_in_block = min(blockDim.x,
total_N - block_offset);
if (block_offset < total_N) {
add_1_thread(&d_a[block_offset],
& d_b[block_offset],
elements_in_block);
}}
… ……
__device__ void add_1_shared(const float* a,
float* b, int N){
extern __shared__ float smem[];
if (threadIdx.x < N)
smem[(threadIdx.x + 1) % N] = a[(threadIdx.x + 1) % N];
__syncthreads();
if (threadIdx.x < N)
b[threadIdx.x] = smem[threadIdx.x] + 1; }
__global__ void add_1_thread_kernel(
const float* d_a,
float* d_b, int total_N){
int block_offset = blockIdx.x * blockDim.x;
int elements_in_block = min(blockDim.x,
total_N - block_offset);
if (block_offset < total_N) {
add_1_thread(&d_a[block_offset],&d_b[block_offset],
elements_in_block); }}
… ……
“Basically”, C++
__device__ void add_1_scalar(const float* a,
float* b, int N)
__device__ void add_1_thread(const float* a,
float* b, int N);
__device__ void add_1_shared(const float* a,
float* b, int N);
Lack of composition has fragmented the ecosystem
Hardware is missing!
Programming GPUs uses finite
resources, like memory and compute.
Model the compute
hierarchy explicitly
__device__ void add_1(block : 1 const float *a,
block : 1 float *b, int N)
{
at(block) blk;
invoke(blk)
{
if (__threadId < N)
b[__threadId] = a[__threadId] + 1;
}
}
Memory can be
lowered to access
different parts of
the compute
hierarchy
__device__ void add_1_thread(
const block : 1 float *a,
block : 1 float *b, int N)
{
at(block : 1) b;
invoke(b)
{
at(thread : 1) t;
thread:
1 float *bt<> b[__thread_id];
invoke(t)
{
// bt[i] = b[__thread_id + i]
bt[0] = ...;
}}}
Composition
becomes a problem
of deciding whether
the required
resources can be
provided
__device__ void add_1_thread(
const block:1 float* a,
block:1 float* b, int N)
@ {block>=1, thread>=1, SMEM: 1024B};
Low-level, safe
substrate makes it
easier to build
productive tools
Safe and compositional CUDA
through scoped resources [email protected]://manya-bansal.github.io/