GPUS and How to Program Them by Manya Bansal

ScyllaDB 0 views 37 slides Oct 14, 2025
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

CUDA, designed as an extension to C++, preserves its familiar abstractions. However, unlike CPU programming --- where compilers and runtime systems abstract away most hardware concerns --- writing CUDA code requires developers to manually map computations onto the GPU’s parallel execution and memo...


Slide Content

A ScyllaDB Community
GPUS, and How to Program
Them
Manya Bansal
Ph.D. Student

A ScyllaDB Community
GPUS, and how to program
them
Manya Bansal
Ph.D. Student
not
^

The Era of Compute

Era of GPUs
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40 V100 A100 H100

Explosion in
the amount
of compute
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35
Symposium. Retrieved from Hot Chips website:

H100

GPU’s Hardware

CPU Execution Model

CPU Execution Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0

CPU Execution Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Compiler
Operating System
Tower of Abstractions

GPU Execution Model

Warp


Warp
Streaming Multiprocessor

Warp

Warp

Warp


Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp
GPU

GPU Execution Model

SM
Instruction Cache
Warp
Scheduler
Warp
Scheduler
Register File
LD/ST LD/STLD/ST
Tensor CoresFP32

Deep Memory Hiearchy

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp
GPU
Global Memory

Deep Memory Hiearchy

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp
GPU
Global Memory

Warp
Threads in a Block

Warp

Warp

Warp
Shared Memory

Deep Memory Hiearchy

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp

Warp
Simultaneous Multiprocessor

Warp

Warp

Warp
GPU
Global Memory

Warp
Threads in a Block

Warp

Warp

Warp
Shared Memory
Registers

SM
L2
Cache
HBM
Memory
Controller

CUDA: How we program GPUs today

“Basically”, C++

A Simple Program
a
1
+
b

__device__ void add_1_scalar(const float* a,
float* b, int N){
for (int i 0; i < N; i++)
b[i] = a[i] + 1;
}

__global__ void add_1_scalar_kernel(
const float *d_a,
float *d_b, int N){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N){
int elements_to_process = min(4, N - tid);
add_1_scalar(&d_a[tid], &d_b[tid],
elements_to_process);
}}


… ……

__device__ void add_1_thread(const float* a,
float* b, int N){
if (threadIdx.x < N)
b[threadIdx.x] = a[threadIdx.x] + 1;}

__global__ void add_1_thread_kernel(
const float* d_a,
float* d_b, int total_N){
int block_offset = blockIdx.x * blockDim.x;
int elements_in_block = min(blockDim.x,
total_N - block_offset);
if (block_offset < total_N) {
add_1_thread(&d_a[block_offset],
& d_b[block_offset],
elements_in_block);
}}

… ……

__device__ void add_1_shared(const float* a,
float* b, int N){
extern __shared__ float smem[];
if (threadIdx.x < N)
smem[(threadIdx.x + 1) % N] = a[(threadIdx.x + 1) % N];
__syncthreads();
if (threadIdx.x < N)
b[threadIdx.x] = smem[threadIdx.x] + 1; }

__global__ void add_1_thread_kernel(
const float* d_a,
float* d_b, int total_N){
int block_offset = blockIdx.x * blockDim.x;
int elements_in_block = min(blockDim.x,
total_N - block_offset);
if (block_offset < total_N) {
add_1_thread(&d_a[block_offset],&d_b[block_offset],
elements_in_block); }}

… ……

“Basically”, C++
__device__ void add_1_scalar(const float* a,
float* b, int N)
__device__ void add_1_thread(const float* a,
float* b, int N);
__device__ void add_1_shared(const float* a,
float* b, int N);

Lack of composition has fragmented the ecosystem

Hardware is missing!

Programming GPUs uses finite
resources, like memory and compute.

Model the compute
hierarchy explicitly
__device__ void add_1(block : 1 const float *a,
block : 1 float *b, int N)
{
at(block) blk;
invoke(blk)
{
if (__threadId < N)
b[__threadId] = a[__threadId] + 1;
}
}

Structured
Decomposition of
Parallel work

__device__ void add_1_thread(
block:1 const float* a,
block:1 float* b, int N){
at(block:2) b2;
invoke(b2) {
at(block:1) b1;
invoke(b1) {
if (__threadId < N)
b[__threadId] = a[__threadId] + 1;
}
}
}

Shared memory is
also special syntax
at (block:1) b;
invoke(b) {
SHARED float shared_array[4][8];
}

Names and
invocation lead to
a natural pattern
for concurrent and
parallel execution
at (thread:2) t1;
at (thread:2) t2;

invoke(t1) { ... }

invoke(t2) { ... }

invoke(t1) { ... }

Pointers are also
tied to compute
resources
__device__ void add_1_thread(
const block : 1 float *a,
block : 1 float *b, int N)
{
at(block : 1) b;
invoke(b)
{
at(thread : 1) t;
invoke(t)
{
// Error
b[0] = 0;
}}}

Memory can be
lowered to access
different parts of
the compute
hierarchy
__device__ void add_1_thread(
const block : 1 float *a,
block : 1 float *b, int N)
{
at(block : 1) b;
invoke(b)
{
at(thread : 1) t;
thread:
1 float *bt<> b[__thread_id];
invoke(t)
{
// bt[i] = b[__thread_id + i]
bt[0] = ...;
}}}

Composition
becomes a problem
of deciding whether
the required
resources can be
provided
__device__ void add_1_thread(
const block:1 float* a,
block:1 float* b, int N)
@ {block>=1, thread>=1, SMEM: 1024B};

Low-level, safe
substrate makes it
easier to build
productive tools

Safe and compositional CUDA
through scoped resources
[email protected]://manya-bansal.github.io/
Tags