GPUS and How to Program Them by Manya Bansal

A ScyllaDB Community
GPUS, and How to Program
Them
Manya Bansal
Ph.D. Student

A ScyllaDB Community
GPUS, and how to program
them
Manya Bansal
Ph.D. Student
not
^

The Era of Compute

Era of GPUs
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35 Symposium.
K20X M40 V100 A100 H100

Explosion in
the amount
of compute
Source: Dally, B. (2023, August 29). Hardware for Deep Learning. Hot Chips 35
Symposium. Retrieved from Hot Chips website:

H100

GPU’s Hardware

CPU Execution Model

CPU Execution Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0

CPU Execution Model
ADD R0, R0, #1
CMP R0, R1
BNE loop
SUB R2, R1, #2
MOV R7, #1
MOV R0, #0
SWI 0
Compiler
Operating System
Tower of Abstractions

GPU Execution Model
…
Warp

…
Warp
Streaming Multiprocessor
…
Warp
…
Warp
…
Warp

…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
GPU

GPU Execution Model
…
SM
Instruction Cache
Warp
Scheduler
Warp
Scheduler
Register File
LD/ST LD/STLD/ST
Tensor CoresFP32

Deep Memory Hiearchy
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
GPU
Global Memory

Deep Memory Hiearchy
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
GPU
Global Memory
…
Warp
Threads in a Block
…
Warp
…
Warp
…
Warp
Shared Memory

Deep Memory Hiearchy
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
…
Warp
Simultaneous Multiprocessor
…
Warp
…
Warp
…
Warp
GPU
Global Memory
…
Warp
Threads in a Block
…
Warp
…
Warp
…
Warp
Shared Memory
Registers

SM
L2
Cache
HBM
Memory
Controller

CUDA: How we program GPUs today

“Basically”, C++

A Simple Program
a
1
+
b

__device__ void add_1_scalar(const float* a,
float* b, int N){
for (int i 0; i < N; i++)
b[i] = a[i] + 1;
}

__global__ void add_1_scalar_kernel(
const float *d_a,
float *d_b, int N){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N){
int elements_to_process = min(4, N - tid);
add_1_scalar(&d_a[tid], &d_b[tid],
elements_to_process);
}}

… ……

__device__ void add_1_thread(const float* a,
float* b, int N){
if (threadIdx.x < N)
b[threadIdx.x] = a[threadIdx.x] + 1;}

__global__ void add_1_thread_kernel(
const float* d_a,
float* d_b, int total_N){
int block_offset = blockIdx.x * blockDim.x;
int elements_in_block = min(blockDim.x,
total_N - block_offset);
if (block_offset < total_N) {
add_1_thread(&d_a[block_offset],
& d_b[block_offset],
elements_in_block);
}}

… ……

__device__ void add_1_shared(const float* a,
float* b, int N){
extern __shared__ float smem[];
if (threadIdx.x < N)
smem[(threadIdx.x + 1) % N] = a[(threadIdx.x + 1) % N];
__syncthreads();
if (threadIdx.x < N)
b[threadIdx.x] = smem[threadIdx.x] + 1; }

__global__ void add_1_thread_kernel(
const float* d_a,
float* d_b, int total_N){
int block_offset = blockIdx.x * blockDim.x;
int elements_in_block = min(blockDim.x,
total_N - block_offset);
if (block_offset < total_N) {
add_1_thread(&d_a[block_offset],&d_b[block_offset],
elements_in_block); }}

… ……

“Basically”, C++
__device__ void add_1_scalar(const float* a,
float* b, int N)
__device__ void add_1_thread(const float* a,
float* b, int N);
__device__ void add_1_shared(const float* a,
float* b, int N);

Lack of composition has fragmented the ecosystem

Hardware is missing!

Programming GPUs uses ﬁnite
resources, like memory and compute.

Model the compute
hierarchy explicitly
__device__ void add_1(block : 1 const float *a,
block : 1 float *b, int N)
{
at(block) blk;
invoke(blk)
{
if (__threadId < N)
b[__threadId] = a[__threadId] + 1;
}
}

Structured
Decomposition of
Parallel work

__device__ void add_1_thread(
block:1 const float* a,
block:1 float* b, int N){
at(block:2) b2;
invoke(b2) {
at(block:1) b1;
invoke(b1) {
if (__threadId < N)
b[__threadId] = a[__threadId] + 1;
}
}
}

Shared memory is
also special syntax
at (block:1) b;
invoke(b) {
SHARED float shared_array[4][8];
}

Names and
invocation lead to
a natural pattern
for concurrent and
parallel execution
at (thread:2) t1;
at (thread:2) t2;

invoke(t1) { ... }

invoke(t2) { ... }

invoke(t1) { ... }

Pointers are also
tied to compute
resources
__device__ void add_1_thread(
const block : 1 float *a,
block : 1 float *b, int N)
{
at(block : 1) b;
invoke(b)
{
at(thread : 1) t;
invoke(t)
{
// Error
b[0] = 0;
}}}

Memory can be
lowered to access
different parts of
the compute
hierarchy
__device__ void add_1_thread(
const block : 1 float *a,
block : 1 float *b, int N)
{
at(block : 1) b;
invoke(b)
{
at(thread : 1) t;
thread:
1 float *bt<> b[__thread_id];
invoke(t)
{
// bt[i] = b[__thread_id + i]
bt[0] = ...;
}}}

Composition
becomes a problem
of deciding whether
the required
resources can be
provided
__device__ void add_1_thread(
const block:1 float* a,
block:1 float* b, int N)
@ {block>=1, thread>=1, SMEM: 1024B};

Low-level, safe
substrate makes it
easier to build
productive tools

Safe and compositional CUDA
through scoped resources
[email protected]://manya-bansal.github.io/

GPUS and How to Program Them by Manya Bansal

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

GPUS and How to Program Them by Manya Bansal

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx