Data-Level Parallelism in Microprocessors

295 views 46 slides Apr 16, 2024
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

Data-level parallelism in CPUs like Vector architectures, SIMD instruction set for multimedia, GPUs


Slide Content

Data-Level Parallelism CS4342 Advanced Computer Architecture Dilum Bandara [email protected] Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5 th Edition, 2012, Morgan Kaufmann Publishers

Outline Vector architectures SIMD instruction set for multimedia GPUs 2

SIMD Architectures Exploit significant data-level parallelism for Matrix-oriented scientific computing Media-oriented image & sound processors SIMD is more energy efficient than MIMD Only needs to fetch 1 instruction per data operation Makes SIMD attractive for personal mobile devices SIMD allows programmer to continue to think sequentially 3

SIMD Parallelism Vector architectures SIMD extensions Graphics Processor Units (GPUs) For x86 processors Expect 2 additional cores per chip per year SIMD width to double every 4 years Potential speedup from SIMD to be 2x from MIMD 4

Potential Speedup 5

1. Vector Architectures Idea Read sets of data elements into “vector registers” Operate on those registers Disperse results back into memory Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth 6

Vector Architecture – VMIPS 7 Vector MIPS

VMIPS Loosely based on Cray-1 Vector registers Each register holds a 64-element, 64 bits/element vector Register file has 16 read ports & 8 write ports Vector functional units Fully pipelined Data & control hazards are detected Vector load-store unit Fully pipelined 1 word per clock cycle after initial latency Scalar registers 32 general-purpose registers 32 floating-point registers 8

VMIPS Instructions ADDVV.D – add 2 vectors ADDVS.D – add vector to a scalar LV/SV – vector load & vector store from address 9

Example – DAXPY Y = aX + Y L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply LV V3,Ry ; load vector Y ADDVV V4,V2,V3 ; add SV Ry,V4 ; store result Requires 6 instructions vs. almost 600 for MIPS Assuming 64 elements per vector 10 DAXPY – Double precision a×X plus Y

Vector Execution Time Execution time depends on 3 factors Length of operand vectors Structural hazards Data dependencies VMIPS functional units consume 1 element per clock cycle ~ Execution time = Vector length 11

Definitions Convey Set of vector instructions that could potentially execute together Free of structural hazards Sequences with read-after-write hazards can be in same convey via chaining Chaining Allows a vector operation to start as soon as individual elements of its vector source operand become available 12

Convey LV V1,Rx ;load vector X MULVS.D V2,V1,F0 ;vector-scalar multiply LV V3,Ry ;load vector Y ADDVV.D V4,V2,V3 ;add two vectors SV Ry,V4 ;store the sum Convoys 1 LV MULVS.D 2 LV ADDVV.D 3 SV 13

Definitions – Chime Unit of time to execute 1 convey m conveys executes in m chimes For vector length of n , requires m × n clock cycles 14

Convey LV V1,Rx ;load vector X MULVS.D V2,V1,F0 ;vector-scalar multiply LV V3,Ry ;load vector Y ADDVV.D V4,V2,V3 ;add two vectors SV Ry,V4 ;store the sum Convoys 1 LV MULVS.D 2 LV ADDVV.D 3 SV 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5 For 64 element vectors, requires 64 x 3 = 192 clock cycles 15

Challenges Start up time Latency of vector functional unit Assume same as Cray-1 Floating-point add  6 clock cycles Floating-point multiply  7 clock cycles Floating-point divide  20 clock cycles Vector load  12 clock cycles 16

Improvements > 1 element per clock cycle Multiple Lanes Non-64 wide vectors Vector Length Register IF statements in vector code Vector Mask Registers Memory system optimizations to support vector processors Memory Banks Multi-dimensional matrices Stride Sparse matrices Programming a vector computer 17

1. Multiple Lanes Multiple hardware lanes Element n of vector register A is “hardwired” to element n of vector register B 18

2. Vector Length Register Vector length not known at compile time? Vector Length Register (VLR) Use strip mining for vectors over maximum length 19

3. Vector Mask Registers for (i = 0; i < 64; i=i+1) if (X[ i ] != 0) X[ i ] = X[ i ] – Y[ i ]; Use vector mask register to “disable” elements LV V1,Rx ;load vector X into V1 LV V2,Ry ;load vector Y L.D F0,#0 ;load FP zero into F0 SNEVS.D V1,F0 ;sets VM( i ) to 1 if V1( i )!=F0 SUBVV.D V1,V1,V2 ;subtract under vector mask SV Rx,V1 ;store the result in X 20

4. Memory Banks Designed to support high bandwidth for vector loads & stores Spread accesses across multiple banks Control bank addresses independently Load or store none sequential words Support multiple vector processors sharing the same memory 21 Source: www.hardwareanalysis.com/content/image/10220/

5. Stride for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j = j+1) { A[ i ][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } } Row-major vs. column-major Vectorize multiplication of rows of B with columns of D Use non-unit stride Bank conflict (stall) occurs when same bank is hit faster than bank busy time 22

2. SIMD Extensions Media applications operate on data types narrower than native word size E.g., four 8-bit operations on a 32-bit system 23

Why SIMD? Need minor changes to hardware Require little extra state compared to vector architectures While context switching No need for high memory bandwidth compared to vector processors No need to deal with page faults when used with virtual memory 24

SIMD Extensions Limitations compared to vector instructions No of data operands encoded into op code Many instructions No sophisticated addressing modes Strided , scatter-gather No mask registers for conditional execution 25

SIMD Implementations Intel MMX (1996) Eight 8-bit integer ops or four 16-bit integer ops Streaming SIMD Extensions (SSE) (1999) Eight 16-bit integer ops Four 32-bit integer/FP ops or two 64-bit integer/FP ops Advanced Vector Extensions (AVX) (2010) Four 64-bit integer/ fp ops Rely on programmer & libraries than compiler Operands must be consecutive & aligned memory locations 26

Graphical Processing Units (GPUs) Originally designed to accelerate large no of computations performed in graphics rendering Offloaded numerically intensive computation from CPU GPUs grew with demand for high performance graphics Eventually GPUs have become much powerful, even more than CPUs for many computations Cost-power-performance advantage 27

GPUs GPU is typically a computer card, installed into a PCI Express slot Market leaders: NVIDIA, Intel, AMD (ATI) Example NVIDIA GPUs at UoM 28 GeForce GTX 480 Tesla 2070

Example Specifications 29 GTX 480 Tesla 2070 Tesla K80 Peak double precision FP performance 650 Gigaflops 515 Gigaflops 2.91 Teraflops Peak single precision FP performance 1.3 Teraflops 1.03 Teraflops 8.74 Tera flops CUDA cores 480 448 4992 Frequency of CUDA Cores 1.40 GHz   1.15 GHz 560/875 MHz Memory size (GDDR5) 1536 MB  6 GB 24 GB Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec ECC Memory No Yes Yes

CPU vs. GPU Architecture 30 GPU devotes more transistors for computation

Basic Idea Heterogeneous execution model CPU – host GPU – device Develop a C-like programming language for GPU Unify all forms of GPU parallelism as threads Handle each data element in a separate thread  no need for synchronization Programming model is “Single Instruction Multiple Thread (SIMT)” 31

Programming GPUs CUDA language for Nvidia GPU products Compute Unified Device Architecture Based on C nvcc compiler Lots of tools for analysis, debug, profile, … OpenCL - Open Computing Language Based on C Supports GPU & CPU programming Lots of active research e.g., automatic code generation for GPUs 32

Threads & Blocks A thread is associated with each data element Threads are organized into blocks Blocks are organized into a grid GPU hardware handles thread management, not applications or OS 33

Grids, Blocks, & Threads July-Aug 2011 34 Grid of size 6 (3x2 blocks) Each block has 12 threads (4x3)

NVIDIA GPU Architecture Similarities to vector machines Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files Differences No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor 35

Example – Multiply 2 Vectors of Length 8192 Code that works over all elements is the grid Thread blocks break this down into manageable sizes 512/1024 threads per block SIMD instruction executes 32 elements at a time Thus, grid size = 16/8 blocks Block is analogous to a strip-mined vector loop with vector length of 32 Block is assigned to a multithreaded SIMD processor by thread block scheduler Current-generation GPUs (Fermi) have 7-15 multithreaded SIMD processors (a.k.a. streaming multiprocessors, SMX) 36

Multithreaded SIMD Processor 37

Threads of SIMD Instructions Each has its own PC Thread scheduler uses scoreboard to dispatch Keeps track of up to 48 threads of SIMD instructions Hides memory latency No data dependencies between threads! Thread block scheduler schedules blocks to SIMD processors Within each SIMD processor 32 SIMD lanes Wide & shallow compared to vector processors 38

Registers NVIDIA GPU has 32,768 registers Divided into lanes Each SIMD thread is limited to 64 registers SIMD thread has up to 64 vector registers of 32 32-bit elements 32 vector registers of 32 64-bit elements Fermi has 16 physical SIMD lanes, each containing 2048 registers 39

Conditional Branching Like vector architectures, GPU branch hardware uses internal masks Also uses Branch synchronization stack Entries consist of masks for each SIMD lane i.e. which threads commit their results (all threads execute) Instruction markers to manage when a branch diverges into multiple execution paths Push on divergent branch …and when paths converge Act as barriers Pops stack Per-thread-lane 1-bit predicate register, specified by programmer 40

NVIDIA GPU Memory Structure GRID Architecture 41 Grid A group of threads all running the same kernel Can run multiple grids at once Block Grids composed of blocks Each block is a logical unit containing a no of coordinating threads & some amount of shared memory

NVIDIA GPU Memory Structures (Cont.) 42

NVIDIA GPU Memory Structures (Cont.) Each SIMD lane has private section of off-chip DRAM Private memory Contains stack frame, spilling registers, & private variables Each multithreaded SIMD processor also has local memory Shared by SIMD lanes / threads within a block Memory shared by SIMD processors is GPU Memory Host can read & write GPU memory 43

Fermi Architecture Innovations Each SIMD processor has 2 SIMD thread schedulers, 2 instruction dispatch units 16 SIMD lanes (SIMD width = 32, chime = 2 cycles), 16 load-store units, 4 special function units Thus, 2 threads of SIMD instructions are scheduled every 2 clock cycles Fast double precision Caches for GPU memory 64-bit addressing & unified address space Error correcting codes Faster context switching Faster atomic instructions 44

Multithreaded Dual SIMD Processor of a Fermi GPU 45

Summary Vector architectures High performance High cost SIMD instruction set for multimedia Simple extensions Low cost Low performance GPUs High performance Low cost SIMD only 46