xxviList of Figures
(HBM2) allow signicantly faster memory bandwidths as compared to pre-
vious generations. 270
9.8 GP100 Pascal SM structure consists of two identical sub-structures that
contain 32 cores, 16 DPUs, 8 LD/ST units, 8 SFUs, and 32 K registers.
They share an instruction cache, however, they have their own instruction
buer. 271
9.9 IEEE 754-2008 oating point standard and the supported oating point data
types by CUDA.halfdata type is supported in Compute Capability 5.3 and
above, whilefloathas seen support from the rst day of the introduction
of CUDA. Support fordoubletypes started in Compute Capability 1.3. 284
10.1 CUDA Occupancy Calculator: Choosing the Compute Capability, max.
shared memory size, registers/kernel, and kernel shared memory usage. In
this specic case, the occupancy is 24 warps per SM (out of a total of 64),
translating to an occupancy of 2464 = 38 %. 337
10.2 Analyzing the occupancy of a case with(1)registers/thread=16,(2)shared
memory/kernel=8192 (8 KB), and(3)threads/block=128 (4 warps). CUDA
Occupancy Calculator plots the occupancy when each kernel contains more
registers (top) and as we launch more blocks (bottom), each requiring an
additional 8 KB. With 8 KB/block, the limitation is 24 warps/SM; however,
it would go up to 32 warps/block, if each block only required 6 KB of shared
memory (6144 Bytes), as shown in the shared memory plot (below). 338
10.3 Analyzing the occupancy of a case with(1)registers/thread=16,(2)shared
memory/kernel=8192 (8 KB), and(3)threads/block=128 (4 warps). CUDA
Occupancy Calculator plots the occupancy when we launch our blocks with
more threads/block (top) and provides a summary of which one of the three
resources will be exposed to the limitation before the others (bottom). In this
specic case, the limited amount of shared memory (48 KB) limits the total
number of blocks we can launch to 6. Alternatively, the number of registers
or the maximum number of blocks per SM does not become a limitation. 339
10.4 Analyzing theGaussKernel7(), which uses (1)registers/thread16,(2)
shared memory/kernel=40,960 (40 KB), and(3)threads/block=256. It is
clear that the shared memory limitation does not allow us to launch more
than a single block with 256 threads (8 warps). If you could reduce the
shared memory down to 24 KB by redesigning your kernel, you could launch
at least 2 blocks (16 warps, as shown in the plot) and double the occupancy. 341
10.5 Analyzing theGaussKernel7()with(1)registers/thread=16,(2)shared
memory/kernel=40,960, and(3)threads/block=256. 342
10.6 Analyzing theGaussKernel8()with(1)registers/thread=16,(2)shared
memory/kernel=24,576, and(3)threads/block=256. 343
10.7 Analyzing theGaussKernel8()with(1)registers/thread=16,(2)shared
memory/kernel=24,576, and(3)threads/block=256. 344
11.1 Nvidia visual proler. 376
11.2 Nvidia proler, command line version. 377
11.3 Nvidia NVVP results with no streaming and using a single stream, on the
K80 GPU. 378
11.4 Nvidia NVVP results with 2 and 4 streams, on the K80 GPU. 379