[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse

tkowalcz 48 views 50 slides May 15, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

Computation is increasingly constrained by power. With each advancement in the manufacturing process, a decreasing percentage of the CPU can operate at full capacity, leading to the emergence of the term 'dark silicon'. This trend necessitates techniques that utilize chip area to optimize po...


Slide Content

HOW I LEARNED TO STOP
WORRYING AND LOVE THE DARK
SILICON APOCALYPSE
HTTP://SLI.DO
#GEECON
ROOM 11

How I learned to stop worrying and love
the dark silicon apocalypse

How I learned to stop worrying and love
the dark silicon apocalypse

@TKOWALCZ
TKOWALCZ
Tomasz Kowalczewski

for(int i = 0; i<100; i++) {
...
}

for(int i = 0; i<100; i++) {
...
}
.back:
cmp rcx, 100
je .outside
...
inc rcx
jmp .back
.outside:

M3 Max
Graviton 4
AMD EPYC Genoa-X

Today: 1000 lights/m @ 100W
2024: 2000 lights/m @ 200W
2028: 4000 lights/m @ 400W

Dennard scaling
As transistors get smaller, their power density stays
constant, so that the power use stays in proportion
with area; both voltage and current scale
(downward) with length
ROBERT H. DENNARD, IBM

S
S

CPU
CPU
x2
x2
2 EPOCHS

75% DARK AFTER 2 GENERATIONS
93% DARK AFTER 4 GENERATIONS
CPU
x2
x2
x4
x4
CPU
4 EPOCHS

MICHAEL B. TAYLOR, IS DARK SILICON USEFUL?

MORE CORES, MORE BETTER
MORE ACCELERATORS, MORE BETTER
(C) PATRICK KENNEDY, SERVETHEHOME.COM

FJCVTZS
Floating-point Javascript Convert to Signed fixed-
point, rounding toward Zero.

PCMPESTRI
Packed Compare Explicit Length Strings, Return
Index

PICTURE CREDIT TO @FRITZCHENSFRITZ, ANNOTATED BY @GPUSAREMAGIC

PICTURE CREDIT TO @FRITZCHENSFRITZ, ANNOTATED BY @GPUSAREMAGIC

VECTOR REGISTER

VECTOR REGISTER

VECTOR REGISTER

VECTOR REGISTER

1 2 3 4 5 6 7 8
VECTOR OPERATIONS
+
10 20 30 40 50 60 70 80
=
11 22 33 44 55 66 77 88

VECTOR REGISTER
add
subtract
multiply
shuffle
blend
load
BITS? BYTES? ELEMENTS?

Integer Integer Integer Integer Integer Integer Integer Integer
VECTOR REGISTERS
SHAPE
+
ELEMENT TYPE
=
SPECIES
SHAPE
64 bit
128 bit
256 bit
Long Long Long Long
Double Double Double Double

Vector cosine similarity
text-embedding-3-large is our new next generation
larger embedding model and creates embeddings
with up to 3072 dimensions.
PICTURE CREDIT OPENAI

Vector cosine similarity
text-embedding-3-large is our new next generation
larger embedding model and creates embeddings
with up to 3072 dimensions.

Origin Destination Flight No. Carrier Codeshare Cabin
KRK DFW 1022 LH ... ...
KRK BOS 933 LH ... ...
KRK LHR 123 BA ... ...
Row vs. Column layout in databases

Origin Destination Flight No. Carrier Codeshare Cabin
KRK DFW 1022 LH ... ...
KRK BOS 933 LH ... ...
KRK LHR 123 BA ... ...
Row vs. Column layout in databases
1022 933 123 ...IntVector

Origin Destination Flight No. Carrier Codeshare Cabin
KRK DFW 1022 LH ... ...
KRK BOS 933 LH ... ...
KRK LHR 123 BA ... ...
1022 933 123 ...
Row vs. Column layout in databases
IntVector

Row vs. Column layout in databases
KRK KRK KRK ... ... ...
DFW BOS LHR ... ... ...
1022 933 123 ... ... ...
Origin
Destination
Flight No.

Row vs. Column layout in databases
KRK KRK KRK ... ... ...
DFW BOS LHR ... ... ...
1022 933 123 ... ... ...
Origin
Destination
Flight No.
1022 933 123 ...IntVector

Row vs. Column layout in databases
3xKRK 13xDFW 42xLHR ... ... ...
18xDFW 22xBOS 2xLHR ... ... ...
1022 -89 -810 ... ... ...
Origin
Destination
Flight No.
1022 933 123 ...IntVector
?

SOURCE: HTTPS://EN.M.WIKIPEDIA.ORG/WIKI/FILE:TABLE_OF_X86_REGISTERS_SVG.SVG
K0 K1 K2 K3 K4 K5 K6 K7

1 2 3 4
1 1 1 1 0 0 0 0
1 2 3 4 0 0 0 0
MEMORY
MASK
Vector masks
V1
LOAD

1 2 3 4 5 6 7 8
1 1 1 1 0 0 0 0
1 2 3 4 0 0 0 0
V1
MASK
Vector masks
MEMORY
STORE
Page boundary, fault suppression

1 2 3 4 5 6 7 8
7 6 5 4 3 2 1 0
8 7 6 5 4 3 2 1
V1
SHUFFLE
Vector shuffles
V2
REARRANGE

VECTOR OPERATIONS

Sorting networks

Sorting networks
1
5
1
5
1
5 1
5

Sorting networks
A
B
C
1
5
1
5

Sorting networks
A
B
C
D
A
B
C

Sorting networks
A
B
C
D
D
B
A
4

5 4 1 2 5 8 7 6
MAX(V1, V2)
2 6 3 7 2 5 1 9
5 6 3 7 5 8 7 9
V1
V2
VECTOR MIN/MAX
MIN(V1, V2)
2 4 1 2 2 5 1 6

Sorting networks
1
5 1
5
5
1
1
5
shuffle
1
1
5
5
min
max
1
5
blend
blend

Sorting networks
BERENGER BRAMAS,
A NOVEL HYBRID QUICKSORT ALGORITHM VECTORIZED USING AVX-512 ON INTEL SKYLAKE

NEED FOR UPFRONT DESGIN OF DATA STRUCTURES AND MEMORY
LAYOUTS
COMPLICATED ALGORITHMS
NEED TO EMPLOY ELABORATE CODE TACTICS TO NOT BREAK (RE)BOXING
FAILURE TO OPTIMISE LEADS TO CATASTROPHIC DEGRADATION OF
PERFORMANCE
WILL INCUBATE UNTIL PROJECT VALHALLA BECOMES AVAILABLE
NEED TO MEASURE, MEASURE, MEASURE ON TARGET HARDWARE

FAST AND ROBUST VECTORIZED IN-PLACE SORTING OF PRIMITIVE TYPES
INTEL® INTRINSICS GUIDE
JVECTOR SIMDOPS
PERFORMANCE SPEED LIMITS
DESIGNING IN 2023 - 10 PROBLEMS TO SOLVE (JIM KELLER)
JLLAMA