In datacenter performance analysis of a tensor processing unit

1,620 views 41 slides May 12, 2018
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

Tensorflow-KR 논문읽기모임 85번째 발표영상입니다 Google의 TPU v1논문에 대해서 발표했습니다(그런데 발표하고 하루만에 Google I/O에서 TPU v3가 발표됐네요... 논문이 나오길 기대합니다)
발표영상 : https://youtu.be/7WhWkhFAIO4
논문링크 : htt...


Slide Content

In-Datacenter Performance Analysis of a
Tensor Processing Unit
TM
6
th
May, 2018
PR12 Paper Review
JinwonLee
Samsung Electronics

References
Most figures and slides are from
Norman P. Jouppi, et al., "In-Datacenter Performance Analysis of a Tensor
Processing Unit", 44th IEEE/ACM International Symposium on Computer
Architecture (ISCA-44), Toronto, Canada, June 2017.
https://arxiv.org/abs/1704.04760
David Patterson, "Evaluation of the Tensor Processing Unit: A Deep Neural
Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017.
https://sites.google.com/view/naeregionalsymposium
KazSato, “An in-depth look at Google’s first Tensor Processing Unit (TPU)”,
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu

Authors

A Golden Age in Microprocessor Design
•Stunning progress in microprocessor design 40 years ≈ 10
6
x faster!
•Three architectural innovations (~1000x)
Width: 8163264 bit (~8x)
Instruction level parallelism:
4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)
Multicore: 1 processor to 16 cores (~16x)
•Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
•Made possible by IC technology:
Moore’s Law: growth in transistor count (2X every 1.5 years)
Dennard Scaling: power/transistor shrinks at same rate as transistors are
added (constant per mm2 of silicon)

End of Growth of Performance?

What’s Left?
•Since
Transistors not getting much better
Power budget not getting much higher
Already switched from 1 inefficient processor/chip to N efficient
processors/chip
•Only path left is Domain Specific Architetures
Just do a few tasks, but extremely well

TPU Origin
•Starting as far back as 2006, Google engineers had discussions about
deploying GPUs, FPGAs, or custom ASICs in their data centers.They
concluded that they can use the excess capacity of the large data
centers.
•The conversation changed in 2013 when it was projected that if
people used voice search for 3 minutes a day using speech
recognition DNNs, it would have required Google’s data centers to
doublein order to meet computation demands.
•Google then started a high-priority project to quickly produce a
custom ASIC for inference.
•The goal was to improve cost-performance by 10x over GPUs.
•Given this mandate, the TPU was designed, verified, built, and
deployed in data centers in just 15 months

TPU
•Built on a 28nm process
•Runs @700MHz
•Consumes 40W when
running
•Connected to its host via a
PCIeGen3 x16 bus
•TPU card to replace a disk
•Up to 4 cards / server

3 Kinds of Popular NNs
•Multi-Layer Perceptrons(MLP)
Each new layer is a set of nonlinear functions of weighted sum of all outputs
( fully connected) from a prior one
•Convolutional Neural Networks(CNN)
Each ensuing layer is a set of nonlinear functions of weighted sums of
spatially nearby subsets of outputs from the prior layer, which also reuses the
weights.
•Recurrent Neural Networks(RNN)
Each subsequent layer is a collection of nonlinear functions of weighted sums
of outputs and the previous state. The most popular RNN is Long Short-Term
Memory (LSTM).

Inference Datacenter Workload(95%)

TPU Architecture and Implementation
•Add as accelerators to existing servers
So connect over I/O Bus(“PCIe”)
TPU ≈ matrix accelerator on I/O bus
•Host server sends it instructions like a Floating Point Unit
Unlike GPU that fetches and executes own instructions
•The goal was to run whole inference models in the TPU to reduce
interactions with the host CPU and to be flexible enough to match
the NN needs of 2015 and beyond

TPU Block Diagram

TPU High Level Architecture
•Matrix Multiply Unit is the heart of the TPU
65,536(256x256) 8-bit MAC units
The matrix unit holds one 64 KiB tile of weights
plus one for double-buffering
>25x as many MACs vs GPU, >100x as many MACs vs CPU
•Peak performance: 92TOPS= 65,536 x 2 x 700M
•The 16-bit products are collected in the 4 MiBof 32-bit Accumulatorsbelow
the matrix unit.
The 4MiB represents 4096, 256-element, 32-bit accumulators
operations / byte @peak performance : 1350 round up : 2048 double
buffering : 4096

TPU High Level Architecture
•The weights for the matrix unit are staged
through an on-chip Weight FIFO that reads
from an off-chip 8 GiBDRAM called Weight Memory
Two 2133MHz DDR3 DRAM channels
for inference, weights are read-only
8 GiBsupports many simultaneously active models
•The intermediate results are held in the 24 MiBon-chip Unified Buffer,
which can serve as inputs to the Matrix Unit
The 24 MiBsize was picked in part to match the pitch of the Matrix Unit on the die
and, given the short development schedule

Floorplan of TPU Die
•The Unified Buffer is
almost a third of the die
•Matrix Multiply Unit is a
quarter
•Control is just 2%

RISC, CISC and the TPU Instruction Set
•Most modern CPUs are heavily influenced by the Reduced Instruction
Set Computer (RISC)design style
With RISC, the focus is to define simple instructions(e.g., load, store, add
and multiply) that are commonly used by the majority of applications and
then to execute those instructions as fast as possible.
•A Complex Instruction Set Computer(CISC) design focuses on
implementing high-level instructions that run more complex tasks
(such as calculating multiply-and-add many times) with each
instruction.
The average clock cycles per instruction (CPI) of these CISC instructions is
typically 10 to 20
•TPU choose the CISC style

TPU Instructions
•It has about a dozen instructions overall, but below five are the key ones

TPU Instructions
•The CISC MatrixMultiplyinstruction is 12 bytes
3 are Unified Buffer address; 2 are accumulator address; 4 are length
(sometimes 2 dimensions for convolutions); and the rest are opcode and
flags.
•Average clock cycles per instruction : > 10
•4-stage overlapped execution, 1 instruction type / stage
Execute other instructions while matrix multiplier busy
•Complexity in SW
No branches, in-order issue, SW controlled buffers, SW controlled pipeline
synchronization

Systolic Execution in Matrix Array
•Problem : Reading a large SRAM uses much more power than
arithmetic
•Solution : Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
•A systolic array is a two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
•It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name

Systolic Array(Example –vector input)

Systolic Array(Example –matrix input)

TPU Systolic Array
•In the TPU, the systolic array is
rotated
•Weights are loaded from the top
and the input data flows into the
array in from the left
•Weights are preloaded and take
effect with the advancing wave
alongside the first data of a new
block

Software Stack
•Software stack is split into a User Space
Driver and a Kernel Driver.
•The Kernel Driver is lightweight and
handles only memory management
and interrupts.
•The User Space driver changes
frequently. It sets up and controls TPU
execution, reformats data into TPU
order, translates API calls into TPU
instructions, and turns them into an
application binary.

Relative Performances : 3 Contemporary Chips
* TPU is less than half die size of the Intel Haswellprocessor
•K80 and TPU in 28nm process, Haswellfabbedin intel 22nm process
•These chips and platforms chosen for comparison because widely deployed in
Google data centers

Relative Performance : 3 Platforms
•These chips and platforms chosen for comparison because widely
deployed in Google data centers

Performance Comparison
•Roofline Performance model
This simple visual model is not perfect, yet
it offers insights on the causes of
performance bottlenecks.
The Y-axis is performance in floating-point
operations per second, thus the peak
computation rate forms the “flat” part of
the roofline.
The X-axis is operational intensity,
measured as floating-point operations per
DRAM byte accessed.

TPU Die Roofline
•The TPU has a long “slanted” part of
its roofline, where operational
intensity means that performance is
limited by memory bandwidth.
•Five of the six applications are
happily bumping their heads against
the ceiling
•MLPs and LSTMs are memory bound,
and CNNs are computation bound.

CPU & GPU Rooflines

Log Rooflines for CPU, GPU and TPU

Linear Rooflines for CPU, GPU and TPU

Why So Far Below Rooflines? (MLP0)
•Response time is the reason
•Researchers have demonstrated that small increases in response
time cause customers to use a service less
•Inference prefers latency over throughput

TPU & GPU Relative Performance to CPU
•GM : Geometric Mean
•WM : Weighted Mean

Performance / Watt

Improving TPU : Move “Ridge Point” to the Left
•Current DRAM
2 DDR 2133MHz 34GB/s
•Replace with GDDR5 like in K80
BW : 34GB/s 180GB/s
Move to Ridge Point from 1350 to 250
This improvement would expand die size by about 10%. However, higher
memory bandwidth reduces pressure on the Unified Buffer, so reducing the
Unified Buffer to 14 MiBcould gain back 10% in area.
Maximum MiBof the 24 MiBUnified Buffer used per NN app

Revised TPU Raised Roofline

Performance / Watt Original & Revised TPU

Overall Performance / Watt

Energy Proportionality

Evaluation of TPU Designs
•Below table shows the differences between the model results and
the hardware performance counters, which average below 10%.

Weighted Mean TPU Relative Performance

Weighted Mean TPU Relative Performance
•First, increasing memory bandwidth ( memory ) has the biggest
impact: performance improves 3X on average when memory
increases 4X
•Second, clock rate has little benefit on average with or without more
accumulators. The reason is the MLPs and LSTMs are memory bound
but only the CNNs are compute bound
Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs
but improves performance of CNNs by about 2X.
•Third, the average performance slightly degrades when the matrix
unit expandsfrom 256x256 to 512x512 for all apps
The issue is analogous to internal fragmentation of large pages