In datacenter performance analysis of a tensor processing unit
1,620 views
41 slides
May 12, 2018
Slide 1 of 41
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
About This Presentation
Tensorflow-KR 논문읽기모임 85번째 발표영상입니다 Google의 TPU v1논문에 대해서 발표했습니다(그런데 발표하고 하루만에 Google I/O에서 TPU v3가 발표됐네요... 논문이 나오길 기대합니다)
발표영상 : https://youtu.be/7WhWkhFAIO4
논문링크 : htt...
Tensorflow-KR 논문읽기모임 85번째 발표영상입니다 Google의 TPU v1논문에 대해서 발표했습니다(그런데 발표하고 하루만에 Google I/O에서 TPU v3가 발표됐네요... 논문이 나오길 기대합니다)
발표영상 : https://youtu.be/7WhWkhFAIO4
논문링크 : https://arxiv.org/abs/1704.04760
Size: 3.14 MB
Language: en
Added: May 12, 2018
Slides: 41 pages
Slide Content
In-Datacenter Performance Analysis of a
Tensor Processing Unit
TM
6
th
May, 2018
PR12 Paper Review
JinwonLee
Samsung Electronics
References
Most figures and slides are from
Norman P. Jouppi, et al., "In-Datacenter Performance Analysis of a Tensor
Processing Unit", 44th IEEE/ACM International Symposium on Computer
Architecture (ISCA-44), Toronto, Canada, June 2017.
https://arxiv.org/abs/1704.04760
David Patterson, "Evaluation of the Tensor Processing Unit: A Deep Neural
Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017.
https://sites.google.com/view/naeregionalsymposium
KazSato, “An in-depth look at Google’s first Tensor Processing Unit (TPU)”,
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu
Authors
A Golden Age in Microprocessor Design
•Stunning progress in microprocessor design 40 years ≈ 10
6
x faster!
•Three architectural innovations (~1000x)
Width: 8163264 bit (~8x)
Instruction level parallelism:
4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)
Multicore: 1 processor to 16 cores (~16x)
•Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
•Made possible by IC technology:
Moore’s Law: growth in transistor count (2X every 1.5 years)
Dennard Scaling: power/transistor shrinks at same rate as transistors are
added (constant per mm2 of silicon)
End of Growth of Performance?
What’s Left?
•Since
Transistors not getting much better
Power budget not getting much higher
Already switched from 1 inefficient processor/chip to N efficient
processors/chip
•Only path left is Domain Specific Architetures
Just do a few tasks, but extremely well
TPU Origin
•Starting as far back as 2006, Google engineers had discussions about
deploying GPUs, FPGAs, or custom ASICs in their data centers.They
concluded that they can use the excess capacity of the large data
centers.
•The conversation changed in 2013 when it was projected that if
people used voice search for 3 minutes a day using speech
recognition DNNs, it would have required Google’s data centers to
doublein order to meet computation demands.
•Google then started a high-priority project to quickly produce a
custom ASIC for inference.
•The goal was to improve cost-performance by 10x over GPUs.
•Given this mandate, the TPU was designed, verified, built, and
deployed in data centers in just 15 months
TPU
•Built on a 28nm process
•Runs @700MHz
•Consumes 40W when
running
•Connected to its host via a
PCIeGen3 x16 bus
•TPU card to replace a disk
•Up to 4 cards / server
3 Kinds of Popular NNs
•Multi-Layer Perceptrons(MLP)
Each new layer is a set of nonlinear functions of weighted sum of all outputs
( fully connected) from a prior one
•Convolutional Neural Networks(CNN)
Each ensuing layer is a set of nonlinear functions of weighted sums of
spatially nearby subsets of outputs from the prior layer, which also reuses the
weights.
•Recurrent Neural Networks(RNN)
Each subsequent layer is a collection of nonlinear functions of weighted sums
of outputs and the previous state. The most popular RNN is Long Short-Term
Memory (LSTM).
Inference Datacenter Workload(95%)
TPU Architecture and Implementation
•Add as accelerators to existing servers
So connect over I/O Bus(“PCIe”)
TPU ≈ matrix accelerator on I/O bus
•Host server sends it instructions like a Floating Point Unit
Unlike GPU that fetches and executes own instructions
•The goal was to run whole inference models in the TPU to reduce
interactions with the host CPU and to be flexible enough to match
the NN needs of 2015 and beyond
TPU Block Diagram
TPU High Level Architecture
•Matrix Multiply Unit is the heart of the TPU
65,536(256x256) 8-bit MAC units
The matrix unit holds one 64 KiB tile of weights
plus one for double-buffering
>25x as many MACs vs GPU, >100x as many MACs vs CPU
•Peak performance: 92TOPS= 65,536 x 2 x 700M
•The 16-bit products are collected in the 4 MiBof 32-bit Accumulatorsbelow
the matrix unit.
The 4MiB represents 4096, 256-element, 32-bit accumulators
operations / byte @peak performance : 1350 round up : 2048 double
buffering : 4096
TPU High Level Architecture
•The weights for the matrix unit are staged
through an on-chip Weight FIFO that reads
from an off-chip 8 GiBDRAM called Weight Memory
Two 2133MHz DDR3 DRAM channels
for inference, weights are read-only
8 GiBsupports many simultaneously active models
•The intermediate results are held in the 24 MiBon-chip Unified Buffer,
which can serve as inputs to the Matrix Unit
The 24 MiBsize was picked in part to match the pitch of the Matrix Unit on the die
and, given the short development schedule
Floorplan of TPU Die
•The Unified Buffer is
almost a third of the die
•Matrix Multiply Unit is a
quarter
•Control is just 2%
RISC, CISC and the TPU Instruction Set
•Most modern CPUs are heavily influenced by the Reduced Instruction
Set Computer (RISC)design style
With RISC, the focus is to define simple instructions(e.g., load, store, add
and multiply) that are commonly used by the majority of applications and
then to execute those instructions as fast as possible.
•A Complex Instruction Set Computer(CISC) design focuses on
implementing high-level instructions that run more complex tasks
(such as calculating multiply-and-add many times) with each
instruction.
The average clock cycles per instruction (CPI) of these CISC instructions is
typically 10 to 20
•TPU choose the CISC style
TPU Instructions
•It has about a dozen instructions overall, but below five are the key ones
TPU Instructions
•The CISC MatrixMultiplyinstruction is 12 bytes
3 are Unified Buffer address; 2 are accumulator address; 4 are length
(sometimes 2 dimensions for convolutions); and the rest are opcode and
flags.
•Average clock cycles per instruction : > 10
•4-stage overlapped execution, 1 instruction type / stage
Execute other instructions while matrix multiplier busy
•Complexity in SW
No branches, in-order issue, SW controlled buffers, SW controlled pipeline
synchronization
Systolic Execution in Matrix Array
•Problem : Reading a large SRAM uses much more power than
arithmetic
•Solution : Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
•A systolic array is a two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
•It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
Systolic Array(Example –vector input)
Systolic Array(Example –matrix input)
TPU Systolic Array
•In the TPU, the systolic array is
rotated
•Weights are loaded from the top
and the input data flows into the
array in from the left
•Weights are preloaded and take
effect with the advancing wave
alongside the first data of a new
block
Software Stack
•Software stack is split into a User Space
Driver and a Kernel Driver.
•The Kernel Driver is lightweight and
handles only memory management
and interrupts.
•The User Space driver changes
frequently. It sets up and controls TPU
execution, reformats data into TPU
order, translates API calls into TPU
instructions, and turns them into an
application binary.
Relative Performances : 3 Contemporary Chips
* TPU is less than half die size of the Intel Haswellprocessor
•K80 and TPU in 28nm process, Haswellfabbedin intel 22nm process
•These chips and platforms chosen for comparison because widely deployed in
Google data centers
Relative Performance : 3 Platforms
•These chips and platforms chosen for comparison because widely
deployed in Google data centers
Performance Comparison
•Roofline Performance model
This simple visual model is not perfect, yet
it offers insights on the causes of
performance bottlenecks.
The Y-axis is performance in floating-point
operations per second, thus the peak
computation rate forms the “flat” part of
the roofline.
The X-axis is operational intensity,
measured as floating-point operations per
DRAM byte accessed.
TPU Die Roofline
•The TPU has a long “slanted” part of
its roofline, where operational
intensity means that performance is
limited by memory bandwidth.
•Five of the six applications are
happily bumping their heads against
the ceiling
•MLPs and LSTMs are memory bound,
and CNNs are computation bound.
CPU & GPU Rooflines
Log Rooflines for CPU, GPU and TPU
Linear Rooflines for CPU, GPU and TPU
Why So Far Below Rooflines? (MLP0)
•Response time is the reason
•Researchers have demonstrated that small increases in response
time cause customers to use a service less
•Inference prefers latency over throughput
TPU & GPU Relative Performance to CPU
•GM : Geometric Mean
•WM : Weighted Mean
Performance / Watt
Improving TPU : Move “Ridge Point” to the Left
•Current DRAM
2 DDR 2133MHz 34GB/s
•Replace with GDDR5 like in K80
BW : 34GB/s 180GB/s
Move to Ridge Point from 1350 to 250
This improvement would expand die size by about 10%. However, higher
memory bandwidth reduces pressure on the Unified Buffer, so reducing the
Unified Buffer to 14 MiBcould gain back 10% in area.
Maximum MiBof the 24 MiBUnified Buffer used per NN app
Revised TPU Raised Roofline
Performance / Watt Original & Revised TPU
Overall Performance / Watt
Energy Proportionality
Evaluation of TPU Designs
•Below table shows the differences between the model results and
the hardware performance counters, which average below 10%.
Weighted Mean TPU Relative Performance
Weighted Mean TPU Relative Performance
•First, increasing memory bandwidth ( memory ) has the biggest
impact: performance improves 3X on average when memory
increases 4X
•Second, clock rate has little benefit on average with or without more
accumulators. The reason is the MLPs and LSTMs are memory bound
but only the CNNs are compute bound
Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs
but improves performance of CNNs by about 2X.
•Third, the average performance slightly degrades when the matrix
unit expandsfrom 256x256 to 512x512 for all apps
The issue is analogous to internal fragmentation of large pages