Origin of Tensor Processing Unit
-Projection: if people searched by voice for 3 minutes a day it would double
Google’s computation demands
-Domain-specific architecture is the solution
-Goal: Make the inference phase 10X of GPUs
-Very short development cycle: ~15 months
Key Neural Net Concepts
-Training (learning) in development vs Inference (prediction) in production
-Batch size
-Amortize weight-fetch time by inferring (or training) many input examples at a time
-Quantization
-Floating point is useful, but uses a lot more energy and takes more time
-Do the training in floating point on GPUs, inference in integers
3 Types of NNs Represent 95% of Google Inference Workload
-Multi-Layer Perceptrons (MLP)
-Each new layer is a set of nonlinear functions of a weighted sum of all outputs from prior layer
("fully connected")
-Convolutional Neural Network (CNN)
-Popular for vision, each layer is a set of nonlinear functions of weighted sums at different
coordinates of spatially nearby subsets of outputs from the prior layer, which allows the
weights to be reused
-Recurrent Neural Networks (RNN) / “Long Short-Term Memory”
-Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the
previous state.
Inference Datacenter Workload (95%)
TPU Architecture
-Matrix Unit has 65,536 (256x256)
8-bit multiply-accumulate units
-700 MHz clock rate
-Peak: 92 trillion operations/second
->25X multiply-accumulate units vs
GPU
->100X multiply-accumulate units vs
CPU
-4 MiB of on-chip Accumulator
memory
-24 MiB of on-chip Unified Buffer
(activation memory)
-Read_Host_Memory
-Reads data from the CPU host memory into the Unified Buffer (UB)
-Write_Host_Memory
-Writes data from the Unified Buffer into the CPU host memory
-Read_Weights
-Reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit
-MatrixMultiply/Convolve
-Causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the
Accumulators
-Activate
-Performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on
Circuit Board
Performance Comparisons
Roofline Model
-Y-axis:
- FLOPs
-X-axis:
-Arithmetic Intensity
-How many operations
per byte fetched?
TPU Roofline
-Very high peak performance
-Bottlenecked by memory
bandwidth
Haswell (CPU) Die
Roofline
-Lower peak performance
-More memory bandwidth
-The neural nets are not as
close to the top as with the
TPU
K80 (GPU) Die
Roofline
-Higher memory
bandwidth than CPU
-The neural nets are far
from their Roofline
Relative Performance Table
Performance/Watt Comparisons
-GPU vs CPU: 1.2X-2.1X total
performance/Watt
-GPU vs CPU: 1.7X-2.9X
incremental performance/Watt
-TPU vs CPU: 17X-34X total
performance/Watt
-TPU vs GPU: 14X-16X total
performance/Watt
-TPU vs CPU: 41X-83X
incremental performance/Watt
-TPU vs GPU: 25X-29X
incremental performance/Watt
Energy Proportionality
-TPU has lowest power
-40W per die
-Poor energy
proportionality
-At 10% load, the TPU
uses 88% of the power it
uses at 100%
Summary
-Inference apps usually emphasize response-time over throughput since they are often user facing.
-As a result of latency limits, the K80 GPU is just a little faster for inference than the Haswell CPU,
despite it having much higher peak performance and memory bandwidth.
-While most architects are accelerating CNNs, they are just 5% of Google's datacenter workload.
-The TPU is about 15X – 30X faster at inference than the K80 GPU and the Haswell CPU.
Summary (contd.)
-Four of the six NN apps that were tested are memory bound; if the TPU were revised to have the
same memory as the K80 GPU, it would be about 30 – 50X faster than the GPU and CPU.
-Despite having a much smaller and lower power chip, the TPU has 25 times as many multiply
accumulators and 3.5 times as much on-chip memory as the K80 GPU.
-The performance per Watt of the TPU is 30X – 80X that of its contemporary CPUs and GPUs; a
revised TPU with K80 memory would be 70X – 200X better.
Resources
Link to paper: https://www.cse.wustl.edu/~roger/566S.s21/P1-Norman-1.pdf
Link to Dave Patterson talk: Dave Patterson Evaluation of the Tensor Processing Unit