In-Datacenter Performance Analysis of a Tensor Processing Unit.pdf

kammoh 0 views 23 slides Oct 06, 2025
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

In-Datacenter Performance Analysis of a Tensor Processing Unit


Slide Content

In-Datacenter Performance Analysis
of a Tensor Processing Unit
By NP Jouppi et al.
Presented by Alex Appel

Note: Some slides adapted from Dave Patterson’s talk at EECS Colloqium with the same title

Agenda
-Introduction/Motivation
-Architecture
-Performance Comparisons
-Main highlights/Summary
-Questions

Origin of Tensor Processing Unit
-Projection: if people searched by voice for 3 minutes a day it would double
Google’s computation demands
-Domain-specific architecture is the solution
-Goal: Make the inference phase 10X of GPUs
-Very short development cycle: ~15 months

Key Neural Net Concepts
-Training (learning) in development vs Inference (prediction) in production
-Batch size
-Amortize weight-fetch time by inferring (or training) many input examples at a time
-Quantization
-Floating point is useful, but uses a lot more energy and takes more time
-Do the training in floating point on GPUs, inference in integers

3 Types of NNs Represent 95% of Google Inference Workload
-Multi-Layer Perceptrons (MLP)
-Each new layer is a set of nonlinear functions of a weighted sum of all outputs from prior layer
("fully connected")
-Convolutional Neural Network (CNN)
-Popular for vision, each layer is a set of nonlinear functions of weighted sums at different
coordinates of spatially nearby subsets of outputs from the prior layer, which allows the
weights to be reused
-Recurrent Neural Networks (RNN) / “Long Short-Term Memory”
-Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the
previous state.

Inference Datacenter Workload (95%)

TPU Architecture
-Matrix Unit has 65,536 (256x256)
8-bit multiply-accumulate units
-700 MHz clock rate
-Peak: 92 trillion operations/second
->25X multiply-accumulate units vs
GPU
->100X multiply-accumulate units vs
CPU
-4 MiB of on-chip Accumulator
memory
-24 MiB of on-chip Unified Buffer
(activation memory)

TPU Chip
-Unified Buffer: 29%
-Matrix Multiply Unit: 24%
-Control: 2%

Main CISC Instructions

-Read_Host_Memory
-Reads data from the CPU host memory into the Unified Buffer (UB)
-Write_Host_Memory
-Writes data from the Unified Buffer into the CPU host memory
-Read_Weights
-Reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit
-MatrixMultiply/Convolve
-Causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the
Accumulators
-Activate
-Performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on

Circuit Board

Performance Comparisons

Roofline Model
-Y-axis:
- FLOPs
-X-axis:
-Arithmetic Intensity
-How many operations
per byte fetched?

TPU Roofline
-Very high peak performance
-Bottlenecked by memory
bandwidth

Haswell (CPU) Die
Roofline
-Lower peak performance
-More memory bandwidth
-The neural nets are not as
close to the top as with the
TPU

K80 (GPU) Die
Roofline
-Higher memory
bandwidth than CPU
-The neural nets are far
from their Roofline

Relative Performance Table

Performance/Watt Comparisons
-GPU vs CPU: 1.2X-2.1X total
performance/Watt
-GPU vs CPU: 1.7X-2.9X
incremental performance/Watt
-TPU vs CPU: 17X-34X total
performance/Watt
-TPU vs GPU: 14X-16X total
performance/Watt
-TPU vs CPU: 41X-83X
incremental performance/Watt
-TPU vs GPU: 25X-29X
incremental performance/Watt

Energy Proportionality
-TPU has lowest power
-40W per die
-Poor energy
proportionality
-At 10% load, the TPU
uses 88% of the power it
uses at 100%

Summary
-Inference apps usually emphasize response-time over throughput since they are often user facing.
-As a result of latency limits, the K80 GPU is just a little faster for inference than the Haswell CPU,
despite it having much higher peak performance and memory bandwidth.
-While most architects are accelerating CNNs, they are just 5% of Google's datacenter workload.
-The TPU is about 15X – 30X faster at inference than the K80 GPU and the Haswell CPU.

Summary (contd.)
-Four of the six NN apps that were tested are memory bound; if the TPU were revised to have the
same memory as the K80 GPU, it would be about 30 – 50X faster than the GPU and CPU.
-Despite having a much smaller and lower power chip, the TPU has 25 times as many multiply
accumulators and 3.5 times as much on-chip memory as the K80 GPU.
-The performance per Watt of the TPU is 30X – 80X that of its contemporary CPUs and GPUs; a
revised TPU with K80 memory would be 70X – 200X better.

Resources
Link to paper: https://www.cse.wustl.edu/~roger/566S.s21/P1-Norman-1.pdf
Link to Dave Patterson talk: Dave Patterson Evaluation of the Tensor Processing Unit

Thank you for listening!

Questions?
Tags