In-Datacenter Performance Analysis of a Tensor Processing Unit.pdf

kammoh 0 views 23 slides Oct 06, 2025

Slide 1 of 23

About This Presentation

In-Datacenter Performance Analysis of a Tensor Processing Unit

Size: 1004.74 KB

Language: en

Added: Oct 06, 2025

Slides: 23 pages

Slide Content

In-Datacenter Performance Analysis
of a Tensor Processing Unit
By NP Jouppi et al.
Presented by Alex Appel

Note: Some slides adapted from Dave Patterson’s talk at EECS Colloqium with the same title

Agenda
-Introduction/Motivation
-Architecture
-Performance Comparisons
-Main highlights/Summary
-Questions

Origin of Tensor Processing Unit
-Projection: if people searched by voice for 3 minutes a day it would double
Google’s computation demands
-Domain-speciﬁc architecture is the solution
-Goal: Make the inference phase 10X of GPUs
-Very short development cycle: ~15 months

Key Neural Net Concepts
-Training (learning) in development vs Inference (prediction) in production
-Batch size
-Amortize weight-fetch time by inferring (or training) many input examples at a time
-Quantization
-Floating point is useful, but uses a lot more energy and takes more time
-Do the training in ﬂoating point on GPUs, inference in integers

3 Types of NNs Represent 95% of Google Inference Workload
-Multi-Layer Perceptrons (MLP)
-Each new layer is a set of nonlinear functions of a weighted sum of all outputs from prior layer
("fully connected")
-Convolutional Neural Network (CNN)
-Popular for vision, each layer is a set of nonlinear functions of weighted sums at different
coordinates of spatially nearby subsets of outputs from the prior layer, which allows the
weights to be reused
-Recurrent Neural Networks (RNN) / “Long Short-Term Memory”
-Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the
previous state.

Inference Datacenter Workload (95%)

TPU Architecture
-Matrix Unit has 65,536 (256x256)
8-bit multiply-accumulate units
-700 MHz clock rate
-Peak: 92 trillion operations/second
->25X multiply-accumulate units vs
GPU
->100X multiply-accumulate units vs
CPU
-4 MiB of on-chip Accumulator
memory
-24 MiB of on-chip Uniﬁed Buffer
(activation memory)

TPU Chip
-Uniﬁed Buffer: 29%
-Matrix Multiply Unit: 24%
-Control: 2%

Main CISC Instructions

-Read_Host_Memory
-Reads data from the CPU host memory into the Unified Buffer (UB)
-Write_Host_Memory
-Writes data from the Unified Buffer into the CPU host memory
-Read_Weights
-Reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit
-MatrixMultiply/Convolve
-Causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the
Accumulators
-Activate
-Performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on

Circuit Board

Performance Comparisons

Rooﬂine Model
-Y-axis:
- FLOPs
-X-axis:
-Arithmetic Intensity
-How many operations
per byte fetched?

TPU Rooﬂine
-Very high peak performance
-Bottlenecked by memory
bandwidth

Haswell (CPU) Die
Rooﬂine
-Lower peak performance
-More memory bandwidth
-The neural nets are not as
close to the top as with the
TPU

K80 (GPU) Die
Rooﬂine
-Higher memory
bandwidth than CPU
-The neural nets are far
from their Rooﬂine

Relative Performance Table

Performance/Watt Comparisons
-GPU vs CPU: 1.2X-2.1X total
performance/Watt
-GPU vs CPU: 1.7X-2.9X
incremental performance/Watt
-TPU vs CPU: 17X-34X total
performance/Watt
-TPU vs GPU: 14X-16X total
performance/Watt
-TPU vs CPU: 41X-83X
incremental performance/Watt
-TPU vs GPU: 25X-29X
incremental performance/Watt

Energy Proportionality
-TPU has lowest power
-40W per die
-Poor energy
proportionality
-At 10% load, the TPU
uses 88% of the power it
uses at 100%

Summary
-Inference apps usually emphasize response-time over throughput since they are often user facing.
-As a result of latency limits, the K80 GPU is just a little faster for inference than the Haswell CPU,
despite it having much higher peak performance and memory bandwidth.
-While most architects are accelerating CNNs, they are just 5% of Google's datacenter workload.
-The TPU is about 15X – 30X faster at inference than the K80 GPU and the Haswell CPU.

Summary (contd.)
-Four of the six NN apps that were tested are memory bound; if the TPU were revised to have the
same memory as the K80 GPU, it would be about 30 – 50X faster than the GPU and CPU.
-Despite having a much smaller and lower power chip, the TPU has 25 times as many multiply
accumulators and 3.5 times as much on-chip memory as the K80 GPU.
-The performance per Watt of the TPU is 30X – 80X that of its contemporary CPUs and GPUs; a
revised TPU with K80 memory would be 70X – 200X better.

Resources
Link to paper: https://www.cse.wustl.edu/~roger/566S.s21/P1-Norman-1.pdf
Link to Dave Patterson talk: Dave Patterson Evaluation of the Tensor Processing Unit

In-Datacenter Performance Analysis of a Tensor Processing Unit.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

In-Datacenter Performance Analysis of a Tensor Processing Unit.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx