Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

deniztortop 1,023 views 147 slides Mar 21, 2024
Slide 1
Slide 1 of 147
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147

About This Presentation

At Cerebras AI Day, we unveiled the next chapter of the Cerebras AI platform, new state-of-the-art AI models, and our latest AI supercomputers.
> Cerebras announces CS-3, the world’s fastest AI Chip with a whopping 4 trillion transistors
> Cerebras selects Qualcomm to deliver unprecedented p...


Slide Content

© 2024 Cerebras Systems Inc. All Rights Reserved

Andrew Feldman
CEO & Co-Founder Cerebras

© 2024 Cerebras Systems Inc. All Rights Reserved
AI Has Fundamentally Changed
Computing
AI Supercomputersx86 Servers

© 2024 Cerebras Systems Inc. All Rights Reserved
There’s a vast chasm in
AI capabilities

AI Developers Are Struggling with
Distributed GPU Training

© 2024 Cerebras Systems Inc. All Rights Reserved
“Itcanbeafrustratingdailylife
experienceoftraininglargemodels…You're
therecarefullymonitoringthevitalsignsof
yourrun:lossspikes,numericalissues,
throughput,gradientnorms,policy
entropy,etc...or10,000GPUscouldbe
idling.”
Co-Founder, OpenAI

© 2024 Cerebras Systems Inc. All Rights Reserved
Co-founder, Reka AI
Former Google Brain Scientist,
“Multi-node GPU training is more of
an afterthought as opposed to
distributed training as a first class
citizen…it’s a hardware lottery."

© 2024 Cerebras Systems Inc. All Rights Reserved
“Buildinglargescaletraining
clustersfromscratchandachieving
highMFUandreliabilityisdamn
hard”
Senior Foundation Model Engineer, Uber

GPT-1
120M Parameters
4 Contributors

GPT-4
1.7T Parameters
240+ contributors
35 just for distributed training
& supercomputing

© 2024 Cerebras Systems Inc. All Rights Reserved
Large Models Simply Don’t Fit on GPUs
ChatGPT (28TB)
H100 (80GB)

© 2024 Cerebras Systems Inc. All Rights Reserved
Developers must cut the model into many pieces..

© 2024 Cerebras Systems Inc. All Rights Reserved
And spread them on hundreds of GPUs

© 2024 Cerebras Systems Inc. All Rights Reserved
An ML problem just turned into a parallel programming problem.
A hardware problem just became a supercomputer problem.
Then re-write the model to work across a cluster

© 2024 Cerebras Systems Inc. All Rights Reserved
This causes a code
explosion
nanoGPT
1B Parameters
639 lines of code
nanoGPT
1B Parameters
639 lines of code
Megatron
100B Parameters
20,507 lines of code

© 2024 Cerebras Systems Inc. All Rights Reserved
You never have to do this on Cerebras

© 2024 Cerebras Systems Inc. All Rights Reserved
The Cerebras Way
Build a compute & memory system that’s vastly larger than the model
Cerebras CS-3 = 1,200 TB
ChatGPT

© 2024 Cerebras Systems Inc. All Rights Reserved
4 trillion transistors
46,225 mm2silicon
900,000 cores optimized for sparse
linear algebra
5nm TSMC process
125 Petaflops of AI compute
44 Gigabytes of on-chip memory
21 PByte/s memory bandwidth
214 Pbit/s fabric bandwidth
Cerebras
Wafer-Scale Engine
The fastest AI chip on earth again

© 2024 Cerebras Systems Inc. All Rights Reserved

© 2024 Cerebras Systems Inc. All Rights Reserved

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Wafer Scale Engine 3 Versus the H100
Cerebras WSE-3
4 Trillion Transistors
46,225 mm2 Silicon
Largest GPU
80 Billion Transistors
814 mm2 Silicon

© 2024 Cerebras Systems Inc. All Rights Reserved

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras CS-3

© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3
SwarmX
MemoryX
Wafer Scale Cluster: The World’s Most Scalable
AI Supercomputer
1 terabyte
1 CS-3
125 petaflops
1 billion parameters
1 petabyte
2048 CS-3s
256 exaflops
24 trillion parameters

© 2024 Cerebras Systems Inc. All Rights Reserved
Exa-scale Performance

© 2024 Cerebras Systems Inc. All Rights Reserved
Single Device Simplicity
MemoryX Memory Units
SwarmX Interconnect
Wafer Scale Engines
1 to 2048 CS-3s Look and Program Like a Single Device

© 2024 Cerebras Systems Inc. All Rights Reserved

© 2024 Cerebras Systems Inc. All Rights Reserved
•Click to edit Master text styles
•Second level
•Third level
•Fourth level
•Fifth level

© 2024 Cerebras Systems Inc. All Rights Reserved

© 2024 Cerebras Systems Inc. All Rights Reserved
Condor Galaxy 2
Stockton, California

© 2024 Cerebras Systems Inc. All Rights Reserved
Condor Galaxy 3 AI Supercomputer
64
CS-3 nodes
58 million
AI cores
8 exaFLOPS
FP16 AI compute
108 TB
Parameter memory
388 Tbps
On-chip bandwidth
Dallas, Texas

© 2024 Cerebras Systems Inc. All Rights Reserved
AI Supercomputers
Built & Operated in the United States
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 1
Santa Clara, CA
●4 ExaFLOPs●64 x CS-2s●82 TB of Memory
ONLINE
Stockton, CADallas, TX
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 2
●4 ExaFLOPs●64 x CS-2s●82 TB of Memory
ONLINE
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 3
●8 ExaFLOPs●64 x CS-3s●108 TB of Memory
Q2 2024

© 2024 Cerebras Systems Inc. All Rights Reserved
CEO of Microsoft
Satya Nadella
■JAIS 30B parameter, bilingual
Arabic-English model■Microsoft’s core LLM offering
in the Middle East■Available on Azure
Cerebras & G42
World leading Arabic LLM

© 2024 Cerebras Systems Inc. All Rights Reserved
“Mayo Clinic selected Cerebras
as its first generative AI
collaborator for its large-scale,
domain-specific AI expertise to
accelerate breakthrough insights
for the benefit of patients.”
Cerebras & Mayo Clinic
Breakthrough insights for the
benefit of patients
Medical Director for Strategy at Mayo Clinic
Dr. Matthew Callstrom

© 2024 Cerebras Systems Inc. All Rights Reserved
“When the largest problem is
solved, a speedup of 228x is
achieved... Moreover…it is unlikely
that such a performance gap can
be closed… given the strong
scalability issues encountered by
this kind of algorithm when using a
large number of multi-GPU nodes
in HPC clusters.”
Cerebras & TotalEnergies
VP of Engineering at TotalEnergies
Diego Klahr VP

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Cluster with 48
systems exceeded the
performance of the World’s
#1 Supercomputer ‘Frontier’
with 37,000 GPUs or a 100x
cost saving.
Cerebras & KAUST
Tony Chan
President, KAUST

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras CS-3 Architecture Deep Dive
Sean Lie, CTO and Co-Founder, Cerebras

© 2024 Cerebras Systems Inc. All Rights Reserved
•2x performance
•Same power
•Same price
Cerebras CS-3: A Generational Leap for AI
LLM Training Performance

© 2024 Cerebras Systems Inc. All Rights Reserved
Registers
•Building on tried-and-true WSE-2 core…
WSE-3 Core
4-way 16b SIMD
WSE-2 Core
Memory
SRAM
48kB
Cache
256B
Fabric
16 General Purpose44 Data Structure

© 2024 Cerebras Systems Inc. All Rights Reserved
Improved performance for AI compute
•New higher performance tensor operations
•New 8-way SIMD for 16b data (FP/BF16)
•New 16-way SIMD for 8b data (Fixed/INT8)
•New faster non-linear functions
•2x higher compute performance core
High bandwidth memory and cache
•48kB memory per core
•New 512B local cache per core
•Full bandwidth for full SIMD performance
WSE-3 Core
Continuing Distributed AI Architecture Leadership
WSE-3 Core
48 Data Structure
Registers
8-way 16b SIMD
Memory
SRAM
48kB
Cache
512B
Fabric
16 General Purpose
16-way 8b SIMD

© 2024 Cerebras Systems Inc. All Rights Reserved
From Small Core to Massive Wafer
Die
Core
WSE-3
84 Die
900k Cores10.7k Cores

© 2024 Cerebras Systems Inc. All Rights Reserved
Uniquely capable of wafer-scale integration
•Invented process in first generation WSE
•Extended to 5nm in collaboration with
TSMC
Co-designed from ground up
•Uniform architecture with built-in
redundancy
•Extending uniform fabric across die
•Wafer behaves as single massive chip
WSE-3 Interconnect
Enabling the Only Wafer Scale Chip in the World

© 2024 Cerebras Systems Inc. All Rights Reserved
WSE-3 Interconnect
Enabling the Biggest Chip in the World
GPUGPUGPUGPUGPUGPUGPUGPU
NV
Link
NV
Link
NV
Link
NV
Link
Each H1008xH100
Bandwidth900GB/s
36x 100Gb/s serial
7.2TB/s
288x 100Gb/s serial
Power36W288W
5.0 pJ/bit
Each Die84xDie
2880GB/s
480x 24Gb/s parallel
242TB/s
40320x 24Gb/s
parallel
1.1W92W
0.05 pJ/bit
10x
More Die
33x
More Bandwidth
100x
More Power Efficient
Wafer Scale Engine
Traditional
Serial across connectors, PCBs,
cables Parallel across <1mm on silicon
*GPU estimate use 5nm 100G serdes power with Nvidia H100 NVLink bandwidth

© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3 System: Purpose Built for Wafer-Scale

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras CS-3Nvidia H100Cerebras Advantage
Chip size46,225 mm2814 mm2 57x
Cores900,00016,896 FP32 + 528 Tensor 52x
On-chip memory44 Gigabytes0.05 Gigabytes880x
Memory bandwidth21 Petabytes/sec0.003 Petabytes/sec7,000X
Fabric bandwidth214 Petabits/sec0.0576 Petabits/sec3,715X
CS-3 vs. GPU
Orders of Magnitude Performance Advantage
Enabling large scale training
Finetune LLaMA 70B on 1B tokens in a day
on a single chip

© 2024 Cerebras Systems Inc. All Rights Reserved
Cluster natively operates as single device
WSE-3 is big enough to run largest models
•Enables compute and memory
disaggregation
•Train with data-parallel only scaling
Architect cluster-level memory and compute
•External memory stores model weights
•Untangle memory and compute
dependency
CS-3 Cluster
Designed as Single ML Accelerator

SwarmX Interconnect
MemoryX Memory Units
Wafer Scale Engines

© 2024 Cerebras Systems Inc. All Rights Reserved
Model capacity not limited by device
•Weights streamed onto wafer to compute
layer
•Weights trigger compute using HW
dataflow
•Weights are never stored on wafer
Decoupling weight optimizer compute
•Gradients streamed out of wafer
•Weight update occurs in MemoryX
MemoryX External Memory
Virtually Unlimited Model Weight Capacity
Memory hierarchy capable of massive models on single device
Weights
Gradients
MemoryX
Optimizer
Compute
Weight
Memory
CS-3

© 2024 Cerebras Systems Inc. All Rights Reserved
Data-parallel only training across CS-3s
•Weights are broadcast to all CS-3s
•Gradients are reduced on way back
Multi-system scaling with the same
execution model as single system
•Same system architecture
•Same network execution flow
•Same software user interface
SwarmX Fabric
Purpose Built Interconnect for Simple Scaling
MemoryX
Optimizer
Compute
Weight
MemoryWeights
Gradients
Weights
Gradients
SwarmX
CS-3s
Scaling to cluster compute while operating like a single device

© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3 Cluster Compute
CS-2 Cluster
192 CS-2 systems
12 exaFLOPS AI Compute

© 2024 Cerebras Systems Inc. All Rights Reserved
•2048 CS-3
in single cluster
•256 exaFLOPS
AI Compute
•Programs like a
single device
CS-3 Cluster Compute
Supercomputer Performance, Single Device Experience

© 2024 Cerebras Systems Inc. All Rights Reserved
SwarmX
Scalable spine-leaf topology
•Standard-based 400/800G
Ethernet
•Performance and cost effective
•RDMA for low overhead and
latency
. . .
. . .
Scaling to 256 exaFLOPS
Purpose Built Scalable Network for AI Training
. . .
CS-2CS-3
Cluster
Size192 systems2048 systems
Link
Speed100 Gb/s400 Gb/s
800 Gb/s
Cluster
Bandwidth1 Pb/s10 Pb/s
Cluster Options

© 2024 Cerebras Systems Inc. All Rights Reserved
Train Today’s SOTA Models in Hour or Days
~1 month ~1 day
Meta GPU ClusterCerebras CS-3 Cluster
LLaMA 70B Training

© 2024 Cerebras Systems Inc. All Rights Reserved
Train Today’s SOTA Models in Hour or Days
~1 month ~1 day
Meta GPU ClusterCerebras CS-3 Cluster
But the CS-3 cluster operates like single device
LLaMA 70B Training

© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3 Cluster Memory
Memory SKUs
Memory
(TByte)1.512
Parameters
(Billion)30240
CS-2 Options

© 2024 Cerebras Systems Inc. All Rights Reserved
MemoryX: The First Petabyte-Scale AI Memory
System
100x Larger
Models
24 Trillion
Parameters
Enterprise SKUsHyperscale SKUs
Memory
(TByte)1.51224361201,200
Parameters
(Billion)302404807202,40024,000
CS-3 MemoryX Options

© 2024 Cerebras Systems Inc. All Rights Reserved
MemoryX
Compute
State
Efficient hybrid state store
•Weights stored in DDR5 and Flash
•Perf and power/cost efficiency
Flexible compute
•Optimizer and other ops run on
CPU
•General purpose and flexible
•Support for all common ML ops
Enabling Multi-Trillion Parameter Models
Most Scalable and Efficient Model Memory
Model weights
CPUModel optimizer
and operations
CS-2CS-3
DRAM
Memory
12 TB DDR4
240B params
36 TB DDR5
720B params
Flash
Memory
1.2 PB
24T params
CPU
Perf1x2x
Cluster Options

© 2024 Cerebras Systems Inc. All Rights Reserved
Large Cluster Memory on a Single Device

© 2024 Cerebras Systems Inc. All Rights Reserved
Train Tomorrow’s Trillion+ Parameter Models
~1.5 years~3 weeks
1000s of GPUCerebras CS-3 Cluster
And the CS-3 cluster still operates like single device
Imagine…
LLaMA 1T Training

© 2024 Cerebras Systems Inc. All Rights Reserved
Interconnect Interconnect
...
Memory Memory
Interconnect
Memory
I see one
big device
I see one
big device
I see one
big device
You Program It Like A Single Device
No Matter The Cluster Size
1x CS-3 4x CS-3 2048x CS-3
Wafer Scale
Cluster

© 2024 Cerebras Systems Inc. All Rights Reserved
Interconnect Interconnect
...
Memory Memory
Interconnect
Memory
And Your Model Always Fits
1B or 1T Parameters
1.5TB36TB 1,200 TB
Wafer Scale
Cluster
Llama
7B Llama 70B Llama 700B
I see one
big device
I see one
big device
I see one
big device

© 2024 Cerebras Systems Inc. All Rights Reserved
Real world seamless cluster scaling
•User: G42
•Model: Jais30B
•Cluster: Condor Galaxy-1
•Experience: “It just worked”
•No complex distributed software
•No changes to parallelism model
•No changes to hyper-parameters
Training SOTA large models everyday
•Unique capability enabled by wafer-scale
0
8
16
24
32
40
48
56
64
0816243240485664
Relative Speedup (x factor)
Number of CS-2s
Jias30B Measured Training Speedup on CG-1
Resulting in Near Linear Scaling
Any Scale While Operating as a Single Device

© 2024 Cerebras Systems Inc. All Rights Reserved
External chip interconnect
Low perf high power connections
Custom proprietary switches
Complex distributed software
Hybrid model-parallel partitioning
Cerebras Design Philosophy:
Massive Compute + Memory for Large Scale Models
On-chip interconnect
“Free” high perf communication
Big enough to run largest models
Simple data-parallel only scaling
Disaggregate compute and memory
GPUWafer Scale Engine
NV
LinkNV
LinkNV
LinkNV
Link

© 2024 Cerebras Systems Inc. All Rights Reserved
But we can and need to do even better…

© 2024 Cerebras Systems Inc. All Rights Reserved
40,000x more compute
In just 5 years
Current trajectory is unsustainable
We must find more efficient
methods
Sparsity is the key
But We Can and Need to Do Even Better
Sparsity Solves the Explosive Cost of Gen AI
BERT
GPT-2
Megatron-LM
T5
T-NLG
GPT-3Jurassic
GopherMT-NLGChincillaLLaMa
GPT-4
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
2018201920202021202220232024
Training Compute (
exaFLOPs
)
Year
exaFLOPs to Train

© 2024 Cerebras Systems Inc. All Rights Reserved
Sparsity opportunities are everywhere
•Neural networks have native sparsity
•e.g. ReLU or Dropout
•Neural networks can be made sparse
•e.g. sparse weights
•Models are over parameterized by design
•Training is act of discovering important
weights
Training dense is wasteful and inefficient
•But not all hardware can take advantage of
all forms of sparsity
Neural Networks are Sparse
Sparsity

© 2024 Cerebras Systems Inc. All Rights Reserved
Memory bandwidth built for sparsity
•Traditional hardware built for dense
•High data reuse à caching à low mem bw
•Wafer-scale memory built for sparse
•Low data reuse à caching à high mem bw
•Enabled by orders of magnitude more mem
bw
CS-3 accelerates all forms of sparsity
•Static and dynamic sparsity
•Structured and unstructured sparsity
Sparsity Acceleration is Memory Bound
x
x
Memory Bandwidth (Byte/FLOP)
RequiredAvailable
Dense MatMul
~0.001
H100
0.003
Sparse MatMul
~1
WSE-3
2

© 2024 Cerebras Systems Inc. All Rights Reserved
Examples of sparse training opportunities
•Dynamic activation sparsity
•e.g. Google: 95% sparse ReLU FFN in LLMs1
•Structured weight sparsity
•e.g. Mistral: 75% sparse FFN MoE 8x7B2
•Unstructured weight sparsity
•e.g. Cerebras: 75% sparse SPDF GPT3
Solving unsustainable scaling for training
•Only HW to accelerate all forms of sparsity
•Even future sparse techniques
Accelerating All Forms of Sparse Training
1 Li et al., The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers, 2023
2 Jiang et al., Mixtral of Experts, 2024
3 Thangarasa et al., SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models, 2023
0%
20%
40%
60%
80%
100%
ReLUMoESPDF
Relative FLOPs
FLOP Reduction From Sparsity
DenseSparse
1.7x2.0x2.8x

© 2024 Cerebras Systems Inc. All Rights Reserved
But sparsity can also transform inference
on a variety of hardware…

© 2024 Cerebras Systems Inc. All Rights Reserved
Neural Magic + Cerebras
Accelerated Inferencing for LLM Optimization
Mark Kurtz
CTO
Neural Magic

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
OUR LEADERSHIP
Who are we?
AI leader in model optimization and inference server acceleration
MIT Professor of Electrical Engineering
and Computer Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of Multicore
Algorithms and Computational
Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO of
Google Cloud, former CTO and EVP of
Worldwide Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of Distributed
Computing and Machine Learning
Dan Alistarh
Principal Research Scientist

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Who are we?
AI leader in model optimization and inference server acceleration
•200+ accepted papers
•60 patents
•GPTQ
•SparseGPT
•Sparse Fine-Tuning
•nm-vllm
•DeepSparse
•SparseML
As a software-delivered solution, we have deep
expertise across AI model training and optimization.
We invented many of the current AI industry’s state-of-
the-art techniques for quantization and sparsification.
Our solutions include enterprise inference servers to
open-source libraries and a sparsified models repo.
OUR LEADERSHIP
MIT Professor of Electrical
Engineering and Computer
Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of
Multicore Algorithms and
Computational Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO
of Google Cloud, former CTO
and EVP of Worldwide
Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of
Distributed Computing and
Machine Learning
Dan Alistarh
Principal Research Scientist

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Who are we?
AI leader in model optimization and inference server acceleration
OUR LEADERSHIP
MIT Professor of Electrical
Engineering and Computer
Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of
Multicore Algorithms and
Computational Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO
of Google Cloud, former CTO
and EVP of Worldwide
Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of
Distributed Computing and
Machine Learning
Dan Alistarh
Principal Research Scientist
•200+ accepted papers
•60 patents
•GPTQ
•SparseGPT
•Sparse Fine-Tuning
•nm-vllm
•DeepSparse
•SparseML
As a software-delivered solution, we have deep
expertise across AI model training and optimization.
We invented many of the current AI industry’s state-of-
the-art techniques for quantization and sparsification.
Our solutions include enterprise inference servers to
open-source libraries and a sparsified models repo.

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
!Issues include
•Requires lots of compute
•Requires lots of memory
•Increases latency
•Very demanding on inference
serving infrastructure
•Expensive to operate and
support

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
!Issues include
•Requires lots of compute
•Requires lots of memory
•Increases latency
•Very demanding on inference
serving infrastructure
•Expensive to operate and
support
Options to resolve
•Decrease the size of the
LLM
•Apply quantization to
combat the accuracy issue
when model size is reduced

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
!Issues include
•Requires lots of compute
•Requires lots of memory
•Increases latency
•Very demanding on inference
serving infrastructure
•Expensive to operate and
support
Options to resolve
•Decrease the size of the
LLM
•Apply quantization to
combat the accuracy issue
when model size is reduced
Llama 2 Size vs Accuracy

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
!Issues include
•Requires lots of compute
•Requires lots of memory
•Increases latency
•Very demanding on inference
serving infrastructure
•Expensive to operate and
support
Options to resolve
•Decrease the size of the
LLM
•Apply quantization to
combat the accuracy issue
when model size is reduced
Llama 2 Size vs Accuracy

© 2024 Cerebras Systems Inc. All Rights Reserved
Before Pruning
The solution - Sparsity

© 2024 Cerebras Systems Inc. All Rights Reserved
The solution - Sparsity
After PruningBefore Pruning

© 2024 Cerebras Systems Inc. All Rights Reserved
The solution - Sparsity
•Preserves the model’s accuracy
while reducing the size of the model
Unstructured Sparsity:After Pruning
•Improves inference and training
performance
Before Pruning

© 2024 Cerebras Systems Inc. All Rights Reserved
Our research collaboration with Cerebras
Create open-source sparse foundational
models that organizations can easily
deploy and use with faster inference.

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
1.7-2.4X
Reduction in
FLOPS

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
90%
Accuracy Recovery

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining on Cerebras
150B Tokens
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Sparse Foundational Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining on Cerebras
150B Tokens
Sparse Foundational Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Off the Shelf
Sparse Fine-Tuning
Quantization with GPTQ
Chat
50%, 70%
Code Generation
50%, 70%

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights ReservedCerebras Proprietary & Confidential Information
Results
Full recovery with 50% and 70% sparse models.
Sparsity vs Accuracy for UltraChat 200k Sparsity vs Accuracy for Evol Code Alpaca

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights ReservedCerebras Proprietary & Confidential Information
Results
4.3X
Memory Reduction
Memory Usage vs Compression Level - Llama 2 7B

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining on Cerebras
150B Tokens
Sparse Foundational Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Off the Shelf
Sparse Fine-Tuning
Quantization with GPTQ
Chat
50%, 70%
Code Generation
50%, 70%

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Off the Shelf
Sparse Fine-Tuning
Quantization with GPTQ
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining on Cerebras
150B Tokens
Sparse Foundational Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Chat
50%, 70%
Code Generation
50%, 70%
Fine-Tuning Your Use Case
Sparse Fine-Tuning for a few
hours
Quantization with GPTQ

© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Off the Shelf
Sparse Fine-Tuning
Quantization with GPTQ
DeepSparse
Llama 2
2T Tokens
Pretrained from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining on Cerebras
150B Tokens
Fine-Tuning Your Use Case
Sparse Fine-Tuning for a few
hours
Quantization with GPTQ
Sparse Foundational Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Chat
50%, 70%
Code Generation
50%, 70%

© 2024 Cerebras Systems Inc. All Rights Reserved
Local inference performance
With sparsity, real time chat is now possible on local CPUs.
Single Stream Token Generation - Llama 2 7BSingle Stream Latency - Llama 2 7B

© 2024 Cerebras Systems Inc. All Rights Reserved
Server inference performance
With sparsity, CPU performance is competitive with GPUs.
Single Stream Decode Performance - Llama 2 7BMulti Stream Decode Performance - Llama 2 7B

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Comparison
Unoptimized Model
Llama 2 7B FP32
Sparse Quantized Model
Llama 2 7B 70% Sparse INT8
20 Tokens/Second2 Tokens/Second
Using Neural Magic DeepSparse on an 8-core AMD Genoa CPU

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Key takeaways
Run SOTA models real
time on just a laptop with
Neural Magic DeepSparse
Transform your infrastructure with
just software to support LLMs
Train sparse models
faster with Cerebras
Takeaway 1
Up to 4X faster than llama.cpp
4XUp to7X
Takeaway 2
Up to 7X more inference streams
per server than llama.cpp at the
same performance level
Up to2X
Takeaway 3
2X faster sparse training
Faster

© 2024 Cerebras Systems Inc. All Rights Reserved© 2024 Cerebras Systems Inc. All Rights Reserved
Next steps
Neural Magic’s Hugging Face
Organization Cerebras Blog
•Arxiv paper with our current results
•Larger models
•Higher sparsities
•INT4 quantization support
•Combine with parameter efficient fine-tuning
Stay tuned for more collaboration with Cerebras
Neural Magic Docs

© 2024 Cerebras Systems Inc. All Rights Reserved
Thank you
Follow us to stay
current on all
things Neural
Magic, including
product updates,
ML research
developments,
and more.
@neuralmagicJoin our
Community
Engage with
fellow ML
practitioners. Ask
questions, share
feedback, and
improve the way
you use Neural
Magic.
Connect with
Neural Magic to
stay up to date with
#SoftwareDelivered
AI.
neural-magic

© 2024 Cerebras Systems Inc. All Rights Reserved
Models & Product
Jessica Liu, VP of Product, Cerebras

© 2024 Cerebras Systems Inc. All Rights Reserved
The goal of AI training: make the loss curve go down

© 2024 Cerebras Systems Inc. All Rights Reserved

But it’s not so simple...

© 2024 Cerebras Systems Inc. All Rights Reserved
This happens all the time

© 2024 Cerebras Systems Inc. All Rights Reserved
Model performance can vary greatly

© 2024 Cerebras Systems Inc. All Rights Reserved
-PUTPGUJNFBOEDPTUSJEJOHPOHFUUJOHUIFCJHSVOSJHIU
Challenges of large GenAI training & fine-tuning
Out of memory
GPU failure
Numerics bug
Low utilization
2.
ML Complexity
1.
Distribution
3.
Cost

© 2024 Cerebras Systems Inc. All Rights Reserved
How to get good model quality at scale
Run
ExperimentsPick WinnersScale UpDesign the
Experiments
1.3 B500M

© 2024 Cerebras Systems Inc. All Rights Reserved
How to get good model quality at scale
Run
ExperimentsPick WinnersScale UpDesign the
Experiments
1.3 B 3B

© 2024 Cerebras Systems Inc. All Rights Reserved
How to get good model quality at scale
Run
ExperimentsPick WinnersScale UpDesign the
Experiments
3 B 7 B Good config
for 1 3B, 30 B

© 2024 Cerebras Systems Inc. All Rights Reserved
Run
ExperimentsPick WinnersScale UpDesign the
Experiments
How to get good model quality at scale
Time / Work
.5 B3 B13 B100B

© 2024 Cerebras Systems Inc. All Rights Reserved
Run
ExperimentsPick WinnersScale UpDesign the
Experiments
How to get good model quality at scale (on GPUs)
Time / Work
.5B3B
13B
100B
1 GPU
8 GPUs
Data Parallelism
256 GPUs
Data & Tensor & Pipeline parallel
2048 GPUs
Data & Tensor & Pipeline & Expert & Sequence parallelism

© 2024 Cerebras Systems Inc. All Rights Reserved
You have to micromanage the
distribution strategy:
•Tensor or pipeline model parallelism
•Distributed data parallelism
•Expert parallelism
•Interleaved pipelining schedule
•Activation checkpointing &
recomputation
•Interplay among model size, cluster size,
connectivity between nodes, number of
nodes, etc.
Scaling frameworks still require tons of work

© 2024 Cerebras Systems Inc. All Rights Reserved
Lines of Code
----------------------------
Python 18395
C/C++ 1118
C++ 649
CUDA 220
HTML 107
Bourne Shell 9
make 7
Markdown 1
Text 1
----------------------------
Total 20507
----------------------------
Nvidia’s GPT-175B Model
20,000 lines of code, weeks to implement
Hard to debug
You have to micromanage the
distribution strategy:
•Tensor or pipeline model parallelism
•Distributed data parallelism
•Expert parallelism
•Interleaved pipelining schedule
•Activation checkpointing &
recomputation
•Interplay among model size, cluster size,
connectivity between nodes, number of
nodes, etc.
Scaling frameworks still require tons of work

© 2024 Cerebras Systems Inc. All Rights Reserved
Cut experiment iteration time from weeks to a day
Lines of Code
----------------------------
Python 18395
C/C++ 1118
C++ 649
CUDA 220
HTML 107
Bourne Shell 9
make 7
Markdown 1
Text 1
----------------------------
Total 20507
----------------------------
Lines of Code
----------------------------
Python 565
C/C++ 0
C++ 0
CUDA 0
HTML 0
Bourne Shell 0
make 0
Markdown 0
Text 0
----------------------------
Total 565
----------------------------
Cerebras’ GPT-175B Model
565 lines of code, 1 Day to implement
"GPT-3 in 565 lines of code" Blog
Nvidia’s GPT-175B Model
20,000 lines of code, weeks to implement
Hard to debug

© 2024 Cerebras Systems Inc. All Rights Reserved
How to scale from 1B to 70B on Cerebras
### GPT-3 XL 1.3B
hidden_size: 2048
num_hidden_layers: 24
num_heads: 16
gpt3_1b_params.yaml
python run.py \
--params gpt3_1b_params.yaml \
--num_steps=100 \
--model_dir=model_dir \
Training:
### Llama-2 70B
hidden_size: 8192
num_hidden_layers: 80
num_heads: 64
llama2_70b_params.yaml
python run.py \
--params llama2_70B_params.yaml \
--num_steps=100 \
--model_dir=model_dir \
Training:

© 2024 Cerebras Systems Inc. All Rights Reserved
Scaling from one CS-3 to a cluster is a 1-line change

© 2024 Cerebras Systems Inc. All Rights Reserved
Time / Work
Cerebras gets you to high-quality large models
faster & more cheaply
On CS-3,
Data parallel only
any model size
Run
ExperimentsPick WinnersScale UpDesign
Sweeps
.5 B3 B13 B100B

© 2024 Cerebras Systems Inc. All Rights Reserved
On GPUs, small models are the default;
large models take large engineering effort.
On CS-3s, large models are the default;
small models come for free.

© 2024 Cerebras Systems Inc. All Rights ReservedCerebras Proprietary & Confidential Information
Med42: Llama-70B Fine-tuned in <1 Week
to Pass the US Medical License Exam
•Scored 72% on USMLE, beating GPT-3.5
•With M42: global healthcare company
with over 450 hospitals and clinics
•Custom curated healthcare dataset of
peer-reviewed papers, medical
textbooks, international health agency
datasets.
•Run finished in 1 weekend

© 2024 Cerebras Systems Inc. All Rights ReservedCerebras Proprietary & Confidential Information
FLOR-6.3B State-of-the-Art Catalan,
Spanish, and English LLM
•Best Catalan model, beating BLOOM-7.3B
•Used latest language adaptation techniques
for languages with less training data
•Reduced inference cost by 10% vs. BLOOM,
incorporating a new, more efficient tokenizer
•Used to build RAG systems for specialized
domains
•Trained on 140B Tokens and in 2.5 days.
•Open Source: Downloaded over 3000 times
FLOR-6.3B

© 2024 Cerebras Systems Inc. All Rights Reserved
JAIS-30B: State-of-the-Art
Arabic-English Bilingual LLM
•SoTA Arabic: Outperforms all other Arabic models
•English: Llama-30B quality in English
•Co-developed with G42’s Core42 and MBZUAI
•Now on Azure AI Cloud as the foundation of their
Model-as-a-Service in the Middle East
Checkpoints on
HuggingFace
Paper available
on Arxiv

© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges
(1) Few high-quality Arabic datasets and
preprocessing pipelines
(2) Tokenizers trained on English
corpora don’t extend well to Arabic
(3) Want highest quality model with
best cost and compute efficiency
Used latest ML techniques – AliBi, SwiGLU
activation, MuP, Scaling laws
Ran many tuning experiments on models of
590M, 1.3B, 2.7B, 6.7B.
New vocab optimized for cross-lingual
alignment and trained custom tokenizer
Built new multi-lingual set, experimenting with
mixes of Arabic-only, and Arabic, English, and
code, to find optimal mix (1:2:0.4)
What we did

© 2024 Cerebras Systems Inc. All Rights Reserved
"I’ve found it really easy to experiment at every model size and scale
on multiple CS systems, which we need to do to get the best results.
There’s no difference between running a job on a single CS versus
multiple ones. All it takes is a small config change, and everything just
works with observable linear speedup!
Launched my first distributed LLM training within the first hour of
logging into a CS cluster for the first time!”
Neha Sengupta, Core42
Principle Applied Scientist

© 2024 Cerebras Systems Inc. All Rights Reserved
Jais-30B-v3 sets new record for open-source ArabicLLMs,
finishes training on 1.3 Trillion tokens
35.1
59.3
39.1
53.1
31.2
49.2
35.1
48.2
31.0
38.1
30.2
48.4
28.9
33.9
26.9
48.4
28.6
32.1
26.4
49.3
MMLU HellaswagARC-CTruthfulQA
Jais-30B outperforms on all common NLP benchmarks in Arabic
Jais-30b-chatacegpt-13b-chatBLOOMz (7.1B)LLaMA (30B)falcon-40b_instruct
Note, results are displayed in order of the legend.

© 2024 Cerebras Systems Inc. All Rights Reserved
The Future is Multimodal

An explosion of exploration in multimodality
Source: Recent advances in Multimodal LLMs

© 2024 Cerebras Systems Inc. All Rights Reserved
•Generalized support for Visual Q&A:
•Multiple vision encoders
•Multiple LLM backbones
•Cross-projection learning
•Multiple modalities to an LLM backbone
•Easy scaling for model size and context length
•Easy to configure many leading literature models
(e.g. LLaVA, AnyMAL, Eyes Wide Shut)
•Dataset: support for quick import of custom datasets
Multimodality is easy on Cerebras
Multimodal Output
CLIPLlama
SigLIP
DinoV2
Mistral
Zephyr
Plug & play vision & LLM backbones

© 2024 Cerebras Systems Inc. All Rights Reserved
Demo

© 2024 Cerebras Systems Inc. All Rights Reserved
Demo

© 2024 Cerebras Systems Inc. All Rights Reserved
Reproducing state-of-the-art results in just a
couple weeks
62.0 58.2
78.5
85.9
62.358.2
78.5
85.3
63.360.4
80.6
63.560.8
80.7
85.7
GQAVQA(t)VQA(v2)POPE
LLaVA1.5 (7B)Cerebras-LLaVA 1.5 (7B)SGPT4V (7B)Cerebras-SGPT4V (7B)
7B parameter model13B parameter model
not reported
Note, results are displayed in order of the legend.
63.361.3
80.0
85.9
64.263.4
82.085.8
GQAVQA(t)VQA(v2)POPE
LLaVA1.5 (13B)Cerebras-LLaVA 1.5 (13B)

63.361.3
80.0
85.9
64.263.36
82.085.8
GQAVQA(t)VQA(v2)POPE
LLaVA1.5 (13B)Cerebras-LLaVA 1.5 (13B)
62.0 58.2
78.5
85.9
62.358.2
78.5
85.3
63.360.4
80.6
63.560.8
80.7
85.7
GQAVQA(t)VQA(v2)POPE
LLaVA1.5 (7B)Cerebras-LLaVA 1.5 (7B)SGPT4V (7B)Cerebras-SGPT4V (7B)
Reproducing state-of-the-art results in just a
couple weeks
Improving
POPEGQAVQAtMMEVQAv2
CS3-LLaVA-7B86.763.961.5157381.4
LLaVA 1.5 13B HD86.364.762.5150081.8
7B model competitive with LLaVA 1.513 Billion HD
- 2X larger and 1.7X higher resolution image input
This model came out <2 months ago

© 2024 Cerebras Systems Inc. All Rights Reserved
Get started quickly with Cerebras ModelZoo
Model code with flexible configuration setup
•Different image encoders:
•CLIP
•SigLIP
•Dino v2
•Different LLM backbones:
•LLaMA
•Mistral
•Zephyr
•Different training recipes:
•LLaMA Pro
•Eyes Wide Shut
•Freezing different parts of the model
Prepared Datasets
•LLAVA 1.5, ShareGPT4V, Instruct4V
•ChartQA, DocVQA, DVQA, ArxivQA, AI2Diagrams
Data pre-processing scripts
•HDF5 file generation support
•Handles mix of multimodal and text-only data
•Optimized for high-throughput training
Easy scaling for model and data
•LLM model size
•Long context lengths
•Image resolution and patch size

© 2024 Cerebras Systems Inc. All Rights Reserved
Model Checkpoints Available on HuggingFace
7B – available now
13B – available now
70B – end of March!

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras’ goal is to bring
State-of-the-Art AI to
every organization

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras solutions meet you wherever you need
Cerebras Wafer Scale Clusters
Cerebras Cloud
Cerebras AI Solutions

© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras AI Model Services
GenAI Success with Cerebras ML Experts on
the Fastest, Most Efficient Platform
•Speed: Multi-Billion param models in days to weeks.
•Tailored to you: Custom chatbots, VQA Systems,
Code Completion, Foundation models, and more
•All the latest ML Techniques: RAG, DPO, LoRA,
MuP, data augmentation, and more.
•Total Ownership: Your data, your model weights.

© 2024 Cerebras Systems Inc. All Rights Reserved
Models on Cerebras
From multi-lingual LLMs to healthcare chatbots to code models.

© 2024 Cerebras Systems Inc. All Rights Reserved
All the Latest ML Techniques & Recipes
Variable Seq Training
DPOLL360 – Open data, models, scripts
Multi-lingual
Pre-training & IFT
Llama70B fine tuning
Domain Adaptation
GPT-3 in 565 lines
of code
Most FLOP efficient
LLM dataset
First family of open GPT models
and OSS use of muP
RAG
LoRA MoE
Multi
Modal
Sparse
Models

© 2024 Cerebras Systems Inc. All Rights Reserved
The model belongs to you
Your data stays with you

© 2024 Cerebras Systems Inc. All Rights Reserved
Cloud
CerebrasAI Supercomputers
Exascale compute with the programmability of a single device
On-Prem

© 2024 Cerebras Systems Inc. All Rights Reserved
AI Applications & Research Panel
Andy Hock, SVP Product & Strategy, Cerebras

Cerebras AI Applications
& Research Panel
Praneetha Elugunti
Mayo Clinic
Jim Culver
GSK
Tim Bishop
Mayo Clinic
Irinia Rish
University of Montreal
Andy Hock
Cerebras

Cerebras x
Qualcomm
Fireside Chat with
Rashid Attar, VP of Cloud Computing,
Qualcomm

Cerebras xQualcomm Technology Partnership
ReducingInference Cost by 10x
Cerebras CS-3
AI Training
Qualcomm Cloud AI100 Ultra
AI Inference

Jointly optimized software stack for
cost efficient LLMs
Cerebras Stack Qualcomm Stack
Sparse trainingSparse inference
Train in FP16Compile & run in MX6
Train large + small modelsApply speculative decoding
Network Architecture
Search
Compile & run on Ultra AI 100

Cerebras xQualcomm: Up t0 10x
Inference Performance
10
8
6
4
2
0
Baseline
Speculative
Decoding
MX6
Compression
Neural
Architectural
SearchSparsity
Total
Tokens / $
1x
1.8x2.2x2.5x2.5x
~10x

Cerebras x G42
Fireside Chat with
Kiril Evtimov, Group CTO G42 & CEO
Core42

G42 across the Entire AI Value Chain
Customer &
Industry Tailored
Solutions
Data
Centers
Compute
Infrastructure
Cloud
Platforms
AI Model
Development
Cloud &
Enterprise AI
Deployment
Application
Development

476B Arabic tokens
1.63T Total tokens
The world’s largest
open-source Arabic LLM30B parameter, bilingual
Arabic-English model
Trained on the
Condor Galaxy 1 and 2
AI Supercomputer