“Meeting the Critical Needs of Accuracy, Performance and Adaptability in Embedded Neural Networks,” a Presentation from Quadric

embeddedvision 25 views 27 slides Jul 05, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/07/meeting-the-critical-needs-of-accuracy-performance-and-adaptability-in-embedded-neural-networks-a-presentation-from-quadric/

Aman Sikka, Chief Architect at Quadric, presents the “Meeting the Critical Nee...


Slide Content

Meeting the Critical Needs
of Accuracy, Performance
and Adaptability in
Embedded Neural Networks
Aman Sikka
Chief Architect
Quadric

Agenda
2
•Challenges in current NPUs (Neural Processing Units)
•Trends in neural networks
•Increase need for DSP like ops but DSP cannot be a fallback
•Back to basics
•Fixed point vs. floating point
•Designing a flexible architecture
•Conclusion
•Q/A
© Quadric

3
Traditional NNs and Hardware

Traditional CNN: Mainly MAC dominated
4
NPU Ops (matrix, pooling, activation)
Classic Algorithm & Control Ops
CPU/DSP
Hardwired
Convolution & Matrix
Co-processor
Pooling & Activations
(selected, pre-wired)
Conv
Conv
Pool
FC
Softmax
Conv
Conv
Pool
Conv
Conv
Conv Conv
Conv
Conv
Conv
Concat
FC
Softmax
© Quadric

Transformers: Can be heavy on DSP-like compute
5
Every data transfer between NPUblock
and CPU-DSPdecreasesperformance
and adds power
NPU operations (matrix, pooling,
activation)
Classic Algorithm & Control Ops
Layer Norm
Softmax
Shape xform
Matmul Matmul Matmul
Shape xformShape xform
Matmul
Matmul
Layer Norm
Matmul
Linear
Matmul
Attention block
Energy Cost of 32bdata element
transferfrom ALU/MAC to
Reg File 1
LRM 2-3X
L2 MEM 70X
Off-chip DDR225X
© Quadric

Non-NPU ops are here to stay
6
•SWIN network requires ~77% of the workload to be executed
on a programmable device
“When a SWIN network is executed on an AI computational
block that includes an AI hardware accelerator designed for
an older CNN architecture, only ~23% of the workload might
run on the AI hardware accelerator’s fixed architecture. In
this one instance.”
Cadence says:
https://www.cadence.com/en_US/home/resources
/white-papers/why-a-dsp-is-indispensable-in-the-
new-world-of-ai-wp.html
© Quadric

7
•Reshape
•Transpose
•Shifted
window
•Patch
creation
•Embeddings
lookup
•…
•Softmax
•Layer norm
•Group norm
•Instance norm
•Pixel Co-relation
•Positional encodings
•2 input matmuls
•Look up tables
•…
•NMS
•ROIAlign
•Noise reduction
•FFT/RFFT
•Equation solver
•Mean subtract
•…
Data Transformations Inference Pre/Post Processing
Non-NPU ops are here to stay
© Quadric

•Floating-point operations are
getting more common during
inference and can take a large
part of compute
•Future designs would have
multiple DSP cores, CPUs, AI
accelerators, vision accelerators

Where are we headed?
8© Quadric

9
Back to basics: Float32 vs Fixed Point

Float32 representation
10
<23bits><8bits>1
Mantissa(fraction)ExponentSign
Float32
−1
&#3627408480;
∗2
????????????&#3627408477;&#3627408476;&#3627408475;??????&#3627408475;&#3627408481;
∗&#3627408474;??????&#3627408475;&#3627408481;??????&#3627408480;&#3627408480;??????
•Range: refers to the span of values that can be represented.
Exponentprovides the dynamic range.
•Precision: refers to the ability of a format to distinguish between
two close values. Mantissaprovides the precision within a range.
IEEE 754
© Quadric

•Floating point offers
•Better dynamic range.
•Ease of development, as a user doesn’t need to adjust for
precision and range
•But at a significant cost –power consumption!
What does floating point offer?
11© Quadric

•Known ranges of input and output: In all quantized neural nets the
input and output ranges are well known. One can use a calibration
dataset to identify all that.
•Fixed ranged ops: Sin, cos, sigmoid, softmax, norm, etc.
•Per operation range estimate: In almost all neural nets the data ranges
across layer and operations can easily be gathered with a calibration
data set
Do we really need floating point?
12© Quadric

•Dynamic fixed point : Based on the operations done internally one can analyze the
math and change fixed point precision per calculation.
Do we really need floating point?
13© Quadric

Fixed point32 representation
14
<p-bits><31-p bits>1
Mantissa(fraction)Integer BitsSign
Fixed32<FracBits(p)>
•Range and Precision: can be controlled by the developer. Precision can be
represented in for 31 fractional bits.
•Example:
•FixedPoint32<24>.. Base int32with 24bits to represent fractional value
•FixedPoint16<11>.. Base int16with 11bits to represent fractional value
© Quadric

•Quadric already supports 60+ networks (transformers, detector,
segmentation, classifier …) within <1%top1 accuracy loss compared to
floating-point models.
Fixed point accuracy numbers
15
http://quadric.io/evs24
© Quadric

16
Back to basics: Designing an architecture from lessons learned

Chimera GPNPU
DMA
Neighbor
Access
Processing Element
Pipeline Control
32bit ALU
(Full C++ Target)
32 Entry Arch
Reg File
LRM –Local Register Memory
(4KB SRAM)
MACs
8 ,16, 32
8b x 8b / 4b x 8b
17
© Quadric

Chimera GPNPU
Scalable: 1 TOPS to 64 TOPS single core
Up to 512 TOPS Multi core
PE
18
© Quadric

Existing solutions
19
Offers very little programmability
•Code partitioning/ programming
complexity
•System complexity / power
•Accelerator brittleness
•No ability to modify hardware after
tapeout
•Leads to lower-performance
“fallback” onto the DSP or CPU
•Shortens market lifetime of SoC

Hardwired
Convolution & Matrix
Co-processor
Programmable
Scalar + Vector
DSP
Shared Memory
AXI
Local
Mem &
Caches
Pooling & Activations
(selected, pre-wired)
Buffer Memory
© Quadric

Chimera GPNPU —A code powered AI engine
20
100% of GPNPU is end user programmable
•Dramatically easier software programming
model with ability to program in C++/python
•Simpler SOC architecture
•Long SoC lifespan –easy ML operator support

Programmable Architecture
Memory
© Quadric

Programming model
21
Machine Learning Training Frameworks
ONNX Runtime/ Relay
Chimera GraphCompiler (CGC)
User Application
Code (C++)
Chimera LLVM C++ Compiler
Target Silicon
with Chimera
GPNPU
Chimera ISS
(Cycle approximate
/ System C)
IR
C++
C++
Trained NN Graph Float32/int8
ONNX
Quadric
Tool/libraries
© Quadric

Instruction-based simulation gives detailed bandwidth,
power, and performanceinsights​
22© Quadric

•Custom implementationof nodes/subgraphs
•e.g., NMS, proprietary layers,custom operators
Custom operatorsupport
23© Quadric

24
Conclusions

•Floating-point unit can easily be replaced with fixed-point integer math equally well.
Same accuracy with lower power & higher performance.
•Fixed operation units (ASICs) only work in niche applications. In today’s world with AI
algorithms changing every 3 weeks, one needs a very flexible architecture which is easy
to program.
•Operations requiring non-mac compute are becoming very common. Having multiple
DSP/special cores is not the right fallback....
Conclusions
25
Need a unified architecture to handle all workloads…
© Quadric

26
Q&A

About Quadric: http://quadric.io/evs24
Pure play Semiconductor IP Licensing
•Processor IP& Software Tools
Edge / device AI/ML Inference + DSP
processing
Silicon Proven Test Chip
HQ: Silicon Valley –Burlingame CA
Total Venture Capital Raised: $48M
May-2023: First IP delivery, DevStudio Online
Patents: 25Granted
Successful silicon in 2021
Visit us at Booth: 717
© Quadric
27