“Deploying Large Models on the Edge: Success Stories and Challenges,” a Presentation from Qualcomm

embeddedvision 175 views 19 slides Jun 05, 2024
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/deploying-large-models-on-the-edge-success-stories-and-challenges-a-presentation-from-qualcomm/

Vinesh Sukumar, Senior Director of Product Management at Qualcomm Technologies, presents the “Deploying Lar...


Slide Content

Deploying Large
Models on the Edge:
Success Stories & Challenges
Dr. VineshSukumar
Sr Director, Product Management
Qualcomm Technologies Inc.
Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

Cloud economics
will NOTallow
Generative AI
to scale
GEN AI is NOT scalable with cloud ONLY
Cost Per
Query
1
(e.g. web search)
GEN AI
Applications
Billions
of Users
Edge GEN AI is
becoming
MORE than
relevant NOW!
1: Reuters 2023
TraditionalGenerative AI
~10x
Coding assistant
Copy
creation
Web search
Office
copilot
Image & video
creation
Text
summarization
Conversational
chatbots

© 2024 Qualcomm Technologies Inc. 2

GEN AI –Edge Application Deployment Trends
2 -10B
Parameter Range
2 -13B
Parameter Range
1 -3B
Parameter Range
•Trending towards MUSTsupport
larger models =>Compute, BW &
Memory with sustained performance
•Multi modality fusion for better
input prompt definition => Sensing +
Vision + Text or various combinations
•Concurrency of models for improved
user experience => Texture +
Stylization + Restructuring for
Visual content
+More around Avatar
creation, Knowledge based
QnA, Intelligent Search,
Co-Pilot Assistance & more..
Anticipated to be deployed in 2023/24
Focus
Area for
today
© 2024 Qualcomm Technologies Inc. 3

Trends/Attributes of
recent generative LVMs:
Prompt-able
•Steerable, user guided,
conditioned, grounded, …
Multi-modal cross-attention
•text/audio/click/3D/image/video/…
Encoder-decoder
Relatively larger
Focusing on LVMs
Language-Vision Models (LVMs): Models that combine vision and language
Example: Stable Diffusion (Stability.ai)
© 2024 Qualcomm Technologies Inc. 4

Our first low rank adaptation (LoRA) on an Android
phone done on LVMs
•1+ billion parameter Stable Diffusion with
LoRAadapter for customized experiences
•Create high-quality custom images based on personal or
artistic preferences
•LoRAenablesscalability and customization
of on-device generative AI across use cases
•Full-stack AI optimization to achieve high performance at
low power and fast switching between adapters
•Enhanced privacy,reliability, personalization,
and cost with on-device processing
Low Rank Adaptation (LoRA)
© 2024 Qualcomm Technologies Inc. 5

Prompt –I will draw realistic pencil portrait from a photo
LVMs Challenges
Transitioning from floating point to fixed point
FP32 W8A16 Per-Channel W8A16 Per-Channel + QAT
W8A16 Per-Tensor
X √ √
Current Status:
Recommended
path working with
various partners
QAT–Quantization Aware Training
© 2024 Qualcomm Technologies Inc. 6

Avatars/Sketch
animated, emotion adaptive
Generate & manipulate
Applications: Generating 3D scenes
for gallery applications, supporting
content creators among others
GEN AI LVM Applications
7© 2024 Qualcomm Technologies Inc.
About 3 to 5B Parameters
Entire 3D
worlds from
prompts
Stable Diffusion + Control Net + NERF (3D)
Visual
-
LVM
3D from a single image
Modify with
voice command
Scenes
relighted, restructured
Objects
textured, stylized, 3D meshes
Stable Diffusion + Control Net
Applications: Generating custom
wallpaper, replacing background,
Content creation in gaming among others
Human-like
AI companions or
Sketch to Image

Text / Visual(IM, VID)/ 3D/ Audio/ Any
What Next ?
8© 2024 Qualcomm Technologies Inc.
Lay the foundation for new consumer applications based on creating synthetic content in any modality
Visual 3D T/Meta Audio Any
X to V X to 3D X to T/Meta X to Audio X to Any
Replace man
with chimpanzee
Wearing a
green umbrella
Question
Answering
How long does this river go?
The river in the image is the Arno
River, which flows through Florence,
Italy. It stretches for approximately
241 kilometers (150 miles) and
empties into the LiguarianSea

•Move towards open-source models (e.g. Llama, Phi)
•Move towards multi modality (e.g. GPT4)
•Movement towards lower bit widths to reduce
memory footprint
Focusing on LLMs
9© 2024 Qualcomm Technologies Inc.
Large Language Models (LLMs):Models that focus on language
Key Ecosystem Asks: move from generic foundational
models to domain specific models
•Data Ownership:Depending on models trained on data
of unknown origin = Safety concerns
•Control:Data is your IP. Own the model generated from
that data Control core IP
•Model Ownership: Own your weights –
Better introspection, Explainability and Portability

Key LLM KPIs
LLM Deployment –KPIs & Memory
10© 2024 Qualcomm Technologies Inc.
Depends on model size and
context length, which in turn
drive DRAM GB needs
Accuracy
Mostly compute bound to
produce first token very fast;
does need high DDR BW
Time to
First Token
(TTFT)
Typically, DDR BW bound as
entire model and context needs
to be moved from DRAM to AI
Engine
Token/s
Depends on UFS BW to transfer
model from Flash to DRAM
Init Time

Memory footprint: How does quantization Help ?
11© 2024 Qualcomm Technologies Inc.
Mapping to various form factors
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16 18 20
DRAM (GB)
Model Size (B Parameters)
7B LLAMA V2 Models
FP16
INT8
INT4
Observation : Reduction in memory needs is becoming important to really enable large models while maintaining accuracy
About 13GB!
Between 4 to 5GB !
Phones
PC
XR

12© 2024 Qualcomm Technologies Inc.

GEN AI LLM Applications –in Commercialization phase
13© 2024 Qualcomm Technologies Inc.
About 1 to 10B Parameters
Standalone usage
In combination
with other modalities
Summarization Sentence Correction
Mobile
Personal
Assistant
Personalized
experience integrated
with other sensory
information and using
voice commands
Speech to
Speech as
interface
Edge AI
Sensing with ASR + LLM + TTS + Visual Avatar
Productivity
Assistant
•QnAfor queries
•Extend to Plug Ins
(Navigation, enterprise,
entertainment..)
•Email Creation
•Document Summarization
LLM (7-20B)
Fine-tuned
data
Peripherals

Qualcomm® AI Engine and Qualcomm® Hexagon™ NPU
14© 2024 Qualcomm Technologies Inc.
Hexagon NPU
•Upgraded micro architecture
•Upgraded micro tile inferencing
•Peak performance cores
•Higher clock speeds
•2X higher bandwidth on shared memory

Developer’s Gateway to Superior On-Device AI
15© 2024 Qualcomm Technologies Inc.
Deploy
Pick a
Model
Target
Platform
Test and
Validate
Pick a
runtime
Qualcomm® AI Hub enables developers
toeasily quantize, optimize, and validate
AImodels in minutes

•Larger models, larger context
length enable more powerful
use cases
•E.g., LLMs in a system (e.g., with RAGs);
LLMs as Agents -orchestrate sequence of
complex tasks
•This also drives need for higher
Tok/s due to need to iterate
multiple times
What Next ?
16© 2024 Qualcomm Technologies Inc.
Lay the foundation for new consumer applications based on need for personalization
Iterative usage system
component skimming
Limited iteration chat
Backgroundtasks
More powerful
Less powerful
Agentic standalone
planning
Basic tasks
Limited function needs
external control
100
10
32
8
50
1
RAG: Retrieval Augmented Generation
ICL: In-context learning
Model Size (B)
Context (k)
Tok/s

•Consumer and enterprise GEN AI applications cannot scale ONLY with cloud
•Significant investments have been done on the edge/client side that can enable many GEN AI
experiences with support for user personalization
•Many ways to support personalization and one among them is LORA
(using Adaptors)
•Deploying GEN AI applications at scale on the client side does come with many challenges like accuracy,
memory and performance so focus on many SW optimizations is needed
•Plenty of innovation happening in the ecosystem side that is expanding from traditional LVM, LLMs to
LMMs while supporting the need for multi modalities
Conclusions
17© 2024 Qualcomm Technologies Inc.

Embedded Vision Summit 2024
2024 Embedded Vision Summit
May 21
st
(1:00-4:00pm)Workshop
“Accelerating ModelDeployment withQualcomm® AI Hub” –Bhushan
Sonawane
May 22
nd
(1:30-2:00pm) Product Related Presentation
“OpenCV for High-Performance, Low-Power Vision Applications on
Snapdragon” –Xin Zhong
May 23
rd
(9:50-10:20am) General Session Talk
“What’s Next in On-Device Generative AI​” –Jilei Hou
May 23
rd
(10:20-11:10am)Panel Session
“Multimodal LLMs at the Edge: Are We There Yet?” –Jilei Hou (Panel session)
May 23
rd
(1:30-2:00pm)Product Related Presentation
“Deploying LargeModels on the Edge :Success Stories &Challenges” –Vinesh
Sukumar
Stop by our booth and live demos at exhibit hall booth 718
18© 2024 Qualcomm Technologies Inc.
Qualcomm AI Hub
https://aihub.qualcomm.com/

19© 2024 Qualcomm Technologies Inc.
Thank You