“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci

embeddedvision 218 views 29 slides Sep 10, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/09/unveiling-the-power-of-multimodal-large-language-models-revolutionizing-perceptual-ai-a-presentation-from-benchsci/

István Fehérvári, Director of Data and ML at BenchSci, presents the “Unveiling the P...


Slide Content

Unveiling the Power of
Multi-Modal Large Language
Models: Revolutionizing
Perceptual AI
Istvan Fehervari
Director, Data and ML

The era of large language models
3© 2024 Istvan Fehervari
LLM
Webpage
design
Co-pilots, digital
assistants
Customer
support bots
Data
exploration/SQL
Copywriting
Education / mental
health bots

•Revolution started in machine translation
•Context-sensitive next token prediction via
attention
•Transformer blocks composed of layers of
attention and feed-forward blocks
•Encoder-decoder architectures
•Intelligence emerges through scale
Large language models
4© 2024 Istvan Fehervari
Input Text
Tokenize
Deep neural network
Decode tokens
Output text

•Enables learnable weights of content pieces
for arbitrary context → better reasoning
•Works very well for ordered and unordered
sets
•Works well with external contexts, e.g., cross-
attention with vision
•All modern ML models leverage some form of
attention
Why the attention mechanism is critical
5© 2024 Istvan Fehervari
Low attention
High attention
PuppiesAre Cute

•LLMs are composed of decoder blocks
•Inputs are all previous tokens –including predicted ones
•Decoder outputs a token distribution based on all
previous tokens
•During generation we sample tokens from the output
distribution
•Temperature
•Top-k / top-n
Building blocks of LLMs
6© 2024 Istvan Fehervari

•Supervised training
•Input –output pairs (e.g., translation)
•Fine-tune on specific task
•Self-supervised training
•Next/masked token prediction –needs a large body of data
•Reinforcement learning human feedback (RLHF)
•For instruction tuning, use human ranking to learn a reward function
Training LLMs
7© 2024 Istvan Fehervari

•LLaMA(2023/2) -7B / 13B / 33B / 65B
•Falcon (2023/5) -7B / 40B / 180B
•LLaMA2 (2023/6) -7B / 13B / 70B
•Mistral (2023/9) -7B (based on LLaMA2)
•Vicuna / Alpaca (based on LLaMA)
•Phi-2 (2023/12) -2.7B
•Mixtral(2023/12) -8x7B
Main foundational open LLMs
8© 2024 Istvan Fehervari

© 2024 Istvan Fehervari
Perception via Language
9

•Annotated class labels are expensive →captions are abundant
•Era of natural language​supervision
•WebImageTextdataset: 400 million images with text captions
•Created with web scraping
•Query words are composed of all words occurring at least 100 times
on Wikipedia
Rise of a new dataset
10© 2024 Istvan Fehervari

•Predicting captions directly does not scale well
•Instead, predict how well a text description and an image “fit together”
•First example of prompt engineering in vision
CLIP: combining language and vision
11© 2024 Istvan Fehervari

Object detection with Grounded DINO
•Open-vocabulary detection
•Text backbone is a pretrained​ transformer
like BERT
•Text-image and image-text cross-attention
at several stages
Object detection with CLIP
12© 2024 Istvan Fehervari

Segmentation with Grounded SAM
•Open-vocabulary segmentation
•Detect boxes with Grounded DINO → Predict mask with SAM
Image segmentation with CLIP
13© 2024 Istvan Fehervari

•CLIPSeg–Image segmentation with prompts
Image segmentation with CLIP
14© 2024 Istvan Fehervari
Lüddeckeet al. -Image Segmentation Using Text and Image Prompts, CVPR 2022

•Stable Diffusion uses CLIP text embeddings
Image generation with CLIP
15© 2024 Istvan Fehervari

16© 2024 Istvan Fehervari
LLMs with Vision

•We want our models to reason over visual input
•What data is needed?
Learning paradigms for (V)LLMs
17© 2024 Istvan Fehervari

Data​set building via
1.Image captioning (ideally with bounding box ground truth)
2.Visual QA datasets
3.Synthetic: create (2) from (1)
Can be done manually or LLM-assisted
Training V-LLMs
18© 2024 Istvan Fehervari

•(Multi-modal) in-context learning (e.g., Otter)
•Inject demonstration set to context
•Requires large context
•Can be used to teach LLMs to use external tools
•(Multi-modal) chain-of-thought (e.g., ScienceQA)
•Immediate reasoning steps for superior output
•Adaptive or pre-defined chain configuration
•Chain construction: infilling or predicting
Learning paradigms for (V)LLMs
19© 2024 Istvan Fehervari

•Learnable interface between modalities
•Expert model translating (e.g., vision)
into text
•Special tokens/function calling​ to
access aux models
LLMs with vision capabilities
20© 2024 Istvan Fehervari
Modality
bridging
Learnable
interface
Query-based
Instruct BLIP,
VisionLLM,
Macaw-LLM
Projection-
based
LLaVa,
PanadaGPT,
Video-ChatGPT
Parameter-
efficient tuning
LaVIN
Expert model
VideoChat-
Text,LLaMA-
Adapter V2

•Use CLIP to map vision and language tokens to the same latent space
(shallow alignment) –LLaVa-1.5
•Keep LLM and image encoder frozen –only train a shallow projection layer
Modality bridging with shallow alignment
21© 2024 Istvan Fehervari

Deeper alignment: Mixture of experts with vision
22© 2024 Istvan Fehervari
•Visual Experts –Mixture of experts with vision,
e.g., CogVLM
•Experts are separate feedforward layers
•Only a few experts are activated during
inference
•Learn a gating network

LLM-aided visual reasoning
23© 2024 Istvan Fehervari
•LLM function calling
•Controller –task planning
•Decision maker –summarize, continue or not
•Semantic refiner –generate text wrt. context
•Strong generalization
•Emergent ability (e.g., understand meme images)
•Better control
Object
detector
Segmentation
model
LLM
(reasoning)
Instruction/question
Answer

24© 2024 Istvan Fehervari
Vision-language on the Edge

•Programming →natural language
instructions
•Training free solutions
•Shorter time-to-market
•Short lead time to adapt to changing
environments
Applications
25© 2024 Istvan Fehervari

•Control: more natural, frictionless UX
•Voice or chat to control/monitor devices/networks
•Answer usability questions (no more manuals)
•Personalized onboarding to new devices
•Feedback:
•Output is interpreted without human-in-the-loop (e.g., alarm systems)
Applications
26© 2024 Istvan Fehervari

•LLMs need lot of resources (compute, memory)
•Visual input, CoTrequires larger context
•Latency is still an issue on the edge
•Output control of LLMs is still unsolved, prone tohallucinations
•AI safety –bias is an unsolved issue
•AI alignment is an upcoming field of research
Challenges
27© 2024 Istvan Fehervari

•Language as control brought tremendous improvements
•LLMs can operate very well with visual signals
•Future products will be more user-friendly, more natural
•Faster time to market, better adaptability both tech and business
•Performance on the edge today is a challenge but will be solved
•AI safety / alignment is the new challenge without a clear answer in sight
Conclusions
28© 2024 Istvan Fehervari

Questions?
29© 2024 Istvan Fehervari

•Yin et al. -A Survey on Multimodal Large Language Models
•Zhang et al. -Vision-Language Models for Vision Tasks: A Survey
•Yin et al. -A Survey on Multimodal Large Language Models
•Lüddecke et al. -Image Segmentation Using Text and Image Prompts
•Liu et al. -Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object
Detection
•Kirillov et al. -Segment Anything
•Wang et al. -CogVLM: Visual Expert for Pretrained Language Models
•Liu et al. -LLaVA: Large Language and Vision Assistant
•Li et al. -Otter: A Multi-Modal Model with In-Context Instruction Tuning
Resources
30© 2024 Istvan Fehervari