Small to Slide: RAG using MultiModal LLMs.

chloewilliams62 242 views 51 slides Sep 18, 2024
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

With the advent of LLMs understanding multimodal inputs, document understanding has a new tool in the arsenal. Elliot Kang and I built an example using Milvus and a real slide deck to show how you can use retrieval in combination with multimodal input to get better results from your existing powerpo...


Slide Content

Small to Slide:
Multimodal for Better RAG
Yi Ding, formerly LlamaIndex.TS

A bit about me
●Cofounded the Messaging Apps team at Apple and was there for 8 years.

●Started using LLMs on January 21, 2023.

●Joined LlamaIndex. Shipped LlamaIndex.TS, create-llama, LlamaCloud,
LlamaParse + many partnerships incl. w/ Zilliz.

A quick survey
●When did you start building with LLMs?

A quick survey
●When did you start building with LLMs?
○This past week?

A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?

A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?

A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
○2023?

A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
○2023?
○Before?

The elephant in the room: OpenAI o1-preview
●Came out Thursday.

What’s the big deal with strawberries?

What’s the big deal with strawberries?
●LLMs “see” tokens, not characters or words.

What’s the big deal with strawberries?

What’s the big deal with strawberries?

What’s the big deal with strawberries?

What’s the big deal with strawberries?

o1 stands for reasoning

Look familiar?

Thinking Longer = More Intelligent
https://www.maximumtruth.org/p/ma
ssive-breakthrough-in-ai-intelligence

Superhuman language capabilities

So all good right?
●Well…

So all good right?
●Well…

●More thinking = more chances to fail.

From the official OpenAI docs

Overthinking
Thanks @hanchunglee

Overthinking

Hallucinations

Hallucinations

Hallucinations
●We are used to hallucinations being negative.

●But they can also be “positive.”

●Suspect that in model training, o1-preview is incentivized to “trust its gut.”

Hallucinations

It’s still early

Small to Slide:
Multimodal for Better RAG
Yi Ding, formerly LlamaIndex.TS

A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
○2023?
○Before?

●RAG?

What is RAG?

What is RAG?
●Retrieval
●Augmented
●Generation

What is RAG?
●Retrieval
●Augmented
●Generation


●Better
●Output
●With
●Search

What is RAG?

Small to Big RAG
●Embed small, retrieve big.

●Key concept: what is embedded does not have to match what is retrieved
and given to the LLM.

●Coined at the Unstructured Meetup!

Small to Big RAG
Chris Churilo works in Marketing. She is currently VP of Marketing at Zilliz, the
makers of the Milvus Vector Database. Prior to that Ms. Churilo was VP of
Marketing at InfluxData.
Chris Churilo is a graduate of Berkeley. Go Bears! In her spare time she enjoys
stand up paddleboarding.

Small to Big RAG
Chris Churilo works in Marketing. She is currently VP of Marketing at Zilliz, the
makers of the Milvus Vector Database. Prior to that Ms. Churilo was VP of
Marketing at InfluxData.
Chris Churilo is a graduate of Berkeley. Go Bears! In her spare time she enjoys
stand up paddleboarding.

Q: Where does Ms. Churilo work?

Small to Big RAG
Chris Churilo works in Marketing. She is currently VP of Marketing at Zilliz, the
makers of the Milvus Vector Database. Prior to that Ms. Churilo was VP of
Marketing at InfluxData.
Chris Churilo is a graduate of Berkeley. Go Bears! In her spare time she enjoys
stand up paddleboarding.

Q: Where does Ms. Churilo work?

Why?

Small to Slide RAG
●Embed text, retrieve image.

●Key concept: what is embedded does not have to match what is retrieved
and given to the multimodal LLM.

●Built with Elliot Kang (LlamaIndex.TS contributor!)

Demo

But do we need to convert slide -> text at all?

ColPali
Instead of transforming PDFs to text,

Transform PDF to image and search over them.

https://huggingface.co/blog/manu/colpali

ColPali
The search is performed using not a single vector per image (like image
embedding models) but many (10^n) embeddings per image.

Vectors are compared against embeddings for query.

ColPali

ColPali

ColPali: How I think about it
●Visual ngrams*

●*Not a deep learning researcher

●Generate using image context (like Small to Slide!)

ColPali: Downsides
●Needs GPU (Modal deployable version coming!)

●Models are language and training specific.

●Far less exploration/tooling so far.

RAG is Evolving Very Fast
Text based RAG Small to Slide ColPali
Pros Easiest to get
started
Best tooling support
Better capture
multimodal
information
No text conversion
needed
Built in attribution
Cons Lose multimodal
information
Chunking is not
perfect
A bit more setup
than text based
Specific to certain
data types
Need GPU and
custom models

One Last Favor
Tags