With the advent of LLMs understanding multimodal inputs, document understanding has a new tool in the arsenal. Elliot Kang and I built an example using Milvus and a real slide deck to show how you can use retrieval in combination with multimodal input to get better results from your existing powerpo...
With the advent of LLMs understanding multimodal inputs, document understanding has a new tool in the arsenal. Elliot Kang and I built an example using Milvus and a real slide deck to show how you can use retrieval in combination with multimodal input to get better results from your existing powerpoint slides.
Size: 1.96 MB
Language: en
Added: Sep 18, 2024
Slides: 51 pages
Slide Content
Small to Slide:
Multimodal for Better RAG
Yi Ding, formerly LlamaIndex.TS
A bit about me
●Cofounded the Messaging Apps team at Apple and was there for 8 years.
A quick survey
●When did you start building with LLMs?
A quick survey
●When did you start building with LLMs?
○This past week?
A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
○2023?
A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
○2023?
○Before?
The elephant in the room: OpenAI o1-preview
●Came out Thursday.
What’s the big deal with strawberries?
What’s the big deal with strawberries?
●LLMs “see” tokens, not characters or words.
What’s the big deal with strawberries?
What’s the big deal with strawberries?
What’s the big deal with strawberries?
What’s the big deal with strawberries?
o1 stands for reasoning
Look familiar?
Thinking Longer = More Intelligent
https://www.maximumtruth.org/p/ma
ssive-breakthrough-in-ai-intelligence
Superhuman language capabilities
So all good right?
●Well…
So all good right?
●Well…
●More thinking = more chances to fail.
From the official OpenAI docs
Overthinking
Thanks @hanchunglee
Overthinking
Hallucinations
Hallucinations
Hallucinations
●We are used to hallucinations being negative.
●But they can also be “positive.”
●Suspect that in model training, o1-preview is incentivized to “trust its gut.”
Hallucinations
It’s still early
Small to Slide:
Multimodal for Better RAG
Yi Ding, formerly LlamaIndex.TS
A quick survey
●When did you start building with LLMs?
○This past week?
○This past month?
○2024?
○2023?
○Before?
●RAG?
What is RAG?
What is RAG?
●Retrieval
●Augmented
●Generation
What is RAG?
●Retrieval
●Augmented
●Generation
●Better
●Output
●With
●Search
What is RAG?
Small to Big RAG
●Embed small, retrieve big.
●Key concept: what is embedded does not have to match what is retrieved
and given to the LLM.
●Coined at the Unstructured Meetup!
Small to Big RAG
Chris Churilo works in Marketing. She is currently VP of Marketing at Zilliz, the
makers of the Milvus Vector Database. Prior to that Ms. Churilo was VP of
Marketing at InfluxData.
Chris Churilo is a graduate of Berkeley. Go Bears! In her spare time she enjoys
stand up paddleboarding.
Small to Big RAG
Chris Churilo works in Marketing. She is currently VP of Marketing at Zilliz, the
makers of the Milvus Vector Database. Prior to that Ms. Churilo was VP of
Marketing at InfluxData.
Chris Churilo is a graduate of Berkeley. Go Bears! In her spare time she enjoys
stand up paddleboarding.
Q: Where does Ms. Churilo work?
Small to Big RAG
Chris Churilo works in Marketing. She is currently VP of Marketing at Zilliz, the
makers of the Milvus Vector Database. Prior to that Ms. Churilo was VP of
Marketing at InfluxData.
Chris Churilo is a graduate of Berkeley. Go Bears! In her spare time she enjoys
stand up paddleboarding.
Q: Where does Ms. Churilo work?
Why?
Small to Slide RAG
●Embed text, retrieve image.
●Key concept: what is embedded does not have to match what is retrieved
and given to the multimodal LLM.
●Built with Elliot Kang (LlamaIndex.TS contributor!)
Demo
But do we need to convert slide -> text at all?
ColPali
Instead of transforming PDFs to text,
Transform PDF to image and search over them.
https://huggingface.co/blog/manu/colpali
ColPali
The search is performed using not a single vector per image (like image
embedding models) but many (10^n) embeddings per image.
Vectors are compared against embeddings for query.
ColPali
ColPali
ColPali: How I think about it
●Visual ngrams*
●*Not a deep learning researcher
●Generate using image context (like Small to Slide!)
ColPali: Downsides
●Needs GPU (Modal deployable version coming!)
●Models are language and training specific.
●Far less exploration/tooling so far.
RAG is Evolving Very Fast
Text based RAG Small to Slide ColPali
Pros Easiest to get
started
Best tooling support
Better capture
multimodal
information
No text conversion
needed
Built in attribution
Cons Lose multimodal
information
Chunking is not
perfect
A bit more setup
than text based
Specific to certain
data types
Need GPU and
custom models