Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag

NABLAS 375 views 9 slides Sep 24, 2024
Slide 1
Slide 1 of 9
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9

About This Presentation

RecipeRagは、ユーザークエリに基づいて、動画データベースから関連するテキストデータと画像データを検索し、それらの情報を組み合わせて、料理のレシピの手順と必要な材料を生成します。
It explains the RecipeRag, which based on...


Slide Content

Recipe Generation/Retrieval from
Videos: RecipeRag

Overview
Multi-Modal Rag: When the data
involve information other than text
-Videos & transcripts
-Web pages
RecipeRag: Given a text query the
application will return the recipe.
-If recipe available in database
-> it should return that
-If not -> it should do generation
and suggest by itself

Input & Desired Output
Retrieval based Generation for known Recipes Retrieval based Generation for unknown Recipes
Goal: To get recipe ingredients, stepwise procedure and for each step get visual-explanation
(either from available database or generated)

Data collection for RecipeRag
For each Video in random recipe videos from Youtube
-Transcript extracted from audio using whisper model)
-Video frames (sampled from video at 5 sec interval)
Youtube
Video
MP3
Audio
We prepare a quick breakfast!
Bake 3 slices.
Butter.
2-3 cloves of garlic.
Fry the bread until golden brown
on both sides.
Avocado 2 pcs.
.
.
Frame
Extraction
. . .
Speech to text, used
deepgram service

Chunking, Metadata & Indexing (text & image data)
*Indexing in a retrieval task is the process of organizing data to make it faster and easier to find specific information.
Chunking (strategy to split/structure the data into smaller parts)
-Transcript is divided into multiple small text segments -> each segment will represent a data
point also called TextNode
-Each image will represent ImageNode

TextNode Metadata
-Entity extractor: Food items present
-Title extractor: Based on initial few segments
-Questions based on recipe title.

Indexing (vectorizing/storing the splitted data)
-For TextNodes the vector embedding are created from original-text-segment +
extracted-metadata -> a combination like
“text: {text_segment}; title: {title}; {question1}; {question2} … ”
-For ImageNode clip based embeddings

Task selection
User
Query for
Recipe
generate_recipe
generate_modified_dish_recipe
Text is generated in retrieval augmented manner
(combining steps of existing recipe), images are
generated with Multi-Modal LLM

generate_custom_dish_recipe
Unknown dish -> text is generated with LLM, images
corresponding to text generated with Multi-modal LLM
generate_known_dish_recipe
Both text/images are retrieved
in RAG manner from indexed database
Based on query and LLM function calling (prompt
engineering) one of the function is run

Retrieval (text data)
User query VectorStore
(TextNodes)
TextNode1
TextNode2
TextNode3
LLM
topK retrieved
text chunks
JSON formatted
Structured Output ->
steps for the recipe
generate_known_dish_recipe
Typical RAG based on user query and store transcript nodes we get output -> with one caveat -> the output is conditioned to be
in json formatted list of steps

Retrieval (image data)
generate_known_dish_recipe
JSON formatted
Structured Output ->
steps for the recipe
VectorStore
(ImageNodes)
For each step from JSON output
We do retrieval of Images and sort them based on their timestamps (when extracted from videos)
ImageNode1
ImageNode2
ImageNode3
generate_unknown_dish_recipe
JSON formatted
Structured Output ->
steps for the recipe
Image1
Image2
Image3
MM-LLM
Image1
Image2
Image3

Thank you