Finding video shots for immersive journalism through text-to-video search
NoTubeProject
19 views
10 slides
Sep 19, 2024
Slide 1 of 10
1
2
3
4
5
6
7
8
9
10
About This Presentation
Video assets from archives or online platforms can provide relevant content for embedding into immersive scenes or for generation of 3D objects or scenes. However, XR content creators lack tools to find relevant video segments for their chosen topic. In this paper, we explore the use case of journal...
Video assets from archives or online platforms can provide relevant content for embedding into immersive scenes or for generation of 3D objects or scenes. However, XR content creators lack tools to find relevant video segments for their chosen topic. In this paper, we explore the use case of journalists creating immersive experiences for news stories and their need to find related video material to create and populate a 3D scene. An innovative approach creates text and video embeddings and matches textual input queries to relevant video shots. This is provided via a Web dashboard for search and retrieval across video collections, with selected shots forming the input to content creation tools to generate and populate an immersive scene, meaning journalists do not need specialist knowledge to communicate stories via XR.
Size: 1.01 MB
Language: en
Added: Sep 19, 2024
Slides: 10 pages
Slide Content
Co-authors: Damianos Galanopoulos and Vasileios Mezaris, CERTH-ITI, Thessaloniki, Greece
Speaker: Lyndon Nixon, MODUL Technology, Vienna, Austria
Finding video shots for immersive
journalism through text-to-video search
Background
This is a part of the news pilot in the TRANSMIXR project.
Immersive journalism allows journalists to tell about a news story or
event in an innovative experience submerging viewers inside a scene.
Immersive scene creation is a complex process that needs the
creation of photorealistic scenes and objects.
Could news image and video be a viable source for non-experts
(journalists)?
Firstly, relevant image and video needs to be findable in the context of
news stories…
2
News dashboard
A Web intelligence platform monitors hourly
news articles published globally, crawls
textual content & annotates documents
A Web-based dashboard provides data
visualisations including storygraph to
highlight emerging and ongoing news
stories
Drill-down search on documents matching a
selected story
3
Video discovery platform
Video documents have been included, both via
YouTube API and direct (from media archives).
We query the API based on terms that match the
keywords in current news coverage, as well as a
set of topical queries for the news pilot.
Basic metadata is enriched and stored, and
searchable by text (title, description,
keywords/entities…)
4
Segmentation and analysis for
indexing
For embedding into immersive content,
shorter video material is preferred.
For generating 3D objects or scenes,
sets of images or video shots can be used
with generative AI.
So we also segment and analyse the video,
such that we have shots with descriptive metadata.
We use TransNet (CNN-based) for video decomposition into shots and
keyframes.
5
Text-to-video search
We learn text and video shot embeddings in the same latent
space.
Shot embeddings use the cross-modal neural network TxV,
trained with multiple video-caption datasets.
This is improved by use of multiple pre-trained vision
language models as video frame feature extractors (CLIP,
SLIP, BLIP…)
Shot level embeddings are aggregated from frame
embeddings by mean pooling, and combined with the
embeddings from TxV.
This means we can now search video shots with textual
queries without use of any textual metadata. The input
query is encoded by the same models and cosine similarity
to shot embeddings is used to create a similarity list.
6
https://github.com/bmezaris/Te
xtToVideoRetrieval-TtimesV
Image/video selection
Journalists can find images and video
(shots) matching the news story or event,
as this text-to-video search is integrated
into the dashboard.
Media can be bookmarked and exported
(as references to the media).
We experimented with using collected
media to create immersive content with
InstantNERF (with mixed results)!
7
Generating immersive media
AI to convert image/video to 3D
objects/scenes is emerging but challenges
remain.
The quality and coverage of the input media
is a critical issue. AI still can’t ‘infer’ well the
non-visible parts of an object or scene.
However, with the right media available (and
now discoverable), journalists could more
easily prepare immersive experiences.
8
Conclusions and outlook
Journalists need easy to find media and easy to use
tools to generate immersive content from that
media.
Our dashboard identifies news stories and makes
video (shots) discoverable that relate to those
stories through text-to-video.
Extracted image/video collections can be converted
into immersive objects/scenes by Generative AI but
the technology is still improving…
9