Dataset and methods for 360-degree video summarization

VasileiosMezaris 38 views 17 slides Jun 20, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Presentation of our paper, "A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods", by I. Kontostathis, E. Apostolidis, V. Mezaris. Presented at the 1st Int. Workshop on Video for Immersive Experiences (Video4IMX-2024) of ACM IMX 2024, Stockh...


Slide Content

A human-annotated video dataset for training and evaluation of 360-degree video summarization methods Ioannis Kontostathis, Evlampios Apostolidis , Vasileios Mezar is Information Technologies Institute Centre for Research and Technology, Hellas VIDEO4IMX-2024 Workshop @ ACM IMX 2024 Stockholm, Sweden, 12 June 2024

Introduction Current status: Increasing interest in the production and distribution of 360 ° video content, supported by: Existence of advanced 360 ° video recording devices (GoPro, Insta360) Compatibility of the most popular social networks and video sharing platforms with this type of video content Potential use: Transforming 360 ° videos to concise 2D-video summaries that can be viewed via traditional devices, (TV sets, smartphones), would: i ) enable repurposing, ii) increase consumption (via extra devices), iii) and facilitate browsing and retrieval of 360 ° video content Observed need for: Technologies that could support the summarization of 360 ° videos Datasets for training these technologies 360 o video 2D video summary

Related Work Most existing datasets can train networks for NFOV selection and saliency prediction, to support: The creation of NFOV videos from 360 ° videos (Pano2Vid) The navigation of the viewer in the content of 360 ° videos (Pano2Vid, Sports-360) The prediction of viewport-dependent 360 ° video saliency (PVS-HM, VR- EyeTracking ) Can support the training of methods for 360 ° video highlight detection or summarization only partially Datasets suitable for training 360 video highlight detection and story-based summarization methods: i ) assume only one important activity or narrative, and ii) are not publicly available

Related Work Most existing datasets can train networks for NFOV selection and saliency prediction , to support: The creation of NFOV videos from 360 ° videos (Pano2Vid) The navigation of the viewer in the content of 360 ° videos (Pano2Vid, Sports-360) The prediction of viewport-dependent 360 ° video saliency (PVS-HM, VR- EyeTracking ) Can support the training of methods for 360 ° video highlight detection or summarization only partially Datasets suitable for training 360 video highlight detection and story-based summarization methods: i ) assume only one important activity or narrative, and ii) are not publicly available

Related Work Most existing datasets can train networks for NFOV selection and saliency prediction, to support: The creation of NFOV videos from 360 ° videos (Pano2Vid) The navigation of the viewer in the content of 360 ° videos (Pano2Vid, Sports-360) The prediction of viewport-dependent 360 ° video saliency (PVS-HM, VR- EyeTracking ) Can support the training of methods for 360 ° video highlight detection or summarization only partially Datasets suitable for training 360 video highlight detection and story-based summarization methods : i ) assume only one important activity or narrative, and ii) are not publicly available

The 360-VSumm dataset: Video generation process Basis: The VR-EyeTracking dataset [21] which comprises of 208 dynamic HD 360◦ videos with diverse range of content (e.g. indoor/outdoor scenes, underwater activities, sports games, short films) Processing steps: Used the ERP frames and their associated ground-truth saliency maps Ran the “2D video production” alg. from Kontostathis et al. (2024), to produce 2D videos showing the detected salient activities and events in the 360◦ videos and compute the frames’ saliency Selected a sub-set of 40 2D-videos, with: Dynamic and diverse visual content Duration longer than 1 min. I . Kontostathis , E . Apostolidis, V . Mezaris . 2024. An Integrated System for Spatio -temporal Summarization of 360-Degrees Videos. MultiMedia Modeling. Springer Nature Switzerland, Cham, 202–215.

The 360-VSumm dataset: Video annotation process Performed by 15 annotators that where asked to select the most important/interesting parts of each video to form a summary that lasts approx. equally to the 15% of the video’s length Each 2D-video’s fragments (corresponding to different salient activities) were further segmented into M 2-sec. sub-fragments Based on a developed user-friendly interactive tool (see figure) each annotator had to select N sub-fragments, with N = 15% of M , to form the video summary The tool allows users to perform multiple times the selection process and check the generated summary, before concluding to the most suitable one

The 360-VSumm dataset: Characteristics 40 2D-videos with dynamic and diverse visual content showing multiple events that overlap in time or run in parallel (see figure) 15 ground-truth annotations (human-generated summaries) for each video; binary vectors indicating which frames have been selected for inclusion in the summary and which not A mean ground-truth summary that can be used for supervised training (average of the 15 ground-truth summaries at the frame-level) Data about the fragments/sub-fragments of these videos and their frames’ saliency Publicly-available at: https://github.com/IDT-ITI/360-VSumm

Experiments: Research questions and exp. settings Research questions: Can we use pre-trained models of state-of-the-art methods for conventional video summarization, to produce summaries for 360◦ videos? Are there any performance gains after re-training these models using data from 360◦ videos? Does it help to take the frames’ saliency into account? Experimental settings: Used 2 SoA methods for traditional 2D video summarization PGL-SUM: supervised method; combines global and local multi-head attention mechanisms with positional encoding to model frame dependencies at various levels of granularity CA-SUM: unsupervised method; contains a concentrated attention mechanism and incorporates knowledge about the uniqueness and diversity of the video frames Estimated similarity between a machine-generated and a user-defined summary using F-Score Split the dataset into five different splits to perform 5-fold cross validation

Experiments: Quantitative results Question: “Can we use pre-trained models of state-of-the-art methods for conventional video summarization, to produce summaries for 360◦ videos?” Measured the performance of a random summarizer and pretrained models of PGL-SUM, CA-SUM (using the SumMe and TVSum video summarization datasets ) on the test splits of 360-Vsumm Results indicated that models trained for conventional video summarization show random-level performance on 360° video summarization Answer: “No, we need methods that are better tailored to the visual characteristics of 360° videos”

Experiments: Quantitative results Question: “A re there any performance gains after re-training these models using data from 360◦ videos?” Investigation using PGL-SUM: Tried to find the optimal number of local attention mechanisms; then explored different options about the number of attention heads Results showed that: i ) using more local attention mechanisms leads to improved performance; ii) using the maximum number of attention heads per mechanism further improves the performance by 1.6% Investigation using CA-SUM: Ran experiments for different regularization factors; then considered various choices for the block size of the concentrated attention Results showed that setting regular. factor = 0.7 and block size = 70 leads to the best performance Answer: “Yes, using the 360-VSumm dataset we can effectively train methods that were designed for conventional video summarization”

Experiments: Quantitative results Question: “Does it help to take the frames’ saliency into account?” Evaluated the performance of variants of PGL-SUM and CA-SUM that use the frames’ saliency scores to weight the deep representations of the visual content of the 2D-video frames Results showed that the use of the frames’ saliency improves the summarization performance of both PGL-SUM and CA-SUM Answer: “Yes, the frames’ saliency is a useful auxiliary signal for the training process”

Experiments: Qualitative results Frame-based overview of the presented events in the video (top part), and the produced summaries of CA-SUM, PGL-SUM and their saliency-aware variants (bottom part). Bounding boxes of the same color indicate activities that take place at the same time in different views of the 360◦ video

Conclusions and future work Conclusions: Presented the 360-VSumm dataset for training and evaluating 360° video summarization methods Trained two SoA methods for conventional 2D-video summarization and evaluated their performance to establish a baseline for future comparisons Took into account two saliency-aware variants of these methods and documented the positive impact of using data about the frames' saliency during the summarization process Developed an interactive tool for annotation purposes that can be used to facilitate similar annotation activities Future work: Extract additional data about the frames of the produced 2D-videos (e.g. their spatial positioning in the 360° video) and use it as extra auxiliary data for training 360° video summarization methods

References E. Apostolidis, E. Adamantidou , A. I. Metsai , V. Mezaris , I. Patras. 2020. Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods. 28th ACM Int. Conf. on Multimedia (MM ’20). ACM, New York, NY, USA, 1056–1064. E. Apostolidis, E. Adamantidou , A. I. Metsai , V. Mezaris , I. Patras. 2021. Video Summarization Using Deep Neural Networks: A Survey. Proc. IEEE 109, 11 (2021), 1838–1863. E. Apostolidis, G. Balaouras , V. Mezaris , I. Patras. 2021. Combining Global and Local Attention with Positional Encoding for Video Summarization. 2021 IEEE Int. Symposium on Multimedia (ISM). 226–234. E. Apostolidis, G. Balaouras , V. Mezaris , I. Patras. 2022. Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames. 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22). ACM, New York, NY, USA, 407–415. E . Bernal- Berdun , D . Martin, D . Gutierrez, B . Masia . 2022. SST-Sal: A spherical spatio -temporal approach for saliency prediction in 360 videos. Computers & Graphics 106 (2022), 200–209. Y . Dahou , M . Tliba , K . McGuinness, N . O’Connor. 2021. ATSal : An Attention Based Architecture for Saliency Prediction in 360 Videos. In Pattern Recognition. ICPR Int . Workshops and Challenges. Springer International Publishing, Cham, 305–320. M . Gygli , H . Grabner , H . Riemenschneider, L . Van Gool. 2014. Creating Summaries from User Videos. European Conf. on Computer Vision (ECCV) 2014. Springer International Publishing, Cham, 505–520. H . -N . Hu, Y . -C . Lin, M . -Y . Liu, H . -T . Cheng, Y . -J . Chang, M . Sun. 2017. Deep 360 Pilot: Learning a Deep Agent for Piloting Through 360deg Sports Videos. IEEE Conf. on Computer Vision and Pattern Recognition. M . Hu, R . Hu, Z . Wang, Z . Xiong , R . Zhong. 2022. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimedia Tools and Applications 81 (2022), 40489–40510. Κ. Kang , S . Cho. 2019. Interactive and Automatic Navigation for 360° Video Playback. ACM Trans. Graph. 38, 4, Article 108 (2019), 11 pages. I . Kontostathis , E . Apostolidis, V . Mezaris . 2024. An Integrated System for Spatio -temporal Summarization of 360-Degrees Videos. MultiMedia Modeling. Springer Nature Switzerland, Cham, 202–215. S . Lee, J . Sung, Y . Yu, G . Kim. 2018. A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).

References G . Liang, Y . Lv , S . Li, S . Zhang, Y . Zhang. 2022. Video summarization with a convolutional attentive adversarial network. Pattern Recognition 131 (2022), 108840. M . Qiao , M . Xu, Z . Wang, Α. Borji . 2021. Viewport-Dependent Saliency Prediction in 360° Video. IEEE Transactions on Multimedia 23 (2021), 748–760. Υ. Song, J. Vallmitjana , A. Stent, A. Jaimes . 2015. TVSum : Summarizing web videos using titles. 2015 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Y . -C . Su , K . Grauman . 2017. Making 360deg Video Watchable in 2D: Learning Videography for Click Free Viewing. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Y . -C . Su , D . Jayaraman, K . Grauman . 2016. Pano2Vid: Automatic Cinematography for Watching 360 Videos. Asian Conference on Computer Vision (ACCV). C . Szegedy , W . Liu, Y . Jia, P . Sermanet , et al. . 2015. Going deeper with convolutions. 2015 IEEE Conf . on Computer Vision and Pattern Recognition (CVPR). 1–9. M . Wang, Y . -J . Li, W . -X . Zhang, C . Richardt, et al,. 2020. Transitioning360: Content-aware NFoV Virtual Camera Paths for 360° Video Playback. 2020 IEEE Int. Symposium on Mixed and Augmented Reality (ISMAR). 185–194. M . Xu, Y . Song, J . Wang, M . Qiao , et al.. 2019. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. on Pattern Analysis and Machine Intelligence 41, 11 (2019), 2693–2708. Y . Xu, Y . Dong, J . Wu, Z . Sun, Z . Shi, J . Yu, S . Gao. 2018. Gaze Prediction in Dynamic 360° Immersive Videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5333–5342. Y . Yu, S . Lee, J . Na, J . Kang, G . Kim. 2018. A Deep Ranking Model for Spatio -Temporal Highlight Detection From a 360 Video. 2018 AAAI Conf. on Artificial Intelligence. S . -S . Zang, H . Yu, Y . Song, R . Zeng. 2023. Unsupervised video summarization using deep Non-Local video summarization networks. Neurocomputing 519 (2023), 26–35. Y . Zhang, Y . Liu, C . Wu. 2024. Attention-guided multigranularity fusion model for video summarization. Expert Systems with Applications 249 (2024), 123568. W . Zhu, J . Lu, Y . Han, J . Zhou. 2022. Learning multiscale hierarchical attention for video summarization. Pattern Recognition 122 (2022), 108312.

Thank you for your attention! Questions? Vasileios Mezaris, [email protected] Code and dataset publicly available at: https://github.com/IDT-ITI/360-VSumm (Traditional) Video summarization demos: https://multimedia2.iti.gr/videosummarization/service/start.html (fully automatic) https://multimedia2.iti.gr/interactivevideosumm/service/start.html (interactive) This work was supported by the EU Horizon Europe and Horizon 2020 programmes under grant agreements 101070109 TransMIXR and 951911 AI4Media, respectively