[NS][Lab_Seminar_240614]Video Matting via Consistency-Regularized Graph Neural Networks.pptx
thanhdowork
60 views
17 slides
Jul 04, 2024
Slide 1 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
About This Presentation
Video Matting via Consistency-Regularized Graph Neural Networks
Size: 3.26 MB
Language: en
Added: Jul 04, 2024
Slides: 17 pages
Slide Content
Video Matting via Consistency-Regularized Graph Neural Networks Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/14 Tiantian Wang et al. ICCV 2021
Introduction Figure 1: Matting results of different models. The first row shows the image and ground truth. The second row represents the predictions of a video matting method [34] (Left) and our method (Right). The third row shows the blended image generated by the foreground and predicted alpha. Clearly, our method can predict more subtle details on the hairs.
Introduction Video matting aims to estimate the foreground opacity (alpha matte) of each video frame To obtain accurate video matting, guarantee is needed: Alpha mattes extracted on individual frames should represent object to be extracted (spatial accuracy) Extracted mattes should not result in noticeable temporal jitter (temporal coherence) This paper focus on 2 challenges: How to produce temporally coherent alpha predictions with existing image matting dataset How to mitigate the domain gap when transferring model trained on the composited dataset to real dataset Propose Cosistency-Regularized Graph Neural Network to address challenges Design a GNN, in space and time to enhance the temporal coherence Propose consistency regularization technique to generalize model pretrained on composited dataset to real one Nodes denotes video frames and edges link a pair of neighboring frames which represented by the pairwise relation
Proposed Algorithm Given a video V = {I i } i=1 V of V frames, the goal is to decompose each frame I i as alpha matte foreground color Hadamard product background color Figure 2: Overview of the proposed method. Given video frames and (pseudo) trimaps, the proposed model first predicts the foreground color and alpha mattes via the GNN by leveraging the frame-wise interaction. Then the predicted foregrounds and alphas are blended with new backgrounds to generate new images, which are forwarded into the same GNN to generate new foregrounds and alphas. The consistency regularization and discriminator are proposed to generalize the model trained on the labeled composited videos to the unlabeled real videos.
Proposed Algorithm Composited Video Matting Given a video V with ground truth labels for each frame I i Generate a trimap T i from A i => coarse information of the foreground, background and unknown regions Latent representation: encoder network input concatenation Graph with K nodes at t-step G t = (V t , E t ) , where V t = {x i t } i=1 K represent latent feature for the i-th frame in the graph and edges E t = {{e i,j t}i =1 K } j=1 K denote relationship between 2 nodes aggregation function
Proposed Algorithm Composited Video Matting Feature aggregation Adopt deformable alignment [40,44], utilizes the deformable convolution to implement feature aggregation Given 2 feature embeddings xi, xj, the offsets on regular convolution kernels are calculated: represent the offsets. R = denotes the regular grid of convolutional kernel Aligned feature map for each position: feature embedding each position offset Aggregated feature for i-th frame: convolutional operation concatenation
Proposed Algorithm Composited Video Matting Node-state updating In the t-th passing step, model the node-state updating: GRU convolution operator for dimensional reduction Network prediction After T message passing iterations, all node representations are updated => use to predict the alpha matte and foreground : decoders Input frame is reconstructed: Minimize the sum of prediction errors of the alpha matte, foreground and input frame:
Proposed Algorithm Real Video Matting GNN still fail when apply to real videos due to the domain gap Propose a regularization approach to enforce consistency on alpha and foreground, when blending with different backgrounds Adopt adversarial training scheme to mitigate the domain gap between composited videos and real ones V = {I i } i=1 V be a video drawn from the composited set, V is labeled R = {U i } i=1 U be video drawn from real set, R is not Consistency regularization For each frame, generate trimap T i using the ground truth alpha matte [48, 28] For R, the pseudo trimap is generated by segmentation map based on DeepLabv3 [9] GNN model utilizes I i and T i as input, generates alpha matte and foreground is composited with random new background B by the alpha , generate a new frame is fed into GNN model and generates new alpha matte and foreground prediction should be consistent with each other, represent the same object against different backgrounds
Proposed Algorithm Real Video Matting New frame can be composited by Consistency regularizer: Learning objective: Adversarial learning to further mitigate the domain gap Augment data by translating the foreground objects with arbitrary small shift (slight variations) predicted from real frame range of the local shift Synthesize composited image: Optimize the adversarial loss: distributions of real frames and backgrounds image random frame discriminator parameters of D
Datasets One labeled dataset for video matting [33] 3 training videos 10 test videos Propose 2 synthesized datasets to alleviate problem Provide real-world dataset Figure 4: Video matting dataset. (a) The first two rows show the composited video frames with the same foreground objects. The foregrounds are first generated from the videos with simple background and then composited with two different backgrounds. (b) The first row shows the original real video frames and the second row indicates the objects are blended with a new background using the annotated foreground and alpha. The third row in (a) and (b) represents the annotated alpha
Experiments Use data augmentation scheme to increase diversity of input data Randomly crop the image and trimap pairs centered on pixels in the unknown regions with varied resolutions and resize them Random rotation , scaling, shearing, vertical and horizontal flipping for the affine transformation Pretrained on image matting dataset [48], finetuned using labeled composited data and unlabeled real data Use similar encoder and decoder structures in [28] For discriminator, use PatchGAN [24] Evaluation metrics SAD MSE Gradient Connectivity Temporal coherence
Experiments Results on the composited and real dataset Table 1: Quantitative results on the two human matting datasets. To better show the performance difference, the numbers for the above measures have been scaled up or scaled down. The scaling factors of the five measures from left to right are 1000, 0.01, 0.01, 0.01, 1000. IM* means we re-train IM using the proposed dataset. The best results are in bold. Table 2: Results on the auxiliary category dataset.
Experiments Qualitative results Figure 5: Visual comparison on the composited dataset.
Experiments Qualitative results Figure 6: Visual comparison on the real dataset.
Experiments Ablation Study Table 3: Ablation study on the variants of the proposed network . ‘Baseline’ means the image-level model without using the GNN. ‘+’ means the progressive connection of different modules.
Experiments Conclusion The author focus on enhancing the temporal coherence for matting in videos Propose to maintain the temporal consistency by exploiting the inter-frame relationship among the video Use GNN to relate adjacent frames to synthesize video matting datasets Propose regularization scheme to enforce the consistency on the alpha, foreground and predicted frames Annotate a real-world dataset with alpha mattes for evaluation