20240819_NM_LivePortrait_Nnabla_youtube_final.pdf

nnablabySony 50 views 42 slides Sep 18, 2024
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

YouTube nnabla channelの次の動画で利用したスライドです。

【AI Paper】Perform efficient portrait animation with LivePortrait!
https://youtu.be/xB71xA_bPoQ?si=JkXTpAHbh4Ydgszr

以下の論文を解説しています。

LivePortrait: Efficient Portrait Animation with Stitching and...


Slide Content

論文解説
シド

Authors propose a portrait animation framework with a focus on better generalization, controllability, and efficiency

Summary
➢Theme: Portrait Animation; synthesize a lifelike
video from a single source image
➢Key Contributions
✓Developing a solid implicit keypoint-based
frameworkwith enhanced generation quality
and generalization ability
✓Designing advanced stitching and retargeting
modulesfor better controllability, with minimum
computational overhead

Theme: Portrait Animation
➢Task of animating a static portrait image with motion derived from a driving video,
audio, text, or generation
➢Targets key regions: face, hair, neck, and shoulder
Generative
Model

Related works fall across 2 categories (1/2)
➢Non-diffusion-based Portrait Animation
▪Rely on implicit keypoints, predefined
motion or latent expression representations
to animate portraits using optical flow and
local transformations
▪However, such methods struggle with
handling large-scale and complex motions
and offer limited flexibility

Non-diffusion-based Portrait Animation: Face Vid2Vid
➢Encodes the target motion using a novel keypointrepresentation for estimating
flow fields to warp the source appearance features
➢Employs a generator to convert warped feature volume to the output image

Related works fall across 2 categories (2/2)
➢Diffusion-based Portrait Animation
▪Leverage LDMsto generate high-quality
animations by iteratively refining from
Gaussian noise
▪Computationally intensive and struggle with
preserving identity / fine details in highly
dynamic scenarios

Diffusion-based Portrait Animation: MegActor
➢Leverages existing face-swapping and stylization frameworks to get the cross-
identity training pairs
➢ReferenceNetextracts identity and background information and injects the
information into Denoising UNet through cross-attention

Summary of the video-driven portrait animation methods
✓LivePortraitstands out by offering controllability features such as stitching, eyes
and lips retargeting, making it a well-rounded and efficient solution for portrait
animation

Overall Architecture
➢LivePortraitnetwork pipeline consists of 2 main stages
I. BASE MODEL
II. STITCHING &
RETARGETING

I. Base Model Training
➢Face Vid2vid as the base model with significant enhancements:
✓Upgraded network architecture
✓Scalable motion transformation
✓Landmark-guided implicit keypointsoptimization
✓Cascaded loss terms

I. Base Model Training: Upgraded network architecture
➢Unification of keypointdetector, head pose
estimation network and expression
deformation into single model with
ConvNeXt-V2-Tiny backbone
➢SPADE decoder as the generator G, with a
PixelShufflefinal layer for improved
efficiency
➢Each channel of the feature volume f
sserves
as a semantic map for generation

I. Base Model Training: Scalable Motion Transformation
➢Introduce scale factor to motion transformation expression, reducing the training
difficulty:
➢x
s, x
dare implicit 3D keypoints; x
c,s∈R
Kx3
represents canonical keypointsof the
source image; δ
s, δ
d∈R
Kx3
are expression deformations; t
s, t
d∈R
Kx3
are the
translations
➢Scale orthographic projection leads to overly flexible learned expressions δ→trade
off b/w flexibility & drivability

I. Base Model Training: Landmark-guided implicit keypointsoptimization
➢Introduce 2D landmarks that capture micro-expressions to tackle the difficulty of
learning subtle facial expressions like eye movements
➢Use estimated landmarks as guidance to optimize the learning of implicit keypoints
▪N = 10 total landmarks taken from eyes and lip region
▪l
iis the i-thlandmark
▪x
s,i,:2and x
d,i,:2represent first 2 dimensions of the
corresponding implicit keypointsrespectively

I. Base Model Training: Cascaded Loss Terms
➢Apply perceptual and GAN losses on the global
region of the input image, and local regions of
face and lip in a cascaded fashion
➢Face and lip regions are defined by 2D
semantic landmarks
➢Adopt face-id loss to preserve identity of the
source image, based on ArcFace

II. Stitching and Retargeting
➢Stitching module pastes the animated portrait back into original image space
without pixel misalignment
➢Retargeting module addresses the issue of incomplete eye closure and lip
movement during cross-id reenactment

II. Stitching and Retargeting: Stitching module
➢MLP estimates a deformation offset Δ
st∈R
Kx3
of the driving keypointsbased on x
s
and x
d
➢Driving keypointsare updated as: and the prediction image as:
➢Stitching objective is formulated as:
▪M
st
masks out non
shoulder region
▪ is the L
1norm
regularization
MLP [126, 128, 128, 64, 65]

II. Stitching and Retargeting: Eyes retargeting module
➢MLP estimates a deformation offset Δ
eyes∈R
Kx3
of the driving keypointsbased on
x
s, source eyes-open condition tuple c
s,eyesand a random driving eyes-open scalar
c
d,eyes∈[0, 0.8]
➢Driving keypointsare updated as: and the prediction image as:
➢Eyes retargeting objective is formulated as:
▪M
eyes
masks out eye
region
▪ is the L
1norm
regularization
MLP [66, 256, 256, 128, 128, 64, 64]

II. Stitching and Retargeting: Lip retargeting module
➢MLP estimates a deformation offset Δ
lip∈R
Kx3
of the driving keypointsbased on
x
s, source lip-open condition tuple c
s,lipand a random driving lip-open scalar c
d,lip
➢Driving keypointsare updated as: and the prediction image as:
➢Lip retargeting objective is formulated as:
▪M
lip
masks out lip
region
▪ is the L
1norm
regularization
MLP [65, 128, 128, 64, 63]

Other loss functions
I.Implicit keypointsequivariance loss
▪ To ensure the consistency of image-specific keypoints
II.Keypointprior loss
▪ To encourage the estimated image-specific keypoints to spread out across the face region
III.Head pose loss
▪ To penalize the prediction error of the head rotation compared to the ground truth
IV.Deformation prior loss
▪ To penalize the magnitude of the deformations

Inference Phase
I.Extract source feature volume and the canonical keypoints
II.Extract motions from each frame of driving video and
conditions c
d,eyes,iand c
d,lip,i
III.Transform source and drivingimplicit keypoints:
IV.Generate final prediction image I
p,iby the warping network Wand the decoder D

Dataset used for evaluation
➢Self-Reenactment Evaluation
▪35 videos from TalkingHead-1KH(~38,400
frames)
▪50 videos from VFHQ(~15,000 frames)
➢Cross-Reenactment Evaluation
▪First 50 images obtained from FFHQas
source
▪Extract 1 frame per 10 frames for driving
sequence (~24 frames / sequence)

Metrics for quantitative evaluation (1/2)



Metrics for quantitative evaluation (2/2)



Evaluated against state-of-the-art methods

Qualitative Comparison: Self-Reenactment
✓Faithfully transfers
motion including lip
movements, eye gazes
✓Preserves appearance
details
✓Stable animation
results with large poses

Qualitative Comparison: Cross-Reenactment
✓Stitching enables stable
animation even when
reference image is
relatively small

Qualitative Comparison: Temporal Consistency
✓LivePortraitmaintains superior temporal consistency in animations, avoiding
unnatural background and foreground movements often seen in diffusion-based
methodslike FADM, AniPortraitand MegActor

Quantitative Comparison: Self-Reenactment
✓Slightly outperforms previous diffusion-based methods in generation quality
✓Demonstrates better eyes motion accuracy

Quantitative Comparison: Cross-Reenactment
✓Outperforms previous diffusion-based and non-diffusion-based methods in both
generation quality and motion accuracy
✓FID of AniPortraitis lower, but requires much more inference time due to multiple
denoising steps

Video Comparison: Self-Reenactment

Video Comparison: Cross-Reenactment

Video Results: Portrait animation from a still image with stitching

Video Results: Portrait video editing with stitching

Video Results: Eyes Retargeting

Video Results: Lip Retargeting

Video Results: Generalization to animals

Ablation Study: Stitching Module
➢Reenactment results of the network
trained with and without stitching
module
✓Forced alignment of shoulder
region with the cropped source
portrait while preserving the
motion & appearance
✓No visually apparent misalignment

Ablation Study: Eyes Retargeting
➢Reenactment results of the network
trained with and without eyes
retargeting module
✓Animated frames achieve the
same eye-closing motion as the
driving

Ablation Study: Lip Retargeting
➢Reenactment results of the network
trained with and without lip
retargeting module
✓Animated frames achieve the
same lip motion as the driving