【AI Paper】Perform efficient portrait animation with LivePortrait!
https://youtu.be/xB71xA_bPoQ?si=JkXTpAHbh4Ydgszr
以下の論文を解説しています。
LivePortrait: Efficient Portrait Animation with Stitching and...
YouTube nnabla channelの次の動画で利用したスライドです。
【AI Paper】Perform efficient portrait animation with LivePortrait!
https://youtu.be/xB71xA_bPoQ?si=JkXTpAHbh4Ydgszr
以下の論文を解説しています。
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control
https://arxiv.org/abs/2407.03168
本スライドで使用している画像は論文中のもの、またはそれを参考に作成したものを使用しています。
Size: 5.78 MB
Language: en
Added: Sep 18, 2024
Slides: 42 pages
Slide Content
論文解説
シド
Authors propose a portrait animation framework with a focus on better generalization, controllability, and efficiency
Summary
➢Theme: Portrait Animation; synthesize a lifelike
video from a single source image
➢Key Contributions
✓Developing a solid implicit keypoint-based
frameworkwith enhanced generation quality
and generalization ability
✓Designing advanced stitching and retargeting
modulesfor better controllability, with minimum
computational overhead
Theme: Portrait Animation
➢Task of animating a static portrait image with motion derived from a driving video,
audio, text, or generation
➢Targets key regions: face, hair, neck, and shoulder
Generative
Model
Related works fall across 2 categories (1/2)
➢Non-diffusion-based Portrait Animation
▪Rely on implicit keypoints, predefined
motion or latent expression representations
to animate portraits using optical flow and
local transformations
▪However, such methods struggle with
handling large-scale and complex motions
and offer limited flexibility
Non-diffusion-based Portrait Animation: Face Vid2Vid
➢Encodes the target motion using a novel keypointrepresentation for estimating
flow fields to warp the source appearance features
➢Employs a generator to convert warped feature volume to the output image
Related works fall across 2 categories (2/2)
➢Diffusion-based Portrait Animation
▪Leverage LDMsto generate high-quality
animations by iteratively refining from
Gaussian noise
▪Computationally intensive and struggle with
preserving identity / fine details in highly
dynamic scenarios
Diffusion-based Portrait Animation: MegActor
➢Leverages existing face-swapping and stylization frameworks to get the cross-
identity training pairs
➢ReferenceNetextracts identity and background information and injects the
information into Denoising UNet through cross-attention
Summary of the video-driven portrait animation methods
✓LivePortraitstands out by offering controllability features such as stitching, eyes
and lips retargeting, making it a well-rounded and efficient solution for portrait
animation
Overall Architecture
➢LivePortraitnetwork pipeline consists of 2 main stages
I. BASE MODEL
II. STITCHING &
RETARGETING
I. Base Model Training
➢Face Vid2vid as the base model with significant enhancements:
✓Upgraded network architecture
✓Scalable motion transformation
✓Landmark-guided implicit keypointsoptimization
✓Cascaded loss terms
I. Base Model Training: Upgraded network architecture
➢Unification of keypointdetector, head pose
estimation network and expression
deformation into single model with
ConvNeXt-V2-Tiny backbone
➢SPADE decoder as the generator G, with a
PixelShufflefinal layer for improved
efficiency
➢Each channel of the feature volume f
sserves
as a semantic map for generation
I. Base Model Training: Scalable Motion Transformation
➢Introduce scale factor to motion transformation expression, reducing the training
difficulty:
➢x
s, x
dare implicit 3D keypoints; x
c,s∈R
Kx3
represents canonical keypointsof the
source image; δ
s, δ
d∈R
Kx3
are expression deformations; t
s, t
d∈R
Kx3
are the
translations
➢Scale orthographic projection leads to overly flexible learned expressions δ→trade
off b/w flexibility & drivability
I. Base Model Training: Landmark-guided implicit keypointsoptimization
➢Introduce 2D landmarks that capture micro-expressions to tackle the difficulty of
learning subtle facial expressions like eye movements
➢Use estimated landmarks as guidance to optimize the learning of implicit keypoints
▪N = 10 total landmarks taken from eyes and lip region
▪l
iis the i-thlandmark
▪x
s,i,:2and x
d,i,:2represent first 2 dimensions of the
corresponding implicit keypointsrespectively
I. Base Model Training: Cascaded Loss Terms
➢Apply perceptual and GAN losses on the global
region of the input image, and local regions of
face and lip in a cascaded fashion
➢Face and lip regions are defined by 2D
semantic landmarks
➢Adopt face-id loss to preserve identity of the
source image, based on ArcFace
II. Stitching and Retargeting
➢Stitching module pastes the animated portrait back into original image space
without pixel misalignment
➢Retargeting module addresses the issue of incomplete eye closure and lip
movement during cross-id reenactment
II. Stitching and Retargeting: Stitching module
➢MLP estimates a deformation offset Δ
st∈R
Kx3
of the driving keypointsbased on x
s
and x
d
➢Driving keypointsare updated as: and the prediction image as:
➢Stitching objective is formulated as:
▪M
st
masks out non
shoulder region
▪ is the L
1norm
regularization
MLP [126, 128, 128, 64, 65]
II. Stitching and Retargeting: Eyes retargeting module
➢MLP estimates a deformation offset Δ
eyes∈R
Kx3
of the driving keypointsbased on
x
s, source eyes-open condition tuple c
s,eyesand a random driving eyes-open scalar
c
d,eyes∈[0, 0.8]
➢Driving keypointsare updated as: and the prediction image as:
➢Eyes retargeting objective is formulated as:
▪M
eyes
masks out eye
region
▪ is the L
1norm
regularization
MLP [66, 256, 256, 128, 128, 64, 64]
II. Stitching and Retargeting: Lip retargeting module
➢MLP estimates a deformation offset Δ
lip∈R
Kx3
of the driving keypointsbased on
x
s, source lip-open condition tuple c
s,lipand a random driving lip-open scalar c
d,lip
➢Driving keypointsare updated as: and the prediction image as:
➢Lip retargeting objective is formulated as:
▪M
lip
masks out lip
region
▪ is the L
1norm
regularization
MLP [65, 128, 128, 64, 63]
Other loss functions
I.Implicit keypointsequivariance loss
▪ To ensure the consistency of image-specific keypoints
II.Keypointprior loss
▪ To encourage the estimated image-specific keypoints to spread out across the face region
III.Head pose loss
▪ To penalize the prediction error of the head rotation compared to the ground truth
IV.Deformation prior loss
▪ To penalize the magnitude of the deformations
Inference Phase
I.Extract source feature volume and the canonical keypoints
II.Extract motions from each frame of driving video and
conditions c
d,eyes,iand c
d,lip,i
III.Transform source and drivingimplicit keypoints:
IV.Generate final prediction image I
p,iby the warping network Wand the decoder D
Dataset used for evaluation
➢Self-Reenactment Evaluation
▪35 videos from TalkingHead-1KH(~38,400
frames)
▪50 videos from VFHQ(~15,000 frames)
➢Cross-Reenactment Evaluation
▪First 50 images obtained from FFHQas
source
▪Extract 1 frame per 10 frames for driving
sequence (~24 frames / sequence)
Metrics for quantitative evaluation (1/2)
↑
↑
↓
↑
Metrics for quantitative evaluation (2/2)
↓
↓
↓
↓
Evaluated against state-of-the-art methods
Qualitative Comparison: Self-Reenactment
✓Faithfully transfers
motion including lip
movements, eye gazes
✓Preserves appearance
details
✓Stable animation
results with large poses
Qualitative Comparison: Cross-Reenactment
✓Stitching enables stable
animation even when
reference image is
relatively small
Qualitative Comparison: Temporal Consistency
✓LivePortraitmaintains superior temporal consistency in animations, avoiding
unnatural background and foreground movements often seen in diffusion-based
methodslike FADM, AniPortraitand MegActor
Quantitative Comparison: Cross-Reenactment
✓Outperforms previous diffusion-based and non-diffusion-based methods in both
generation quality and motion accuracy
✓FID of AniPortraitis lower, but requires much more inference time due to multiple
denoising steps
Video Comparison: Self-Reenactment
Video Comparison: Cross-Reenactment
Video Results: Portrait animation from a still image with stitching
Video Results: Portrait video editing with stitching
Video Results: Eyes Retargeting
Video Results: Lip Retargeting
Video Results: Generalization to animals
Ablation Study: Stitching Module
➢Reenactment results of the network
trained with and without stitching
module
✓Forced alignment of shoulder
region with the cropped source
portrait while preserving the
motion & appearance
✓No visually apparent misalignment
Ablation Study: Eyes Retargeting
➢Reenactment results of the network
trained with and without eyes
retargeting module
✓Animated frames achieve the
same eye-closing motion as the
driving
Ablation Study: Lip Retargeting
➢Reenactment results of the network
trained with and without lip
retargeting module
✓Animated frames achieve the
same lip motion as the driving