High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

1National Key Laboratory of Fundamental Science on Synthetic Vision
2School of Cyber Science and Engineering
3 College of Computer Science
Sichuan University, Chengdu, China
CVPR 2025
Introduction

Qualitative results of our method. The target lighting is applied to the meshes of the driving frames to generate shading hints. Using the shading hints, our relightable portrait animation framework animates and relights the reference frame, e.g., the results within the solid boxes show lighting consistent with the target lighting and poses consistent with the driving frames.

Abstract

Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.

Overview

Overview of our pipeline for lighting controllable portrait animation. It consists of two main stages: (1) Portrait Attributes Subspace Modeling Stage: We use DECA to encode video frames and extract lighting, pose, and shape parameters, which are rendered as shading hints. After processing the shading hints and reference image through the shading adapter and reference adapter, the two features are randomly selected and fused as guidance to guide the Stable Video Diffusion Model in generating denoised video frames with consistent lighting, pose, identity, and appearance. (2) Relighting and Animation Stage: We render the shading hints using the pose of the portrait from the video, the shape from the reference image, and the spherical harmonics coefficients of the target lighting. After processing the shading hints and reference image through two adapters, we employ multi-condition classifier-free guidance to adjust the magnitude of the extrinsic feature guidance direction, enabling the generation of lighting controllable portrait animations.

Relightable Portrait Animation Demo

BibTeX

@misc{guo2025highfidelityrelightablemonocularportrait,
      title={High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model}, 
      author={Mingtao Guo and Guanyu Xing and Yanli Liu},
      year={2025},
      eprint={2502.19894},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.19894}, 
}