The visual dubbing task aims to generate mouth movements synchronized with the driving audio,
which has seen significant progress in recent years. However, previous methods struggle to
achieve generalized results with consistent visual quality, occlusion robustness, and lip habit
resemblance. This limits their applicability to diverse scenarios. Enlarging the diffusion-based
model is a straightforward approach to enhance generalization but leads to greater complexity.
In this paper, we propose StableDub to utilize the diffusion prior for efficient and generalized
visual dubbing. We employ direct image concatenation for appearance injection, achieving substantial
memory optimization, and propose the hybrid Mamba-Attention denoising network with streamlined
training. To resemble the lip habit of the target avatar, we propose a lip habit modulation module
that incorporates the habits into the generation process. Two compact facial masks are also derived
to enhance generalization and optimization efficiency. Extensive experimental results demonstrate
that StableDub outperforms the state-of-the-art methods in various aspects.
Demo videos may take up to 1 minute to load. Please wait...
Please expand video to full screen and unmute for better observation.
Left: original; Right: dubbed
The Last of Us — a 3D action-adventure game
Frozen — an American animated fantasy film
Detective Conan — a Japanese detective anime series
PS: The mouth movements of the character in video may not be driven by their own voice, which falls under the cross-audio dubbing setting.
In real movie translation or dubbing, characters and their voices should be manually matched. This process is not in the scope of this work.
Given the input video of the target avatar and arbitrary audio, StableDub alters the masked mouth region and generates lip movements that match the audio content. To supplement appearance details in the generation process, we concatenate an appearance image with per-frame video latent and send them into the denoising U-Net. To enhance the resemblance of generated lip movements with the target avatar, we extract the habit feature from another lip segment and modulate it with the audio feature. The modulated feature is sent to the audio cross-attention layer in each U-Net block. The spatial and temporal Mamba layers are incorporated to efficiently ensure image quality and video coherence. 1 or 2 denotes that the layer is optimized in the training of stage 1 or stage 2.