The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress
in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately
capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting
approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment.
In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis.
Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual
synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we
introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process.
By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting
superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the
perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in
low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit
resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency.
We expand the applicability of visual dubbing methods from comprehensive aspects.
Demo videos may take up to 1 minute to load. Please wait...
Please expand video to full screen and unmute for better observation.
Left: original; Right: dubbed
The Last of Us — a 3D action-adventure game
Frozen — an American animated fantasy film
Detective Conan — a Japanese detective anime series
PS: The mouth movements of the character in video may not be driven by their own voice, which falls under the cross-audio dubbing setting.
In real movie translation or dubbing, characters and their voices should be manually matched. This process is not in the scope of this work.
Given the input video of the target avatar and arbitrary driving audio, StableDub alters the masked mouth region and generates lip movements that match the audio content. To supplement appearance details in the generation process, we concatenate an appearance image with per-frame video latent and send them into the denoising U-Net. To enhance the resemblance of generated lip movements with the target avatar, we extract the habit feature from another lip segment and modulate it with the audio feature. The modulated feature is sent to the audio cross-attention layer in each U-Net block. The spatial and temporal Mamba layers are incorporated to efficiently ensure image quality and video coherence. 1 or 2 denotes that the layer is optimized in the training of stage 1 or stage 2.