StableDub : Taming Diffusion Prior for Efficient and Generalized Visual Dubbing

Anonymous Authors

Abstract

The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, previous methods struggle to achieve generalized results with consistent visual quality, occlusion robustness, and lip habit resemblance. This limits their applicability to diverse scenarios. Enlarging the diffusion-based model is a straightforward approach to enhance generalization but leads to greater complexity. In this paper, we propose StableDub to utilize the diffusion prior for efficient and generalized visual dubbing. We employ direct image concatenation for appearance injection, achieving substantial memory optimization, and propose the hybrid Mamba-Attention denoising network with streamlined training. To resemble the lip habit of the target avatar, we propose a lip habit modulation module that incorporates the habits into the generation process. Two compact facial masks are also derived to enhance generalization and optimization efficiency. Extensive experimental results demonstrate that StableDub outperforms the state-of-the-art methods in various aspects.

Demo videos may take up to 1 minute to load. Please wait...
Please expand video to full screen and unmute for better observation.

Self Reenactment

Cross-Audio Dubbing

Occlusion Robustness

Out-of-Domain Face

Left: original; Right: dubbed

The Last of Us — a 3D action-adventure game

Frozen — an American animated fantasy film

Detective Conan — a Japanese detective anime series

PS: The mouth movements of the character in video may not be driven by their own voice, which falls under the cross-audio dubbing setting.
In real movie translation or dubbing, characters and their voices should be manually matched. This process is not in the scope of this work.

Comparison on Donald Trump

Comparison on Theresa May

Method Overview

Given the input video of the target avatar and arbitrary audio, StableDub alters the masked mouth region and generates lip movements that match the audio content. To supplement appearance details in the generation process, we concatenate an appearance image with per-frame video latent and send them into the denoising U-Net. To enhance the resemblance of generated lip movements with the target avatar, we extract the habit feature from another lip segment and modulate it with the audio feature. The modulated feature is sent to the audio cross-attention layer in each U-Net block. The spatial and temporal Mamba layers are incorporated to efficiently ensure image quality and video coherence. 1 or 2 denotes that the layer is optimized in the training of stage 1 or stage 2.