MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Yingshuang Zou1,2* Yikang Ding2*† Chuanrui Zhang1,2 Jiazhe Guo1,2 Bohan Li2,4 Xiaoyang Lyu5 Feiyang Tan3 Xiaojuan Qi5 Haoqian Wang1
1Tsinghua University2MEGVII Technology3Mach Drive4Shanghai Jiao Tong University5The University of Hong Kong 
* Equal Contribution † Project Leader 
Paper Code Pre-trained Models
architecture

TL;DR

We present MuDG, a controllable Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction.

Abstract

Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

Architecture

architecture
Overview of MuDG. (a) Training Phase of the Multi-modal Diffusion Model (MDM). (b) Pipeline of the Multi-modal 3DGS Scene.

Comparisons with other GS Works on NVS

We present qualitative comparisons with other GS works:

Visualization of Multi-modal Results

We present the visualization of results from multi-modal diffusion model:

Visualization of New Trajectories

We present the visualization of new trajectories(shift2m, shift3m, shift4m):

Comparisons with the State-of-the-art

We present qualitative comparisons with the following state-of-the-art models:

SOTA comparisons

Visualization of Multi-modal Results

SOTA comparisons