Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images

ICCV 2025 Logo
1Technical University of Munich, 2MCML

TL;DR: We leverage fine-tuned diffusion models for inpainting and a pre-trained depth predictor to generate high-quality scene geometry from a single image. Afterwards, we distill a feed-forward scene reconstruction model, which performs on par with reconstruction methods trained with multi-view supervision.

Abstract

Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision.

We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.

Video

Method

Dream-to-Recon comprises three steps: a) We train a view completion model (VCM) that inpaints occluded areas and refines warped views. Training uses only a single view per scene and leverages forward-backward warping for data generation. b) The VCM is applied iteratively alongside a depth prediction network to synthesize virtual novel views, enabling progressive refinement of the 3D geometry. c) The synthesized scene geometries are then used to distill a feed-forward scene reconstruction model. by supervising occupancy and virtual depth.

Method Overview

Results

3D Reconstruction

KITTI-360

TODO


Waymo

TODO

View Completion

KITTI-360

TODO


Occlusion Masks

TODO


BibTeX

@inproceedings{wulff2025dreamtorecon,
  title     = {Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images},
  author    = {Wulff, Philipp and Wimbauer, Felix and Muhle, Dominik and Cremers, Daniel},
  booktitle = {IEEE International Conference on Computer Vision (ICCV)},
  year      = {2025}
}

Acknowledgements: This work was funded by the ERC Advanced Grant ”SIMULACRON” (agreement #884679), the GNI Project ”AI4Twinning”, and the DFG project CR 250/26-1 ”4D YouTube”.