Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision.
We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.
Dream-to-Recon comprises three steps: a) We train a view completion model (VCM) that inpaints occluded areas and refines warped views. Training uses only a single view per scene and leverages forward-backward warping for data generation. b) The VCM is applied iteratively alongside a depth prediction network to synthesize virtual novel views, enabling progressive refinement of the 3D geometry. c) The synthesized scene geometries are then used to distill a feed-forward scene reconstruction model by supervising occupancy and virtual depth.
We visualize the scene geometry by discretizing the occupancy into a voxel grid. We compare our method against Behind the Scenes (BTS) [Wimbauer 2023] and also train a second BTS baseline that uses depth prediction in addition to multi-view supervision (BTS-D).
Both, the synthesized scene (Ours) and the geometry predicted by the scene reconstruction model (Ours Distilled) exhibit high overall quality on par with both BTS variants. However, the synthesized scenes occasionally display artifacts caused by poor view completion or depth prediction. The distilled scene reconstruction model does not exhibit these issues, as it is trained on a diverse dataset.
Scenes in the Waymo Open Dataset are more dynamic compared to KITTI-360. As shown in the following video, both BTS variants fail to reconstruct dynamic objects. This failure stems from their use of multi-view data across multiple timesteps, which introduces inconsistency when the object is in motion. In contrast, our synthesized scenes naturally handle such scenarios.
While we do not focus on novel view synthesis, we use synthetic novel views as an intermediary step to synthetic occupancy. We rely on our view completion model (VCM) to complete occlusions and remove artifacts in warped input images. The following video shows synthetic novel views obtained through a single render-and-refine step.
When warping the input view into a novel view, the sharp changes in the input depth map serve as a good proxy for which regions will introduce occlusions. We leverage these depth-gradients to mask regions in the novel view which are occluded in the input view and post-process the mask to remove noise and fill in missing regions. We compare this strategy against a method that leverages two-way optical flow between the input and novel views to identify regions visible only in the latter. Occlusions are shown in yellow.
@inproceedings{wulff2025dreamtorecon,
title = {Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images},
author = {Wulff, Philipp and Wimbauer, Felix and Muhle, Dominik and Cremers, Daniel},
booktitle = {IEEE International Conference on Computer Vision (ICCV)},
year = {2025}
}
Acknowledgements: This work was funded by the ERC Advanced Grant ”SIMULACRON” (agreement #884679), the GNI Project ”AI4Twinning”, and the DFG project CR 250/26-1 ”4D YouTube”.