GeNVS: 3D-Aware Diffusion Models | ICCV 2023

Eric R. Chan*, Koki Nagano*, Matthew A. Chan*, Alexander W. Bergman*, JJ Park*, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, Gordon Wetzstein

Generative Novel View Synthesis with 3D-Aware Diffusion Models

ABSTRACT

We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. This latent feature field captures the distribution over possible scene representations and improves our method’s ability to generate view-consistent novel renderings. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences. We demonstrate state-of-the-art results on synthetic renderings and room-scale scenes; we also show compelling results for challenging, real-world objects.

FILES

 

CITATION

E. Chan, K. Nagano, M. Chan, A. Bergman, JJ Park, A. Levy, M. Aittala, S. De Mello, T. Karras, G. Wetzstein, GeNVS: Generative Novel View Synthesis with 3D-Aware Diffusion Models, ICCV 2023

@inproceedings{chan2023genvs,
author = {Eric R. Chan and Koki Nagano and Matthew A. Chan and Alexander W. Bergman and Jeong Joon Park and Axel Levy and Miika Aittala and Shalini De Mello and Tero Karras and Gordon Wetzstein},
title = {{GeNVS}: Generative Novel View Synthesis with {3D}-Aware Diffusion Models},
booktitle = {ICCV},
year = {2023}
}

GeNVS framework

At a high level, our model operates as a conditional diffusion model for images, much like the models that have been successful in image inpainting, superresolution, and other conditional image generation tasks. Conditioned on an input view, we generate novel views by progressively denoising a sample of Gaussian noise. However, in this work, we embed 3D priors into the architecture in the form of a 3D feature field, which enhances the model’s ability to synthesize views on complex scenes. We lift and aggregate features from input image(s) into a 3D feature field. Given a query viewpoint, we volume-render a feature image to condition a U-Net image denoiser. The entire model, including feature encoder, volume renderer, and U-Net components, is trained end-to-end as an image-conditional diffusion model. At inference, we generate consistent sequences in an auto-regressive fashion.

Results



We demonstrate our method achieves compelling single-image novel view synthesis results on challenging, unmasked scenes from the Common Objects in 3D Dataset. To our knowledge, ours is the first work to attempt single-image novel view synthesis on this benchmark without object masks.

Flexibility is a strength of our approach. Beyond object-centric scenes, our method can also operate on large, inside-out scenes, such as these room-scale scenes from the Matterport3D dataset. By autoregressively generating frames, we can explore far from the input pose.

Our method is competitive with state-of-the-art baselines for single-image novel view synthesis on objects from the ShapeNet dataset.

Acknowledgements

We thank David Luebke, Samuli Laine, Tsung-Yi Lin, and Jaakko Lehtinen for feedback on drafts and early discussions. We thank Jonáš Kulhánek and Xuanchi Ren for thoughtful communications and for providing results and data for comparisons. We thank Trevor Chan for help with figures. Koki Nagano and Eric Chan were partially supported by DARPA’s Semantic Forensics (SemaFor) contract (HR0011-20-3-0005). JJ park was supported by ARL grant W911NF-21-2-0104. This project was in part supported by Samsung, the Stanford Institute for Human-Centered AI (HAI), and a PECASE from the ARO. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. Distribution Statement “A” (Approved for Public Release, Distribution Unlimited).

Related Projects

You may also be interested in related projects on 3D GANs, such as :

  • Deng et al., LumiGAN, 2023 (link)
  • Po and Wetzstein, Locally Conditioned Diffusion, 2023 (link)
  • Bergman et al. GNARF, NeurIPS 2022 (link)
  • Chan et al. EG3D, CVPR 2022 (link)
  • Chan et al. pi-GAN. CVPR 2021 (link)