SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene | CVPR 2023

Minjung Son, Jeong Joon Park, Leonidas Guibas, Gordon Wetzstein

Generating different realizations of a single 3D scene from a few images.

ABSTRACT

Generative models have shown great promise in synthesizing photorealistic 3D objects, but they require large amounts of training data. We introduce SinGRAF, a 3D-aware generative model that is trained with a few input images of a single scene. Once trained, SinGRAF generates different realizations of this 3D scene that preserve the appearance of the input while varying scene layout. For this purpose, we build on recent progress in 3D GAN architectures and introduce a novel progressive-scale patch discrimination approach during training. With several experiments, we demonstrate that the results produced by SinGRAF outperform the closest related works in both quality and diversity by a large margin.

FILES

CITATION

M. Son, J. J. Park, L. Guibas, G. Wetzstein, SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene, CVPR 2023.

@inproceedings{son2023singraf,
author = {M. Son and J. J. Park and L. Guibas and G. Wetzstein},
title = {SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene},
booktitle = {CVPR},
year = {2023},
}

Overview

SinGRAF generates different plausible realizations of a single 3D scene from a few unposed input images of that scene. In this example, i.e., the “office_3” scene, we use 100 input images, four of which are shown in the top row. Next, we visualize four realizations of the 3D scene as panoramas, rendered using the generated neural radiance fields. Note the variations in scene layout, including chairs, tables, lamps, and other parts, while staying faithful to the structure and style of the input images.
SinGRAF pipeline. The framework takes as input a few images of a single scene, for example rendered from a 3D scan or photographed (bottom). The 3D-aware generator (top left) is then trained to generate 2D feature planes that are arranged in a triplane configuration and rendered into patches of varying scale (top right). These rendered patches along with patches cropped from the input images are then compared by a discriminator. Once trained, the SinGRAF generator synthesizes different realizations of the 3D scene that resemble the appearance of the training images while varying the layout.

Results


Results of a scene from the Replica dataset. We show a camera rotating inside two different realizations of this scene (left and center), generated from the same input images. Interpolating the latent code of the 3D GAN results in a smooth, semantically meaningful interpolation of these scenes (right).

Results of a scene from the Replica dataset. We show a camera rotating inside two different realizations of this scene (left and center), generated from the same input images. Interpolating the latent code of the 3D GAN results in a smooth, semantically meaningful interpolation of these scenes (right).

Results of a scene from the Replica dataset. We show a camera rotating inside two different realizations of this scene (left and center), generated from the same input images. Interpolating the latent code of the 3D GAN results in a smooth, semantically meaningful interpolation of these scenes (right).

Results of a scene from the Replica dataset. We show a camera rotating inside two different realizations of this scene (left and center), generated from the same input images. Interpolating the latent code of the 3D GAN results in a smooth, semantically meaningful interpolation of these scenes (right).

Results of a scene from the Matterport3D dataset. We show a camera rotating inside two different realizations of this scene (left and center), generated from the same input images. Interpolating the latent code of the 3D GAN results in a smooth, semantically meaningful interpolation of these scenes (right).

Results of an outdoor scene captured with a cellphone camera. We show a camera rotating inside two different realizations of this scene (left and center), generated from the same input images by the GSN baseline (DeVries et al., ICCV 2021) (top) and SinGRAF (bottom). Interpolating the latent code of the 3D GAN results no variation for GSN but a smooth, semantically meaningful interpolation of these scenes for SinGRAF (right). The GSN algorithm does not allow for diverse realizations of this scene to be sampled.

RELATED PROJECTS

You may also be interested in related projects on neural scene representations, such as :

  • Chan et al. EG3D. CVPR 2022 (link)
  • DeVries et al. GSN. ICCV 2021 (link)
  • Chan et al. pi-GAN. CVPR 2021 (link)
  • Sitzmann et al. Implicit Neural Representations with Periodic Activation Functions. NeurIPS 2020 (link)
  • Sitzmann et al. Scene Representation Networks. NeurIPS 2019 (link)