Scene Representation Networks | NeurIPS 2019

Vincent Sitzmann, Michael Zollhöfer, Gordon Wetzstein

A data-efficient, scalable, interpretable and flexible neural scene representation.

Supplemental Video

ABSTRACT

The advent of deep learning has given rise to neural scene representations – learned mathematical models of a 3D environment. However, many of these representations do not explicitly reason about geometry and thus do not account for the underlying 3D structure of the scene. In contrast, geometric deep learning has explored 3D-structure-aware representations of scene geometry, but requires explicit 3D supervision. We propose Scene Representation Networks (SRNs), a continuous, 3D-structure-aware scene representation that encodes both geometry and appearance. SRNs represent scenes as continuous functions that map world coordinates to a feature representation of local scene properties. By formulating the image formation as a differentiable ray-marching algorithm, SRNs can be trained end-to-end from only 2D observations, without access to depth or geometry. This formulation naturally generalizes across scenes, learning powerful geometry and appearance priors in the process. We demonstrate the potential of SRNs by evaluating them for novel view synthesis, few-shot reconstruction, joint shape and appearance interpolation, and unsupervised discovery of a non-rigid face model.

Generalizing Shape & Appearance Priors Across Scenes

SRNs explain all 2D observations in 3D, leading to unsupervised, yet explicit, reconstruction of geometry jointly with appearance. Normal maps may visualize the reconstructed geometry and make SRNs fully interpretable. On the left, you can see the normal maps of the reconstructed geometry – note that these are learned fully unsupervised! In the center, you can see novel views generated by SRNs, and to the right, the ground-truth views. This model was trained on 50 2D observations each of ~2.5k cars in the Shapenet v2 dataset.

Camera Pose Extrapolation

SRNs generate images without using convolutional neural networks – pixels of a rendered image are only connected via the 3D scene representation and can be generated completely independently. SRNs can thus be sampled at arbitrary image resolutions without retraining, and naturally generalize to completely unseen camera transformations. The model that generated the images above was trained on cars, but only on views with a constant distance to each car – yet, it flawlessly enables zoom and camera roll, though these transformations were entirely unobserved at training time. In contrast, models with black-box neural renderers will fail entirely to generate these novel views.

Instance Interpolation

Single-image Reconstruction

By generalizing SRNs over a class of scenes, they enable few-shot reconstruction of both shape and geometry – a car, for instance, may be reconstructed from only a single observation, enabling almost perfectly multi-view consistent novel view generation.

Non-rigid Deformation

Because surfaces are parameterized smoothly, SRNs naturally allow for non-rigid deformation. The model above was trained on 50 images each of 1000 faces, where we used the ground-truth identity and expression parameters as latent codes. A single identity has only been observed with a single facial expression. By fixing identity parameters and varying expression parameters, SRNs allow for non-rigid deformation of the learned face model, effortlessly generalizing facial expressions across identities (right). Similar to the cars and chairs above, interpolation latent vectors yields smooth interpolation of the respective identities and expressions (left). Note that all movements are reflected in the normal map as well as the appearance.

Proof-of-concept: Inside-out Novel View Synthesis

Here, we show first results for inside-out novel view synthesis. We rendered 500 images of a minecraft room, and trained a single SRN with 500k parameters on this dataset.

FILES

 

CITATION

V. Sitzmann et al., “Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations,” NeurIPS, 2019.

Bibtex

@inproceedings{sitzmann2019srns,
author = {Sitzmann, Vincent
and Zollh{\”o}fer, Michael
and Wetzstein, Gordon},
title = {Scene Representation Networks:
Continuous 3D-Structure-Aware
Neural Scene Representations},
booktitle = {Proc. NeurIPS},
year={2019}
}

Related Projects

You may also be interested in related projects focusing on neural scene representations and rendering:

  • Chan et al. pi-GAN. CVPR 2021 (link)
  • Kellnhofer et al. Neural Lumigraph Rendering. CVPR 2021 (link)
  • Lindell et al. Automatic Integration for Fast Neural Rendering. CVPR 2021 (link)
  • Sitzmann et al. Implicit Neural Representations with Periodic Activation Functions. NeurIPS 2020 (link)
  • Sitzmann et al. MetaSDF. NeurIPS 2020 (link)
  • Sitzmann et al. Deep Voxels. CVPR 2019 (link)