DeepVoxels: Learning Persistent 3D Feature Embeddings | CVPR 2019

Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, Michael Zollhöfer

3D understanding in generative neural networks.

Supplemental Video

ABSTRACT

In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry. At its core, our approach is based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying 3D scene structure. Our approach combines insights from 3D geometric computer vision with recent advances in learning image-to-image mappings based on adversarial loss functions. DeepVoxels is supervised, without requiring a 3D reconstruction of the scene, using a 2D re-rendering loss and enforces perspective and multi-view geometry in a principled manner. We apply our persistent 3D scene representation to the problem of novel view synthesis demonstrating high-quality results for a variety of challenging scenes.

Learning persistent object representations

During training, we learn a persistent DeepVoxels representation that encodes the view-dependent appearance of a 3D
object from a dataset of posed multi-view images (top). At test time, DeepVoxels enable novel view synthesis (bottom)

Overview of all model components. At the heart of our encoder-decoder based architecture is a novel viewpoint-invariant and persistent 3D volumetric scene representation called DeepVoxels that enforces spatial structure on the learned per-voxel code vectors.

Explicit Occlusion Reasoning

Occlusion reasoning is essential to rendering. To this end, we propose a differentiable occlusion module that explicitly reasons about and enforces voxel occlusions. The feature volume (represented by feature grid) is first resampled
into the canonical view volume via a projection transformation and trilinear interpolation. The occlusion network then predicts per-pixel
softmax weights along each depth ray. The canonical view volume is then collapsed along the depth dimension via a convex combination
of voxels to yield the final, occlusion-aware feature map. The per-voxel visibility weights can be used to compute a depth map.

Results on Real Captures

We have trained DeepVoxels on a number of real scenes captured with a DSLR camera. Camera poses were obtained via sparse bundle-adjustment. The animations above show novel views rendered with the DeepVoxels representation on the left and the nearest neighbor in the training set on the right.

FILES

CITATION

V. Sitzmann et al., “DeepVoxels: Learning Persistent 3D Feature Embeddings,” in Proc. CVPR, 2019.

BibTeX

@inproceedings{sitzmann2019deepvoxels,
author = {Sitzmann, Vincent
and Thies, Justus
and Heide, Felix
and Nie{\ss}ner, Matthias
and Wetzstein, Gordon
and Zollh{\”o}fer, Michael},
title = {DeepVoxels: Learning Persistent 3D Feature Embeddings},
booktitle = {Proc. CVPR},
year={2019}
}
}

Related Projects

You may also be interested in related projects focusing on neural scene representations and rendering:

Chan et al. pi-GAN. CVPR 2021 (link)
Kellnhofer et al. Neural Lumigraph Rendering. CVPR 2021 (link)
Lindell et al. Automatic Integration for Fast Neural Rendering. CVPR 2021 (link)
Sitzmann et al. Implicit Neural Representations with Periodic Activation Functions. NeurIPS 2020 (link)
Sitzmann et al. MetaSDF. NeurIPS 2020 (link)
Sitzmann et al. Scene Representation Networks. NeurIPS 2019 (link)