Depth from Defocus with Learned Optics | ICCP 2021

Hayato Ikoma, Cindy Nguyen, Christopher Metzler, Yifan (Evan) Peng, and Gordon Wetzstein

End-to-end optimization of optics and image processing for passive depth estimation.

ABSTRACT

Monocular depth estimation remains a challenging problem, despite significant advances in neural network architectures that leverage pictorial depth cues alone. Inspired by depth from defocus and emerging point spread function engineering approaches that optimize programmable optics end-to-end with depth estimation networks, we propose a new and improved framework for depth estimation from a single RGB image using a learned phase-coded aperture. Our optimized aperture design uses rotational symmetry constraints for computational efficiency, and we jointly train the optics and the network using an occlusion-aware image formation model that provides more accurate defocus blur at depth discontinuities than previous techniques do. Using this framework and a custom prototype camera, we demonstrate state-of-the art image and depth estimation quality among end-to-end optimized computational cameras in simulation and experiment.

FILES

 

CITATION

Hayato Ikoma, Cindy M. Nguyen, Christopher A. Metzler, Yifan Peng, Gordon Wetzstein, Depth from Defocus with Learned Optics for Imaging and Occlusion-aware Depth Estimation, International Conference on Computational Photography 2021

@article{Ikoma:2021,
author = {Hayato Ikoma and Cindy M. Nguyen and Christopher A. Metzler and Yifan Peng and Gordon Wetzstein},
title = {Depth from Defocus with Learned Optics for Imaging and Occlusion-aware Depth Estimation},
journal = {IEEE International Conference on Computational Photography (ICCP)},
year={2021}
}

Illustration of end-to-end (E2E) optimization framework. RGBD images of a training set are convolved with the depth-dependent 3D PSF created by a lens surface profile h and combined using alpha compositing. The resulting sensor image i is processed by an approximate-inverse-based preconditioner before being fed into the CNN. A loss function L is applied to both the resulting RGB image and the depth map. The error is backpropagated into the CNN parameters and the surface profile of the phase-coded aperture.
Comparing image formation models that simulate defocus blur from an RGB image (top left) and a depth map (top right). Existing linear models, including Wu et al.’s and Chang et al.’s variants of it, do not model blur at depth discontinuities adequately. Our nonlinear occlusion-aware model achieves a more faithful approximation of a ray-traced ground truth image.
Prototype phase-coded aperture camera. (a) A disassembled camera lens next to our fabricated DOE with a 3D-printed mounting adapter. (b) A microscopic image of the fabricated DOE. The dark gray area is the DOE made of NOA61, and the light gray area is the light-blocking metal aperture made of chromium and gold. The black scale bar on the bottom right is 1 mm. (c) The height profile of the designed DOE. The maximum height is 2.1 um.
Depth-dependent point spread functions (PSFs). The designed PSF (top row) is optimized with our end-to-end simulator. Optical imperfections result in the captured PSF (center row), slightly deviating from the design. Instead of working directly with the captured PSF, we fit a parametric model to it (bottom row) that we use to refine our CNN. The scale bar represents 100 um. For visualization purposes, we convert the linear intensity of the PSF to amplitude by applying a square root.
Experimentally captured results of indoor and outdoor scenes. From left: Images of scenes captured with a conventional camera, depth maps estimated by a CNN from these conventional camera images, images captured by our phase-coded camera prototype with the optimized DOE, all-in-focus (AiF) images estimated by our algorithm from these coded sensor images, depth maps estimated by our algorithm from the coded sensor images.


Experimentally captured video showing high temporal coherence of our method. From top-left to bottom-right: an image of the scene captured with a conventional camera, a depth map estimated by a CNN comparable to ours from this conventional camera image, a depth map estimated from the conventional image by a state-of-the-art monocular depth estimator (MiDaS), an image captured by our phase-coded camera prototype with the optimized DOE, an AiF image estimated by our CNN from this coded sensor image, and a depth map estimated by our CNN from the coded sensor image.

Related Projects

You may also be interested in related projects, where we apply the idea of Deep Optics, i.e. end-to-end optimization of optics and image processing, to other applications, like image classification, extended depth-of-field imaging, superresolution imaging, or optical computing.

  • Wetzstein et al. 2020. AI with Optics & Photonics. Nature (review paper, link)
  • Martel et al. 2020. Neural Sensors. ICCP & TPAMI 2020 (link)
  • Dun et al. 2020. Learned Diffractive Achromat. Optica 2020 (link)
  • Metzler et al. 2020. Deep Optics for HDR Imaging. CVPR 2020 (link)
  • Chang et al. 2019. Deep Optics for Depth Estimation and Object Detection. ICCV 2019 (link)
  • Peng et al. 2019. Large Field-of-view Imaging with Learned DOEs. SIGGRAPH Asia 2019 (link)
  • Chang et al. 2018. Hybrid Optical-Electronic Convolutional Neural Networks with Optimized Diffractive Optics for Image Classification. Scientific Reports (link)
  • Sitzmann et al. 2018. End-to-end Optimization of Optics and Imaging Processing for Achromatic Extended Depth-of-field and Super-resolution Imaging. ACM SIGGRAPH 2018 (link)