Deep Optics for Monocular Depth Estimation and 3D Object Detection | ICCV 2019

Julie Chang, Gordon Wetzstein

Monocular depth estimation and 3D object detection with optimized optical elements, defocus blur, and chromatic aberrations.

ABSTRACT

Depth estimation and 3D object detection are critical for scene understanding but remain challenging to perform with a single image due to the loss of 3D information during image capture. Recent models using deep neural networks have improved monocular depth estimation performance, but there is still difficulty in predicting absolute depth and generalizing outside a standard dataset. Here we introduce the paradigm of deep optics, i.e. end-to-end design of optics and image processing, to the monocular depth estimation problem, using coded defocus blur as an additional depth cue to be decoded by a neural network. We evaluate several optical coding strategies along with an end-to-end optimization scheme for depth estimation on three datasets, including NYU Depth v2 and KITTI. We find an optimized freeform lens design yields the best results, but chromatic aberration from a singlet lens offers significantly improved performance as well. We build a physical prototype and validate that chromatic aberrations improve depth estimation on real-world results. In addition, we train object detection networks on the KITTI dataset and show that the lens optimized for depth estimation also results in improved 3D object detection performance.

OPTICAL AND NETWORK MODEL

PSF simulation model: (Top) Optical propagation model of point sources through a phase mask placed in front of a thin lens. PSFs are simulated by calculating intensity of the electric field at the sensor plane. (Bottom) Sample PSFs from thin lens defocus only, with chromatic aberrations, and using an optimized mask initialized with astigmatism.
Depth-dependent image formation: Given a set of lens parameters, an all-in-focus image, and its binned depth map, the image formation model generates the appropriate PSFs and applies depth-dependent convolution to simulate the corresponding sensor image, which is then passed into a U-Net for depth estimation.

DEPTH ESTIMATION

NYU-DEPTHV2 AND KITTI EXAMPLES

Depth estimation results: (Top) Examples with RMSE (m) from the NYU Depth v2 dataset with all-in-focus, defocus, chromatic aberration, and optimized models. The simulated sensor image from the optimized system is also shown. (Bottom) Examples with RMSE (m) from the KITTI dataset (cropped to fit) with all-in-focus and optimized models; the sensor image from the optimized model is also shown. All depth maps use the same colormap, but the maximum value is 7 m for NYU Depth and 50 m for KITTI.

DEPTH ESTIMATION TEST ERROR

Depth estimation error with different optical models for various datasets. RMSEs are reported for linear depth (m); see paper for log-scaling. Lowest errors are bolded. The KITTI* dataset is our KITTI dataset subset.

Optical Model Rectangles NYU Depth v2 KITTI*
All-in-focus 0.4626 0.9556 2.9100
Defocus only 0.2268 0.4814 2.5400
Defocus + astigmatism 0.1348 0.4561 2.3634
Defocus + chromatic aberration 0.0984 0.4496 2.2566
Optimized 0.0902 0.4325 1.9288

OPTICAL PROTOTYPE AND CAPTURED RESULTS

Optical prototype: For our real-world capture, we use a Canon camera and a Thorlabs singlet lens with inherent chromatic aberrations. We compare results from these images with corresponding images from all-in-focus capture by placing a pinhole over the lens.

Real-world results: (Top) Captured and calibrated depth-dependent PSFs, displayed at the same scale. (Bottom) Examples of images captured using our prototype with a zoomed region inset, depth estimation with chromatic aberration, and depth estimation from the corresponding all-in-focus image (not shown). Depth map colorscale is the same for all depth maps.

CITATION

Julie Chang and Gordon Wetzstein. Deep Optics for Monocular Depth Estimation and 3D Object Detection. IEEE International Conference on Computer Vision (ICCV), 2019

BibTeX

@inproceedings{Chang:2019:DeepOptics3D,
author = {Julie Chang and Gordon Wetzstein},
title = {Deep Optics for Monocular Depth Estimation and 3D Object Detection},
booktitle = {Proc. IEEE ICCV},
year = {2019},
}

Acknowledgements

This project was supported by an NSF CAREER Award (IIS 1553333), a Terman Faculty Fellowship, a Sloan Fellowship, an Okawa Research Grant, by the KAUST Office of Sponsored Research through the Visual Computing Center CCF grant, the DARPA REVEAL program, and the ARO (ECASE-Army Award W911NF-19-1-0120).

FILES

 

Related Projects

You may also be interested in related projects, where we apply the idea of Deep Optics, i.e. end-to-end optimization of optics and image processing, to other applications, like image classification, extended depth-of-field imaging, superresolution imaging, or optical computing.

  • Wetzstein et al. 2020. AI with Optics & Photonics. Nature (review paper, link)
  • Martel et al. 2020. Neural Sensors. ICCP & TPAMI 2020 (link)
  • Dun et al. 2020. Learned Diffractive Achromat. Optica 2020 (link)
  • Metzler et al. 2020. Deep Optics for HDR Imaging. CVPR 2020 (link)
  • Chang et al. 2019. Deep Optics for Depth Estimation and Object Detection. ICCV 2019 (link)
  • Peng et al. 2019. Large Field-of-view Imaging with Learned DOEs. SIGGRAPH Asia 2019 (link)
  • Chang et al. 2018. Hybrid Optical-Electronic Convolutional Neural Networks with Optimized Diffractive Optics for Image Classification. Scientific Reports (link)
  • Sitzmann et al. 2018. End-to-end Optimization of Optics and Imaging Processing for Achromatic Extended Depth-of-field and Super-resolution Imaging. ACM SIGGRAPH 2018 (link)

KITTI Result Videos

with optimized lens
Example KITTI result videos with optimized lens: From top to bottom, the simulated sensor images with the optimized lens, predicted depth maps, 2D bounding boxes (car/pedestrian/cyclist), and 3D bounding boxes (car/pedestrian/cyclist).
all-in-focus
Example KITTI result videos from original all-in-focus dataset images: From top to bottom, the original drive sequence, predicted depth maps, 2D bounding boxes (car/pedestrian/cyclist), and 3D bounding boxes (car/pedestrian/cyclist).