PixelRNN | CVPR 2024

Haley So, Laurie Bose, Piotr Dudek, Gordon Wetzstein

In-pixel recurrent neural networks for end-to-end-optimized perception with neural sensors.

ABSTRACT

Conventional image sensors digitize high-resolution images at fast frame rates, producing a large amount of data that needs to be transmitted off the sensor for further processing. This is challenging for perception systems operating on edge devices, because communication is power inefficient and induces latency. Fueled by innovations in stacked image sensor fabrication, emerging sensor-processors offer programmability and minimal processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture, PixelRNN, that encodes spatio-temporal features on the sensor using purely binary operations. PixelRNN reduces the amount of data to be transmitted off the sensor by a factor of 64x compared to conventional systems while offering competitive accuracy for hand gesture recognition and lip reading tasks. We experimentally validate PixelRNN using a prototype implementation on the SCAMP-5 sensor-processor platform.

FILES

Technical paper and supplement (link to arxiv)

CITATION

H. So, L. Bose, P. Dudek, G. Wetzstein, PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors, CVPR 2024.

@inproceedings{so_pixelrnn,
author = {Haley So and Laurie Bose and Piotr Dudek and Gordon Wetzstein},
title = {PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors},
booktitle = {CVPR},
year = {2024},
}

Overview and Results

The perception pipeline of PixelRNN can be broken down into an on-sensor encoder and a task-specific, off-sensor decoder. On the left is the camera equipped with a sensor-processor, which offers processing and memory at the pixel level. The captured light is directly processed by our spatio-temporal encoder on the sensor plane, compressing the readout bandwidth by 64 times. Our designed PixelRNN is shown on the right.

We compare baselines, including a RAW and difference camera (DIFF) as well as several RNN architectures, each with 1- and 2-layer CNN encoders and binary or full 32 bit floating point precision. PixelRNN offers the best performance for the lowest memory footprint, especially when used with binary weights. The dashed vertical line indicates the available memory on our hardware platform, SCAMP-5, showing that low-precision network architectures are the only feasible option in practice.

Our experimental in-pixel processing platform, SCAMP-5, offers memory and simple compute in each pixel as well as communication between neighboring pixels. We prototype our PixelRNN architecture on this platform.

Experimental Results. We run the training sets through the SCAMP-5 implementation twice. The first outputs are used to fine-tune the off-sensor linear layer decoder. In theory, the train set accuracy of the second runs should be close. With the noise accumulated through the analog compute, however, the SCAMP-5 implementation is not deterministic. Adding Gaussian noise during training increases the test-set performance.

This pipeline shows the sequence of operations from left to right. The input image is downsampled, duplicated, and binarized. Stored convolutional weights perform 16 convolutions, to produce 16 feature maps in the 4 by 4 grid of processor elements. A ReLU activation is applied, followed by max-pooling, downsampling, and binarization. This can either be fed to another CNN layer or to the input of the RNN. The RNN takes in the output of the CNN and the previous hidden state to calculate the new hidden state. The output is read out every 16 frames, yielding 64 times decrease in bandwidth.

ACKNOWLEDGEMENTS

This project was in part supported by Samsung and the National Science Foundation.

RELATED PROJECTS

You may also be interested in related projects on neural sensors, such as :

So et al. MantissaCam for Snapshot HDR Imaging. ICCP 2022 (link)
Nguyen et al. Learning Spatially Varying Pixel Exposures for Motion Deblurring. ICCP 2022 (link)
Vargas et al. Time-multiplexed Coded Apertures. ICCV 2021 (link)
Martel et al. Neural Sensors. ICCP 2020 (link)