CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

CVPR 2026 Workshops (3DMV)

1Institute of Photogrammetry and GeoInformation, Leibniz University Hannover

Panoramic visualization of the cylindrical projection of RGB inputs. This figure is provided solely for illustration, to show how objects captured from different views are mapped to nearby locations in cylindrical coordinates.

Abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

Attention maps for a query token (indicated by the arrow in the back-left image), as overlays on the respective RGB images, showing that this token attends to itself, nearby regions, and to the corresponding region in the spatially adjacent image

Results

Exemplary 3D reconstructions, comparing our method to the state-of-the-art on nuScenes. It shows the reconstruction of the overlap regions of the front-right, back-right and the back camera.

Exemplary consistency error maps, computed using the metric described in Sec. 4.1 overlayed on the front-image, comparing our approach with state-of-the-art methods on DDAD. Our method maps overlapping regions from the two images to nearby 3D coordinates. In contrast, the other methods exhibit a higher inconsistency in these regions (green bounding boxes). (a) shows the combined relevant regions from the front-right and front-left images that overlap with the front image. Errors are visualized using the inferno colormap, ranging from black (low error) to yellow (high error).

Comparison of depth maps predicted by our method and by state-of-the-art methods on DDAD. Our results show better preserved details and well-defined object boundaries (green bounding boxes).

BibTeX

@article{abualhanud2025cylinderdepth,
  title={CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation},
  author={Abualhanud, Samer and Grannemann, Christian and Mehltretter, Max},
  journal={arXiv preprint arXiv:2511.16428},
  year={2025}
}