TY - GEN
T1 - Depth-Aware Audio Visual Segmentation with Geometry-Heuristic Cross Attention
AU - Afrisal, Hadha
AU - Abpeikar, Shadi
AU - Cruz, Francisco
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - The current state-of-the-art in Audio Visual Segmentation (AVS) has demonstrated successful milestones in performing pixel-level sounding object segmentation. However, they faced significant issues of misalignment and performance degradation in complex settings, such as in substandard lighting conditions and cluttered environments. To address those challenges, we introduce Depth-aware Audio Visual Segmentation (Depth AVS) to enhance the capability of current transformer-based AVS. This paper proposes three main contributions: first, a Geometry-Heuristic Cross Attention (GHCA) as a novel method to suppress irrelevant distant features from the main object of interest, thus enhancing the robustness of audio-visual cross-attention fusion in cluttered and inadequate lighting conditions. Second, an Intermediate Fusion module for integrating depth and RGB features to enrich our model’s learning with non-redundant visual features. Third, a depth-aware segmentor that outputs not only a binary mask but also a segmented depth mask. We experimented using the S4 AVSBench-Object dataset against the current state-of-the-art in AVS, AVSegFormer. Our experiments demonstrate that our Depth AVS surpasses the performance of the AVS baseline method using only small input sizes. Our Depth AVS also extends the capability of AVS using distance estimation with a small error rate.
AB - The current state-of-the-art in Audio Visual Segmentation (AVS) has demonstrated successful milestones in performing pixel-level sounding object segmentation. However, they faced significant issues of misalignment and performance degradation in complex settings, such as in substandard lighting conditions and cluttered environments. To address those challenges, we introduce Depth-aware Audio Visual Segmentation (Depth AVS) to enhance the capability of current transformer-based AVS. This paper proposes three main contributions: first, a Geometry-Heuristic Cross Attention (GHCA) as a novel method to suppress irrelevant distant features from the main object of interest, thus enhancing the robustness of audio-visual cross-attention fusion in cluttered and inadequate lighting conditions. Second, an Intermediate Fusion module for integrating depth and RGB features to enrich our model’s learning with non-redundant visual features. Third, a depth-aware segmentor that outputs not only a binary mask but also a segmented depth mask. We experimented using the S4 AVSBench-Object dataset against the current state-of-the-art in AVS, AVSegFormer. Our experiments demonstrate that our Depth AVS surpasses the performance of the AVS baseline method using only small input sizes. Our Depth AVS also extends the capability of AVS using distance estimation with a small error rate.
KW - Audio Visual Segmentation
KW - Depth-aware Attention
KW - Geometry Heuristic Cross Attention
KW - Multimodal Transformer
KW - Sounding Object Distance Estimation
UR - https://www.scopus.com/pages/publications/105023826491
U2 - 10.1007/978-981-95-4972-6_15
DO - 10.1007/978-981-95-4972-6_15
M3 - Conference contribution
AN - SCOPUS:105023826491
SN - 9789819549719
T3 - Lecture Notes in Computer Science
SP - 187
EP - 199
BT - AI 2025
A2 - Liu, Miaomiao
A2 - Yu, Xin
A2 - Xu, Chang
A2 - Song, Yiliao
PB - Springer Science and Business Media Deutschland GmbH
T2 - 38th Australasian Joint Conference on Artificial Intelligence, AI 2025
Y2 - 1 December 2025 through 5 December 2025
ER -