Skip to main navigation Skip to search Skip to main content

Depth-Aware Audio Visual Segmentation with Geometry-Heuristic Cross Attention

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The current state-of-the-art in Audio Visual Segmentation (AVS) has demonstrated successful milestones in performing pixel-level sounding object segmentation. However, they faced significant issues of misalignment and performance degradation in complex settings, such as in substandard lighting conditions and cluttered environments. To address those challenges, we introduce Depth-aware Audio Visual Segmentation (Depth AVS) to enhance the capability of current transformer-based AVS. This paper proposes three main contributions: first, a Geometry-Heuristic Cross Attention (GHCA) as a novel method to suppress irrelevant distant features from the main object of interest, thus enhancing the robustness of audio-visual cross-attention fusion in cluttered and inadequate lighting conditions. Second, an Intermediate Fusion module for integrating depth and RGB features to enrich our model’s learning with non-redundant visual features. Third, a depth-aware segmentor that outputs not only a binary mask but also a segmented depth mask. We experimented using the S4 AVSBench-Object dataset against the current state-of-the-art in AVS, AVSegFormer. Our experiments demonstrate that our Depth AVS surpasses the performance of the AVS baseline method using only small input sizes. Our Depth AVS also extends the capability of AVS using distance estimation with a small error rate.

Original languageEnglish
Title of host publicationAI 2025
Subtitle of host publicationAdvances in Artificial Intelligence - 38th Australasian Joint Conference on Artificial Intelligence, AI 2025, Proceedings
EditorsMiaomiao Liu, Xin Yu, Chang Xu, Yiliao Song
PublisherSpringer Science and Business Media Deutschland GmbH
Pages187-199
Number of pages13
ISBN (Print)9789819549719
DOIs
StatePublished - 2026
Event38th Australasian Joint Conference on Artificial Intelligence, AI 2025 - Canberra, Australia
Duration: 1 Dec 20255 Dec 2025

Publication series

NameLecture Notes in Computer Science
Volume16371 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference38th Australasian Joint Conference on Artificial Intelligence, AI 2025
Country/TerritoryAustralia
CityCanberra
Period1/12/255/12/25

Keywords

  • Audio Visual Segmentation
  • Depth-aware Attention
  • Geometry Heuristic Cross Attention
  • Multimodal Transformer
  • Sounding Object Distance Estimation

Fingerprint

Dive into the research topics of 'Depth-Aware Audio Visual Segmentation with Geometry-Heuristic Cross Attention'. Together they form a unique fingerprint.

Cite this