Details of Research Outputs

Status已发表Published
TitleQuestion-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering
Creator
Date Issued2024-05-01
Source PublicationIEEE Transactions on Circuits and Systems for Video Technology
ISSN1051-8215
Volume34Issue:5Pages:4109-4119
Abstract

As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.

KeywordAudio-visual question answering deep learning multimodal learning video understanding
DOI10.1109/TCSVT.2023.3318220
URLView source
Indexed BySCIE
Language英语English
WOS Research AreaEngineering
WOS SubjectEngineering, Electrical & Electronic
WOS IDWOS:001221132000024
Scopus ID2-s2.0-85174833125
Citation statistics
Cited Times:5[WOS]   [WOS Record]     [Related Records in WOS]
Document TypeJournal article
Identifierhttp://repository.uic.edu.cn/handle/39GCC9TT/11787
CollectionBeijing Normal-Hong Kong Baptist University
Corresponding AuthorWang, Lei
Affiliation
1.University of Wollongong, School of Computing and Information Technology, Wollongong, 2522, Australia
2.University of Electronic Science and Technology of China, School of Computer Science and Engineering, Chengdu, 610056, China
3.Institute of Computer Science, Beijing Normal University-Hong Kong, Baptist University United International College, Zhuhai, 519000, China
Recommended Citation
GB/T 7714
Chen, Zailong,Wang, Lei,Wang, Penget al. Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(5): 4109-4119.
APA Chen, Zailong, Wang, Lei, Wang, Peng, & Gao, Peng. (2024). Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering. IEEE Transactions on Circuits and Systems for Video Technology, 34(5), 4109-4119.
MLA Chen, Zailong,et al."Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering". IEEE Transactions on Circuits and Systems for Video Technology 34.5(2024): 4109-4119.
Files in This Item:
There are no files associated with this item.
Related Services
Usage statistics
Google Scholar
Similar articles in Google Scholar
[Chen, Zailong]'s Articles
[Wang, Lei]'s Articles
[Wang, Peng]'s Articles
Baidu academic
Similar articles in Baidu academic
[Chen, Zailong]'s Articles
[Wang, Lei]'s Articles
[Wang, Peng]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Chen, Zailong]'s Articles
[Wang, Lei]'s Articles
[Wang, Peng]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.