科研成果详情

发表状态已发表Published
题名Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering
作者
发表日期2024-05-01
发表期刊IEEE Transactions on Circuits and Systems for Video Technology
ISSN/eISSN1051-8215
卷号34期号:5页码:4109-4119
摘要

As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.

关键词Audio-visual question answering deep learning multimodal learning video understanding
DOI10.1109/TCSVT.2023.3318220
URL查看来源
收录类别SCIE
语种英语English
WOS研究方向Engineering
WOS类目Engineering, Electrical & Electronic
WOS记录号WOS:001221132000024
Scopus入藏号2-s2.0-85174833125
引用统计
文献类型期刊论文
条目标识符https://repository.uic.edu.cn/handle/39GCC9TT/11787
专题北师香港浸会大学
通讯作者Wang, Lei
作者单位
1.University of Wollongong, School of Computing and Information Technology, Wollongong, 2522, Australia
2.University of Electronic Science and Technology of China, School of Computer Science and Engineering, Chengdu, 610056, China
3.Institute of Computer Science, Beijing Normal University-Hong Kong, Baptist University United International College, Zhuhai, 519000, China
推荐引用方式
GB/T 7714
Chen, Zailong,Wang, Lei,Wang, Penget al. Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(5): 4109-4119.
APA Chen, Zailong, Wang, Lei, Wang, Peng, & Gao, Peng. (2024). Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering. IEEE Transactions on Circuits and Systems for Video Technology, 34(5), 4109-4119.
MLA Chen, Zailong,et al."Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering". IEEE Transactions on Circuits and Systems for Video Technology 34.5(2024): 4109-4119.
条目包含的文件
条目无相关文件。
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Chen, Zailong]的文章
[Wang, Lei]的文章
[Wang, Peng]的文章
百度学术
百度学术中相似的文章
[Chen, Zailong]的文章
[Wang, Lei]的文章
[Wang, Peng]的文章
必应学术
必应学术中相似的文章
[Chen, Zailong]的文章
[Wang, Lei]的文章
[Wang, Peng]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。