浏览全部资源
扫码关注微信
1. 西安电子科技大学 计算机科学与技术学院,陕西 西安 710071
2. 西安电子科技大学 大数据与视觉智能关键技术重点实验室,陕西 西安 710071
[ "高迪辉(1998—),男,西安电子科技大学硕士研究生,E-mail:[email protected]" ]
盛立杰(1976—),男,副教授,E-mail:[email protected]
[ "许小冬(1999—),男,西安电子科技大学硕士研究生,E-mail:[email protected]" ]
[ "苗启广(1972—),男,教授,E-mail:[email protected]" ]
纸质出版日期:2024-08-20,
网络出版日期:2024-03-13,
收稿日期:2023-07-10,
移动端阅览
高迪辉, 盛立杰, 许小冬, 等. 图文跨模态检索的联合特征方法[J]. 西安电子科技大学学报, 2024,51(4):128-138.
Dihui GAO, Lijie SHENG, Xiaodong XU, et al. Joint feature approach for image-text cross-modal retrieval. [J]. Journal of Xidian University, 2024,51(4):128-138.
高迪辉, 盛立杰, 许小冬, 等. 图文跨模态检索的联合特征方法[J]. 西安电子科技大学学报, 2024,51(4):128-138. DOI: 10.19665/j.issn1001-2400.20240302.
Dihui GAO, Lijie SHENG, Xiaodong XU, et al. Joint feature approach for image-text cross-modal retrieval. [J]. Journal of Xidian University, 2024,51(4):128-138. DOI: 10.19665/j.issn1001-2400.20240302.
随着深度学习的快速发展
图文跨模态检索性能有了显著提升。然而现有方法仅利用全局信息对图像文本整体匹配或仅利用局部信息匹配
对图文信息的利用存在局限性
检索性能有待进一步提升。为了充分挖掘图像与文本语义的潜在联系
提出了一种基于联合特征的跨模态检索模型
其特征提取部分由两级网络分别处理图像与文本的局部特征和全局特征。并且在全局特征优化的过程中
设计了基于注意力机制的双线性层结构来过滤冗余信息
减小与局部特征的精细度差距。同时为实现两类特征联合优化
在损失函数部分使用三元组排名损失获取不同模态间的联系
并引入语义标签分类损失保持全局语义一致性。所提出的模型具有广泛的通用性
可以有效提升仅基于局部信息模型的性能。在公开数据集Flickr30k和MS COCO上一系列的实验结果表明
提出的模型有效地提升了跨模态图文检索任务的性能
在Flickr30k数据集检索任务中提出的模型在文本检索的R@1指标上提高了约5.1%
在图像检索的R@1指标上提高了约2.8%。
With the rapid development of deep learning
cross-modal retrieval performance has been significantly improved.However
existing methods only match the image text as a whole or only use local information for matching
there are limitations in the use of graphic and textual information
and the retrieval performance needs to be further improved.In order to fully exploit the potential semantic relationship between images and texts
this paper proposes a cross-modal retrieval model based on joint features.In the feature extraction part
two sub-networks are used to deal with the local features and global features of images and texts respectively
and a bilinear layer structure based on the attention mechanism is designed to filter redundant information.In the loss function part
the triplet ranking loss and semantic label classification loss are used to realize feature joint optimization.And the proposed model has a wide range of generality
which can effectively improve the performance of the model only based on local information.A series of experimental results on the public datasets Flickr30k and MS COCO show that the proposed model effectively improves the performance of cross-modal image-text retrieval tasks.In the Flickr30k dataset retrieval task
the proposed model improves 5.1% on the R@1 metric for text retrieval and 2.8% on the R@1 metric for image retrieval.
跨模态检索深度学习自注意力网络图像检索
cross-modal retrievaldeep learningself-attention networkimage retrieval
WANG L, LI Y, LAZEBNIK S. Learning Deep Structure-Preserving Image-Text Embeddings[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:5005-5013.
KARPATHY A, LI F. Deep Visual-Semantic Alignments for Generating Image Descriptions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:3128-3137.
LEE K H, CHEN X, HUA G, et al. Stacked Cross Attention for Image-Text Matching[C]//Proceedings of European Conference on Computer Vision. Heidelberg:Springer, 2018:201-216.
KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models(2014)[J/OL].[2014-11-10]. https://arxiv.org/abs/1411.2539. https://arxiv.org/abs/1411.2539https://arxiv.org/abs/1411.2539
FAGHRI F, FLEET D J, KIROS J R, et al. VSE++:Improving Visual-Semantic Embeddings with Hard Negatives(2017)[J/OL].[2017-07-18]. https://arxiv.org/abs/1707.05612. https://arxiv.org/abs/1707.05612https://arxiv.org/abs/1707.05612
QU L, LIU M, CAO D, et al. Context-Aware Multi-View Summarization Network for Image-Text Matching[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020:1047-1055.
MESSINA N, AMATO G, ESULI A, et al. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders[J]. ACM Transactions on Multimedia Computing,Communications and Applications, 2021, 17(4):1-23.
ZHANG K, MAO Z, WANG Q, et al. Negative-Aware Attention Framework for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022:15661-15670.
PAN Z, WU F, ZHANG B. Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:19275-19284.
FU Z, MAO Z, SONG Y, et al. Learning Semantic Relationship Among Instances for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:15159-15168.
JIANG D, YE M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:2787-2797.
HUANG R, LONG Y,HANJ, et al. Nlip:Noise-Robust Language-Image Pre-Training[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2023, 37(1):926-934.
YANG A, PAN J, LIN J, et al. Chinese Clip:Contrastive Vision-Language Pretraining in Chinese(2022)[J/OL].[2022-11-02]. https://arxiv.org/abs/2211.01335. https://arxiv.org/abs/2211.01335https://arxiv.org/abs/2211.01335
LI J, SELVARAJU R, GOTMARE A, et al. Align before Fuse:Vision and Language Representation Learning with Momentum Distillation[J]. Advances in Neural Information Processing Systems, 2021,34:9694-9705.
LI J, LI D, XIONG C, et al. Blip:Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation[C]//International Conference on Machine Learning. New York: PMLR, 2022:12888-12900.
LI J, LI D, SAVARESE S, et al. BLIP-2:Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models[C]//Proceedings of the 40th International Conference on Machine Learning. New York: PMLR, 2023:19730-19742.
姜定, 叶茫. 面向跨模态文本到图像行人重识别的Transformer网络[J]. 中国图象图形学报, 2023, 28(5):1384-1395.
JIANG Ding, YE Mang. Transformer Network for Cross-Modal Text-to-Image Person Re-Identification[J]. Journal of Image and Graphics, 2023, 28(5):1384-1395.
QI J, PENG Y, YUAN Y. Cross-Media Multi-Level Alignment with Relation Attention Network(2018)[J/OL].[2018-04-25]. https://arxiv.org/abs/1804.09539. https://arxiv.org/abs/1804.09539https://arxiv.org/abs/1804.09539
ZHANG Y, ZHOU W, WANG M, et al. Deep Relation Embedding for Cross-Modal Retrieval[J]. IEEE Transactions on Image Processing, 2020,30:617-627.
JI Z, WANG H, HAN J, et al. Saliency-Guided Attention Network for Image-Sentence Matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2019:5754-5763.
HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recoginition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
REN S, HE K, GIRSHICK R, et al. Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of IEEE Conference on COMPUTER Vision and Pattern Recognition. Piscataway:IEEE, 2018:6077-6086.
KRISHNA R, ZHU Y, GROTH O, et al. Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All You Need[C]//Advances in Neural Information Processing Systems. San Diego: NeruIPS, 2017:5998-6008.
DEVLIN J, CHANG M, LEE K, et al. BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding(2018)[J/OL].[2018-10-11]. https://arxiv.org/abs/1810.04805. https://arxiv.org/abs/1810.04805https://arxiv.org/abs/1810.04805
YOUNG P, LAI A, HODOSH M, et al. From Image Descriptions to Visual Denotations:New Similarity Metrics for Semantic Inference over Event Descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):67-78.
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft Coco:Common Objects in Context[C]//Proceedings of the European Conference on Computer Vision. Heidelberg:Springer, 2014:740-755.
LIU C, MAO Z, LIU A, et al. Focus Your Attention:A Bidirectional Focal Attention Network for Image-Text Matching[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019:3-11.
CHEN H, DING G, LIU X, et al. Imram:Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2020:4321-4329.
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构