83 / 2024-08-15 10:36:49
A Fine-Grained Semantic Alignment Method for Remote Sensing Text-Image Retrieval (AITC 2024+摘要)
Remote Sensing,Text-Image Retrieval,Fine-Grained Alignment,Dataset
摘要待审
张伟航 / 中国科学院空天信息创新研究院
陈佳良 / 中国科学院空天信息创新研究院
张文凯 / 中国科学院空天信息创新研究院
李新明 / 空天信息大学
高鑫 / 中国科学院空天信息创新研究院
In recent years, with the rapid development of remote sensing (RS) technology, a vast amount of high-resolution RS imagery has emerged, which holds significant value for humanity in extracting valuable knowledge. Cross-modal text-image retrieval methods have garnered increasing attention in the research of RS image retrieval. These methods involve both visual and language comprehension, enabling users to find images that best illustrate the topic of a text query or text descriptions that best explain the content of a visual query. Although some progress has been made in the field of cross-modal text-image retrieval for RS, previous mainstream methods have struggled to obtain fine-grained semantic discriminative features. Furthermore, due to the repetitive and ambiguous nature of text descriptions in commonly used datasets, these methods may not be directly applicable to fine-grained retrieval tasks for remote RS images and texts. To address this issue, we innovatively propose a fine-grained semantic alignment retrieval framework for RS text-image retrieval in this paper. Firstly, previously commonly used datasets RSICD and RSITMD have coarse-grained and repetitive captions. To achieve fine-grained semantic alignment, we conducted fine-grained annotations based on RSICD and RSITMD. Figure 1 shows an example from the RSICD dataset. Different from previous short-sentence annotation methods, we recaption the geospatial elements in each RS image with five fine-grained annotations. For the attributes of geospatial elements, we extended beyond categories to include information on state, color, quantity, etc. For the relationships between entities, we expanded to include spatial relationships and functional relationships. The volume of image-text pairs in the recaptioned dataset FG-RSICD and FG-RSITMD are 109K and 47K, respectively. The average length of captions is 30.5 and 32.2, respectively, up from 11.5 and 11.3 previously. To ensure the accuracy and diversity of the dataset semantics, we employed multimodal large models for scoring and conducted manual verification. Moreover, Previous approaches are limited by data quality and model capabilities and are suboptimal on fine-grained semantic alignment. Traditional methods primarily encode the entire image and serve the obtained high-dimensional embedding vectors as the visual representation of the image. Due to the lack of fine-grained visual representation, it is challenging to learn fine-grained visual-semantic correspondences. In order to learn fine-grained semantic discriminative features in recaptioned datasets, we incorporated instance feature information for supplementation. Specifically, inspired by bottom-up attention, we segment RS images and extract regions of interest that contain essential targets. During training, we project the precomputed regions of interest features into a common space and align them with visual and semantic representations. Empirical results indicate that the re-annotated datasets RSICD and RSITMD contain fine-grained information and implicit knowledge within complex RS scenes. Compared to previous methods, our framework exhibits superior retrieval performance. Especially on the RSICD dataset, the retrieval performance of our proposed framework achieves a relative improvement of 8%. The datasets and code will be open-sourced.
重要日期
  • 会议日期

    09月20日

    2024

    09月22日

    2024

  • 08月30日 2024

    初稿截稿日期

  • 09月22日 2024

    注册截止日期

主办单位
山东省人民政府
中国电子学会
承办单位
中国科学院学部
中国科学院空天信创新研究所息
复旦大学
联系方式
移动端
在手机上打开
小程序
打开微信小程序
客服
扫码或点此咨询
Baidu
map