In recent years, with the rapid development of remote sensing (RS) technology, a vast amount of high-resolution RS imagery has emerged, which holds significant value for humanity in extracting valuable knowledge. Cross-modal text-image retrieval methods have garnered increasing attention in the research of RS image retrieval. These methods involve both visual and language comprehension, enabling users to find images that best illustrate the topic of a text query or text descriptions that best explain the content of a visual query. Although some progress has been made in the field of cross-modal text-image retrieval for RS, previous mainstream methods have struggled to obtain fine-grained semantic discriminative features. Furthermore, due to the repetitive and ambiguous nature of text descriptions in commonly used datasets, these methods may not be directly applicable to fine-grained retrieval tasks for remote RS images and texts. To address this issue, we innovatively propose a fine-grained semantic alignment retrieval framework for RS text-image retrieval in this paper. Firstly, previously commonly used datasets RSICD and RSITMD have coarse-grained and repetitive captions. To achieve fine-grained semantic alignment, we conducted fine-grained annotations based on RSICD and RSITMD. Figure 1 shows an example from the RSICD dataset. Different from previous short-sentence annotation methods, we recaption the geospatial elements in each RS image with five fine-grained annotations. For the attributes of geospatial elements, we extended beyond categories to include information on state, color, quantity, etc. For the relationships between entities, we expanded to include spatial relationships and functional relationships. The volume of image-text pairs in the recaptioned dataset FG-RSICD and FG-RSITMD are 109K and 47K, respectively. The average length of captions is 30.5 and 32.2, respectively, up from 11.5 and 11.3 previously. To ensure the accuracy and diversity of the dataset semantics, we employed multimodal large models for scoring and conducted manual verification. Moreover, Previous approaches are limited by data quality and model capabilities and are suboptimal on fine-grained semantic alignment. Traditional methods primarily encode the entire image and serve the obtained high-dimensional embedding vectors as the visual representation of the image. Due to the lack of fine-grained visual representation, it is challenging to learn fine-grained visual-semantic correspondences. In order to learn fine-grained semantic discriminative features in recaptioned datasets, we incorporated instance feature information for supplementation. Specifically, inspired by bottom-up attention, we segment RS images and extract regions of interest that contain essential targets. During training, we project the precomputed regions of interest features into a common space and align them with visual and semantic representations. Empirical results indicate that the re-annotated datasets RSICD and RSITMD contain fine-grained information and implicit knowledge within complex RS scenes. Compared to previous methods, our framework exhibits superior retrieval performance. Especially on the RSICD dataset, the retrieval performance of our proposed framework achieves a relative improvement of 8%. The datasets and code will be open-sourced.