Remote sensing visual grounding (RSVG) holds great promise in the field of remote sensing (RS) and has been extensively studied by researchers in recent years. Existing methods are mostly based on the transformer architecture, leveraging multi-head self-attention mechanisms in multimodal encoders to integrate visual and textual features. However, the multiple self-attention operations within the multilayer multimodal encoders have quadratic time complexity, resulting in significant computational demands and slower model inference speed. Furthermore, simply using multimodal encoders for multimodal feature fusion struggles to express small-scale targets within the feature tokens, leading to poor localization performance for small targets. To address these issues, we propose a fast and accurate end-to-end RSVG method (FAEM). FAEM employs a novel multi-scale multimodal feature fusion mechanism (MMFFM) to replace the multilayer multimodal encoders, enabling faster and more accurate multimodal inference. MMFFM captures regions in the multi-scale visual feature maps that are highly relevant to the query text by broadcasting textual features across these visual feature maps. Additionally, we optimized the loss function commonly used in RSVG methods, addressing issues of negative values and overly simplistic constraint conditions. Experimental results demonstrate the effectiveness of our approach. FAEM achieves the fastest inference speed while maintaining accuracy comparable to current state-of-the-art methods.