# RemoteCLIP
**Repository Path**: YJ-He/RemoteCLIP
## Basic Information
- **Project Name**: RemoteCLIP
- **Description**: 🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-20
- **Last Updated**: 2026-01-27
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
### News
- **2024/04/26**: The training dataset of RemoteCLIP (RET-3, SEG-4, DET-10) is released on 🤗HuggingFace, see [[gzqy1026/RemoteCLIP](https://huggingface.co/datasets/gzqy1026/RemoteCLIP)].
- **2024/04/03**: Our RemoteCLIP paper has been accepted by IEEE Transactions on Geoscience and Remote Sensing (TGRS) [[doi](https://ieeexplore.ieee.org/document/10504785)].
- **2024/03/01**: RemoteCLIP joined the leaderboard on [paperswithcode.com](https://paperswithcode.com/paper/remoteclip-a-vision-language-foundation-model) [](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsicd?p=remoteclip-a-vision-language-foundation-model) [](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsitmd?p=remoteclip-a-vision-language-foundation-model)
- **2023/12/01**: You can now auto-label remote sensing datasets with RemoteCLIP using the [`autodistill-remote-clip`](https://github.com/autodistill/autodistill-remote-clip) extension in the [Autodistill](https://github.com/autodistill/autodistill) framework, thanks [James Gallagher](https://jamesg.blog/) from Roboflow!
- **2023/11/07**: To facilitate reproducing RemoteCLIP's SOTA image-text retrieval results, we have prepared a `retrieval.py` script for retrieval evaluation on RSITMD, RSICD, and UCM datasets. Please see the [Retrieval Evaluation](#retrieval-evaluation) section for details.
- **2023/07/27**: We make pretrained checkpoints of RemoteCLIP models (`ResNet-50`, `ViT-base-32`, and `ViT-large-14`) available! We converted the weights to the [`OpenCLIP`](https://github.com/mlfoundations/open_clip) format, such that loading and using RemoteCLIP is extremely easy! Please see the [Load RemoteCLIP](#load-remoteclip) section for details. We also provide a Jupyter Notebook [demo.ipynb](demo.ipynb). You can also [](https://colab.research.google.com/github/ChenDelong1999/RemoteCLIP/blob/main/RemoteCLIP_colab_demo.ipynb), thanks [Dr. Gordon McDonald](https://github.com/gdmcdonald) from the University of Sydney!
- **2023/06/19**: We propose RemoteCLIP, the first vision-language foundation model for remote sensing. The preprint of our RemoteCLIP paper is available on arxiv [[2306.11029]](https://arxiv.org/abs/2306.11029).
### Introduction
Welcome to the official repository of our paper "[*RemoteCLIP: A Vision Language Foundation Model for Remote Sensing*](https://arxiv.org/abs/2306.11029)"!
General-purpose foundation models have become increasingly important in the field of artificial intelligence. While self-supervised learning (SSL) and Masked Image Modeling (MIM) have led to promising results in building such foundation models for remote sensing, these models primarily learn low-level features, require annotated data for fine-tuning, and are not applicable for retrieval and zero-shot applications due to the lack of language understanding.
**In response to these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics, as well as aligned text embeddings for seamless downstream application.** To address the scarcity of pre-training data, we leverage data scaling, converting heterogeneous annotations based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion, and further incorporating UAV imagery, resulting in a 12xlarger pretraining dataset.

RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting. Evaluations on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, show that RemoteCLIP consistently outperforms baseline foundation models across different model scales.

**Impressively, RemoteCLIP outperforms previous SoTA by 9.14% mean recall on the RSICD dataset and by 8.92% on RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets.**

### Load RemoteCLIP
RemoteCLIP is trained with the [`ITRA`](https://itra.readthedocs.io) codebase, and we have converted the pretrained checkpoints to [`OpenCLIP`](https://github.com/mlfoundations/open_clip) compatible format and uploaded them to [[this Huggingface Repo]](https://huggingface.co/chendelong/RemoteCLIP/tree/main), such that accessing the model could be more convenient!
- To load RemoteCILP, please first prepare an environment with [OpenCLIP](https://github.com/mlfoundations/open_clip) installation, for example, by running this command:
```bash
# https://pypi.org/project/open-clip-torch/
pip install open-clip-torch
```
- Then, download the pretrained checkpoint from [huggingface](https://huggingface.co/chendelong/RemoteCLIP), you can clone the repo with Git LFS, or download it automatically via [huggingface_hub](https://github.com/huggingface/huggingface_hub):
```python
from huggingface_hub import hf_hub_download
for model_name in ['RN50', 'ViT-B-32', 'ViT-L-14']:
checkpoint_path = hf_hub_download("chendelong/RemoteCLIP", f"RemoteCLIP-{model_name}.pt", cache_dir='checkpoints')
print(f'{model_name} is downloaded to {checkpoint_path}.')
```
- Now, you can initialize a CLIP model with `OpenCLIP`, then load the RemoteCLIP checkpoint with a few lines of code:
```python
import torch, open_clip
from PIL import Image
model_name = 'ViT-L-14' # 'RN50' or 'ViT-B-32' or 'ViT-L-14'
model, _, preprocess = open_clip.create_model_and_transforms(model_name)
tokenizer = open_clip.get_tokenizer(model_name)
ckpt = torch.load(f"path/to/your/checkpoints/RemoteCLIP-{model_name}.pt", map_location="cpu")
message = model.load_state_dict(ckpt)
print(message)
model = model.cuda().eval()
```
- The following is an example of text-to-image retrieval with RemoteCILP:
```python
text_queries = [
"A busy airport with many airplanes.",
"Satellite view of Hohai University.",
"A building next to a lake.",
"Many people in a stadium.",
"a cute cat",
]
text = tokenizer(text_queries)
image = preprocess(Image.open("assets/airport.jpg")).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image.cuda())
text_features = model.encode_text(text.cuda())
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]
print(f'Predictions of {model_name}:')
for query, prob in zip(text_queries, text_probs):
print(f"{query:<40} {prob * 100:5.1f}%")
```
You could get the following outputs:
```
Predictions of RN50:
A busy airport with many airplanes. 100.0%
Satellite view of Hohai University. 0.0%
A building next to a lake. 0.0%
Many people in a stadium. 0.0%
a cute cat 0.0%
```