2022-11-09 08:14:09 +00:00
---
license: apache-2.0
---
2022-11-09 08:32:22 +00:00
2022-11-09 08:39:57 +00:00
2022-11-09 08:32:22 +00:00
# Chinese-CLIP-Base
## Introduction
This is the base-version of the Chinese CLIP. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP
2022-11-09 08:41:46 +00:00
## Use with the official API
2022-11-09 08:32:22 +00:00
We provide a simple code snippet to show how to use the API for Chinese-CLIP. For starters, please install cn_clip:
```bash
# to install the latest stable release
pip install cn_clip
# or install from source code
2022-11-09 08:35:27 +00:00
cd Chinese-CLIP
2022-11-09 08:32:22 +00:00
pip install -e .
```
After installation, use Chinese CLIP as shown below:
```python
2022-11-09 08:39:57 +00:00
import torch
2022-11-09 08:32:22 +00:00
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name, available_models
print("Available models:", available_models())
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize the features. Please use the normalized features for downstream tasks.
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
logits_per_image, logits_per_text = model.get_similarity(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]
```
However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.
< br > < br >
## Results
### MUGE Text-to-Image Retrieval
< table border = "1" width = "100%" >
< tr align = "center" >
< th > Setup< / th > < th colspan = "4" > Zero-shot< / th > < th colspan = "4" > Finetune< / th >
< / tr >
< tr align = "center" >
< td > Metric< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > MR< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > MR< / td >
< / tr >
< tr align = "center" >
< td > Wukong< sub > ViT-B< / sub > < / td > < td > 33.4< / td > < td > 59.3< / td > < td > 69.7< / td > < td > 54.1< / td > < td > 39.2< / td > < td > 66.9< / td > < td > 77.4< / td > < td > 61.2< / td >
< / tr >
< tr align = "center" >
< td > R2D2< sub > ViT-B< / sub > < / td > < td > -< / td > < td > -< / td > < td > -< / td > < td > -< / td > < td > 47.4< / td > < td > 75.1< / td > < td > 83.5< / td > < td > 68.7< / td >
< / tr >
< tr align = "center" >
< td > CN-CLIP< sub > ViT-B< / sub > < / td > < td > < b > 52.1< / b > < / td > < td > < b > 76.7< / b > < / td > < td > < b > 84.4< / b > < / td > < td > < b > 71.1< / b > < / td > < td > < b > 58.4< / b > < / td > < td > < b > 83.6< / b > < / td > < td > < b > 90.0< / b > < / td > < td > < b > 77.4< / b > < / td >
< / tr >
< / table >
### Flickr30K-CN Retrieval
< table border = "1" width = "100%" >
< tr align = "center" >
< th > Task< / th > < th colspan = "6" > Text-to-Image< / th > < th colspan = "6" > Image-to-Text< / th >
< / tr >
< tr align = "center" >
< th > Setup< / th > < th colspan = "3" > Zero-shot< / th > < th colspan = "3" > Finetune< / th > < th colspan = "3" > Zero-shot< / th > < th colspan = "3" > Finetune< / th >
< / tr >
< tr align = "center" >
< td > Metric< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td >
< / tr >
< tr align = "center" >
< td > Wukong< sub > ViT-B< / sub > < / td > < td > 45.7< / td > < td > 73.8< / td > < td > 82.2< / td > < td > 67.6< / td > < td > 89.6< / td > < td > 94.2< / td > < td > 66.2< / td > < td > 88.7< / td > < td > 94.3< / td > < td > 83.9< / td > < td > 97.6< / td > < td > 99.0< / td >
< / tr >
< tr align = "center" >
< td > R2D2< sub > ViT-B< / sub > < / td > < td > -< / td > < td > -< / td > < td > -< / td > < td > 78.3< / td > < td > 94.6< / td > < td > 97.0< / td > < td > -< / td > < td > -< / td > < td > -< / td > < td > 92.6< / td > < td > < b > 99.1< / b > < / td > < td > < b > 99.8< / b > < / td >
< / tr >
< tr align = "center" >
< td > CN-CLIP< sub > ViT-B< / sub > < / td > < td > < b > 62.7< / b > < / td > < td > < b > 86.9< / b > < / td > < td > < b > 92.8< / b > < / td > < td > < b > 79.1< / b > < / td > < td > < b > 94.8< / b > < / td > < td > < b > 97.4< / b > < / td > < td > < b > 74.6< / b > < / td > < td > < b > 93.5< / b > < / td > < td > < b > 97.1< / b > < / td > < td > < b > 93.5< / b > < / td > < td > 99.0< / td > < td > 99.5< / td >
< / tr >
< / table >
### COCO-CN Retrieval
< table border = "1" width = "100%" >
< tr align = "center" >
< th > Task< / th > < th colspan = "6" > Text-to-Image< / th > < th colspan = "6" > Image-to-Text< / th >
< / tr >
< tr align = "center" >
< th > Setup< / th > < th colspan = "3" > Zero-shot< / th > < th colspan = "3" > Finetune< / th > < th colspan = "3" > Zero-shot< / th > < th colspan = "3" > Finetune< / th >
< / tr >
< tr align = "center" >
< td > Metric< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td > < td > R@1< / td > < td > R@5< / td > < td > R@10< / td >
< / tr >
< tr align = "center" >
< td > Wukong< sub > ViT-B< / sub > < / td > < td > 49.2< / td > < td > 79.4< / td > < td > 87.9< / td > < td > 67.0< / td > < td > 91.4< / td > < td > 96.7< / td > < td > 48.3< / td > < td > 77.8< / td > < td > 88.8< / td > < td > 65.8< / td > < td > 90.3< / td > < td > 96.6< / td >
< / tr >
< tr align = "center" >
< td > R2D2< sub > ViT-B< / sub > < / td > < td > -< / td > < td > -< / td > < td > -< / td > < td > 75.1< / td > < td > 94.2< / td > < td > 98.1< / td > < td > -< / td > < td > -< / td > < td > -< / td > < td > 76.1< / td > < td > 95.3< / td > < td > 98.5< / td >
< / tr >
< tr align = "center" >
< td > CN-CLIP< sub > ViT-B< / sub > < / td > < td > < b > 62.2< / b > < / td > < td > < b > 86.6< / b > < / td > < td > < b > 94.9< / b > < / td > < td > < b > 77.0< / b > < / td > < td > < b > 97.1< / b > < / td > < td > < b > 99.0< / b > < / td > < td > < b > 57.0< / b > < / td > < td > < b > 84.1< / b > < / td > < td > < b > 93.6< / b > < / td > < td > < b > 77.4< / b > < / td > < td > < b > 96.2< / b > < / td > < td > < b > 98.9< / b > < / td >
< / tr >
< / table >
2022-11-09 08:34:02 +00:00
< br >
## Citation
2022-11-09 08:39:57 +00:00
If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!
2022-11-09 08:34:02 +00:00
```
@article {chinese-clip,
title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
journal={arXiv preprint arXiv:2211.01335},
year={2022}
}
```
< br >