This repository provides a Japanese [CLIP (Contrastive Language-Image Pre-Training)](https://arxiv.org/abs/2103.00020) model. The model was trained by [rinna Co., Ltd.](https://corp.rinna.co.jp/)
The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. The text encoder was trained upon the pre-trained Japanese RoBERTa model [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) with the same sentencepiece tokenizer.
# Training
The model was trained on [CC12M](https://github.com/google-research-datasets/conceptual-12m) translated the captions to Japanese.