From 055792af34b3591111c5a71ff1a4ee940b7be0dc Mon Sep 17 00:00:00 2001 From: mkshing Date: Mon, 16 May 2022 11:05:47 +0900 Subject: [PATCH] add image encoder's information --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d8e55ec..5ad56a7 100644 --- a/README.md +++ b/README.md @@ -62,7 +62,7 @@ print("Label probs:", text_probs) # prints: [[1.0, 0.0, 0.0]] ``` # Model architecture -The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. The text encoder was trained upon the Japanese pre-trained RoBERTa model [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) with the same sentencepiece tokenizer. +The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. It was initialized with [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) as the image encoder and the Japanese pre-trained RoBERTa model [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) with the same sentencepiece tokenizer as the text encoder. # Training The model was trained on [CC12M](https://github.com/google-research-datasets/conceptual-12m) translated the captions to Japanese.