dandelin/vilt-b32-finetuned-vqa is a forked repo from huggingface. License: apache-2-0

apache-2-0 model visual-question-answering

Go to file

nielsr d0a1f6ab88 Better prompts for widget samples (#2 ) - Better prompts for widget samples (5a1d3ba58fe705a59c0d26d94e7e10e1dbcce846) - What's the animal doing (67d4cf273bc9974aee5d87d3f768f590a408e5ce) Co-authored-by: Mishig Davaadorj <mishig@users.noreply.huggingface.co>		2022-08-02 13:03:04 +00:00
.gitattributes	initial commit	2021-11-26 12:37:49 +00:00
README.md	Better prompts for widget samples (#2 )	2022-08-02 13:03:04 +00:00
config.json	First commit	2021-11-26 12:58:56 +00:00
preprocessor_config.json	Fix tokenizer files	2022-01-19 17:03:47 +01:00
pytorch_model.bin	First commit	2021-11-26 12:58:56 +00:00
special_tokens_map.json	Upload special_tokens_map.json	2021-11-28 18:55:28 +00:00
tokenizer.json	Fix tokenizer files	2022-01-19 17:03:47 +01:00
tokenizer_config.json	Update tokenizer_config.json	2022-01-19 17:24:36 +00:00
vocab.txt	Upload vocab.txt	2021-11-28 18:55:14 +00:00

README.md

Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2

Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.

Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.

Intended uses & limitations

You can use the raw model for visual question answering.

How to use

Here is how to use this model in PyTorch:

from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image

# prepare image + question
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "How many cats are there?"

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])

Training data

(to do)

Training procedure

Preprocessing

(to do)

Pretraining

(to do)

Evaluation results

(to do)

BibTeX entry and citation info

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}