dandelin/vilt-b32-finetuned-vqa is a forked repo from huggingface. License: apache-2-0
Go to file
nielsr d0a1f6ab88 Better prompts for widget samples (#2)
- Better prompts for widget samples (5a1d3ba58fe705a59c0d26d94e7e10e1dbcce846)
- What's the animal doing (67d4cf273bc9974aee5d87d3f768f590a408e5ce)


Co-authored-by: Mishig Davaadorj <mishig@users.noreply.huggingface.co>
2022-08-02 13:03:04 +00:00
.gitattributes initial commit 2021-11-26 12:37:49 +00:00
README.md Better prompts for widget samples (#2) 2022-08-02 13:03:04 +00:00
config.json First commit 2021-11-26 12:58:56 +00:00
preprocessor_config.json Fix tokenizer files 2022-01-19 17:03:47 +01:00
pytorch_model.bin First commit 2021-11-26 12:58:56 +00:00
special_tokens_map.json Upload special_tokens_map.json 2021-11-28 18:55:28 +00:00
tokenizer.json Fix tokenizer files 2022-01-19 17:03:47 +01:00
tokenizer_config.json Update tokenizer_config.json 2022-01-19 17:24:36 +00:00
vocab.txt Upload vocab.txt 2021-11-28 18:55:14 +00:00

README.md

tags license widget
visual-question-answering
apache-2.0
text src
What's the animal doing? https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
text src
What is on top of the building? https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg

Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2

Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.

Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.

Intended uses & limitations

You can use the raw model for visual question answering.

How to use

Here is how to use this model in PyTorch:

from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image

# prepare image + question
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "How many cats are there?"

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])

Training data

(to do)

Training procedure

Preprocessing

(to do)

Pretraining

(to do)

Evaluation results

(to do)

BibTeX entry and citation info

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}