dandelin/vilt-b32-finetuned-vqa is a forked repo from huggingface. License: apache-2-0

apache-2-0 model visual-question-answering

Go to file

Niels Rogge 1d0521195d Update tokenizer_config.json		2022-01-19 17:24:36 +00:00
.gitattributes	initial commit	2021-11-26 12:37:49 +00:00
README.md	Remove dataset tags	2021-11-27 10:14:11 +00:00
config.json	First commit	2021-11-26 12:58:56 +00:00
preprocessor_config.json	Fix tokenizer files	2022-01-19 17:03:47 +01:00
pytorch_model.bin	First commit	2021-11-26 12:58:56 +00:00
special_tokens_map.json	Upload special_tokens_map.json	2021-11-28 18:55:28 +00:00
tokenizer.json	Fix tokenizer files	2022-01-19 17:03:47 +01:00
tokenizer_config.json	Update tokenizer_config.json	2022-01-19 17:24:36 +00:00
vocab.txt	Upload vocab.txt	2021-11-28 18:55:14 +00:00

README.md

license
apache-2.0

Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2

Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.

Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

(to do)

Intended uses & limitations

You can use the raw model for visual question answering.

How to use

(to do)

Training data

(to do)

Training procedure

Preprocessing

(to do)

Pretraining

(to do)

Evaluation results

(to do)

BibTeX entry and citation info

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}