diff --git a/README.md b/README.md new file mode 100644 index 0000000..4d5abaa --- /dev/null +++ b/README.md @@ -0,0 +1,56 @@ +--- +license: apache-2.0 +tags: +datasets: +- imagenet-21k +--- + +# Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2 + +Vision-and-Language Transformer (ViLT) model fine-tuned on [VQAv2](). It was introduced in the paper [ViLT: Vision-and-Language Transformer +Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT). + +Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team. + +## Model description + +(to do) + +## Intended uses & limitations + +You can use the raw model for visual question answering. + +### How to use + +(to do) + +## Training data + +(to do) + +## Training procedure + +### Preprocessing + +(to do) + +### Pretraining + +(to do) + +## Evaluation results + +(to do) + +### BibTeX entry and citation info + +```bibtex +@misc{kim2021vilt, + title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, + author={Wonjae Kim and Bokyung Son and Ildoo Kim}, + year={2021}, + eprint={2102.03334}, + archivePrefix={arXiv}, + primaryClass={stat.ML} +} +``` \ No newline at end of file