diff --git a/README.md b/README.md
new file mode 100644
index 0000000..4d5abaa
--- /dev/null
+++ b/README.md
@@ -0,0 +1,56 @@
+---
+license: apache-2.0
+tags:
+datasets:
+- imagenet-21k
+---
+
+# Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
+
+Vision-and-Language Transformer (ViLT) model fine-tuned on [VQAv2](). It was introduced in the paper [ViLT: Vision-and-Language Transformer
+Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT). 
+
+Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
+
+## Model description
+
+(to do)
+
+## Intended uses & limitations
+
+You can use the raw model for visual question answering. 
+
+### How to use
+
+(to do)
+
+## Training data
+
+(to do)
+
+## Training procedure
+
+### Preprocessing
+
+(to do)
+
+### Pretraining
+
+(to do)
+
+## Evaluation results
+
+(to do)
+
+### BibTeX entry and citation info
+
+```bibtex
+@misc{kim2021vilt,
+      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
+      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
+      year={2021},
+      eprint={2102.03334},
+      archivePrefix={arXiv},
+      primaryClass={stat.ML}
+}
+```
\ No newline at end of file