donut-base-finetuned-docvqa/README.md

---
license: mit
pipeline_tag: document-question-answering
tags:
- donut
- image-to-text
- vision
widget:
- text: "What is the invoice number?"
  src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png"
- text: "What is the purchase amount?"
  src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/contract.jpeg"
---

# Donut (base-sized model, fine-tuned on DocVQA) 

Donut model fine-tuned on DocVQA. It was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).

Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team.

## Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder. 

![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)

## Intended uses & limitations

This model is fine-tuned on DocVQA, a document visual question answering dataset.

We refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) which includes code examples.

### BibTeX entry and citation info

```bibtex
@article{DBLP:journals/corr/abs-2111-15664,
  author    = {Geewook Kim and
               Teakgyu Hong and
               Moonbin Yim and
               Jinyoung Park and
               Jinyeong Yim and
               Wonseok Hwang and
               Sangdoo Yun and
               Dongyoon Han and
               Seunghyun Park},
  title     = {Donut: Document Understanding Transformer without {OCR}},
  journal   = {CoRR},
  volume    = {abs/2111.15664},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.15664},
  eprinttype = {arXiv},
  eprint    = {2111.15664},
  timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
Create README.md 2022-08-12 16:44:48 +00:00			`---`
			`license: mit`
Update README.md (#2) - Update README.md (f007ea1fb586afd9d87b1f02de630a826163b2fb) Co-authored-by: Ankur Goyal <ankrgyl@users.noreply.huggingface.co> 2022-09-14 13:24:58 +00:00			`pipeline_tag: document-question-answering`
Create README.md 2022-08-12 16:44:48 +00:00			`tags:`
Update README.md 2022-08-13 08:27:57 +00:00			`- donut`
			`- image-to-text`
Create README.md 2022-08-12 16:44:48 +00:00			`- vision`
Update README.md 2022-09-21 12:50:31 +00:00			`widget:`
			`- text: "What is the invoice number?"`
			`src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png"`
			`- text: "What is the purchase amount?"`
			`src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/contract.jpeg"`
Create README.md 2022-08-12 16:44:48 +00:00			`---`

			`# Donut (base-sized model, fine-tuned on DocVQA)`

			`Donut model fine-tuned on DocVQA. It was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).`

			`Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team.`

			`## Model description`

			`Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.`

			`![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)`

			`## Intended uses & limitations`

			`This model is fine-tuned on DocVQA, a document visual question answering dataset.`

			`We refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) which includes code examples.`

			`### BibTeX entry and citation info`

			```bibtex
			`@article{DBLP:journals/corr/abs-2111-15664,`
			`author = {Geewook Kim and`
			`Teakgyu Hong and`
			`Moonbin Yim and`
			`Jinyoung Park and`
			`Jinyeong Yim and`
			`Wonseok Hwang and`
			`Sangdoo Yun and`
			`Dongyoon Han and`
			`Seunghyun Park},`
			`title = {Donut: Document Understanding Transformer without {OCR}},`
			`journal = {CoRR},`
			`volume = {abs/2111.15664},`
			`year = {2021},`
			`url = {https://arxiv.org/abs/2111.15664},`
			`eprinttype = {arXiv},`
			`eprint = {2111.15664},`
			`timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},`
			`biburl = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},`
			`bibsource = {dblp computer science bibliography, https://dblp.org}`
			`}`
			```