donut-base/README.md

---
license: mit
tags:
- donut
- image-to-text
- vision
---

# Donut (base-sized model, pre-trained only) 

Donut model pre-trained-only. It was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).

Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team.

## Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder. 

![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)

## Intended uses & limitations

This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the [model hub](https://huggingface.co/models?search=donut) to look for fine-tuned versions on a task that interests you.

### How to use

We refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) which includes code examples.

### BibTeX entry and citation info

```bibtex
@article{DBLP:journals/corr/abs-2111-15664,
  author    = {Geewook Kim and
               Teakgyu Hong and
               Moonbin Yim and
               Jinyoung Park and
               Jinyeong Yim and
               Wonseok Hwang and
               Sangdoo Yun and
               Dongyoon Han and
               Seunghyun Park},
  title     = {Donut: Document Understanding Transformer without {OCR}},
  journal   = {CoRR},
  volume    = {abs/2111.15664},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.15664},
  eprinttype = {arXiv},
  eprint    = {2111.15664},
  timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
Create README.md 2022-08-12 16:41:05 +00:00			`---`
			`license: mit`
			`tags:`
Update README.md 2022-08-13 08:27:12 +00:00			`- donut`
			`- image-to-text`
Create README.md 2022-08-12 16:41:05 +00:00			`- vision`
			`---`

			`# Donut (base-sized model, pre-trained only)`

			`Donut model pre-trained-only. It was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).`
Update README.md 2022-08-12 16:42:08 +00:00
Create README.md 2022-08-12 16:41:05 +00:00			`Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team.`

			`## Model description`

			`Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.`

Update README.md 2022-08-12 16:42:08 +00:00			`![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)`

Create README.md 2022-08-12 16:41:05 +00:00			`## Intended uses & limitations`

Update README.md 2022-08-12 16:42:55 +00:00			`This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the [model hub](https://huggingface.co/models?search=donut) to look for fine-tuned versions on a task that interests you.`
Create README.md 2022-08-12 16:41:05 +00:00
			`### How to use`

			`We refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) which includes code examples.`

			`### BibTeX entry and citation info`

			```bibtex
			`@article{DBLP:journals/corr/abs-2111-15664,`
			`author = {Geewook Kim and`
			`Teakgyu Hong and`
			`Moonbin Yim and`
			`Jinyoung Park and`
			`Jinyeong Yim and`
			`Wonseok Hwang and`
			`Sangdoo Yun and`
			`Dongyoon Han and`
			`Seunghyun Park},`
			`title = {Donut: Document Understanding Transformer without {OCR}},`
			`journal = {CoRR},`
			`volume = {abs/2111.15664},`
			`year = {2021},`
			`url = {https://arxiv.org/abs/2111.15664},`
			`eprinttype = {arXiv},`
			`eprint = {2111.15664},`
			`timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},`
			`biburl = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},`
			`bibsource = {dblp computer science bibliography, https://dblp.org}`
			`}`
			```