diff --git a/README.md b/README.md index f780117..2f9ccdc 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,56 @@ --- license: gpl-3.0 +tags: +- DocVQA +- Document Question Answering +- Document Visual Question Answering +datasets: +- MP-DocVQA +language: +- en --- + +# LayoutLMv3 base fine-tuned on MP-DocVQA + +This is pretrained LayoutLMv3 from [Microsoft hub](https://huggingface.co/microsoft/layoutlmv3-base) and fine-tuned on Multipage DocVQA (MP-DocVQA) dataset. + + +This model was used as a baseline in [Hierarchical multimodal transformers for Multi-Page DocVQA](https://arxiv.org/pdf/2212.05935.pdf). +- Results on the MP-DocVQA dataset are reported in Table 2. +- Training hyperparameters can be found in Table 8 of Appendix D. + + +## How to use + +Here is how to use this model to get the features of a given text in PyTorch: + +```python +import torch +from transformers import LayoutLMv3Processor, LayoutLMv3ForQuestionAnswering + +processor = LayoutLMv3Processor.from_pretrained("rubentito/layoutlmv3-base-mpdocvqa", apply_ocr=False) +model = LayoutLMv3ForQuestionAnswering.from_pretrained("rubentito/layoutlmv3-base-mpdocvqa") + +image = Image.open("example.jpg").convert("RGB") +question = "Is this a question?" +context = ["Example"] +boxes = [0, 0, 1000, 1000] # This is an example bounding box covering the whole image. +document_encoding = processor(image, question, context, boxes=boxes, return_tensors="pt") +outputs = model(**document_encoding) + +# Get the answer +start_idx = torch.argmax(outputs.start_logits, axis=1) +end_idx = torch.argmax(outputs.end_logits, axis=1) +answers = self.processor.tokenizer.decode(input_tokens[start_idx: end_idx+1]).strip() +``` + +## BibTeX entry + +```tex +@article{tito2022hierarchical, + title={Hierarchical multimodal transformers for Multi-Page DocVQA}, + author={Tito, Rub{\`e}n and Karatzas, Dimosthenis and Valveny, Ernest}, + journal={arXiv preprint arXiv:2212.05935}, + year={2022} +} +``` \ No newline at end of file