From 7bcdcde68b0d920396cd4398cae867786b723f2d Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 10:02:37 +0000
Subject: [PATCH 1/7] First draft of model card

---
 README.md | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..02c05ee
--- /dev/null
+++ b/README.md
@@ -0,0 +1,85 @@
+---
+license: apache-2.0
+tags:
+---
+
+# Vision Transformer (base-sized model) 
+
+Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. 
+
+Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.
+
+## Model description
+
+The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. 
+
+Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
+
+Note that this model does not provide any fine-tuned heads, as these were zero'd by Google researchers. However, the model does include the pre-trained pooler, which can be used for downstream tasks (such as image classification).
+
+By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
+
+## Intended uses & limitations
+
+You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=facebook/detr) to look for all available DETR models.
+
+### How to use
+
+Here is how to use this model:
+
+```python
+from transformers import ViTFeatureExtractor, ViTModel
+from PIL import Image
+import requests
+url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
+model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
+inputs = feature_extractor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+last_hidden_states = outputs.last_hidden_state
+```
+
+Currently, both the feature extractor and model support PyTorch. 
+
+## Training data
+
+The DETR model was trained on [COCO 2017 object detection](https://cocodataset.org/#download), a dataset consisting of 118k/5k annotated images for training/validation respectively. 
+
+## Training procedure
+
+### Preprocessing
+
+The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py). 
+
+Images are resized/rescaled such that the shortest side is at least 800 pixels and the largest side at most 1333 pixels, and normalized across the RGB channels with the ImageNet mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225).
+
+### Training
+
+The model was trained for 300 epochs on 16 V100 GPUs. This takes 3 days, with 4 images per GPU (hence a total batch size of 64).
+
+## Evaluation results
+
+This model achieves an AP (average precision) of 42.0 on COCO 2017 validation. For more details regarding evaluation results, we refer to table 1 of the original paper.
+### BibTeX entry and citation info
+
+```bibtex
+@article{DBLP:journals/corr/abs-2005-12872,
+  author    = {Nicolas Carion and
+               Francisco Massa and
+               Gabriel Synnaeve and
+               Nicolas Usunier and
+               Alexander Kirillov and
+               Sergey Zagoruyko},
+  title     = {End-to-End Object Detection with Transformers},
+  journal   = {CoRR},
+  volume    = {abs/2005.12872},
+  year      = {2020},
+  url       = {https://arxiv.org/abs/2005.12872},
+  archivePrefix = {arXiv},
+  eprint    = {2005.12872},
+  timestamp = {Thu, 28 May 2020 17:38:09 +0200},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2005-12872.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
\ No newline at end of file

From f7bfb7db0149497c88c4b49951ecd66dc2b48899 Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 10:04:39 +0000
Subject: [PATCH 2/7] Update model card

---
 README.md | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 02c05ee..c969d61 100644
--- a/README.md
+++ b/README.md
@@ -3,21 +3,15 @@ license: apache-2.0
 tags:
 ---
 
-# Vision Transformer (base-sized model) 
+# DETR (End-to-End Object Detection) model with ResNet-50 backbone
 
-Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. 
+DEtection Transformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Carion et al. and first released in [this repository](https://github.com/facebookresearch/detr). 
 
-Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.
+Disclaimer: The team releasing DETR did not write a model card for this model so this model card has been written by the Hugging Face team.
 
 ## Model description
 
-The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. 
-
-Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
-
-Note that this model does not provide any fine-tuned heads, as these were zero'd by Google researchers. However, the model does include the pre-trained pooler, which can be used for downstream tasks (such as image classification).
-
-By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
+The DETR model is an encoder-decoder transformer with a convolutional backbone. 
 
 ## Intended uses & limitations
 

From 073311fe2ecbc74031f0fc7b33aff50ebfc61eaf Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 10:06:43 +0000
Subject: [PATCH 3/7] Update model card

---
 README.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c969d61..e9b2da9 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@ tags:
 
 # DETR (End-to-End Object Detection) model with ResNet-50 backbone
 
-DEtection Transformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Carion et al. and first released in [this repository](https://github.com/facebookresearch/detr). 
+DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Carion et al. and first released in [this repository](https://github.com/facebookresearch/detr). 
 
 Disclaimer: The team releasing DETR did not write a model card for this model so this model card has been written by the Hugging Face team.
 
@@ -13,6 +13,12 @@ Disclaimer: The team releasing DETR did not write a model card for this model so
 
 The DETR model is an encoder-decoder transformer with a convolutional backbone. 
 
+First, an image is sent through a CNN backbone, outputting a lower-resolution feature map, typically of shape (1, 2048, height/32, width/32). This is then projected to match the hidden dimension of the Transformer, which is 256 by default, using a nn.Conv2D layer. Next, the feature map is flattened and transposed to obtain a tensor of shape (batch_size, seq_len, d_model) = (1, width/32*height/32, 256). 
+
+This is sent through the encoder, outputting encoder_hidden_states of the same shape. Next, so-called object queries are sent through the decoder. This is just a tensor of shape (batch_size, num_queries, d_model), with num_queries typically set to 100 and is initialized with zeros. Each object query looks for a particular object in the image. Next, the decoder updates these object queries through multiple self-attention and encoder-decoder attention layers to output decoder_hidden_states of the same shape: (batch_size, num_queries, d_model). Next, two heads are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no object", and a MLP to predict bounding boxes for each query. So the number of queries actually determines the maximum number of objects the model can detect in an image.
+
+The model is trained using a "bipartite matching loss": so what we actually do is compare the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy for the classes and L1 regression loss for the bounding boxes are used to optimize the parameters of the model.
+
 ## Intended uses & limitations
 
 You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=facebook/detr) to look for all available DETR models.

From 0631208742971f51ae93ee3d8051b0e5f50f828a Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 10:14:31 +0000
Subject: [PATCH 4/7] Add code example

---
 README.md | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index e9b2da9..03c68b9 100644
--- a/README.md
+++ b/README.md
@@ -28,16 +28,22 @@ You can use the raw model for object detection. See the [model hub](https://hugg
 Here is how to use this model:
 
 ```python
-from transformers import ViTFeatureExtractor, ViTModel
+from transformers import DetrFeatureExtractor, DetrForObjectDetection
 from PIL import Image
 import requests
+
 url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
 image = Image.open(requests.get(url, stream=True).raw)
-feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
-model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
+
+feature_extractor = DetrFeatureExtractor.from_pretrained('facebook/detr-resnet-50')
+model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
+
 inputs = feature_extractor(images=image, return_tensors="pt")
 outputs = model(**inputs)
-last_hidden_states = outputs.last_hidden_state
+
+# model predicts bounding boxes and corresponding COCO classes
+logits = outputs.logits
+bboxes = outputs.pred_boxes
 ```
 
 Currently, both the feature extractor and model support PyTorch. 

From 953125419f73c54a3ffa745c56bc229a8f652cfd Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 10:20:13 +0000
Subject: [PATCH 5/7] Update model description

---
 README.md | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 03c68b9..1ff4034 100644
--- a/README.md
+++ b/README.md
@@ -11,13 +11,9 @@ Disclaimer: The team releasing DETR did not write a model card for this model so
 
 ## Model description
 
-The DETR model is an encoder-decoder transformer with a convolutional backbone. 
+The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses so-called object queries to detect objects in an image. Each object query looks for a particular object in the image. For COCO, the number of object queries is set to 100. 
 
-First, an image is sent through a CNN backbone, outputting a lower-resolution feature map, typically of shape (1, 2048, height/32, width/32). This is then projected to match the hidden dimension of the Transformer, which is 256 by default, using a nn.Conv2D layer. Next, the feature map is flattened and transposed to obtain a tensor of shape (batch_size, seq_len, d_model) = (1, width/32*height/32, 256). 
-
-This is sent through the encoder, outputting encoder_hidden_states of the same shape. Next, so-called object queries are sent through the decoder. This is just a tensor of shape (batch_size, num_queries, d_model), with num_queries typically set to 100 and is initialized with zeros. Each object query looks for a particular object in the image. Next, the decoder updates these object queries through multiple self-attention and encoder-decoder attention layers to output decoder_hidden_states of the same shape: (batch_size, num_queries, d_model). Next, two heads are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no object", and a MLP to predict bounding boxes for each query. So the number of queries actually determines the maximum number of objects the model can detect in an image.
-
-The model is trained using a "bipartite matching loss": so what we actually do is compare the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy for the classes and L1 regression loss for the bounding boxes are used to optimize the parameters of the model.
+The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.
 
 ## Intended uses & limitations
 

From 06a83ce7a41ebcafe320c00add54c38faaddc5cf Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 11:34:46 +0000
Subject: [PATCH 6/7] Add tags

---
 README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/README.md b/README.md
index 1ff4034..21f2fc6 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,9 @@
 ---
 license: apache-2.0
 tags:
+- object-detection
+datasets:
+- coco
 ---
 
 # DETR (End-to-End Object Detection) model with ResNet-50 backbone

From 3d24b54ca3edd01d6ea9f5dfd3f8d19c272709ee Mon Sep 17 00:00:00 2001
From: Niels Rogge <niels.rogge1@gmail.com>
Date: Tue, 1 Jun 2021 11:42:19 +0000
Subject: [PATCH 7/7] Make evaluation results bold

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 21f2fc6..cd4c912 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,7 @@ The model was trained for 300 epochs on 16 V100 GPUs. This takes 3 days, with 4
 
 ## Evaluation results
 
-This model achieves an AP (average precision) of 42.0 on COCO 2017 validation. For more details regarding evaluation results, we refer to table 1 of the original paper.
+This model achieves an AP (average precision) of **42.0** on COCO 2017 validation. For more details regarding evaluation results, we refer to table 1 of the original paper.
 ### BibTeX entry and citation info
 
 ```bibtex