dpt-large/README.md

## ---
license: apache-2.0
tags:
- vision
- depth-estimation
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
  example_title: Tiger
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
  example_title: Teapot
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
  example_title: Palace
---

# DPT (large-sized model)

Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. and first released in [this repository](https://github.com/isl-org/DPT). 

Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.

## Model description

DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.

![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.png)

## Intended uses & limitations

You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for
fine-tuned versions on a task that interests you.

### How to use

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

```python
from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

# prepare image for the model
inputs = feature_extractor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
```

For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt).

### BibTeX entry and citation info

```bibtex
@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Ranftl and
               Alexey Bochkovskiy and
               Vladlen Koltun},
  title     = {Vision Transformers for Dense Prediction},
  journal   = {CoRR},
  volume    = {abs/2103.13413},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.13413},
  eprinttype = {arXiv},
  eprint    = {2103.13413},
  timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
Create README.md 2022-04-14 08:25:26 +00:00			`## ---`
			`license: apache-2.0`
			`tags:`
			`- vision`
			`- depth-estimation`
			`widget:`
			`- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg`
			`example_title: Tiger`
			`- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg`
			`example_title: Teapot`
			`- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg`
			`example_title: Palace`
			`---`

			`# DPT (large-sized model)`

			`Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. and first released in [this repository](https://github.com/isl-org/DPT).`

			`Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.`

			`## Model description`

			`DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.`

			`![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.png)`

			`## Intended uses & limitations`

			`You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for`
			`fine-tuned versions on a task that interests you.`

			`### How to use`

			`Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:`

			```python
			`from transformers import DPTFeatureExtractor, DPTForDepthEstimation`
			`import torch`
			`import numpy as np`
			`from PIL import Image`
			`import requests`

			`url = "http://images.cocodataset.org/val2017/000000039769.jpg"`
			`image = Image.open(requests.get(url, stream=True).raw)`

			`feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")`
			`model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")`

			`# prepare image for the model`
			`inputs = feature_extractor(images=image, return_tensors="pt")`

			`with torch.no_grad():`
			`outputs = model(**inputs)`
			`predicted_depth = outputs.predicted_depth`

			`# interpolate to original size`
			`prediction = torch.nn.functional.interpolate(`
			`predicted_depth.unsqueeze(1),`
			`size=image.size[::-1],`
			`mode="bicubic",`
			`align_corners=False,`
			`)`

			`# visualize the prediction`
			`output = prediction.squeeze().cpu().numpy()`
			`formatted = (output * 255 / np.max(output)).astype("uint8")`
			`depth = Image.fromarray(formatted)`
			```

			`For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt).`

			`### BibTeX entry and citation info`

			```bibtex
			`@article{DBLP:journals/corr/abs-2103-13413,`
			`author = {Ren{\'{e}} Ranftl and`
			`Alexey Bochkovskiy and`
			`Vladlen Koltun},`
			`title = {Vision Transformers for Dense Prediction},`
			`journal = {CoRR},`
			`volume = {abs/2103.13413},`
			`year = {2021},`
			`url = {https://arxiv.org/abs/2103.13413},`
			`eprinttype = {arXiv},`
			`eprint = {2103.13413},`
			`timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},`
			`biburl = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},`
			`bibsource = {dblp computer science bibliography, https://dblp.org}`
			`}`
			```