dpt-hybrid-midas/README.md

---
license: apache-2.0
tags:
- vision
- depth-estimation
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
  example_title: Tiger
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
  example_title: Teapot
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
  example_title: Palace
---

# DPT-Hybrid 

Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. and first released in [this repository](https://github.com/isl-org/DPT). This repository hosts the "hybrid" version of the model as stated in the paper.

Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.

## Model description

DPT-Hybrid uses the Vision Transformer Hybrid (ViT-Hybrid) as backbone and adds a neck + head on top for monocular depth estimation.

![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg)

DPT-Hybrid diverges from DPT by using [ViT-hybrid](https://huggingface.co/google/vit-hybrid-base-bit-384) as a backbone and taking some activations from the backbone.

## Intended uses & limitations

You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for
fine-tuned versions on a task that interests you.

### How to use

Here is how to use this model for zero-shot depth estimation on an image:

```python
from PIL import Image
import numpy as np
import requests
import torch


from transformers import DPTForDepthEstimation, DPTFeatureExtractor

model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# prepare image for the model
inputs = feature_extractor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth.show()
```

For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt).

### BibTeX entry and citation info

```bibtex
@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Ranftl and
               Alexey Bochkovskiy and
               Vladlen Koltun},
  title     = {Vision Transformers for Dense Prediction},
  journal   = {CoRR},
  volume    = {abs/2103.13413},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.13413},
  eprinttype = {arXiv},
  eprint    = {2103.13413},
  timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
Create README.md 2022-12-07 11:32:23 +00:00			`---`
			`license: apache-2.0`
			`tags:`
			`- vision`
			`- depth-estimation`
			`widget:`
			`- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg`
			`example_title: Tiger`
			`- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg`
			`example_title: Teapot`
			`- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg`
			`example_title: Palace`
			`---`

			`# DPT-Hybrid`

			`Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. and first released in [this repository](https://github.com/isl-org/DPT). This repository hosts the "hybrid" version of the model as stated in the paper.`

			`Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.`

			`## Model description`

Update README.md 2022-12-07 11:32:50 +00:00			`DPT-Hybrid uses the Vision Transformer Hybrid (ViT-Hybrid) as backbone and adds a neck + head on top for monocular depth estimation.`
Create README.md 2022-12-07 11:32:23 +00:00
			`![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg)`

			`DPT-Hybrid diverges from DPT by using [ViT-hybrid](https://huggingface.co/google/vit-hybrid-base-bit-384) as a backbone and taking some activations from the backbone.`

			`## Intended uses & limitations`

			`You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for`
			`fine-tuned versions on a task that interests you.`

			`### How to use`

			`Here is how to use this model for zero-shot depth estimation on an image:`

			```python
			`from PIL import Image`
			`import numpy as np`
			`import requests`
			`import torch`


			`from transformers import DPTForDepthEstimation, DPTFeatureExtractor`

			`model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)`
			`feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas")`

			`url = "http://images.cocodataset.org/val2017/000000039769.jpg"`
			`image = Image.open(requests.get(url, stream=True).raw)`

			`# prepare image for the model`
			`inputs = feature_extractor(images=image, return_tensors="pt")`

			`with torch.no_grad():`
			`outputs = model(**inputs)`
			`predicted_depth = outputs.predicted_depth`

			`# interpolate to original size`
			`prediction = torch.nn.functional.interpolate(`
			`predicted_depth.unsqueeze(1),`
			`size=image.size[::-1],`
			`mode="bicubic",`
			`align_corners=False,`
			`)`

			`# visualize the prediction`
			`output = prediction.squeeze().cpu().numpy()`
			`formatted = (output * 255 / np.max(output)).astype("uint8")`
			`depth = Image.fromarray(formatted)`
			`depth.show()`
Update README.md (#1) - Update README.md (a05eabcb1201e0b0e8a27a19fc5c40cba20ffa0c) Co-authored-by: Younes Belkada <ybelkada@users.noreply.huggingface.co> 2022-12-07 15:49:50 +00:00			```
Create README.md 2022-12-07 11:32:23 +00:00
			`For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt).`

			`### BibTeX entry and citation info`

			```bibtex
			`@article{DBLP:journals/corr/abs-2103-13413,`
			`author = {Ren{\'{e}} Ranftl and`
			`Alexey Bochkovskiy and`
			`Vladlen Koltun},`
			`title = {Vision Transformers for Dense Prediction},`
			`journal = {CoRR},`
			`volume = {abs/2103.13413},`
			`year = {2021},`
			`url = {https://arxiv.org/abs/2103.13413},`
			`eprinttype = {arXiv},`
			`eprint = {2103.13413},`
			`timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},`
			`biburl = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},`
			`bibsource = {dblp computer science bibliography, https://dblp.org}`
			`}`
			```