Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation.
It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/DPT).
The model card has been written in combination by the Hugging Face team and Intel.
| Model Detail | Description |
| ----------- | ----------- |
| Model Authors - Company | Intel |
| Date | March 22, 2022 |
| Version | 1 |
| Type | Computer Vision - Monocular Depth Estimation |
| Paper or Other Resources | [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) and [GitHub Repo](https://github.com/isl-org/DPT) |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/dpt-large/discussions) and [Intel Developers Discord](https://discord.gg/rv2Gp55UJQ)|
| Intended Use | Description |
| ----------- | ----------- |
| Primary intended uses | You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for
fine-tuned versions on a task that interests you. |
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. |
| Factors | Description |
| ----------- | ----------- |
| Groups | Multiple datasets compiled together |
| Instrumentation | - |
| Environment | Inference completed on Intel Xeon Platinum 8280 CPU @ 2.70GHz with 8 physical cores and an NVIDIA RTX 2080 GPU. |
| Card Prompts | Model deployment on alternate hardware and software will change model performance |
| Metrics | Description |
| ----------- | ----------- |
| Model performance measures | Zero-shot Transfer |
| Decision thresholds | - |
| Approaches to uncertainty and variability | - |
| Training and Evaluation Data | Description |
| ----------- | ----------- |
| Datasets | The dataset is called MIX 6, and contains around 1.4M images. The model was initialized with ImageNet-pretrained weights.|
| Motivation | To build a robust monocular depth prediction network |
| Preprocessing | "We resize the image such that the longer side is 384 pixels and train on random square crops of size 384. ... We perform random horizontal flips for data augmentation." See [Ranftl et al. (2021)](https://arxiv.org/abs/2103.13413) for more details. |
## Quantitative Analyses
| Model | Training set | DIW WHDR | ETH3D AbsRel | Sintel AbsRel | KITTI δ>1.25 | NYU δ>1.25 | TUM δ>1.25 |
Table 1. Comparison to the state of the art on monocular depth estimation. We evaluate zero-shot cross-dataset transfer according to the
protocol defined in [30]. Relative performance is computed with respect to the original MiDaS model [30]. Lower is better for all metrics. ([Ranftl et al., 2021](https://arxiv.org/abs/2103.13413))
| Ethical Considerations | Description |
| ----------- | ----------- |
| Data | The training data come from multiple image datasets compiled together. |
| Human life | The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of monocular depth image datasets. |
| Mitigations | No additional risk mitigation strategies were considered during model development. |
| Risks and harms | The extent of the risks involved by using the model remain unknown.|
| Use cases | - |
| Caveats and Recommendations |
| ----------- |
| There are no additional caveats or recommendations for this model. |