Intel/dpt-large is a forked repo from huggingface. License: apache-2-0
Go to file
Niels Rogge 1f614e9fc1 Update README.md 2022-04-14 08:25:35 +00:00
.gitattributes initial commit 2022-02-24 15:11:22 +00:00
README.md Update README.md 2022-04-14 08:25:35 +00:00
config.json Add model 2022-03-22 14:29:40 +01:00
preprocessor_config.json Add feature extractor 2022-03-22 15:11:54 +01:00
pytorch_model.bin Add model 2022-03-22 14:29:40 +01:00

README.md

license tags widget
apache-2.0
vision
depth-estimation
src example_title
https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg Tiger
src example_title
https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg Teapot
src example_title
https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg Palace

DPT (large-sized model)

Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. and first released in this repository.

Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.

model image

Intended uses & limitations

You can use the raw model for zero-shot monocular depth estimation. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

# prepare image for the model
inputs = feature_extractor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)

For more code examples, we refer to the documentation.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Ranftl and
               Alexey Bochkovskiy and
               Vladlen Koltun},
  title     = {Vision Transformers for Dense Prediction},
  journal   = {CoRR},
  volume    = {abs/2103.13413},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.13413},
  eprinttype = {arXiv},
  eprint    = {2103.13413},
  timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}