diff --git a/README.md b/README.md new file mode 100644 index 0000000..9f14c61 --- /dev/null +++ b/README.md @@ -0,0 +1,90 @@ +## --- +license: apache-2.0 +tags: +- vision +- depth-estimation +widget: +- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg + example_title: Tiger +- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg + example_title: Teapot +- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg + example_title: Palace +--- + +# DPT (large-sized model) + +Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. and first released in [this repository](https://github.com/isl-org/DPT). + +Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team. + +## Model description + +DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation. + +![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.png) + +## Intended uses & limitations + +You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for +fine-tuned versions on a task that interests you. + +### How to use + +Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: + +```python +from transformers import DPTFeatureExtractor, DPTForDepthEstimation +import torch +import numpy as np +from PIL import Image +import requests + +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) + +feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large") +model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large") + +# prepare image for the model +inputs = feature_extractor(images=image, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs) + predicted_depth = outputs.predicted_depth + +# interpolate to original size +prediction = torch.nn.functional.interpolate( + predicted_depth.unsqueeze(1), + size=image.size[::-1], + mode="bicubic", + align_corners=False, +) + +# visualize the prediction +output = prediction.squeeze().cpu().numpy() +formatted = (output * 255 / np.max(output)).astype("uint8") +depth = Image.fromarray(formatted) +``` + +For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt). + +### BibTeX entry and citation info + +```bibtex +@article{DBLP:journals/corr/abs-2103-13413, + author = {Ren{\'{e}} Ranftl and + Alexey Bochkovskiy and + Vladlen Koltun}, + title = {Vision Transformers for Dense Prediction}, + journal = {CoRR}, + volume = {abs/2103.13413}, + year = {2021}, + url = {https://arxiv.org/abs/2103.13413}, + eprinttype = {arXiv}, + eprint = {2103.13413}, + timestamp = {Wed, 07 Apr 2021 15:31:46 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} +``` \ No newline at end of file