Update README.md

This commit is contained in:
Suraj Patil 2022-03-14 18:01:04 +00:00 committed by huggingface-web
parent 2cea2ab5ae
commit 0993c71e8a
1 changed files with 2 additions and 7 deletions

View File

@ -17,15 +17,10 @@ January 2021
### Model Type
The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
### Model Version
The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
Initially, weve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
*This port does not include the ResNet model.*
Please see the paper linked below for further details about their specification.
### Documents