From f4881ba48ee4d21b7ed5602603b9e3e92eb1b346 Mon Sep 17 00:00:00 2001 From: Suraj Patil Date: Mon, 14 Mar 2022 17:58:13 +0000 Subject: [PATCH] Update README.md --- README.md | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 05c0d5e..118eecd 100644 --- a/README.md +++ b/README.md @@ -17,15 +17,10 @@ January 2021 ### Model Type -The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer. +The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. -### Model Version +The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. -Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50. - -*This port does not include the ResNet model.* - -Please see the paper linked below for further details about their specification. ### Documents