From 0993c71e8ad62658387de2714a69f723ddfffacb Mon Sep 17 00:00:00 2001
From: Suraj Patil <surajp815@gmail.com>
Date: Mon, 14 Mar 2022 18:01:04 +0000
Subject: [PATCH] Update README.md

---
 README.md | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 4bce29f..bcfe824 100644
--- a/README.md
+++ b/README.md
@@ -17,15 +17,10 @@ January 2021
 
 ### Model Type
 
-The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
+The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
 
-### Model Version
+The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
 
-Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
-
-*This port does not include the ResNet model.*
-
-Please see the paper linked below for further details about their specification.
 
 ### Documents