diff --git a/README.md b/README.md
index 52e27f4..5452b22 100644
--- a/README.md
+++ b/README.md
@@ -31,6 +31,8 @@ Disclaimer: The team releasing X-CLIP did not write a model card for this model
 
 X-CLIP is a minimal extension of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. 
 
+![X-CLIP architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png)
+
 This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.
 
 ## Intended uses & limitations