* Each layer consists of one feedforward block and one self attention block.
+† Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.
The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model
dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64