Models roughly sorted by performance, or by FLOPs if not available.
-`*` represents evaluation numbers reported by their respective authors, all other numbers are provided by -running the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/) either with the released +* Evaluation numbers reported by their respective authors. All other numbers are provided by
+running lm-evaluation-harness
either with released
weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these
-might not be directly comparable. See [this blog post](https://www.eleuther.ai/research-log/gpt3-model-sizes/) for more
-details.
+might not be directly comparable. See this blog post for more
+details.
† Megatron-11B provides no comparable metrics, and several implementations using the released weights do not +reproduce the generation quality and evaluations. (see 1 +2 3) +Thus, evaluation was not attempted.
-`‡` These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models +‡ These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models failed to deduplicate training data for certain test sets, while the GPT-Neo models as well as this one is -trained on The Pile, which has not been deduplicated against any test sets. +trained on the Pile, which has not been deduplicated against any test sets.