From f793746a8fff7a83f266fa0df91064727c8c76a2 Mon Sep 17 00:00:00 2001
From: Luca Papariello <luca.papariello@gmail.com>
Date: Thu, 25 Nov 2021 12:41:12 +0000
Subject: [PATCH] Add evaluation results

---
 README.md | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index b1d009a..78073b7 100644
--- a/README.md
+++ b/README.md
@@ -14,6 +14,11 @@ model-index:
 
 This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset.
 
+## Model description
+
+This model is an XLM-RoBERTa transformer model with a classification head on top (i.e. a linear layer on top of the pooled output). 
+For additional information please refer to the [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) model card or to the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al.
+
 ## Intended uses & limitations
 
 You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 20 languages: 
@@ -22,13 +27,62 @@ You can directly use this model as a language detector, i.e. for sequence classi
 
 ## Training and evaluation data
 
-It achieves the following results on the evaluation set:
-- Loss: 0.0103
-- Accuracy: 0.9977
-- F1: 0.9977
+The model was fine-tuned on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset, which consists of text sequences in 20 languages. The training set contains 70k samples, while the validation and test sets 10k each. The average accuracy on the test set is **99.6%** (this matches the average macro/weighted F1-score being the test set perfectly balanced). A more detailed evaluation is provided by the following table.
+
+| Language | Precision | Recall | F1-score | support |
+|:--------:|:---------:|:------:|:--------:|:-------:|
+|ar        |0.998      |0.996   |0.997     |500      |
+|bg        |0.998      |0.964   |0.981     |500      |
+|de        |0.998      |0.996   |0.997     |500      |
+|el        |0.996      |1.000   |0.998     |500      |
+|en        |1.000      |1.000   |1.000     |500      |
+|es        |0.967      |1.000   |0.983     |500      |
+|fr        |1.000      |1.000   |1.000     |500      |
+|hi        |0.994      |0.992   |0.993     |500      |
+|it        |1.000      |0.992   |0.996     |500      |
+|ja        |0.996      |0.996   |0.996     |500      |
+|nl        |1.000      |1.000   |1.000     |500      |
+|pl        |1.000      |1.000   |1.000     |500      |
+|pt        |0.988      |1.000   |0.994     |500      |
+|ru        |1.000      |0.994   |0.997     |500      |
+|sw        |1.000      |1.000   |1.000     |500      |
+|th        |1.000      |0.998   |0.999     |500      |
+|tr        |0.994      |0.992   |0.993     |500      |
+|ur        |1.000      |1.000   |1.000     |500      |
+|vi        |0.992      |1.000   |0.996     |500      |
+|zh        |1.000      |1.000   |1.000     |500      |
+
+### Benchmarks
+
+As a baseline to compare `xlm-roberta-base-language-detection` against, we have used the Python [langid](https://github.com/saffsd/langid.py) library. Since it comes pre-trained on 97 languages, we have used its `.set_languages()` method to constrain the language set to our 20 languages. The average accuracy of langid on the test set is **98.5%**. More details are provided by the table below.
+
+| Language | Precision | Recall | F1-score | support |
+|:--------:|:---------:|:------:|:--------:|:-------:|
+|ar        |0.990      |0.970   |0.980     |500      |
+|bg        |0.998      |0.964   |0.981     |500      |
+|de        |0.992      |0.944   |0.967     |500      |
+|el        |1.000      |0.998   |0.999     |500      |
+|en        |1.000      |1.000   |1.000     |500      |
+|es        |1.000      |0.968   |0.984     |500      |
+|fr        |0.996      |1.000   |0.998     |500      |
+|hi        |0.949      |0.976   |0.963     |500      |
+|it        |0.990      |0.980   |0.985     |500      |
+|ja        |0.927      |0.988   |0.956     |500      |
+|nl        |0.980      |1.000   |0.990     |500      |
+|pl        |0.986      |0.996   |0.991     |500      |
+|pt        |0.950      |0.996   |0.973     |500      |
+|ru        |0.996      |0.974   |0.985     |500      |
+|sw        |1.000      |1.000   |1.000     |500      |
+|th        |1.000      |0.996   |0.998     |500      |
+|tr        |0.990      |0.968   |0.979     |500      |
+|ur        |0.998      |0.996   |0.997     |500      |
+|vi        |0.971      |0.990   |0.980     |500      |
+|zh        |1.000      |1.000   |1.000     |500      |
 
 ## Training procedure
 
+Fine-tuning was done via the `Trainer` API.
+
 ### Training hyperparameters
 
 The following hyperparameters were used during training:
@@ -43,11 +97,17 @@ The following hyperparameters were used during training:
 
 ### Training results
 
+The validation results on the `valid` split of the Language Identification dataset are summarised here below.
+
 | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1     |
 |:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
 | 0.2492        | 1.0   | 1094 | 0.0149          | 0.9969   | 0.9969 |
 | 0.0101        | 2.0   | 2188 | 0.0103          | 0.9977   | 0.9977 |
 
+In short, it achieves the following results on the validation set:
+- Loss: 0.0101
+- Accuracy: 0.9977
+- F1: 0.9977
 
 ### Framework versions