From f793746a8fff7a83f266fa0df91064727c8c76a2 Mon Sep 17 00:00:00 2001 From: Luca Papariello Date: Thu, 25 Nov 2021 12:41:12 +0000 Subject: [PATCH] Add evaluation results --- README.md | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 64 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b1d009a..78073b7 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,11 @@ model-index: This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset. +## Model description + +This model is an XLM-RoBERTa transformer model with a classification head on top (i.e. a linear layer on top of the pooled output). +For additional information please refer to the [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) model card or to the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. + ## Intended uses & limitations You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 20 languages: @@ -22,13 +27,62 @@ You can directly use this model as a language detector, i.e. for sequence classi ## Training and evaluation data -It achieves the following results on the evaluation set: -- Loss: 0.0103 -- Accuracy: 0.9977 -- F1: 0.9977 +The model was fine-tuned on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset, which consists of text sequences in 20 languages. The training set contains 70k samples, while the validation and test sets 10k each. The average accuracy on the test set is **99.6%** (this matches the average macro/weighted F1-score being the test set perfectly balanced). A more detailed evaluation is provided by the following table. + +| Language | Precision | Recall | F1-score | support | +|:--------:|:---------:|:------:|:--------:|:-------:| +|ar |0.998 |0.996 |0.997 |500 | +|bg |0.998 |0.964 |0.981 |500 | +|de |0.998 |0.996 |0.997 |500 | +|el |0.996 |1.000 |0.998 |500 | +|en |1.000 |1.000 |1.000 |500 | +|es |0.967 |1.000 |0.983 |500 | +|fr |1.000 |1.000 |1.000 |500 | +|hi |0.994 |0.992 |0.993 |500 | +|it |1.000 |0.992 |0.996 |500 | +|ja |0.996 |0.996 |0.996 |500 | +|nl |1.000 |1.000 |1.000 |500 | +|pl |1.000 |1.000 |1.000 |500 | +|pt |0.988 |1.000 |0.994 |500 | +|ru |1.000 |0.994 |0.997 |500 | +|sw |1.000 |1.000 |1.000 |500 | +|th |1.000 |0.998 |0.999 |500 | +|tr |0.994 |0.992 |0.993 |500 | +|ur |1.000 |1.000 |1.000 |500 | +|vi |0.992 |1.000 |0.996 |500 | +|zh |1.000 |1.000 |1.000 |500 | + +### Benchmarks + +As a baseline to compare `xlm-roberta-base-language-detection` against, we have used the Python [langid](https://github.com/saffsd/langid.py) library. Since it comes pre-trained on 97 languages, we have used its `.set_languages()` method to constrain the language set to our 20 languages. The average accuracy of langid on the test set is **98.5%**. More details are provided by the table below. + +| Language | Precision | Recall | F1-score | support | +|:--------:|:---------:|:------:|:--------:|:-------:| +|ar |0.990 |0.970 |0.980 |500 | +|bg |0.998 |0.964 |0.981 |500 | +|de |0.992 |0.944 |0.967 |500 | +|el |1.000 |0.998 |0.999 |500 | +|en |1.000 |1.000 |1.000 |500 | +|es |1.000 |0.968 |0.984 |500 | +|fr |0.996 |1.000 |0.998 |500 | +|hi |0.949 |0.976 |0.963 |500 | +|it |0.990 |0.980 |0.985 |500 | +|ja |0.927 |0.988 |0.956 |500 | +|nl |0.980 |1.000 |0.990 |500 | +|pl |0.986 |0.996 |0.991 |500 | +|pt |0.950 |0.996 |0.973 |500 | +|ru |0.996 |0.974 |0.985 |500 | +|sw |1.000 |1.000 |1.000 |500 | +|th |1.000 |0.996 |0.998 |500 | +|tr |0.990 |0.968 |0.979 |500 | +|ur |0.998 |0.996 |0.997 |500 | +|vi |0.971 |0.990 |0.980 |500 | +|zh |1.000 |1.000 |1.000 |500 | ## Training procedure +Fine-tuning was done via the `Trainer` API. + ### Training hyperparameters The following hyperparameters were used during training: @@ -43,11 +97,17 @@ The following hyperparameters were used during training: ### Training results +The validation results on the `valid` split of the Language Identification dataset are summarised here below. + | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | |:-------------:|:-----:|:----:|:---------------:|:--------:|:------:| | 0.2492 | 1.0 | 1094 | 0.0149 | 0.9969 | 0.9969 | | 0.0101 | 2.0 | 2188 | 0.0103 | 0.9977 | 0.9977 | +In short, it achieves the following results on the validation set: +- Loss: 0.0101 +- Accuracy: 0.9977 +- F1: 0.9977 ### Framework versions