huggingface/xlm-roberta-large-finetuned-conll03-english is a forked repo from huggingface. License: None
Go to file
sgugger 33a83d9855 Preliminary model card (#3)
- Preliminary model card (a1ba29720e3a880b01a3671a94f0a63ba165d322)
- Update README.md (644c2072adda5427097d8c86e5a8835c51e2344a)
- Update README.md (651ad7d29dee5f93fc36682aab0082c248fe9593)
- Update README.md (7e71727b820444d28ec9be25af0c82762dc241d5)
- Update README.md (0b12cb010033b156b283f10b65620f733256ac09)


Co-authored-by: Marissa Gerchick <Marissa@users.noreply.huggingface.co>
2022-07-22 08:04:08 +00:00
.gitattributes initial commit 2019-12-19 15:25:42 +00:00
README.md Preliminary model card (#3) 2022-07-22 08:04:08 +00:00
config.json Update config.json 2020-04-24 16:07:33 +00:00
pytorch_model.bin Update pytorch_model.bin 2019-12-19 15:25:42 +00:00
rust_model.ot Update rust_model.ot 2020-07-23 17:13:41 +00:00
sentencepiece.bpe.model Update sentencepiece.bpe.model 2019-12-19 15:26:44 +00:00
tokenizer.json Update tokenizer.json 2020-10-12 12:57:14 +00:00

README.md

language
multilingual
af
am
ar
as
az
be
bg
bn
br
bs
ca
cs
cy
da
de
el
en
eo
es
et
eu
fa
fi
fr
fy
ga
gd
gl
gu
ha
he
hi
hr
hu
hy
id
is
it
ja
jv
ka
kk
km
kn
ko
ku
ky
la
lo
lt
lv
mg
mk
ml
mn
mr
ms
my
ne
nl
no
om
or
pa
pl
ps
pt
ro
ru
sa
sd
si
sk
sl
so
sq
sr
su
sv
sw
ta
te
th
tl
tr
ug
uk
ur
uz
vi
xh
yi
zh

xlm-roberta-large-finetuned-conll03-english

Table of Contents

  1. Model Details
  2. Uses
  3. Bias, Risks, and Limitations
  4. Training
  5. Evaluation
  6. Environmental Impact
  7. Technical Specifications
  8. Citation
  9. Model Card Authors
  10. How To Get Started With the Model

Model Details

Model Description

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is XLM-RoBERTa-large fine-tuned with the conll2003 dataset in English.

  • Developed by: See associated paper
  • Model type: Multi-lingual language model
  • Language(s) (NLP) or Countries (images): XLM-RoBERTa is a multilingual model trained on 100 different languages; see GitHub Repo for full list; model is fine-tuned on a dataset in English
  • License: More information needed
  • Related Models: RoBERTa, XLM
  • Resources for more information: -GitHub Repo -Associated Paper

Uses

Direct Use

The model is a language model. The model can be used for token classification, a natural language understanding task in which a label is assigned to some tokens in a text.

Downstream Use

Potential downstream use cases include Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. To learn more about token classification and other potential downstream use cases, see the Hugging Face token classification docs.

Out-of-Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

Bias, Risks, and Limitations

CONTENT WARNING: Readers should be made aware that language generated by this model may be disturbing or offensive to some and may propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). In the context of tasks relevant to this model, Mishra et al. (2020) explore social biases in NER systems for English and find that there is systematic bias in existing NER systems in that they fail to identify named entities from different demographic groups (though this paper did not look at BERT). For example, using a sample sentence from Mishra et al. (2020):

>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Alya told Jasmine that Andrew could pay with cash..")
[{'end': 2,
  'entity': 'I-PER',
  'index': 1,
  'score': 0.9997861,
  'start': 0,
  'word': '▁Al'},
 {'end': 4,
  'entity': 'I-PER',
  'index': 2,
  'score': 0.9998591,
  'start': 2,
  'word': 'ya'},
 {'end': 16,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.99995816,
  'start': 10,
  'word': '▁Jasmin'},
 {'end': 17,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9999584,
  'start': 16,
  'word': 'e'},
 {'end': 29,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.99998057,
  'start': 23,
  'word': '▁Andrew'}]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Training

See the following resources for training data and training procedure details:

Evaluation

See the associated paper for evaluation details.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 500 32GB Nvidia V100 GPUs (from the associated paper)
  • Hours used: More information needed
  • Cloud Provider: More information needed
  • Compute Region: More information needed
  • Carbon Emitted: More information needed

Technical Specifications

See the associated paper for further details.

Citation

BibTeX:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

APA:

  • Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

Model Card Authors

This model card was written by the team at Hugging Face.

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly within a pipeline for NER.

Click to expand
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Hello I'm Omar and I live in Zürich.")

[{'end': 14,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9999175,
  'start': 10,
  'word': '▁Omar'},
 {'end': 35,
  'entity': 'I-LOC',
  'index': 10,
  'score': 0.9999906,
  'start': 29,
  'word': '▁Zürich'}]