Compare commits

..

No commits in common. "cc63721791a6e1d60f4764997bbb311667ec75d8" and "58670260a454d0ae6edc17c9cc9095c8e70d9aab" have entirely different histories.

5 changed files with 30 additions and 46 deletions

1
.gitattributes vendored
View File

@ -15,4 +15,3 @@
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text

View File

@ -4,8 +4,6 @@ datasets:
- Jean-Baptiste/wikiner_fr
widget:
- text: "Je m'appelle jean-baptiste et je vis à montréal"
- text: "george washington est allé à washington"
license: mit
---
# camembert-ner: model fine-tuned from camemBERT for NER task.
@ -14,20 +12,9 @@ license: mit
[camembert-ner] is a NER model that was fine-tuned from camemBERT on wikiner-fr dataset.
Model was trained on wikiner-fr dataset (~170 634 sentences).
Model was validated on emails/chat data and overperformed other models on this type of data specifically.
Model was validated on emails/chat data and surperformed other models on this type of data specifically.
In particular the model seems to work better on entity that don't start with an upper case.
## Training data
Training data was classified as follow:
Abbreviation|Description
-|-
O |Outside of a named entity
MISC |Miscellaneous entity
PER |Persons name
ORG |Organization
LOC |Location
## How to use camembert-ner with HuggingFace
@ -44,7 +31,7 @@ model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert
from transformers import pipeline
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.")
@ -94,23 +81,27 @@ nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance d
## Model performances (metric: seqeval)
Overall
precision|recall|f1
-|-|-
0.8859|0.8971|0.8914
Global
```
'precision': 0.8859
'recall': 0.8971
'f1': 0.8914
```
By entity
```
'LOC': {'precision': 0.8905576596578294,
'recall': 0.900554675118859,
'f1': 0.8955282684352223},
'MISC': {'precision': 0.8175627240143369,
'recall': 0.8117437722419929,
'f1': 0.8146428571428571},
'ORG': {'precision': 0.8099480326651819,
'recall': 0.8265151515151515,
'f1': 0.8181477315335584},
'PER': {'precision': 0.9372509960159362,
'recall': 0.959812321501428,
'f1': 0.9483975005039308}
entity|precision|recall|f1
-|-|-|-
PER|0.9372|0.9598|0.9483
ORG|0.8099|0.8265|0.8181
LOC|0.8905|0.9005|0.8955
MISC|0.8175|0.8117|0.8146
```
For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails:
https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa

View File

@ -12,19 +12,19 @@
"hidden_size": 768,
"id2label": {
"0": "O",
"1": "I-LOC",
"2": "I-PER",
"3": "I-MISC",
"4": "I-ORG"
"1": "LOC",
"2": "PER",
"3": "MISC",
"4": "ORG"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"I-LOC": 1,
"I-MISC": 3,
"LOC": 1,
"MISC": 3,
"O": 0,
"I-ORG": 4,
"I-PER": 2
"ORG": 4,
"PER": 2
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,

BIN
model.onnx (Stored with Git LFS)

Binary file not shown.

BIN
model.safetensors (Stored with Git LFS)

Binary file not shown.