diff --git a/README.md b/README.md index ac5f2fc..0cfefb2 100644 --- a/README.md +++ b/README.md @@ -24,16 +24,18 @@ Training data was classified as follow: Abbreviation|Description -|- -O| Outside of a named entity -MISC | Miscellaneous entity -PER | Person’s name -ORG | Organization -LOC | Location +O |Outside of a named entity +MISC |Miscellaneous entity +PER |Person’s name +ORG |Organization +LOC |Location In order to simplify, the prefix B- or I- from original conll2003 was removed. I used the train and test dataset from original conll2003 for training and the "validation" dataset for validation. This resulted in a dataset of size: -Train | 17494 -Validation | 3250 + +Train | Validation +-|- +17494 | 3250 ## How to use camembert-ner with HuggingFace @@ -90,31 +92,31 @@ nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to ## Model performances Model performances computed on conll2003 validation dataset (computed on the tokens predictions) -``` -entity | precision | recall | f1 -- | - | - | - -PER | 0.9914 | 0.9927 | 0.9920 -ORG | 0.9627 | 0.9661 | 0.9644 -LOC | 0.9795 | 0.9862 | 0.9828 -MISC | 0.9292 | 0.9262 | 0.9277 -Overall | 0.9740 | 0.9766 | 0.9753 -``` + +entity|precision|recall|f1 +-|-|-|- +PER|0.9914|0.9927|0.9920 +PER|0.9914|0.9927|0.9920 +ORG|0.9627|0.9661|0.9644 +LOC|0.9795|0.9862|0.9828 +MISC|0.9292|0.9262|0.9277 +Overall|0.9740|0.9766|0.9753 + On private dataset (email, chat, informal discussion), computed on word predictions: -``` -entity | precision | recall | f1 -- | - | - | - -PER | 0.8823 | 0.9116 | 0.8967 -ORG | 0.7694 | 0.7292 | 0.7487 -LOC | 0.8619 | 0.7768 | 0.8171 -``` -Spacy (en_core_web_trf-3.2.0) on the same private dataset was giving: -``` -entity | precision | recall | f1 -- | - | - | - -PER | 0.9146 | 0.8287 | 0.8695 -ORG | 0.7655 | 0.6437 | 0.6993 -LOC | 0.8727 | 0.6180 | 0.7236 -``` +entity|precision|recall|f1 +-|-|-|- +PER|0.8823|0.9116|0.8967 +ORG|0.7694|0.7292|0.7487 +LOC|0.8619|0.7768|0.8171 + +By comparison on the same private dataset, Spacy (en_core_web_trf-3.2.0) was giving: + +entity|precision|recall|f1 +-|-|-|- +PER|0.9146|0.8287|0.8695 +ORG|0.7655|0.6437|0.6993 +LOC|0.8727|0.6180|0.7236 +