Jean-Baptiste/roberta-large-ner-english is a forked repo from huggingface. License: mit

Go to file

JB Polle 8f3abc1ef8 Adding ONNX file of this model (#6 ) - Adding ONNX file of this model (a49525de74e3567169157cb8ff78fa95ad144823) Co-authored-by: Ali Cenk Baytop <alicenkbaytop@users.noreply.huggingface.co>		2023-03-22 02:19:36 +00:00
.gitattributes	Adding `safetensors` variant of this model (#5 )	2023-03-17 03:34:09 +00:00
README.md	Update README.md	2022-10-12 23:16:21 +00:00
config.json	Update of the model with similar parameters	2022-01-05 22:41:00 -05:00
merges.txt	initial commit	2022-01-05 11:49:20 -05:00
model.onnx	Adding ONNX file of this model (#6 )	2023-03-22 02:19:36 +00:00
model.safetensors	Adding `safetensors` variant of this model (#5 )	2023-03-17 03:34:09 +00:00
pytorch_model.bin	Update of the model with similar parameters	2022-01-05 22:41:00 -05:00
results.csv	Update of the model with similar parameters	2022-01-05 22:41:00 -05:00
special_tokens_map.json	initial commit	2022-01-05 11:49:20 -05:00
tf_model.h5	Add TF weights (#1 )	2022-06-10 13:15:48 +00:00
tokenizer_config.json	initial commit	2022-01-05 11:49:20 -05:00
vocab.json	initial commit	2022-01-05 11:49:20 -05:00

README.md

language

datasets

widget

train-eval-index

license

conll2003

text
My name is jean-baptiste and I live in montreal

text
My name is clara and I live in berkeley, california.

text
My name is wolfgang and I live in berlin

config

task

task_id

splits

col_mapping

conll2003

token-classification

entity_extraction

eval_split
validation

tokens	ner_tags
tokens	tags

mit

roberta-large-ner-english: model fine-tuned from roberta-large for NER task

Introduction

[roberta-large-ner-english] is an english NER model that was fine-tuned from roberta-large on conll2003 dataset. Model was validated on emails/chat data and outperformed other models on this type of data specifically. In particular the model seems to work better on entity that don't start with an upper case.

Training data

Training data was classified as follow:

Abbreviation	Description
O	Outside of a named entity
MISC	Miscellaneous entity
PER	Person’s name
ORG	Organization
LOC	Location

In order to simplify, the prefix B- or I- from original conll2003 was removed. I used the train and test dataset from original conll2003 for training and the "validation" dataset for validation. This resulted in a dataset of size:

Train	Validation
17494	3250

How to use roberta-large-ner-english with HuggingFace

Load roberta-large-ner-english and its sub-word tokenizer :

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner-english")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner-english")


##### Process text sample (from wikipedia)

from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer")


[{'entity_group': 'ORG',
  'score': 0.99381506,
  'word': ' Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.99970853,
  'word': ' Steve Jobs',
  'start': 29,
  'end': 39},
 {'entity_group': 'PER',
  'score': 0.99981767,
  'word': ' Steve Wozniak',
  'start': 41,
  'end': 54},
 {'entity_group': 'PER',
  'score': 0.99956465,
  'word': ' Ronald Wayne',
  'start': 59,
  'end': 71},
 {'entity_group': 'PER',
  'score': 0.9997918,
  'word': ' Wozniak',
  'start': 92,
  'end': 99},
 {'entity_group': 'MISC',
  'score': 0.99956393,
  'word': ' Apple I',
  'start': 102,
  'end': 109}]

Model performances

Model performances computed on conll2003 validation dataset (computed on the tokens predictions)

entity	precision	recall	f1
PER	0.9914	0.9927	0.9920
ORG	0.9627	0.9661	0.9644
LOC	0.9795	0.9862	0.9828
MISC	0.9292	0.9262	0.9277
Overall	0.9740	0.9766	0.9753

On private dataset (email, chat, informal discussion), computed on word predictions:

entity	precision	recall	f1
PER	0.8823	0.9116	0.8967
ORG	0.7694	0.7292	0.7487
LOC	0.8619	0.7768	0.8171

By comparison on the same private dataset, Spacy (en_core_web_trf-3.2.0) was giving:

entity	precision	recall	f1
PER	0.9146	0.8287	0.8695
ORG	0.7655	0.6437	0.6993
LOC	0.8727	0.6180	0.7236

For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails: https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa

README.md Unescape Escape