BaptisteDoyen/camembert-base-xnli is a forked repo from huggingface. License: mit
Go to file
BaptisteDoyen 791c5260a7 Add TF weights (#1)
- Add TF weights (541601d8564e789a51a9e40df89831b87a609a8f)


Co-authored-by: Joao Gante <joaogante@users.noreply.huggingface.co>
2022-06-29 09:30:26 +00:00
.gitattributes initial commit 2021-03-24 16:43:34 +00:00
README.md Update README.md 2021-04-08 14:11:55 +00:00
config.json change labels in config file 2021-03-25 11:43:36 +00:00
pytorch_model.bin adding model and tokenizer files 2021-03-24 16:59:02 +00:00
sentencepiece.bpe.model adding model and tokenizer files 2021-03-24 16:59:02 +00:00
special_tokens_map.json adding model and tokenizer files 2021-03-24 16:59:02 +00:00
tf_model.h5 Add TF weights (#1) 2022-06-29 09:30:26 +00:00
tokenizer_config.json change config files 2021-03-25 11:19:48 +00:00

README.md

language thumbnail tags license pipeline_tag datasets metrics
fr
zero-shot-classification
xnli
nli
fr
mit zero-shot-classification
xnli
accuracy

camembert-base-xnli

Model description

Camembert-base model fine-tuned on french part of XNLI dataset.
One of the few Zero-Shot classification model working on french 🇫🇷

Intended uses & limitations

How to use

Two different usages :

  • As a Zero-Shot sequence classifier :
classifier = pipeline("zero-shot-classification", 
                      model="BaptisteDoyen/camembert-base-xnli")

sequence = "L'équipe de France joue aujourd'hui au Parc des Princes"
candidate_labels = ["sport","politique","science"]
hypothesis_template = "Ce texte parle de {}."    

classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)     
# outputs :                                        
# {'sequence': "L'équipe de France joue aujourd'hui au Parc des Princes",
# 'labels': ['sport', 'politique', 'science'],
# 'scores': [0.8595073223114014, 0.10821866989135742, 0.0322740375995636]}                      
  • As a premise/hypothesis checker :
    The idea is here to compute a probability of the form \( P(premise|hypothesis ) \)
# load model and tokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained("BaptisteDoyen/camembert-base-xnli")
tokenizer = AutoTokenizer.from_pretrained("BaptisteDoyen/camembert-base-xnli") 
# sequences
premise = "le score pour les bleus est élevé"
hypothesis = "L'équipe de France a fait un bon match"
# tokenize and run through model
x = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = nli_model(x)[0]
# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (0) as the probability of the label being true 
entail_contradiction_logits = logits[:,::2]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,0]
prob_label_is_true[0].tolist() * 100
# outputs
# 86.40775084495544

Training data

Training data is the french fold of the XNLI dataset released in 2018 by Facebook.
Available with great ease using the datasets library :

from datasets import load_dataset
dataset = load_dataset('xnli', 'fr')                     

Training/Fine-Tuning procedure

Training procedure is here pretty basic and was performed on the cloud using a single GPU.
Main training parameters :

  • lr = 2e-5 with lr_scheduler_type = "linear"
  • num_train_epochs = 4
  • batch_size = 12 (limited by GPU-memory)
  • weight_decay = 0.01
  • metric_for_best_model = "eval_accuracy"

Eval results

We obtain the following results on validation and test sets:

Set Accuracy
validation 81.4
test 81.7