cardiffnlp/twitter-roberta-base-sentiment is a forked repo from huggingface. License: None
Go to file
cardiffnlp c3f00119f3 Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
.ipynb_checkpoints Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
.gitattributes initial commit 2020-11-12 17:27:58 +00:00
README.md Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
config.json Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
merges.txt Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
pytorch_model.bin Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
special_tokens_map.json Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
tf_model.h5 Adding tweeteval classifier 2020-11-12 17:31:20 +00:00
vocab.json Adding tweeteval classifier 2020-11-12 17:31:20 +00:00

README.md

Twitter-roBERTa-base

This is a roBERTa-base model trained on ~58M tweets and finetuned for the Sentiment Analysis task at Semeval 2018. For full description: TweetEval benchmark (Findings of EMNLP 2020). To evaluate this and other models on Twitter-specific data, please refer to the Tweeteval official repository.

Example of classification

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary

task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    spamreader = csv.reader(html[:-1], delimiter='\t')
labels = [row[1] for row in spamreader]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night 😊"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Output:

1) positive 0.8466
2) neutral 0.1458
3) negative 0.0076