Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
datasets:
|
4 |
+
- tweet-qa
|
5 |
+
tags:
|
6 |
+
- qa
|
7 |
+
|
8 |
+
---
|
9 |
+
|
10 |
+
|
11 |
+
# ByT5-base fine-tuned for Question Answering (on Tweets)
|
12 |
+
[ByT5](https://huggingface.co/google/byt5-base) base fine-tuned on [tweets hate speech detection](https://huggingface.co/datasets/tweets_hate_speech_detection) dataset for **Sequence Classification** downstream task.
|
13 |
+
|
14 |
+
# Details of ByT5 - Base π§
|
15 |
+
|
16 |
+
ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
|
17 |
+
ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
|
18 |
+
ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
|
19 |
+
Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
|
20 |
+
Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel*
|
21 |
+
|
22 |
+
|
23 |
+
## Details of the downstream task (Sequence Classification as Text generation) - Dataset π
|
24 |
+
|
25 |
+
[tweets_hate_speech_detection](hhttps://huggingface.co/datasets/tweets_hate_speech_detection)
|
26 |
+
|
27 |
+
|
28 |
+
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
|
29 |
+
|
30 |
+
Formally, given a training sample of tweets and labels, where label β1β denotes the tweet is racist/sexist and label β0β denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
|
31 |
+
|
32 |
+
- Data Instances:
|
33 |
+
|
34 |
+
The dataset contains a label denoting is the tweet a hate speech or not
|
35 |
+
|
36 |
+
```json
|
37 |
+
{'label': 0, # not a hate speech
|
38 |
+
'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'}
|
39 |
+
```
|
40 |
+
- Data Fields:
|
41 |
+
|
42 |
+
**label**: 1 - it is a hate speech, 0 - not a hate speech
|
43 |
+
|
44 |
+
**tweet**: content of the tweet as a string
|
45 |
+
|
46 |
+
- Data Splits:
|
47 |
+
|
48 |
+
The data contains training data with **31962** entries
|
49 |
+
|
50 |
+
## Test set metrics π§Ύ
|
51 |
+
|
52 |
+
We created a representative test set with the 5% of the entries.
|
53 |
+
|
54 |
+
The dataset is so imbalanced and we got a **F1 score of 79.8**
|
55 |
+
|
56 |
+
|
57 |
+
|
58 |
+
## Model in Action π
|
59 |
+
|
60 |
+
```sh
|
61 |
+
git clone https://github.com/huggingface/transformers.git
|
62 |
+
pip install -q ./transformers
|
63 |
+
```
|
64 |
+
|
65 |
+
```python
|
66 |
+
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
67 |
+
|
68 |
+
ckpt = 'Narrativa/byt5-base-tweet-hate-detection'
|
69 |
+
|
70 |
+
tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
71 |
+
model = T5ForConditionalGeneration.from_pretrained(ckpt).to("cuda")
|
72 |
+
|
73 |
+
def classify_tweet(tweet):
|
74 |
+
|
75 |
+
inputs = tokenizer([tweet], padding='max_length', truncation=True, max_length=512, return_tensors='pt')
|
76 |
+
input_ids = inputs.input_ids.to('cuda')
|
77 |
+
attention_mask = inputs.attention_mask.to('cuda')
|
78 |
+
output = model.generate(input_ids, attention_mask=attention_mask)
|
79 |
+
return tokenizer.decode(output[0], skip_special_tokens=True)
|
80 |
+
|
81 |
+
|
82 |
+
classify_tweet('here goes your tweet...')
|
83 |
+
```
|
84 |
+
|
85 |
+
Created by: [Narrativa](https://www.narrativa.com/)
|
86 |
+
|
87 |
+
About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI
|