Update README.md
Browse files
README.md
CHANGED
@@ -22,37 +22,33 @@ Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang,
|
|
22 |
|
23 |
## Details of the downstream task (Question Answering) - Dataset π
|
24 |
|
25 |
-
[TweetQA](hhttps://huggingface.co/datasets/
|
26 |
|
27 |
|
28 |
-
|
29 |
-
|
30 |
-
Formally, given a training sample of tweets and labels, where label β1β denotes the tweet is racist/sexist and label β0β denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
|
31 |
|
32 |
- Data Instances:
|
33 |
|
34 |
-
|
35 |
|
36 |
```json
|
37 |
-
{
|
38 |
-
|
|
|
|
|
|
|
|
|
39 |
```
|
40 |
- Data Fields:
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
**tweet**: content of the tweet as a string
|
45 |
|
46 |
-
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
51 |
|
52 |
-
We created a representative test set with the 5% of the entries.
|
53 |
-
|
54 |
-
The dataset is so imbalanced and we got a **F1 score of 79.8**
|
55 |
-
|
56 |
|
57 |
|
58 |
## Model in Action π
|
@@ -65,21 +61,22 @@ pip install -q ./transformers
|
|
65 |
```python
|
66 |
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
67 |
|
68 |
-
ckpt = 'Narrativa/byt5-base-tweet-
|
69 |
|
70 |
tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
71 |
-
model = T5ForConditionalGeneration.from_pretrained(ckpt).to(
|
72 |
|
73 |
-
def
|
74 |
|
75 |
-
|
|
|
76 |
input_ids = inputs.input_ids.to('cuda')
|
77 |
attention_mask = inputs.attention_mask.to('cuda')
|
78 |
output = model.generate(input_ids, attention_mask=attention_mask)
|
79 |
return tokenizer.decode(output[0], skip_special_tokens=True)
|
80 |
|
81 |
|
82 |
-
|
83 |
```
|
84 |
|
85 |
Created by: [Narrativa](https://www.narrativa.com/)
|
|
|
22 |
|
23 |
## Details of the downstream task (Question Answering) - Dataset π
|
24 |
|
25 |
+
[TweetQA](hhttps://huggingface.co/datasets/tweet_qa)
|
26 |
|
27 |
|
28 |
+
With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, we present the first large-scale dataset for QA over social media data. To make the tweets are meaningful and contain interesting information, we gather tweets used by journalists to write news articles. We then ask human annotators to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, we allow the answers to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.
|
|
|
|
|
29 |
|
30 |
- Data Instances:
|
31 |
|
32 |
+
Sample
|
33 |
|
34 |
```json
|
35 |
+
{
|
36 |
+
"Question": "who is the tallest host?",
|
37 |
+
"Answer": ["sam bee","sam bee"],
|
38 |
+
"Tweet": "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017",
|
39 |
+
"qid": "3554ee17d86b678be34c4dc2c04e334f"
|
40 |
+
}
|
41 |
```
|
42 |
- Data Fields:
|
43 |
|
44 |
+
Question: a question based on information from a tweet
|
|
|
|
|
45 |
|
46 |
+
Answer: list of possible answers from the tweet
|
47 |
|
48 |
+
Tweet: source tweet
|
49 |
|
50 |
+
qid: question id
|
51 |
|
|
|
|
|
|
|
|
|
52 |
|
53 |
|
54 |
## Model in Action π
|
|
|
61 |
```python
|
62 |
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
63 |
|
64 |
+
ckpt = 'Narrativa/byt5-base-finetuned-tweet-qa'
|
65 |
|
66 |
tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
67 |
+
model = T5ForConditionalGeneration.from_pretrained(ckpt).to('cuda')
|
68 |
|
69 |
+
def get_answer(question, context):
|
70 |
|
71 |
+
input_text = 'question: %s context: %s' % (question, context)
|
72 |
+
inputs = tokenizer([input_text], return_tensors='pt')
|
73 |
input_ids = inputs.input_ids.to('cuda')
|
74 |
attention_mask = inputs.attention_mask.to('cuda')
|
75 |
output = model.generate(input_ids, attention_mask=attention_mask)
|
76 |
return tokenizer.decode(output[0], skip_special_tokens=True)
|
77 |
|
78 |
|
79 |
+
get_answer('here goes your question', 'And here the context/tweet...')
|
80 |
```
|
81 |
|
82 |
Created by: [Narrativa](https://www.narrativa.com/)
|