File size: 5,411 Bytes
2f00024
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5074ae7
fb05949
e637501
fb05949
2f00024
 
 
 
 
 
 
 
 
fb05949
2f00024
 
 
fb05949
2f00024
 
 
fb05949
2f00024
 
 
fb05949
 
4cc67ef
fb05949
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cc67ef
fb05949
 
2f00024
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: apache-2.0
base_model: google/flan-t5-base
tags:
- generated_from_trainer
metrics:
- rouge
model-index:
- name: flan-t5-base-openbsd-faq
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# flan-t5-base-openbsd-faq

This model is a fine-tuned version of [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) fintuned on [ajsbsd/openbsd-faq](https://huggingface.co/datasets/ajsbsd/openbsd-faq)

These are questions from https://www.openbsd.org/faq/faq1.html for use on [ajsbsd.net](https://ajsbsd.net)

It achieves the following results on the evaluation set:
- Loss: 2.2385
- Rouge1: 0.3935
- Rouge2: 0.3383
- Rougel: 0.3906
- Rougelsum: 0.3844

## Model description

This model is a fine-tuned version of [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)

## Intended uses & limitations

OpenBSD Q/A chat-bot.

## Training and evaluation data

Questions created from https://www.openbsd.org/faq/faq1.html in Q/A format for text2text generation.

## Training procedure

Trained at Google Colab with the following code.

```
!pip install -q transformers[torch] tokenizers datasets evaluate rouge_score sentencepiece huggingface_hub --upgrade

from huggingface_hub import notebook_login
notebook_login()

import nltk
from datasets import load_dataset
import evaluate
import numpy as np
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

# Load and split the dataset
dataset = load_dataset("ajsbsd/openbsd-faq")
dataset = dataset["train"].train_test_split(test_size=0.2)
#dataset = load_dataset("csv", data_files="./JEOPARDY_CSV.csv")
#dataset = dataset["train"].train_test_split(test_size=0.2)
# Load the tokenizer, model, and data collator
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# We prefix our tasks with "answer the question"
prefix = "Please answer this question: "

# Define our preprocessing function
def preprocess_function(examples):
    """Add prefix to the sentences, tokenize the text, and set the labels"""
    # The "inputs" are the tokenized answer:
    inputs = [prefix + doc for doc in examples["question"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    # The "labels" are the tokenized outputs:
    labels = tokenizer(text_target=examples["answer"], max_length=512, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Map the preprocessing function across our dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Set up Rouge score for evaluation
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # decode preds and labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # rougeLSum expects newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return result

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./flan-t5-base-openbsd-faq",
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    push_to_hub=False
)

# Set up trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

trainer.push_to_hub()
```


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 8
- eval_batch_size: 4
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum |
|:-------------:|:-----:|:----:|:---------------:|:------:|:------:|:------:|:---------:|
| No log        | 1.0   | 9    | 2.2184          | 0.3985 | 0.3308 | 0.3878 | 0.3902    |
| No log        | 2.0   | 18   | 2.2060          | 0.4044 | 0.3231 | 0.3959 | 0.3937    |
| No log        | 3.0   | 27   | 2.2271          | 0.4063 | 0.3315 | 0.4006 | 0.3971    |
| No log        | 4.0   | 36   | 2.2251          | 0.4069 | 0.3366 | 0.4001 | 0.3937    |
| No log        | 5.0   | 45   | 2.2385          | 0.3935 | 0.3383 | 0.3906 | 0.3844    |


### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.14.7
- Tokenizers 0.15.0