Update README.md
Browse files
README.md
CHANGED
@@ -147,24 +147,46 @@ If you use the model, please cite the paper:
|
|
147 |
## AGILE - Automatic Genre Identification Benchmark
|
148 |
|
149 |
We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
|
150 |
-
for the automatic enrichment of large text collections with genre information.
|
151 |
You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
|
152 |
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
|
160 |
-
| GPT-
|
161 |
-
|
|
162 |
-
|
|
163 |
-
|
|
164 |
-
|
|
165 |
-
|
|
166 |
-
|
|
167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
|
170 |
## Intended use and limitations
|
@@ -233,76 +255,6 @@ labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction':
|
|
233 |
| Other | A text that which does not fall under any of other genre categories. | |
|
234 |
|
235 |
|
236 |
-
## Performance
|
237 |
-
|
238 |
-
### Comparison with other models at in-dataset and cross-dataset experiments
|
239 |
-
|
240 |
-
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
|
241 |
-
using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
|
242 |
-
|
243 |
-
At the in-dataset experiments (trained and tested on splits of the same dataset),
|
244 |
-
it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
|
245 |
-
|
246 |
-
| Trained on | Micro F1 | Macro F1 |
|
247 |
-
|:-------------|-----------:|-----------:|
|
248 |
-
| FTD | 0.843 | 0.851 |
|
249 |
-
| X-GENRE | 0.797 | 0.794 |
|
250 |
-
| CORE | 0.778 | 0.627 |
|
251 |
-
| GINCO | 0.754 | 0.75 |
|
252 |
-
|
253 |
-
When applied on test splits of each of the datasets, the classifier performs well:
|
254 |
-
|
255 |
-
| Trained on | Tested on | Micro F1 | Macro F1 |
|
256 |
-
|:-------------|:------------|-----------:|-----------:|
|
257 |
-
| X-GENRE | CORE | 0.837 | 0.859 |
|
258 |
-
| X-GENRE | FTD | 0.804 | 0.809 |
|
259 |
-
| X-GENRE | X-GENRE | 0.797 | 0.794 |
|
260 |
-
| X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
|
261 |
-
| X-GENRE | GINCO | 0.749 | 0.758 |
|
262 |
-
|
263 |
-
The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
|
264 |
-
- EN-GINCO (available upon request): a sample of the English enTenTen20 corpus
|
265 |
-
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
|
266 |
-
|
267 |
-
| Trained on | Tested on | Micro F1 | Macro F1 |
|
268 |
-
|:-------------|:------------|-----------:|-----------:|
|
269 |
-
| X-GENRE | EN-GINCO | 0.688 | 0.691 |
|
270 |
-
| X-GENRE | FinCORE | 0.674 | 0.581 |
|
271 |
-
| GINCO | EN-GINCO | 0.632 | 0.502 |
|
272 |
-
| FTD | EN-GINCO | 0.574 | 0.475 |
|
273 |
-
| CORE | EN-GINCO | 0.485 | 0.422 |
|
274 |
-
|
275 |
-
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
|
276 |
-
trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
|
277 |
-
|
278 |
-
Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
|
279 |
-
of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
|
280 |
-
The X-GINCO dataset comprises 790 manually-annotated instances in 10 languages -
|
281 |
-
Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
|
282 |
-
To evaluate the performance on genre labels, the dataset is balanced by labels,
|
283 |
-
and the vague label "Other" is not included.
|
284 |
-
Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
|
285 |
-
|
286 |
-
|
287 |
-
The evaluation shows high cross-lingual performance of the model,
|
288 |
-
even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
|
289 |
-
|
290 |
-
|
291 |
-
The outlier is Maltese, on which classifier does not perform well -
|
292 |
-
we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
|
293 |
-
|
294 |
-
| Genre label | ca | el | hr | is | mk | sl | sq | tr | uk | Avg | mt |
|
295 |
-
|---------------|------|------|------|------|------|------|------|------|------|------|------|
|
296 |
-
| News | 0.82 | 0.90 | 0.95 | 0.73 | 0.91 | 0.90 | 0.89 | 0.95 | 1.00 | 0.89 | 0.69 |
|
297 |
-
| Opinion/Argumentation | 0.84 | 0.87 | 0.78 | 0.82 | 0.78 | 0.82 | 0.67 | 0.82 | 0.91 | 0.81 | 0.33 |
|
298 |
-
| Instruction | 0.75 | 0.71 | 0.75 | 0.78 | 1.00 | 1.00 | 0.95 | 0.90 | 0.95 | 0.86 | 0.69 |
|
299 |
-
| Information/Explanation | 0.72 | 0.70 | 0.95 | 0.50 | 0.84 | 0.90 | 0.80 | 0.82 | 1.00 | 0.80 | 0.52 |
|
300 |
-
| Promotion | 0.78 | 0.62 | 0.87 | 0.75 | 0.95 | 1.00 | 0.95 | 0.86 | 0.78 | 0.84 | 0.82 |
|
301 |
-
| Forum | 0.84 | 0.95 | 0.91 | 0.95 | 1.00 | 1.00 | 0.78 | 0.89 | 0.95 | 0.91 | 0.18 |
|
302 |
-
| Prose/Lyrical | 0.91 | 1.00 | 0.86 | 1.00 | 0.95 | 0.91 | 0.86 | 0.95 | 1.00 | 0.93 | 0.18 |
|
303 |
-
| Legal | 0.95 | 1.00 | 1.00 | 0.84 | 0.95 | 0.95 | 0.95 | 1.00 | 1.00 | 0.96 | / |
|
304 |
-
| Macro-F1 | 0.83 | 0.84 | 0.88 | 0.80 | 0.92 | 0.94 | 0.85 | 0.90 | 0.95 | 0.87 | 0.49 |
|
305 |
-
|
306 |
### Fine-tuning hyperparameters
|
307 |
|
308 |
Fine-tuning was performed with `simpletransformers`.
|
|
|
147 |
## AGILE - Automatic Genre Identification Benchmark
|
148 |
|
149 |
We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
|
150 |
+
for the automatic enrichment of large text collections with genre information. The benchmark comprises 11 European languages and two test datasets.
|
151 |
You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
|
152 |
|
153 |
+
The model outperforms all other technologies, including GPT models (used in a zero-shot scenario).
|
154 |
+
|
155 |
+
Results on the English test dataset (EN-GINCO):
|
156 |
+
|
157 |
+
| Model | Test Dataset | Macro F1 | Micro F1 |
|
158 |
+
|:-------------------------------------------------------------------------------------------------------------------|:---------------|-----------:|-----------:|
|
159 |
+
| [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier) | en-ginco | 0.687 | 0.684 |
|
160 |
+
| GPT-4o (gpt-4o-2024-08-06) (zero-shot) | en-ginco | 0.62 | 0.735 |
|
161 |
+
| Llama 3.3 (70B) (zero-shot) | en-ginco | 0.586 | 0.684 |
|
162 |
+
| Gemma 2 (27B) (zero-shot) | en-ginco | 0.564 | 0.603 |
|
163 |
+
| Gemma 3 (27B) (zero-shot) | en-ginco | 0.541 | 0.672 |
|
164 |
+
| GPT-4o-mini (gpt-4o-mini-2024-07-18) (zero-shot) | en-ginco | 0.534 | 0.632 |
|
165 |
+
| Support Vector Machine | en-ginco | 0.514 | 0.489 |
|
166 |
+
| GPT-3.5-Turbo (zero-shot) | en-ginco | 0.494 | 0.625 |
|
167 |
+
| DeepSeek-R1 14B (zero-shot) | en-ginco | 0.293 | 0.229 |
|
168 |
+
| Dummy Classifier (stratified) | en-ginco | 0.088 | 0.154 |
|
169 |
+
| Dummy classifier (most frequent) | en-ginco | 0.032 | 0.169 |
|
170 |
+
|
171 |
+
Results on the multilingual test dataset (X-GINCO), comprising instances in Albanian, Catalan, Croatian, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian:
|
172 |
+
|
173 |
+
| Model | Test Dataset | Macro F1 | Micro F1 |
|
174 |
+
|:-------------------------------------------------------------------------------------------------------------------|:---------------|-----------:|-----------:|
|
175 |
+
| [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier) | x-ginco | 0.847 | 0.845 |
|
176 |
+
| GPT-4o (gpt-4o-2024-08-06) (zero-shot) | x-ginco | 0.776 | 0.769 |
|
177 |
+
| Llama 3.3 (70B) (zero-shot) | x-ginco | 0.741 | 0.738 |
|
178 |
+
| Gemma 3 (27B) (zero-shot) | x-ginco | 0.739 | 0.733 |
|
179 |
+
| GPT-4o-mini (gpt-4o-mini-2024-07-18) (zero-shot) | x-ginco | 0.688 | 0.67 |
|
180 |
+
| GPT-3.5-Turbo (zero-shot) | x-ginco | 0.627 | 0.622 |
|
181 |
+
| Gemma 2 (27B) (zero-shot) | x-ginco | 0.612 | 0.593 |
|
182 |
+
| DeepSeek-R1 14B (zero-shot) | x-ginco | 0.197 | 0.204 |
|
183 |
+
| Support Vector Machine | x-ginco | 0.166 | 0.184 |
|
184 |
+
| Dummy Classifier (stratified) | x-ginco | 0.106 | 0.113 |
|
185 |
+
| Dummy classifier (most frequent) | x-ginco | 0.029 | 0.133 |
|
186 |
+
|
187 |
+
(The multilingual test dataset is easier than the English one, as the vague label "Other" and instances that were predicted with a confidence score below 0.80 were not included in the test dataset.)
|
188 |
+
|
189 |
+
For language-specific results, see [the AGILE benchmark](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark).
|
190 |
|
191 |
|
192 |
## Intended use and limitations
|
|
|
255 |
| Other | A text that which does not fall under any of other genre categories. | |
|
256 |
|
257 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
258 |
### Fine-tuning hyperparameters
|
259 |
|
260 |
Fine-tuning was performed with `simpletransformers`.
|