Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
TajaKuzman commited on
Commit
55a869b
·
verified ·
1 Parent(s): c7f9bfc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -86
README.md CHANGED
@@ -147,24 +147,46 @@ If you use the model, please cite the paper:
147
  ## AGILE - Automatic Genre Identification Benchmark
148
 
149
  We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
150
- for the automatic enrichment of large text collections with genre information.
151
  You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
152
 
153
- In an out-of-dataset scenario (evaluating a model on a manually-annotated English EN-GINCO dataset (available upon request)) on which it was not trained),
154
- the model outperforms all other technologies:
155
-
156
- | | micro F1 | macro F1 | accuracy |
157
- |:----------------------------|-----------:|-----------:|-----------:|
158
- | **XLM-RoBERTa, fine-tuned on the X-GENRE dataset - X-GENRE classifier** (Kuzman et al. 2023) | 0.68 | 0.69 | 0.68 |
159
- | GPT-4 (7/7/2023) (Kuzman et al. 2023) | 0.65 | 0.55 | 0.65 |
160
- | GPT-3.5-turbo (Kuzman et al. 2023) | 0.63 | 0.53 | 0.63 |
161
- | SVM (Kuzman et al. 2023) | 0.49 | 0.51 | 0.49 |
162
- | Logistic Regression (Kuzman et al. 2023) | 0.49 | 0.47 | 0.49 |
163
- | FastText (Kuzman et al. 2023) | 0.45 | 0.41 | 0.45 |
164
- | Naive Bayes (Kuzman et al. 2023) | 0.36 | 0.29 | 0.36 |
165
- | mt0 | 0.32 | 0.23 | 0.27 |
166
- | Zero-Shot classification with `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` @ HuggingFace | 0.2 | 0.15 | 0.2 |
167
- | Dummy Classifier (stratified) (Kuzman et al. 2023)| 0.14 | 0.1 | 0.14 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
 
170
  ## Intended use and limitations
@@ -233,76 +255,6 @@ labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction':
233
  | Other | A text that which does not fall under any of other genre categories. | |
234
 
235
 
236
- ## Performance
237
-
238
- ### Comparison with other models at in-dataset and cross-dataset experiments
239
-
240
- The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
241
- using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
242
-
243
- At the in-dataset experiments (trained and tested on splits of the same dataset),
244
- it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
245
-
246
- | Trained on | Micro F1 | Macro F1 |
247
- |:-------------|-----------:|-----------:|
248
- | FTD | 0.843 | 0.851 |
249
- | X-GENRE | 0.797 | 0.794 |
250
- | CORE | 0.778 | 0.627 |
251
- | GINCO | 0.754 | 0.75 |
252
-
253
- When applied on test splits of each of the datasets, the classifier performs well:
254
-
255
- | Trained on | Tested on | Micro F1 | Macro F1 |
256
- |:-------------|:------------|-----------:|-----------:|
257
- | X-GENRE | CORE | 0.837 | 0.859 |
258
- | X-GENRE | FTD | 0.804 | 0.809 |
259
- | X-GENRE | X-GENRE | 0.797 | 0.794 |
260
- | X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
261
- | X-GENRE | GINCO | 0.749 | 0.758 |
262
-
263
- The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
264
- - EN-GINCO (available upon request): a sample of the English enTenTen20 corpus
265
- - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
266
-
267
- | Trained on | Tested on | Micro F1 | Macro F1 |
268
- |:-------------|:------------|-----------:|-----------:|
269
- | X-GENRE | EN-GINCO | 0.688 | 0.691 |
270
- | X-GENRE | FinCORE | 0.674 | 0.581 |
271
- | GINCO | EN-GINCO | 0.632 | 0.502 |
272
- | FTD | EN-GINCO | 0.574 | 0.475 |
273
- | CORE | EN-GINCO | 0.485 | 0.422 |
274
-
275
- At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
276
- trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
277
-
278
- Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
279
- of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
280
- The X-GINCO dataset comprises 790 manually-annotated instances in 10 languages -
281
- Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
282
- To evaluate the performance on genre labels, the dataset is balanced by labels,
283
- and the vague label "Other" is not included.
284
- Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
285
-
286
-
287
- The evaluation shows high cross-lingual performance of the model,
288
- even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
289
-
290
-
291
- The outlier is Maltese, on which classifier does not perform well -
292
- we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
293
-
294
- | Genre label | ca | el | hr | is | mk | sl | sq | tr | uk | Avg | mt |
295
- |---------------|------|------|------|------|------|------|------|------|------|------|------|
296
- | News | 0.82 | 0.90 | 0.95 | 0.73 | 0.91 | 0.90 | 0.89 | 0.95 | 1.00 | 0.89 | 0.69 |
297
- | Opinion/Argumentation | 0.84 | 0.87 | 0.78 | 0.82 | 0.78 | 0.82 | 0.67 | 0.82 | 0.91 | 0.81 | 0.33 |
298
- | Instruction | 0.75 | 0.71 | 0.75 | 0.78 | 1.00 | 1.00 | 0.95 | 0.90 | 0.95 | 0.86 | 0.69 |
299
- | Information/Explanation | 0.72 | 0.70 | 0.95 | 0.50 | 0.84 | 0.90 | 0.80 | 0.82 | 1.00 | 0.80 | 0.52 |
300
- | Promotion | 0.78 | 0.62 | 0.87 | 0.75 | 0.95 | 1.00 | 0.95 | 0.86 | 0.78 | 0.84 | 0.82 |
301
- | Forum | 0.84 | 0.95 | 0.91 | 0.95 | 1.00 | 1.00 | 0.78 | 0.89 | 0.95 | 0.91 | 0.18 |
302
- | Prose/Lyrical | 0.91 | 1.00 | 0.86 | 1.00 | 0.95 | 0.91 | 0.86 | 0.95 | 1.00 | 0.93 | 0.18 |
303
- | Legal | 0.95 | 1.00 | 1.00 | 0.84 | 0.95 | 0.95 | 0.95 | 1.00 | 1.00 | 0.96 | / |
304
- | Macro-F1 | 0.83 | 0.84 | 0.88 | 0.80 | 0.92 | 0.94 | 0.85 | 0.90 | 0.95 | 0.87 | 0.49 |
305
-
306
  ### Fine-tuning hyperparameters
307
 
308
  Fine-tuning was performed with `simpletransformers`.
 
147
  ## AGILE - Automatic Genre Identification Benchmark
148
 
149
  We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
150
+ for the automatic enrichment of large text collections with genre information. The benchmark comprises 11 European languages and two test datasets.
151
  You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
152
 
153
+ The model outperforms all other technologies, including GPT models (used in a zero-shot scenario).
154
+
155
+ Results on the English test dataset (EN-GINCO):
156
+
157
+ | Model | Test Dataset | Macro F1 | Micro F1 |
158
+ |:-------------------------------------------------------------------------------------------------------------------|:---------------|-----------:|-----------:|
159
+ | [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier) | en-ginco | 0.687 | 0.684 |
160
+ | GPT-4o (gpt-4o-2024-08-06) (zero-shot) | en-ginco | 0.62 | 0.735 |
161
+ | Llama 3.3 (70B) (zero-shot) | en-ginco | 0.586 | 0.684 |
162
+ | Gemma 2 (27B) (zero-shot) | en-ginco | 0.564 | 0.603 |
163
+ | Gemma 3 (27B) (zero-shot) | en-ginco | 0.541 | 0.672 |
164
+ | GPT-4o-mini (gpt-4o-mini-2024-07-18) (zero-shot) | en-ginco | 0.534 | 0.632 |
165
+ | Support Vector Machine | en-ginco | 0.514 | 0.489 |
166
+ | GPT-3.5-Turbo (zero-shot) | en-ginco | 0.494 | 0.625 |
167
+ | DeepSeek-R1 14B (zero-shot) | en-ginco | 0.293 | 0.229 |
168
+ | Dummy Classifier (stratified) | en-ginco | 0.088 | 0.154 |
169
+ | Dummy classifier (most frequent) | en-ginco | 0.032 | 0.169 |
170
+
171
+ Results on the multilingual test dataset (X-GINCO), comprising instances in Albanian, Catalan, Croatian, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian:
172
+
173
+ | Model | Test Dataset | Macro F1 | Micro F1 |
174
+ |:-------------------------------------------------------------------------------------------------------------------|:---------------|-----------:|-----------:|
175
+ | [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier) | x-ginco | 0.847 | 0.845 |
176
+ | GPT-4o (gpt-4o-2024-08-06) (zero-shot) | x-ginco | 0.776 | 0.769 |
177
+ | Llama 3.3 (70B) (zero-shot) | x-ginco | 0.741 | 0.738 |
178
+ | Gemma 3 (27B) (zero-shot) | x-ginco | 0.739 | 0.733 |
179
+ | GPT-4o-mini (gpt-4o-mini-2024-07-18) (zero-shot) | x-ginco | 0.688 | 0.67 |
180
+ | GPT-3.5-Turbo (zero-shot) | x-ginco | 0.627 | 0.622 |
181
+ | Gemma 2 (27B) (zero-shot) | x-ginco | 0.612 | 0.593 |
182
+ | DeepSeek-R1 14B (zero-shot) | x-ginco | 0.197 | 0.204 |
183
+ | Support Vector Machine | x-ginco | 0.166 | 0.184 |
184
+ | Dummy Classifier (stratified) | x-ginco | 0.106 | 0.113 |
185
+ | Dummy classifier (most frequent) | x-ginco | 0.029 | 0.133 |
186
+
187
+ (The multilingual test dataset is easier than the English one, as the vague label "Other" and instances that were predicted with a confidence score below 0.80 were not included in the test dataset.)
188
+
189
+ For language-specific results, see [the AGILE benchmark](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark).
190
 
191
 
192
  ## Intended use and limitations
 
255
  | Other | A text that which does not fall under any of other genre categories. | |
256
 
257
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
  ### Fine-tuning hyperparameters
259
 
260
  Fine-tuning was performed with `simpletransformers`.