classla
/

xlm-roberta-base-multilingual-text-genre-classifier

@@ -147,24 +147,46 @@ If you use the model, please cite the paper:
 ## AGILE - Automatic Genre Identification Benchmark
 We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
-for the automatic enrichment of large text collections with genre information.
 You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
-In an out-of-dataset scenario (evaluating a model on a manually-annotated English EN-GINCO dataset (available upon request)) on which it was not trained),
- the model outperforms all other technologies:
-|                             |   micro F1 |   macro F1 |   accuracy |
-|:----------------------------|-----------:|-----------:|-----------:|
-| **XLM-RoBERTa, fine-tuned on the X-GENRE dataset - X-GENRE classifier**  (Kuzman et al. 2023)                   |       0.68 |       0.69 |       0.68 |
-| GPT-4 (7/7/2023)  (Kuzman et al. 2023)            |       0.65 |       0.55 |       0.65 |
-| GPT-3.5-turbo (Kuzman et al. 2023)    |       0.63 |       0.53 |       0.63 |
-| SVM  (Kuzman et al. 2023)                       |       0.49 |       0.51 |       0.49 |
-| Logistic Regression (Kuzman et al. 2023)        |       0.49 |       0.47 |       0.49 |
-| FastText (Kuzman et al. 2023)                   |       0.45 |       0.41 |       0.45 |
-| Naive Bayes  (Kuzman et al. 2023)             |       0.36 |       0.29 |       0.36 |
-| mt0                        |       0.32 |       0.23 |       0.27 |
-| Zero-Shot classification with `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` @ HuggingFace                 |       0.2  |       0.15 |       0.2  |
-| Dummy Classifier (stratified) (Kuzman et al. 2023)|       0.14 |       0.1  |       0.14 |
 ## Intended use and limitations
@@ -233,76 +255,6 @@ labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction':
 | Other                   | A text that which does not fall   under any of other genre categories.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                               |
-## Performance
-### Comparison with other models at in-dataset and cross-dataset experiments
-The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
-using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
-At the in-dataset experiments (trained and tested on splits of the same dataset),
-it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
-| Trained on   |   Micro F1 |   Macro F1 |
-|:-------------|-----------:|-----------:|
-| FTD          |      0.843 |      0.851 |
-| X-GENRE      |      0.797 |      0.794 |
-| CORE         |      0.778 |      0.627 |
-| GINCO     |      0.754 |      0.75  |
-When applied on test splits of each of the datasets, the classifier performs well:
-| Trained on   | Tested on   |   Micro F1 |   Macro F1 |
-|:-------------|:------------|-----------:|-----------:|
-| X-GENRE      | CORE        |      0.837 |      0.859 |
-| X-GENRE      | FTD         |      0.804 |      0.809 |
-| X-GENRE      | X-GENRE     |      0.797 |      0.794 |
-| X-GENRE      | X-GENRE-dev |      0.784 |      0.784 |
-| X-GENRE      | GINCO    |      0.749 |      0.758 |
-The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
-- EN-GINCO (available upon request): a sample of the English enTenTen20 corpus
-- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
-| Trained on   | Tested on   |   Micro F1 |   Macro F1 |
-|:-------------|:------------|-----------:|-----------:|
-| X-GENRE      | EN-GINCO    |      0.688 |      0.691 |
-| X-GENRE      | FinCORE    |      0.674 |      0.581 |
-| GINCO     | EN-GINCO    |      0.632 |      0.502 |
-| FTD          | EN-GINCO    |      0.574 |      0.475 |
-| CORE         | EN-GINCO    |      0.485 |      0.422 |
-At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
-trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
-Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
-of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
-The X-GINCO dataset comprises 790 manually-annotated instances in 10 languages -
-Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
-To evaluate the performance on genre labels, the dataset is balanced by labels,
-and the vague label "Other" is not included.
-Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
-The evaluation shows high cross-lingual performance of the model,
-even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
-The outlier is Maltese, on which classifier does not perform well -
-we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
-| Genre label   | ca   | el   | hr   | is   | mk   | sl   | sq   | tr   | uk   | Avg  | mt   |
-|---------------|------|------|------|------|------|------|------|------|------|------|------|
-| News          | 0.82 | 0.90 | 0.95 | 0.73 | 0.91 | 0.90 | 0.89 | 0.95 | 1.00 | 0.89 | 0.69 |
-| Opinion/Argumentation       | 0.84 | 0.87 | 0.78 | 0.82 | 0.78 | 0.82 | 0.67 | 0.82 | 0.91 | 0.81 | 0.33 |
-| Instruction   | 0.75 | 0.71 | 0.75 | 0.78 | 1.00 | 1.00 | 0.95 | 0.90 | 0.95 | 0.86 | 0.69 |
-| Information/Explanation   | 0.72 | 0.70 | 0.95 | 0.50 | 0.84 | 0.90 | 0.80 | 0.82 | 1.00 | 0.80 | 0.52 |
-| Promotion     | 0.78 | 0.62 | 0.87 | 0.75 | 0.95 | 1.00 | 0.95 | 0.86 | 0.78 | 0.84 | 0.82 |
-| Forum         | 0.84 | 0.95 | 0.91 | 0.95 | 1.00 | 1.00 | 0.78 | 0.89 | 0.95 | 0.91 | 0.18 |
-| Prose/Lyrical | 0.91 | 1.00 | 0.86 | 1.00 | 0.95 | 0.91 | 0.86 | 0.95 | 1.00 | 0.93 | 0.18 |
-| Legal         | 0.95 | 1.00 | 1.00 | 0.84 | 0.95 | 0.95 | 0.95 | 1.00 | 1.00 | 0.96 | /    |
-| Macro-F1      | 0.83 | 0.84 | 0.88 | 0.80 | 0.92 | 0.94 | 0.85 | 0.90 | 0.95 | 0.87 | 0.49 |
 ### Fine-tuning hyperparameters
 Fine-tuning was performed with `simpletransformers`.

 ## AGILE - Automatic Genre Identification Benchmark
 We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
+for the automatic enrichment of large text collections with genre information. The benchmark comprises 11 European languages and two test datasets.
 You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
+The model outperforms all other technologies, including GPT models (used in a zero-shot scenario).
+Results on the English test dataset (EN-GINCO):
+| Model                                                                                                              | Test Dataset   |   Macro F1 |   Micro F1 |
+|:-------------------------------------------------------------------------------------------------------------------|:---------------|-----------:|-----------:|
+| [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier)           | en-ginco       |      0.687 |      0.684 |
+| GPT-4o (gpt-4o-2024-08-06) (zero-shot)                                                                             | en-ginco       |      0.62  |      0.735 |
+| Llama 3.3 (70B) (zero-shot)                                                                                        | en-ginco       |      0.586 |      0.684 |
+| Gemma 2 (27B) (zero-shot)                                                                                          | en-ginco       |      0.564 |      0.603 |
+| Gemma 3 (27B) (zero-shot)                                                                                          | en-ginco       |      0.541 |      0.672 |
+| GPT-4o-mini (gpt-4o-mini-2024-07-18) (zero-shot)                                                                   | en-ginco       |      0.534 |      0.632 |
+| Support Vector Machine                                                                                             | en-ginco       |      0.514 |      0.489 |
+| GPT-3.5-Turbo (zero-shot)                                                                                          | en-ginco       |      0.494 |      0.625 |
+| DeepSeek-R1 14B (zero-shot)                                                                                        | en-ginco       |      0.293 |      0.229 |
+| Dummy Classifier (stratified)                                                                                      | en-ginco       |      0.088 |      0.154 |
+| Dummy classifier (most frequent)                                                                                   | en-ginco       |      0.032 |      0.169 |
+Results on the multilingual test dataset (X-GINCO), comprising instances in Albanian, Catalan, Croatian, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian:
+| Model                                                                                                              | Test Dataset   |   Macro F1 |   Micro F1 |
+|:-------------------------------------------------------------------------------------------------------------------|:---------------|-----------:|-----------:|
+| [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier)           | x-ginco        |      0.847 |      0.845 |
+| GPT-4o (gpt-4o-2024-08-06) (zero-shot)                                                                             | x-ginco        |      0.776 |      0.769 |
+| Llama 3.3 (70B) (zero-shot)                                                                                        | x-ginco        |      0.741 |      0.738 |
+| Gemma 3 (27B) (zero-shot)                                                                                          | x-ginco        |      0.739 |      0.733 |
+| GPT-4o-mini (gpt-4o-mini-2024-07-18) (zero-shot)                                                                   | x-ginco        |      0.688 |      0.67  |
+| GPT-3.5-Turbo (zero-shot)                                                                                          | x-ginco        |      0.627 |      0.622 |
+| Gemma 2 (27B) (zero-shot)                                                                                          | x-ginco        |      0.612 |      0.593 |
+| DeepSeek-R1 14B (zero-shot)                                                                                        | x-ginco        |      0.197 |      0.204 |
+| Support Vector Machine                                                                                             | x-ginco        |      0.166 |      0.184 |
+| Dummy Classifier (stratified)                                                                                      | x-ginco        |      0.106 |      0.113 |
+| Dummy classifier (most frequent)                                                                                   | x-ginco        |      0.029 |      0.133 |
+(The multilingual test dataset is easier than the English one, as the vague label "Other" and instances that were predicted with a confidence score below 0.80 were not included in the test dataset.)
+For language-specific results, see [the AGILE benchmark](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark).
 ## Intended use and limitations
 | Other                   | A text that which does not fall   under any of other genre categories.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                               |
 ### Fine-tuning hyperparameters
 Fine-tuning was performed with `simpletransformers`.