|
--- |
|
license: cc-by-4.0 |
|
base_model: bertin-project/bertin-roberta-base-spanish |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: bertin_base_climate_detection_spa |
|
results: [] |
|
datasets: |
|
- somosnlp/spa_climate_detection |
|
language: |
|
- es |
|
widget: |
|
- text: > |
|
El uso excesivo de fertilizantes nitrogenados -un fenómeno frecuente en la |
|
agricultura- da lugar a la producción de óxido nitroso, un potente gas de |
|
efecto invernadero. Un uso más juicioso de los fertilizantes puede frenar |
|
estas emisiones y reducir la producción de fertilizantes, que consume mucha |
|
energía. |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
|
|
# Model Card for bertin_base_climate_detection_spa_v2 |
|
|
|
README Spanish Version: [README_ES](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/README_ES.md) |
|
|
|
<p align="center"> |
|
<img src="https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/resolve/main/model_image_repo_380.jpg" alt="Model Illustration" width="500"> |
|
</p> |
|
|
|
|
|
This model is a fine-tuning version of the model: [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) using the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection). |
|
The model is focused on the identification of texts on topics related to climate change and sustainability. This project was based on the English version of [climatebert/distilroberta-base-climate-detector](https://huggingface.co/climatebert/distilroberta-base-climate-detector). |
|
|
|
The motivation of the project was to create a repository in Spanish on information or resources on topics such as: climate change, sustainability, global warming, energy, etc; the idea is to give visibility to solutions, examples of good environmental practices or news that help us to combat the effects of climate change; in a way similar to what the project [Drawdown](https://drawdown.org/solutions/table-of-solutions) does but providing examples of solutions or new research on each topic. To achieve this |
|
In order to achieve this objective, we consider that the identification of texts that speak about these topics is the first step. Some of the direct applications are: classification of papers and scientific publications, news, opinions. |
|
|
|
Future steps: |
|
- We intend to create an advanced model that classifies texts related to climate change based on sectors (token classification), for example: classify based on electricity, agriculture, industry, transport, etc. |
|
- Publish a sector-based dataset. |
|
- Realize a Q/A model that can provide relevant information to the user on the topic of climate change. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Developed by:** [Gerardo Huerta](https://huggingface.co/Gerard-1705) [Gabriela Zuñiga](https://huggingface.co/Gabrielaz) |
|
- **Funded by:** SomosNLP, HuggingFace |
|
- **Model type:** Language model, instruction tuned, text classification |
|
- **Language(s):** es-ES, es-PE |
|
- **License:** cc-by-nc-sa-4.0 |
|
- **Fine-tuned from model:** [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) |
|
- **Dataset used:** [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection) |
|
|
|
### Model resurces: |
|
|
|
- **Repository:** [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. --> |
|
- **Demo:** [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico) |
|
- **Video presentation:** [Proyecto BERTIN-ClimID](https://www.youtube.com/watch?v=sfXLUP9Ei-o) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
- News classification: With this model it is possible to classify news headlines related to the areas of climate change. |
|
- Paper classification: The identification of scientific texts that disclose solutions and/or effects of climate change. For this use, the abstract of each paper can be used for identification. |
|
|
|
### Indirect Use |
|
- For the creation of information repositories regarding climate issues. |
|
- This model can serve as a basis for creating new classification systems for climate solutions to disseminate new efforts to combat climate change in different sectors. |
|
- Creation of new datasets that address the issue. |
|
|
|
### Out-of-Scope Use |
|
- The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation. |
|
|
|
## Bias, Risks, and Limitations |
|
No specific studies on biases and limitations have been carried out at this point, however, we make the following points based on previous experience and model tests: |
|
- It inherits the biases and limitations of the base model with which it was trained, for more details see: [BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403). However, they are not so obvious to find because of the type of task in which the model is being implemented, such as text classification. |
|
- Direct biases such as the majority use of high-level language in the dataset due to the use of texts extracted from news, legal documentation of companies that can complicate the identification of texts with low-level language (e.g. colloquial). To mitigate these biases, diverse opinions on climate change issues extracted from sources such as social networks were included in the dataset, in addition to a rebalancing of the labels. |
|
- The dataset inherits other limitations such as: the model loses performance on short texts, this is due to the fact that most of the texts used in the dataset have a long length between 200 - 500 words. Again, we tried to mitigate these limitations by including short texts. |
|
|
|
### Recommendations |
|
|
|
- As we have mentioned, the model tends to lower performance in short texts, so it is advisable to establish a selection criterion for long texts whose subject matter needs to be identified. |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
## Asumiendo tener instalados transformers, torch |
|
from transformers import AutoModelForSequenceClassification |
|
import torch |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("somosnlp/bertin_base_climate_detection_spa") |
|
model = AutoModelForSequenceClassification.from_pretrained("somosnlp/bertin_base_climate_detection_spa") |
|
|
|
# Traduccion del label |
|
id2label = {0: "NEGATIVE", 1: "POSITIVE"} |
|
label2id = {"NEGATIVE": 0, "POSITIVE": 1} |
|
|
|
# Funcion de inferencia |
|
def inference_fun(Texto): |
|
inputs = tokenizer(Texto, return_tensors="pt") |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
predicted_class_id = logits.argmax().item() |
|
output_tag = model.config.id2label[predicted_class_id] |
|
return output_tag |
|
|
|
input_text = "El uso excesivo de fertilizantes nitrogenados -un fenómeno frecuente en la agricultura- da lugar a la producción de óxido nitroso, un potente gas de efecto invernadero. Un uso más juicioso de los fertilizantes puede frenar estas emisiones y reducir la producción de fertilizantes, que consume mucha energía." |
|
|
|
print(inference_fun(input_text)) |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
The training data were obtained from the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection). |
|
The training data represent about 79% of the total data in the dataset. |
|
|
|
The labels are represented as follows: |
|
|
|
Labels 1s |
|
|
|
1000 - data on paragraphs extracted from company reports on the subject. |
|
|
|
600 - data on various opinions, mostly short texts. |
|
|
|
Labels 0s |
|
|
|
300 - data on paragraphs extracted from business reports not related to the subject. |
|
|
|
500 - data on news on various topics unrelated to the subject. |
|
|
|
500 - data on opinions on various topics unrelated to the subject. |
|
|
|
### Training Procedure |
|
You can check our Google Colab to review the training procedure we take: [Colab Entrenamiento](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/entrenamiento_del_modelo.ipynb) |
|
The accelerate configuration is as follows: |
|
In which compute environment are you running?: 0 |
|
Which type of machine are you using?: No distributed training |
|
Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:NO |
|
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO |
|
Do you want to use DeepSpeed? [yes/NO]: NO |
|
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all |
|
Do you wish to use FP16 or BF16 (mixed precision)?: no |
|
|
|
#### Training Hyperparameters |
|
The following hyperparameters were used during training: |
|
- learning_rate: 2e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 2 |
|
|
|
#### Speeds, Sizes, Times |
|
The model was trained in 2 epochs with a total training duration of 14.22 minutes, 'train_runtime': 853.6759. |
|
Additional information: No mixed precision (FP16 or BF16) was used. |
|
|
|
|
|
#### Resultados del entrenamiento: |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | |
|
|:-------------:|:-----:|:----:|:---------------:|:--------:| |
|
| No log | 1.0 | 182 | 0.1964 | 0.9551 | |
|
| No log | 2.0 | 364 | 0.1592 | 0.9705 | |
|
|
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
The assessment data were obtained from the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection). |
|
The assessment data represent about 21% of the total data in the dataset. |
|
The labels are represented as follows: |
|
|
|
Labels 1s |
|
|
|
320 - data on paragraphs extracted from company reports on the subject. |
|
|
|
160 - data on various opinions, mostly short texts. |
|
|
|
Labels 0s |
|
|
|
80 - data on paragraphs extracted from business reports not related to the subject. |
|
|
|
120 - data on news on various topics unrelated to the subject. |
|
|
|
100 - data on opinions on various topics unrelated to the subject. |
|
|
|
|
|
**Model reached the following results on evaluation dataset:** |
|
- **Loss:** 0.1592 |
|
- **Accuracy:** 0.9705 |
|
|
|
#### Metrics |
|
The metric was precision. |
|
|
|
### Results |
|
Look at the Inference section of Colab: [entrenamiento_del_modelo](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/entrenamiento_del_modelo.ipynb) |
|
|
|
Accuracy 0.95 |
|
Precision 0.916 |
|
Recall 0.99 |
|
F1 score 0.951 |
|
|
|
## Environmental Impact |
|
Using the tool [ML CO2 IMPACT](https://mlco2.github.io/impact/#co2eq) we estimate the following environmental impact due to training: |
|
- **Type of hardware:** T4 |
|
- **Total Hours for iterations and tests:** 4 horas |
|
- **Cloud provider** Google Cloud (colab) |
|
- **Computational region** us-east |
|
- **Carbon footprint** 0.1kg CO2 |
|
|
|
|
|
## Technical Specifications |
|
|
|
#### Software |
|
|
|
- Transformers 4.39.3 |
|
- Pytorch 2.2.1+cu121 |
|
- Datasets 2.18.0 |
|
- Tokenizers 0.15.2 |
|
|
|
#### Hardware |
|
|
|
- GPU equivalent to T4 |
|
- For reference, the model was trained on the free version of Google Colab |
|
|
|
## License |
|
|
|
cc-by-nc-sa-4.0 Due to inheritance of the data used in the dataset. |
|
|
|
## Citation |
|
**BibTeX:** |
|
``` |
|
@software{BERTIN-ClimID, |
|
author = {Gerardo Huerta, Gabriela Zuñiga}, |
|
title = {BERTIN-ClimID: BERTIN-Base Climate-related text Identification}, |
|
month = Abril, |
|
year = 2024, |
|
url = {https://huggingface.co/somosnlp/bertin_base_climate_detection_spa} |
|
} |
|
``` |
|
|
|
## More Information |
|
|
|
This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. We thank all event organizers and sponsors for their support during the event. |
|
|
|
**Team:** |
|
|
|
- [Gerardo Huerta](https://huggingface.co/Gerard-1705) |
|
- [Gabriela Zuñiga](https://huggingface.co/Gabrielaz) |
|
|
|
## Contact |
|
|
|
- [email protected] |
|
- [email protected] |