bertin_base_climate_detection_spa / README.md

Update Readme.md

42aa377 verified about 1 year ago

12.1 kB

	---
	license: cc-by-4.0
	base_model: bertin-project/bertin-roberta-base-spanish
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: bertin_base_climate_detection_spa
	results: []
	datasets:
	- somosnlp/spa_climate_detection
	language:
	- es
	widget:
	- text: >
	El uso excesivo de fertilizantes nitrogenados -un fenómeno frecuente en la
	agricultura- da lugar a la producción de óxido nitroso, un potente gas de
	efecto invernadero. Un uso más juicioso de los fertilizantes puede frenar
	estas emisiones y reducir la producción de fertilizantes, que consume mucha
	energía.
	pipeline_tag: text-classification
	---


	# Model Card for bertin_base_climate_detection_spa_v2

	README Spanish Version: [README_ES](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/README_ES.md)

	<p align="center">
	<img src="https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/resolve/main/model_image_repo_380.jpg" alt="Model Illustration" width="500">
	</p>


	This model is a fine-tuning version of the model: [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) using the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection).
	The model is focused on the identification of texts on topics related to climate change and sustainability. This project was based on the English version of [climatebert/distilroberta-base-climate-detector](https://huggingface.co/climatebert/distilroberta-base-climate-detector).

	The motivation of the project was to create a repository in Spanish on information or resources on topics such as: climate change, sustainability, global warming, energy, etc; the idea is to give visibility to solutions, examples of good environmental practices or news that help us to combat the effects of climate change; in a way similar to what the project [Drawdown](https://drawdown.org/solutions/table-of-solutions) does but providing examples of solutions or new research on each topic. To achieve this
	In order to achieve this objective, we consider that the identification of texts that speak about these topics is the first step. Some of the direct applications are: classification of papers and scientific publications, news, opinions.

	Future steps:
	- We intend to create an advanced model that classifies texts related to climate change based on sectors (token classification), for example: classify based on electricity, agriculture, industry, transport, etc.
	- Publish a sector-based dataset.
	- Realize a Q/A model that can provide relevant information to the user on the topic of climate change.

	## Model Details

	### Model Description
	- Developed by: [Gerardo Huerta](https://huggingface.co/Gerard-1705) [Gabriela Zuñiga](https://huggingface.co/Gabrielaz)
	- Funded by: SomosNLP, HuggingFace
	- Model type: Language model, instruction tuned, text classification
	- Language(s): es-ES, es-PE
	- License: cc-by-nc-sa-4.0
	- Fine-tuned from model: [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
	- Dataset used: [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection)

	### Model resurces:

	- Repository: [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. -->
	- Demo: [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico)
	- Video presentation: [Proyecto BERTIN-ClimID](https://www.youtube.com/watch?v=sfXLUP9Ei-o)

	## Uses

	### Direct Use
	- News classification: With this model it is possible to classify news headlines related to the areas of climate change.
	- Paper classification: The identification of scientific texts that disclose solutions and/or effects of climate change. For this use, the abstract of each paper can be used for identification.

	### Indirect Use
	- For the creation of information repositories regarding climate issues.
	- This model can serve as a basis for creating new classification systems for climate solutions to disseminate new efforts to combat climate change in different sectors.
	- Creation of new datasets that address the issue.

	### Out-of-Scope Use
	- The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation.

	## Bias, Risks, and Limitations
	No specific studies on biases and limitations have been carried out at this point, however, we make the following points based on previous experience and model tests:
	- It inherits the biases and limitations of the base model with which it was trained, for more details see: [BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403). However, they are not so obvious to find because of the type of task in which the model is being implemented, such as text classification.
	- Direct biases such as the majority use of high-level language in the dataset due to the use of texts extracted from news, legal documentation of companies that can complicate the identification of texts with low-level language (e.g. colloquial). To mitigate these biases, diverse opinions on climate change issues extracted from sources such as social networks were included in the dataset, in addition to a rebalancing of the labels.
	- The dataset inherits other limitations such as: the model loses performance on short texts, this is due to the fact that most of the texts used in the dataset have a long length between 200 - 500 words. Again, we tried to mitigate these limitations by including short texts.

	### Recommendations

	- As we have mentioned, the model tends to lower performance in short texts, so it is advisable to establish a selection criterion for long texts whose subject matter needs to be identified.

	## How to Get Started with the Model

	```python
	## Asumiendo tener instalados transformers, torch
	from transformers import AutoModelForSequenceClassification
	import torch
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("somosnlp/bertin_base_climate_detection_spa")
	model = AutoModelForSequenceClassification.from_pretrained("somosnlp/bertin_base_climate_detection_spa")

	# Traduccion del label
	id2label = {0: "NEGATIVE", 1: "POSITIVE"}
	label2id = {"NEGATIVE": 0, "POSITIVE": 1}

	# Funcion de inferencia
	def inference_fun(Texto):
	inputs = tokenizer(Texto, return_tensors="pt")
	with torch.no_grad():
	logits = model(**inputs).logits
	predicted_class_id = logits.argmax().item()
	output_tag = model.config.id2label[predicted_class_id]
	return output_tag

	input_text = "El uso excesivo de fertilizantes nitrogenados -un fenómeno frecuente en la agricultura- da lugar a la producción de óxido nitroso, un potente gas de efecto invernadero. Un uso más juicioso de los fertilizantes puede frenar estas emisiones y reducir la producción de fertilizantes, que consume mucha energía."

	print(inference_fun(input_text))
	```


	## Training Details

	### Training Data
	The training data were obtained from the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection).
	The training data represent about 79% of the total data in the dataset.

	The labels are represented as follows:

	Labels 1s

	1000 - data on paragraphs extracted from company reports on the subject.

	600 - data on various opinions, mostly short texts.

	Labels 0s

	300 - data on paragraphs extracted from business reports not related to the subject.

	500 - data on news on various topics unrelated to the subject.

	500 - data on opinions on various topics unrelated to the subject.

	### Training Procedure
	You can check our Google Colab to review the training procedure we take: [Colab Entrenamiento](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/entrenamiento_del_modelo.ipynb)
	The accelerate configuration is as follows:
	In which compute environment are you running?: 0
	Which type of machine are you using?: No distributed training
	Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:NO
	Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
	Do you want to use DeepSpeed? [yes/NO]: NO
	What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
	Do you wish to use FP16 or BF16 (mixed precision)?: no

	#### Training Hyperparameters
	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 2

	#### Speeds, Sizes, Times
	The model was trained in 2 epochs with a total training duration of 14.22 minutes, 'train_runtime': 853.6759.
	Additional information: No mixed precision (FP16 or BF16) was used.


	#### Resultados del entrenamiento:

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| No log \| 1.0 \| 182 \| 0.1964 \| 0.9551 \|
	\| No log \| 2.0 \| 364 \| 0.1592 \| 0.9705 \|


	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	The assessment data were obtained from the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection).
	The assessment data represent about 21% of the total data in the dataset.
	The labels are represented as follows:

	Labels 1s

	320 - data on paragraphs extracted from company reports on the subject.

	160 - data on various opinions, mostly short texts.

	Labels 0s

	80 - data on paragraphs extracted from business reports not related to the subject.

	120 - data on news on various topics unrelated to the subject.

	100 - data on opinions on various topics unrelated to the subject.


	Model reached the following results on evaluation dataset:
	- Loss: 0.1592
	- Accuracy: 0.9705

	#### Metrics
	The metric was precision.

	### Results
	Look at the Inference section of Colab: [entrenamiento_del_modelo](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/entrenamiento_del_modelo.ipynb)

	Accuracy 0.95
	Precision 0.916
	Recall 0.99
	F1 score 0.951

	## Environmental Impact
	Using the tool [ML CO2 IMPACT](https://mlco2.github.io/impact/#co2eq) we estimate the following environmental impact due to training:
	- Type of hardware: T4
	- Total Hours for iterations and tests: 4 horas
	- Cloud provider Google Cloud (colab)
	- Computational region us-east
	- Carbon footprint 0.1kg CO2


	## Technical Specifications

	#### Software

	- Transformers 4.39.3
	- Pytorch 2.2.1+cu121
	- Datasets 2.18.0
	- Tokenizers 0.15.2

	#### Hardware

	- GPU equivalent to T4
	- For reference, the model was trained on the free version of Google Colab

	## License

	cc-by-nc-sa-4.0 Due to inheritance of the data used in the dataset.

	## Citation
	BibTeX:
	```
	@software{BERTIN-ClimID,
	author = {Gerardo Huerta, Gabriela Zuñiga},
	title = {BERTIN-ClimID: BERTIN-Base Climate-related text Identification},
	month = Abril,
	year = 2024,
	url = {https://huggingface.co/somosnlp/bertin_base_climate_detection_spa}
	}
	```

	## More Information

	This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. We thank all event organizers and sponsors for their support during the event.

	Team:

	- [Gerardo Huerta](https://huggingface.co/Gerard-1705)
	- [Gabriela Zuñiga](https://huggingface.co/Gabrielaz)

	## Contact

	- [email protected]
	- [email protected]