Gerard-1705 commited on
Commit
eeb9027
·
verified ·
1 Parent(s): b8484f3

Upload README.md

Browse files

Se agrega un nuevo README en ingles

Files changed (1) hide show
  1. README.md +264 -0
README.md ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ base_model: bertin-project/bertin-roberta-base-spanish
4
+ tags:
5
+ - generated_from_trainer
6
+ metrics:
7
+ - accuracy
8
+ model-index:
9
+ - name: bertin_base_climate_detection_spa
10
+ results: []
11
+ datasets:
12
+ - somosnlp/spa_climate_detection
13
+ language:
14
+ - es
15
+ widget:
16
+ - text: >
17
+ El uso excesivo de fertilizantes nitrogenados -un fenómeno frecuente en la
18
+ agricultura- da lugar a la producción de óxido nitroso, un potente gas de
19
+ efecto invernadero. Un uso más juicioso de los fertilizantes puede frenar
20
+ estas emisiones y reducir la producción de fertilizantes, que consume mucha
21
+ energía.
22
+ pipeline_tag: text-classification
23
+ ---
24
+
25
+
26
+ # Model Card for bertin_base_climate_detection_spa_v2
27
+
28
+ README Spanish Version: [README_ES](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/README_ES.md)
29
+
30
+ <p align="center">
31
+ <img src="https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/resolve/main/model_image_repo_380.jpg" alt="Model Illustration" width="500">
32
+ </p>
33
+
34
+
35
+ This model is a fine-tuning version of the model: [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) using the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection).
36
+ The model is focused on the identification of texts on topics related to climate change and sustainability. This project was based on the English version of [climatebert/distilroberta-base-climate-detector](https://huggingface.co/climatebert/distilroberta-base-climate-detector).
37
+
38
+ The motivation of the project was to create a repository in Spanish on information or resources on topics such as: climate change, sustainability, global warming, energy, etc; the idea is to give visibility to solutions, examples of good environmental practices or news that help us to combat the effects of climate change; in a way similar to what the project [Drawdown](https://drawdown.org/solutions/table-of-solutions) does but providing examples of solutions or new research on each topic. To achieve this
39
+ In order to achieve this objective, we consider that the identification of texts that speak about these topics is the first step. Some of the direct applications are: classification of papers and scientific publications, news, opinions.
40
+
41
+ Future steps:
42
+ - We intend to create an advanced model that classifies texts related to climate change based on sectors (token classification), for example: classify based on electricity, agriculture, industry, transport, etc.
43
+ - Publish a sector-based dataset.
44
+ - Realize a Q/A model that can provide relevant information to the user on the topic of climate change.
45
+
46
+ ## Model Details
47
+
48
+ ### Model Description
49
+ - **Developed by:** [Gerardo Huerta](https://huggingface.co/Gerard-1705) [Gabriela Zuñiga](https://huggingface.co/Gabrielaz)
50
+ - **Funded by:** SomosNLP, HuggingFace
51
+ - **Model type:** Language model, instruction tuned, text classification
52
+ - **Language(s):** es-ES, es-PE
53
+ - **License:** cc-by-nc-sa-4.0
54
+ - **Fine-tuned from model:** [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
55
+ - **Dataset used:** [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection)
56
+
57
+ ### Fuentes de modelos
58
+
59
+ - **Repository:** [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. -->
60
+ - **Demo:** [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico)
61
+ - **Video presentation:** [Proyecto BERTIN-ClimID](https://www.youtube.com/watch?v=sfXLUP9Ei-o)
62
+
63
+ ## Uses
64
+
65
+ ### Direct Use
66
+ - News classification: With this model it is possible to classify news headlines related to the areas of climate change.
67
+ - Paper classification: The identification of scientific texts that disclose solutions and/or effects of climate change. For this use, the abstract of each paper can be used for identification.
68
+
69
+ ### Indirect Use
70
+ - For the creation of information repositories regarding climate issues.
71
+ - This model can serve as a basis for creating new classification systems for climate solutions to disseminate new efforts to combat climate change in different sectors.
72
+ - Creation of new datasets that address the issue.
73
+
74
+ ### Out-of-Scope Use
75
+ - The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation.
76
+
77
+ ## Bias, Risks, and Limitations
78
+ En este punto no se han realizados estudios concretos sobre los sesgos y limitaciones, sin embargo hacemos los siguientes apuntes en base a experiencia previa y pruebas del modelo:
79
+ - Hereda los sesgos y limitaciones del modelo base con el que fue entrenado, para mas detalles véase: [BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403). Sin embargo, no son tan evidentes de encontrar por el tipo de tarea en el que se esta implementando el modelo como lo es la clasificacion de texto.
80
+ - Sesgos directos como por ejemplo el mayoritario uso de lenguaje de alto nivel en el dataset debido a que se utilizan textos extraidos de noticias, documentación legal de empresas que pueden complicar la identificación de textos con lenguajes de bajo nivel (ejemplo: coloquial). Para mitigar estos sesgos, se incluyeron en el dataset opiniones diversas sobre temas de cambio climatico extraidas de fuentes como redes sociales, adicional se hizo un rebalanceo de las etiquetas.
81
+ - El dataset nos hereda otras limitaciones como por ejemplo: el modelo pierde rendimiento en textos cortos, esto es debido a que la mayoria de los textos utilizados en el dataset tienen una longitud larga de entre 200 - 500 palabras. Nuevamente se intentó mitigar estas limitaciones con la inclusión de textos cortos.
82
+
83
+ ### Recommendations
84
+
85
+ - Como hemos mencionado, el modelo tiende a bajar el rendimiento en textos cortos, por lo que lo recomendable es establecer un criterio de selección de textos largos a los cuales se necesita identificar su temática.
86
+
87
+ ## How to Get Started with the Model
88
+
89
+ ```python
90
+ ## Asumiendo tener instalados transformers, torch
91
+ from transformers import AutoModelForSequenceClassification
92
+ import torch
93
+ from transformers import AutoTokenizer
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained("somosnlp/bertin_base_climate_detection_spa")
96
+ model = AutoModelForSequenceClassification.from_pretrained("somosnlp/bertin_base_climate_detection_spa")
97
+
98
+ # Traduccion del label
99
+ id2label = {0: "NEGATIVE", 1: "POSITIVE"}
100
+ label2id = {"NEGATIVE": 0, "POSITIVE": 1}
101
+
102
+ # Funcion de inferencia
103
+ def inference_fun(Texto):
104
+ inputs = tokenizer(Texto, return_tensors="pt")
105
+ with torch.no_grad():
106
+ logits = model(**inputs).logits
107
+ predicted_class_id = logits.argmax().item()
108
+ output_tag = model.config.id2label[predicted_class_id]
109
+ return output_tag
110
+
111
+ input_text = "El uso excesivo de fertilizantes nitrogenados -un fenómeno frecuente en la agricultura- da lugar a la producción de óxido nitroso, un potente gas de efecto invernadero. Un uso más juicioso de los fertilizantes puede frenar estas emisiones y reducir la producción de fertilizantes, que consume mucha energía."
112
+
113
+ print(inference_fun(input_text))
114
+ ```
115
+
116
+
117
+ ## Training Details
118
+
119
+ ### Training Data
120
+ The training data were obtained from the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection).
121
+ The training data represent about 79% of the total data in the dataset.
122
+
123
+ The labels are represented as follows:
124
+
125
+ Labels 1s
126
+
127
+ 1000 - data on paragraphs extracted from company reports on the subject.
128
+
129
+ 600 - data on various opinions, mostly short texts.
130
+
131
+ Labels 0s
132
+
133
+ 300 - data on paragraphs extracted from business reports not related to the subject.
134
+
135
+ 500 - data on news on various topics unrelated to the subject.
136
+
137
+ 500 - data on opinions on various topics unrelated to the subject.
138
+
139
+ ### Training Procedure
140
+ You can check our Google Colab to review the training procedure we take: [Colab Entrenamiento](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/entrenamiento_del_modelo.ipynb)
141
+ The accelerate configuration is as follows:
142
+ In which compute environment are you running?: 0
143
+ Which type of machine are you using?: No distributed training
144
+ Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:NO
145
+ Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
146
+ Do you want to use DeepSpeed? [yes/NO]: NO
147
+ What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
148
+ Do you wish to use FP16 or BF16 (mixed precision)?: no
149
+
150
+ #### Training Hyperparameters
151
+ The following hyperparameters were used during training:
152
+ - learning_rate: 2e-05
153
+ - train_batch_size: 16
154
+ - eval_batch_size: 16
155
+ - seed: 42
156
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
157
+ - lr_scheduler_type: linear
158
+ - num_epochs: 2
159
+
160
+ #### Speeds, Sizes, Times
161
+ El modelo fue entrenado en 2 epocas con una duración total de 14.22 minutos de entrenamiento, 'train_runtime': 853.6759.
162
+ Como dato adicional: No se utilizó precision mixta (FP16 ó BF16)
163
+
164
+
165
+ #### Resultados del entrenamiento:
166
+
167
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy |
168
+ |:-------------:|:-----:|:----:|:---------------:|:--------:|
169
+ | No log | 1.0 | 182 | 0.1964 | 0.9551 |
170
+ | No log | 2.0 | 364 | 0.1592 | 0.9705 |
171
+
172
+
173
+ ## Evaluation
174
+
175
+ ### Testing Data, Factors & Metrics
176
+
177
+ #### Testing Data
178
+
179
+ The assessment data were obtained from the dataset [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection).
180
+ The assessment data represent about 21% of the total data in the dataset.
181
+ The labels are represented as follows:
182
+
183
+ Labels 1s
184
+
185
+ 320 - data on paragraphs extracted from company reports on the subject.
186
+
187
+ 160 - data on various opinions, mostly short texts.
188
+
189
+ Labels 0s
190
+
191
+ 80 - data on paragraphs extracted from business reports not related to the subject.
192
+
193
+ 120 - data on news on various topics unrelated to the subject.
194
+
195
+ 100 - data on opinions on various topics unrelated to the subject.
196
+
197
+
198
+ **Model reached the following results on evaluation dataset:**
199
+ - **Loss:** 0.1592
200
+ - **Accuracy:** 0.9705
201
+
202
+ #### Metrics
203
+ The metric was precision.
204
+
205
+ ### Results
206
+ Look at the Inference section of Colab: [entrenamiento_del_modelo](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/blob/main/entrenamiento_del_modelo.ipynb)
207
+
208
+ Accuracy 0.95
209
+ Precision 0.916
210
+ Recall 0.99
211
+ F1 score 0.951
212
+
213
+ ## Environmental Impact
214
+ Utilizando la herramienta de [ML CO2 IMPACT](https://mlco2.github.io/impact/#co2eq) calculamos que el siguiente impacto ambiental debido al entrenamiento:
215
+ - **Tipo de hardware:** T4
216
+ - **Horas utilizadas (incluye pruebas e iteraciones para mejorar el modelo):** 4 horas
217
+ - **Proveedor de nube:** Google Cloud (colab)
218
+ - **Región computacional:** us-east
219
+ - **Huella de carbono emitida:** 0.1kg CO2
220
+
221
+
222
+ ## Technical Specifications
223
+
224
+ #### Software
225
+
226
+ - Transformers 4.39.3
227
+ - Pytorch 2.2.1+cu121
228
+ - Datasets 2.18.0
229
+ - Tokenizers 0.15.2
230
+
231
+ #### Hardware
232
+
233
+ - GPU equivalent to T4
234
+ - For reference, the model was trained on the free version of Google Colab
235
+
236
+ ## License
237
+
238
+ cc-by-nc-sa-4.0 Due to inheritance of the data used in the dataset.
239
+
240
+ ## Citation
241
+ **BibTeX:**
242
+ ```
243
+ @software{BERTIN-ClimID,
244
+ author = {Gerardo Huerta, Gabriela Zuñiga},
245
+ title = {BERTIN-ClimID: BERTIN-Base Climate-related text Identification},
246
+ month = Abril,
247
+ year = 2024,
248
+ url = {https://huggingface.co/somosnlp/bertin_base_climate_detection_spa}
249
+ }
250
+ ```
251
+
252
+ ## More Information
253
+
254
+ This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. We thank all event organizers and sponsors for their support during the event.
255
+
256
+ **Team:**
257
+
258
+ - [Gerardo Huerta](https://huggingface.co/Gerard-1705)
259
+ - [Gabriela Zuñiga](https://huggingface.co/Gabrielaz)
260
+
261
+ ## Contact
262
+
263
264