friederikebauer commited on
Commit
1c6570f
·
1 Parent(s): a97ed3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -17
README.md CHANGED
@@ -28,28 +28,28 @@ model-index:
28
  revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
29
  metrics:
30
  - type: accuracy
31
- value: 0.7336065573770492
32
  name: Accuracy 'Bezeichnung'
33
  - type: precision
34
- value: 0.6611111111111111
35
  name: Precision 'Bezeichnung' (macro)
36
  - type: recall
37
- value: 0.7056652046783626
38
  name: Recall 'Bezeichnung' (macro)
39
  - type: f1
40
- value: 0.6674970889256604
41
  name: F1 'Bezeichnung' (macro)
42
  - type: accuracy
43
- value: 0.8934426229508197
44
  name: Accuracy 'Thema'
45
  - type: precision
46
- value: 0.902382942746851
47
  name: Precision 'Thema' (macro)
48
  - type: recall
49
- value: 0.8909340386389567
50
  name: Recall 'Thema' (macro)
51
  - type: f1
52
- value: 0.8768881364249785
53
  name: F1 'Thema' (macro)
54
  ---
55
 
@@ -61,6 +61,19 @@ model-index:
61
 
62
  This model is based on [bert-base-german-cased](https://huggingface.co/bert-base-cased) and fine-tuned on [and-effect/mdk_gov_data_titles_clf](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf).
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  - **Developed by:** and-effect
65
  - **Model type:** Text Classification
66
  - **Language(s) (NLP):** de
@@ -137,14 +150,19 @@ print(sentence_embeddings)
137
 
138
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
139
 
140
- The model is intended to classify open source dataset titles from german municipalities. More information on the Taxonomy (classification categories) and the Project can be found on [More Information Needed].
 
 
141
 
142
 
143
  # Bias, Risks, and Limitations
144
 
145
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
146
 
147
- The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
 
 
 
148
 
149
  ## Recommendations
150
 
@@ -213,17 +231,30 @@ The evaluation data can be found [here](https://huggingface.co/datasets/and-effe
213
 
214
  ### Metrics
215
 
216
- The model performance is tested with fours metrices. Accuracy, Precision, Recall and F1 Score. A lot of classes were not predicted and are thus set to zero for the calculation of precision, recall and f1 score. For these metrices the additional calucations were performed exluding classes with less than two predictions for the level 'Bezeichnung' (see in table results 'Bezeichnung II'. Although intepretation of these results should be interpreted with caution, because they do not represent all classes.
 
 
 
 
 
 
 
 
 
 
217
 
218
  ## Results
219
 
220
  | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
221
  |-----|-----|-----|-----|-----|
222
- | Test dataset 'Bezeichnung' I | 0.7336065573770492 | 0.6611111111111111 | 0.7056652046783626 | 0.6674970889256604 |
223
- | Test dataset 'Thema' I | 0.8934426229508197 | 0.902382942746851 | 0.8909340386389567 | 0.8768881364249785 |
224
- | Test dataset 'Bezeichnung' II | 0.7336065573770492 | 0.5829457364341085 | 0.8229090167278661 | 0.6544072311514172 |
225
- | Validation dataset 'Bezeichnung' I | 0.5148514851485149 | 0.346125116713352 | 0.3553921568627451 | 0.33252525252525256 |
226
- | Validation dataset 'Thema' I | 0.7722772277227723 | 0.5908392682586232 | 0.6784524126899494 | 0.5962308463774738 |
227
- | Validation dataset 'Bezeichnung' II | 0.5148514851485149 | 0.5768253968253969 | 0.6916666666666667 | 0.592808080808081 |
228
 
 
 
229
 
 
 
28
  revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
29
  metrics:
30
  - type: accuracy
31
+ value: 0.73
32
  name: Accuracy 'Bezeichnung'
33
  - type: precision
34
+ value: 0.66
35
  name: Precision 'Bezeichnung' (macro)
36
  - type: recall
37
+ value: 0.71
38
  name: Recall 'Bezeichnung' (macro)
39
  - type: f1
40
+ value: 0.68
41
  name: F1 'Bezeichnung' (macro)
42
  - type: accuracy
43
+ value: 0.89
44
  name: Accuracy 'Thema'
45
  - type: precision
46
+ value: 0.90
47
  name: Precision 'Thema' (macro)
48
  - type: recall
49
+ value: 0.89
50
  name: Recall 'Thema' (macro)
51
  - type: f1
52
+ value: 0.88
53
  name: F1 'Thema' (macro)
54
  ---
55
 
 
61
 
62
  This model is based on [bert-base-german-cased](https://huggingface.co/bert-base-cased) and fine-tuned on [and-effect/mdk_gov_data_titles_clf](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf).
63
 
64
+ It was created as part of the Bertelsmann Foundation's Musterdatenkatalog (MDK) project (See their website [here](https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog)).
65
+ The main intent of the MDK project was to classify open data into a taxonomy to help give an overview of already published data.
66
+ It can help municipalities in Germany, as well as data analysts and journalists, to see which cities have already published data sets and what might be missing.
67
+ The project uses a taxonomy to classify the data and the model was specifically trained for the project and the classification task. It thus has a clear intended downstream task and should be used with the mentioned taxonomy.
68
+
69
+ **Information about the underlying taxonomy:**
70
+ The used taxonomy 'Musterdatenkatalog' has two levels: 'Thema' and 'Bezeichnung' which roughly translates to topic and label. There are 25 entries for the top level ranging from topics such as 'Finanzen' (finance) to 'Gesundheit' (health).
71
+ The second level, 'Bezeichnung' (label) goes into more detail and would for example contain 'Krankenhaus' (hospital) in the case of the topic being health. The second level contains 241 labels. The combination of topic and label (Thema + Bezeichnung) creates a 'Musterdatensatz'.
72
+
73
+ One can classify the data into the topics or the labels, results for both are presented down below. Although matching to other taxonomies is provdided in the published rdf version of the taxonomy (todo), the model is tailored to this taxonomy.
74
+
75
+
76
+
77
  - **Developed by:** and-effect
78
  - **Model type:** Text Classification
79
  - **Language(s) (NLP):** de
 
150
 
151
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
152
 
153
+ The model is intended to classify open source dataset titles from german municipalities. The model is specifically tailored for this task and uses a specific taxonomy.
154
+ More information on the taxonomy (classification categories) and the Project can be found on the [project website](https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog).
155
+
156
 
157
 
158
  # Bias, Risks, and Limitations
159
 
160
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
161
 
162
+ The model has some limititations. The model has some limitations in terms of the downstream task.
163
+ 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited.
164
+ 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'.
165
+ 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
166
 
167
  ## Recommendations
168
 
 
231
 
232
  ### Metrics
233
 
234
+ The model performance is tested with four metrics: Accuracy, Precision, Recall and F1 Score. Although the data is imbalanced accuracy was still used as the imbalance accurately represents the tendency for more entries for some classes, for example 'Raumplanung - Bebauungsplan'.
235
+
236
+ A lot of classes were not predicted and are thus set to zero for the calculation of precision, recall and f1 score.
237
+ For these metrices additional calculations were performed. These are denoted with 'II' in the table and excluded the classes with less than two predictions for the level 'Bezeichnung'.
238
+ One must be careful when interpreting the results of these calculations though as they do not give any information about the classes left out.
239
+
240
+ The tasks denoted with 'I' include all classes.
241
+
242
+ The tasks are split not only into either including all classes ('I') or not ('II'), they are also divided into a task on 'Bezeichnung' or 'Thema'.
243
+ As previously mentioned this has to do with the underlying taxonomy. The task on 'Thema' is performed on the first level of the taxonomy with 25 classes, the task on 'Bezeichnung' is performed on the second level which has 241 classes.
244
+
245
 
246
  ## Results
247
 
248
  | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
249
  |-----|-----|-----|-----|-----|
250
+ | Test dataset 'Bezeichnung' I | 0.73 (.82)* | 0.66 | 0.71 | 0.67 |
251
+ | Test dataset 'Thema' I | 0.89 (.92)* | 0.90 | 0.89 | 0.88 |
252
+ | Test dataset 'Bezeichnung' II | 0.73 | 0.58 | 0.82 | 0.65 |
253
+ | Validation dataset 'Bezeichnung' I | 0.51 | 0.35 | 0.36 | 0.33 |
254
+ | Validation dataset 'Thema' I | 0.77 | 0.59 | 0.68 | 0.60 |
255
+ | Validation dataset 'Bezeichnung' II | 0.51 | 0.58 | 0.69 | 0.59 |
256
 
257
+ \* the accuracy in brackets was calculated with a manual analysis. This was done to check for data entries that could for example be part of more than one class and thus were actually correctly classified by the algorithm.
258
+ In this step the correct labeling of the test data was also checked again for possible mistakes and resulted in a better performance.
259
 
260
+ The validation dataset was created manually to check certain classes