Update README.md
Browse files
README.md
CHANGED
@@ -75,11 +75,11 @@ The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/pol
|
|
75 |
The pre-processing operations used to produce the final training dataset were as follows:
|
76 |
|
77 |
1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85), selecting only IKITracs samples.
|
78 |
-
2. For IKITracs, labels are assigned based on the presence of of 'parameter' values matching the mapping taxonomy defined by TraCS
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
|
84 |
###**Parameter to category mapping taxonomy**
|
85 |
|index|Category|Parameter|
|
|
|
75 |
The pre-processing operations used to produce the final training dataset were as follows:
|
76 |
|
77 |
1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85), selecting only IKITracs samples.
|
78 |
+
2. For IKITracs, labels are assigned based on the presence of of 'parameter' values matching the category mapping taxonomy defined by TraCS (ref. below)
|
79 |
+
3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
|
80 |
+
4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
|
81 |
+
5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
|
82 |
+
6. Data is then augmented using sentence shuffle from the ```albumentations``` library and insertions from ```nlpaug```.
|
83 |
|
84 |
###**Parameter to category mapping taxonomy**
|
85 |
|index|Category|Parameter|
|