mtyrrell commited on
Commit
aa0e5b9
·
1 Parent(s): f57c953

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -75,11 +75,11 @@ The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/pol
75
  The pre-processing operations used to produce the final training dataset were as follows:
76
 
77
  1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85), selecting only IKITracs samples.
78
- 2. For IKITracs, labels are assigned based on the presence of of 'parameter' values matching the mapping taxonomy defined by TraCS^*
79
- 4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
80
- 5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
81
- 6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select hihg quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
82
- 7. Data is then augmented using sentence shuffle from the ```albumentations``` library
83
 
84
  ###**Parameter to category mapping taxonomy**
85
  |index|Category|Parameter|
 
75
  The pre-processing operations used to produce the final training dataset were as follows:
76
 
77
  1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85), selecting only IKITracs samples.
78
+ 2. For IKITracs, labels are assigned based on the presence of of 'parameter' values matching the category mapping taxonomy defined by TraCS (ref. below)
79
+ 3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
80
+ 4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
81
+ 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
82
+ 6. Data is then augmented using sentence shuffle from the ```albumentations``` library and insertions from ```nlpaug```.
83
 
84
  ###**Parameter to category mapping taxonomy**
85
  |index|Category|Parameter|