ptrdvn commited on
Commit
56633c4
·
verified ·
1 Parent(s): 29d69ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -96,7 +96,7 @@ for output in outputs:
96
 
97
  # Evaluation
98
 
99
- We evaluated this model for output accuracy and the percentage of valid Japanese `<think>` sections using the first 50 rows of the (SakanaAI/gsm8k-ja-test_250-1319)[https://huggingface.co/datasets/SakanaAI/gsm8k-ja-test_250-1319] dataset.
100
 
101
  We compare this to the original R1 model and test in both regimes where repetition penalty is 1.0 and 1.1:
102
 
@@ -110,7 +110,7 @@ We compare this to the original R1 model and test in both regimes where repetiti
110
  Code for the SakanaAI/gsm8k-ja-test_250-1319 evaluation can be found [here](https://drive.google.com/file/d/1gCzCJv5vasw8R3KVQimfoIDFyfxwxNvC/view?usp=sharing).
111
 
112
 
113
- We further use the first 50 prompts from (DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja)[https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja] to evaluate the percentage of valid Japanese `\<think\>` sections in model responses.
114
  This benchmark contains more varied and complex prompts, meaning this is a more realistic evaluation of how reliably this model can output Japanese.
115
 
116
  | | Repetition Penalty | Valid Japanese `<think>` (%) |
 
96
 
97
  # Evaluation
98
 
99
+ We evaluated this model for output accuracy and the percentage of valid Japanese `<think>` sections using the first 50 rows of the [SakanaAI/gsm8k-ja-test_250-1319](https://huggingface.co/datasets/SakanaAI/gsm8k-ja-test_250-1319) dataset.
100
 
101
  We compare this to the original R1 model and test in both regimes where repetition penalty is 1.0 and 1.1:
102
 
 
110
  Code for the SakanaAI/gsm8k-ja-test_250-1319 evaluation can be found [here](https://drive.google.com/file/d/1gCzCJv5vasw8R3KVQimfoIDFyfxwxNvC/view?usp=sharing).
111
 
112
 
113
+ We further use the first 50 prompts from [DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja) to evaluate the percentage of valid Japanese `\<think\>` sections in model responses.
114
  This benchmark contains more varied and complex prompts, meaning this is a more realistic evaluation of how reliably this model can output Japanese.
115
 
116
  | | Repetition Penalty | Valid Japanese `<think>` (%) |