zoeye123 commited on
Commit
85d4195
·
verified ·
1 Parent(s): 41d7e55

Model save

Browse files
Files changed (4) hide show
  1. README.md +2 -4
  2. all_results.json +5 -5
  3. train_results.json +5 -5
  4. trainer_state.json +1618 -19
README.md CHANGED
@@ -1,11 +1,9 @@
1
  ---
2
  base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
3
- datasets: open-r1/OpenR1-Math-220k
4
  library_name: transformers
5
  model_name: DeepSeek-R1-Distill-Qwen-1.5B-GRPO
6
  tags:
7
  - generated_from_trainer
8
- - open-r1
9
  - trl
10
  - grpo
11
  licence: license
@@ -13,7 +11,7 @@ licence: license
13
 
14
  # Model Card for DeepSeek-R1-Distill-Qwen-1.5B-GRPO
15
 
16
- This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset.
17
  It has been trained using [TRL](https://github.com/huggingface/trl).
18
 
19
  ## Quick start
@@ -29,7 +27,7 @@ print(output["generated_text"])
29
 
30
  ## Training procedure
31
 
32
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yeshugno-microsoft/huggingface/runs/x9spl3vt)
33
 
34
 
35
  This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
 
1
  ---
2
  base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
 
3
  library_name: transformers
4
  model_name: DeepSeek-R1-Distill-Qwen-1.5B-GRPO
5
  tags:
6
  - generated_from_trainer
 
7
  - trl
8
  - grpo
9
  licence: license
 
11
 
12
  # Model Card for DeepSeek-R1-Distill-Qwen-1.5B-GRPO
13
 
14
+ This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
15
  It has been trained using [TRL](https://github.com/huggingface/trl).
16
 
17
  ## Quick start
 
27
 
28
  ## Training procedure
29
 
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yeshugno-microsoft/huggingface/runs/0cfrj5s4)
31
 
32
 
33
  This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
all_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "total_flos": 0.0,
3
- "train_loss": 2.7939677238464355e-09,
4
- "train_runtime": 254.4903,
5
- "train_samples": 10,
6
- "train_samples_per_second": 0.039,
7
- "train_steps_per_second": 0.008
8
  }
 
1
  {
2
  "total_flos": 0.0,
3
+ "train_loss": 0.0003328805177226313,
4
+ "train_runtime": 21588.6366,
5
+ "train_samples": 1000,
6
+ "train_samples_per_second": 0.046,
7
+ "train_steps_per_second": 0.006
8
  }
train_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "total_flos": 0.0,
3
- "train_loss": 2.7939677238464355e-09,
4
- "train_runtime": 254.4903,
5
- "train_samples": 10,
6
- "train_samples_per_second": 0.039,
7
- "train_steps_per_second": 0.008
8
  }
 
1
  {
2
  "total_flos": 0.0,
3
+ "train_loss": 0.0003328805177226313,
4
+ "train_runtime": 21588.6366,
5
+ "train_samples": 1000,
6
+ "train_samples_per_second": 0.046,
7
+ "train_steps_per_second": 0.006
8
  }
trainer_state.json CHANGED
@@ -1,51 +1,1650 @@
1
  {
2
  "best_metric": null,
3
  "best_model_checkpoint": null,
4
- "epoch": 0.8,
5
  "eval_steps": 500,
6
- "global_step": 2,
7
  "is_hyper_param_search": false,
8
  "is_local_process_zero": true,
9
  "is_world_process_zero": true,
10
  "log_history": [
11
  {
12
- "completion_length": 1842.03125,
13
- "epoch": 0.4,
14
  "grad_norm": 0.0,
15
  "kl": 0.0,
16
  "learning_rate": 0.0,
17
  "loss": 0.0,
18
- "reward": 0.125,
19
- "reward_std": 0.13363061845302582,
20
- "rewards/accuracy_reward": 0.125,
21
  "rewards/format_reward": 0.0,
22
  "step": 1
23
  },
24
  {
25
- "completion_length": 1805.84375,
26
- "epoch": 0.8,
27
  "grad_norm": 0.0,
28
  "kl": 0.0,
29
- "learning_rate": 0.0,
30
  "loss": 0.0,
31
- "reward": 0.125,
32
- "reward_std": 0.13363061845302582,
33
- "rewards/accuracy_reward": 0.125,
34
  "rewards/format_reward": 0.0,
35
  "step": 2
36
  },
37
  {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  "epoch": 0.8,
39
- "step": 2,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  "total_flos": 0.0,
41
- "train_loss": 2.7939677238464355e-09,
42
- "train_runtime": 254.4903,
43
- "train_samples_per_second": 0.039,
44
- "train_steps_per_second": 0.008
45
  }
46
  ],
47
  "logging_steps": 1,
48
- "max_steps": 2,
49
  "num_input_tokens_seen": 0,
50
  "num_train_epochs": 1,
51
  "save_steps": 500,
 
1
  {
2
  "best_metric": null,
3
  "best_model_checkpoint": null,
4
+ "epoch": 1.0,
5
  "eval_steps": 500,
6
+ "global_step": 125,
7
  "is_hyper_param_search": false,
8
  "is_local_process_zero": true,
9
  "is_world_process_zero": true,
10
  "log_history": [
11
  {
12
+ "completion_length": 1721.0417022705078,
13
+ "epoch": 0.008,
14
  "grad_norm": 0.0,
15
  "kl": 0.0,
16
  "learning_rate": 0.0,
17
  "loss": 0.0,
18
+ "reward": 0.17708333488553762,
19
+ "reward_std": 0.09763014316558838,
20
+ "rewards/accuracy_reward": 0.17708333488553762,
21
  "rewards/format_reward": 0.0,
22
  "step": 1
23
  },
24
  {
25
+ "completion_length": 1944.875015258789,
26
+ "epoch": 0.016,
27
  "grad_norm": 0.0,
28
  "kl": 0.0,
29
+ "learning_rate": 7.692307692307692e-08,
30
  "loss": 0.0,
31
+ "reward": 0.09375,
32
+ "reward_std": 0.05653337761759758,
33
+ "rewards/accuracy_reward": 0.09375,
34
  "rewards/format_reward": 0.0,
35
  "step": 2
36
  },
37
  {
38
+ "completion_length": 1621.6250228881836,
39
+ "epoch": 0.024,
40
+ "grad_norm": 0.0,
41
+ "kl": 0.0,
42
+ "learning_rate": 7.692307692307692e-08,
43
+ "loss": 0.0,
44
+ "reward": 0.14583333861082792,
45
+ "reward_std": 0.16199623048305511,
46
+ "rewards/accuracy_reward": 0.14583333861082792,
47
+ "rewards/format_reward": 0.0,
48
+ "step": 3
49
+ },
50
+ {
51
+ "completion_length": 1667.333351135254,
52
+ "epoch": 0.032,
53
+ "grad_norm": 0.0,
54
+ "kl": 0.0,
55
+ "learning_rate": 7.692307692307692e-08,
56
+ "loss": 0.0,
57
+ "reward": 0.09375000558793545,
58
+ "reward_std": 0.11302226781845093,
59
+ "rewards/accuracy_reward": 0.0729166716337204,
60
+ "rewards/format_reward": 0.02083333395421505,
61
+ "step": 4
62
+ },
63
+ {
64
+ "completion_length": 1555.8854598999023,
65
+ "epoch": 0.04,
66
+ "grad_norm": 0.0,
67
+ "kl": 0.0,
68
+ "learning_rate": 1.5384615384615385e-07,
69
+ "loss": 0.0,
70
+ "reward": 0.0625,
71
+ "reward_std": 0.06527911871671677,
72
+ "rewards/accuracy_reward": 0.0625,
73
+ "rewards/format_reward": 0.0,
74
+ "step": 5
75
+ },
76
+ {
77
+ "completion_length": 1598.6354522705078,
78
+ "epoch": 0.048,
79
+ "grad_norm": 0.0,
80
+ "kl": 0.0,
81
+ "learning_rate": 2.3076923076923078e-07,
82
+ "loss": 0.0,
83
+ "reward": 0.06250000186264515,
84
+ "reward_std": 0.11020193248987198,
85
+ "rewards/accuracy_reward": 0.0416666679084301,
86
+ "rewards/format_reward": 0.02083333395421505,
87
+ "step": 6
88
+ },
89
+ {
90
+ "completion_length": 1771.7916870117188,
91
+ "epoch": 0.056,
92
+ "grad_norm": 0.0,
93
+ "kl": 0.0,
94
+ "learning_rate": 3.076923076923077e-07,
95
+ "loss": 0.0,
96
+ "reward": 0.0520833358168602,
97
+ "reward_std": 0.06436608731746674,
98
+ "rewards/accuracy_reward": 0.0520833358168602,
99
+ "rewards/format_reward": 0.0,
100
+ "step": 7
101
+ },
102
+ {
103
+ "completion_length": 1735.2916870117188,
104
+ "epoch": 0.064,
105
+ "grad_norm": 0.0,
106
+ "kl": 0.0,
107
+ "learning_rate": 3.8461538461538463e-07,
108
+ "loss": 0.0,
109
+ "reward": 0.07291666977107525,
110
+ "reward_std": 0.11302226781845093,
111
+ "rewards/accuracy_reward": 0.07291666977107525,
112
+ "rewards/format_reward": 0.0,
113
+ "step": 8
114
+ },
115
+ {
116
+ "completion_length": 1848.2916870117188,
117
+ "epoch": 0.072,
118
+ "grad_norm": 0.0,
119
+ "kl": 0.0,
120
+ "learning_rate": 4.6153846153846156e-07,
121
+ "loss": -0.0,
122
+ "reward": 0.010416666977107525,
123
+ "reward_std": 0.03608439117670059,
124
+ "rewards/accuracy_reward": 0.010416666977107525,
125
+ "rewards/format_reward": 0.0,
126
+ "step": 9
127
+ },
128
+ {
129
+ "completion_length": 1668.5312805175781,
130
+ "epoch": 0.08,
131
+ "grad_norm": 0.0,
132
+ "kl": 0.0,
133
+ "learning_rate": 4.6153846153846156e-07,
134
+ "loss": 0.0,
135
+ "reward": 0.08333333674818277,
136
+ "reward_std": 0.14910665899515152,
137
+ "rewards/accuracy_reward": 0.08333333674818277,
138
+ "rewards/format_reward": 0.0,
139
+ "step": 10
140
+ },
141
+ {
142
+ "completion_length": 1871.9687805175781,
143
+ "epoch": 0.088,
144
+ "grad_norm": 0.0,
145
+ "kl": 0.0,
146
+ "learning_rate": 5.384615384615384e-07,
147
+ "loss": 0.0,
148
+ "reward": 0.0,
149
+ "reward_std": 0.0,
150
+ "rewards/accuracy_reward": 0.0,
151
+ "rewards/format_reward": 0.0,
152
+ "step": 11
153
+ },
154
+ {
155
+ "completion_length": 1863.7500305175781,
156
+ "epoch": 0.096,
157
+ "grad_norm": 0.0,
158
+ "kl": 0.0,
159
+ "learning_rate": 5.384615384615384e-07,
160
+ "loss": 0.0,
161
+ "reward": 0.02083333395421505,
162
+ "reward_std": 0.07216878235340118,
163
+ "rewards/accuracy_reward": 0.02083333395421505,
164
+ "rewards/format_reward": 0.0,
165
+ "step": 12
166
+ },
167
+ {
168
+ "completion_length": 1803.5104370117188,
169
+ "epoch": 0.104,
170
+ "grad_norm": 0.0,
171
+ "kl": 0.0,
172
+ "learning_rate": 6.153846153846154e-07,
173
+ "loss": 0.0,
174
+ "reward": 0.0,
175
+ "reward_std": 0.0,
176
+ "rewards/accuracy_reward": 0.0,
177
+ "rewards/format_reward": 0.0,
178
+ "step": 13
179
+ },
180
+ {
181
+ "completion_length": 1560.7708587646484,
182
+ "epoch": 0.112,
183
+ "grad_norm": 0.0,
184
+ "kl": 0.0,
185
+ "learning_rate": 6.923076923076922e-07,
186
+ "loss": -0.0,
187
+ "reward": 0.14583333395421505,
188
+ "reward_std": 0.04865618050098419,
189
+ "rewards/accuracy_reward": 0.14583333395421505,
190
+ "rewards/format_reward": 0.0,
191
+ "step": 14
192
+ },
193
+ {
194
+ "completion_length": 1918.3646087646484,
195
+ "epoch": 0.12,
196
+ "grad_norm": 0.0,
197
+ "kl": 0.0,
198
+ "learning_rate": 6.923076923076922e-07,
199
+ "loss": -0.0,
200
+ "reward": 0.22916666697710752,
201
+ "reward_std": 0.15789688751101494,
202
+ "rewards/accuracy_reward": 0.22916666697710752,
203
+ "rewards/format_reward": 0.0,
204
+ "step": 15
205
+ },
206
+ {
207
+ "completion_length": 1745.6458740234375,
208
+ "epoch": 0.128,
209
+ "grad_norm": 0.0,
210
+ "kl": 0.0,
211
+ "learning_rate": 6.923076923076922e-07,
212
+ "loss": -0.0,
213
+ "reward": 0.18750000093132257,
214
+ "reward_std": 0.14127394929528236,
215
+ "rewards/accuracy_reward": 0.18750000093132257,
216
+ "rewards/format_reward": 0.0,
217
+ "step": 16
218
+ },
219
+ {
220
+ "completion_length": 1737.7812728881836,
221
+ "epoch": 0.136,
222
+ "grad_norm": 0.0,
223
+ "kl": 0.0,
224
+ "learning_rate": 7.692307692307693e-07,
225
+ "loss": 0.0,
226
+ "reward": 0.0416666679084301,
227
+ "reward_std": 0.06154575198888779,
228
+ "rewards/accuracy_reward": 0.03125,
229
+ "rewards/format_reward": 0.010416666977107525,
230
+ "step": 17
231
+ },
232
+ {
233
+ "completion_length": 1527.2917175292969,
234
+ "epoch": 0.144,
235
+ "grad_norm": 0.0,
236
+ "kl": 0.0,
237
+ "learning_rate": 7.692307692307693e-07,
238
+ "loss": -0.0,
239
+ "reward": 0.3958333348855376,
240
+ "reward_std": 0.21344273164868355,
241
+ "rewards/accuracy_reward": 0.3854166679084301,
242
+ "rewards/format_reward": 0.010416666977107525,
243
+ "step": 18
244
+ },
245
+ {
246
+ "completion_length": 1783.3958587646484,
247
+ "epoch": 0.152,
248
+ "grad_norm": 0.0,
249
+ "kl": 0.0,
250
+ "learning_rate": 8.461538461538461e-07,
251
+ "loss": 0.0,
252
+ "reward": 0.13541666697710752,
253
+ "reward_std": 0.03608439117670059,
254
+ "rewards/accuracy_reward": 0.13541666697710752,
255
+ "rewards/format_reward": 0.0,
256
+ "step": 19
257
+ },
258
+ {
259
+ "completion_length": 1755.5208740234375,
260
+ "epoch": 0.16,
261
+ "grad_norm": 0.0,
262
+ "kl": 0.0,
263
+ "learning_rate": 8.461538461538461e-07,
264
+ "loss": 0.0,
265
+ "reward": 0.23958333674818277,
266
+ "reward_std": 0.16979892551898956,
267
+ "rewards/accuracy_reward": 0.22916666977107525,
268
+ "rewards/format_reward": 0.010416666977107525,
269
+ "step": 20
270
+ },
271
+ {
272
+ "completion_length": 1797.2083740234375,
273
+ "epoch": 0.168,
274
+ "grad_norm": 0.0,
275
+ "kl": 0.0,
276
+ "learning_rate": 8.461538461538461e-07,
277
+ "loss": -0.0,
278
+ "reward": 0.05208333395421505,
279
+ "reward_std": 0.10518955811858177,
280
+ "rewards/accuracy_reward": 0.05208333395421505,
281
+ "rewards/format_reward": 0.0,
282
+ "step": 21
283
+ },
284
+ {
285
+ "completion_length": 1737.2916717529297,
286
+ "epoch": 0.176,
287
+ "grad_norm": 0.0,
288
+ "kl": 0.0,
289
+ "learning_rate": 8.461538461538461e-07,
290
+ "loss": 0.0,
291
+ "reward": 0.07291666883975267,
292
+ "reward_std": 0.14628632366657257,
293
+ "rewards/accuracy_reward": 0.06250000093132257,
294
+ "rewards/format_reward": 0.010416666977107525,
295
+ "step": 22
296
+ },
297
+ {
298
+ "completion_length": 1865.281265258789,
299
+ "epoch": 0.184,
300
+ "grad_norm": 0.07227283716201782,
301
+ "kl": 0.0,
302
+ "learning_rate": 9.230769230769231e-07,
303
+ "loss": 0.0,
304
+ "reward": 0.052083334885537624,
305
+ "reward_std": 0.09763014316558838,
306
+ "rewards/accuracy_reward": 0.052083334885537624,
307
+ "rewards/format_reward": 0.0,
308
+ "step": 23
309
+ },
310
+ {
311
+ "completion_length": 1721.583351135254,
312
+ "epoch": 0.192,
313
+ "grad_norm": 3.1304349249694496e-05,
314
+ "kl": -7.547438144683838e-06,
315
+ "learning_rate": 1e-06,
316
+ "loss": -0.0,
317
+ "reward": 0.06250000279396772,
318
+ "reward_std": 0.10045047849416733,
319
+ "rewards/accuracy_reward": 0.0520833358168602,
320
+ "rewards/format_reward": 0.010416666977107525,
321
+ "step": 24
322
+ },
323
+ {
324
+ "completion_length": 1667.0521240234375,
325
+ "epoch": 0.2,
326
+ "grad_norm": 0.057344950735569,
327
+ "kl": -7.309019565582275e-06,
328
+ "learning_rate": 9.998229818723738e-07,
329
+ "loss": -0.0,
330
+ "reward": 0.1041666716337204,
331
+ "reward_std": 0.14010312780737877,
332
+ "rewards/accuracy_reward": 0.09375,
333
+ "rewards/format_reward": 0.010416666977107525,
334
+ "step": 25
335
+ },
336
+ {
337
+ "completion_length": 1752.8021240234375,
338
+ "epoch": 0.208,
339
+ "grad_norm": 0.11655361950397491,
340
+ "kl": -9.395182132720947e-06,
341
+ "learning_rate": 9.992920667580175e-07,
342
+ "loss": -0.0,
343
+ "reward": 0.26041667349636555,
344
+ "reward_std": 0.17456801980733871,
345
+ "rewards/accuracy_reward": 0.2291666716337204,
346
+ "rewards/format_reward": 0.031250000931322575,
347
+ "step": 26
348
+ },
349
+ {
350
+ "completion_length": 1865.9167175292969,
351
+ "epoch": 0.216,
352
+ "grad_norm": 0.0001482899097027257,
353
+ "kl": -1.2703239917755127e-05,
354
+ "learning_rate": 9.984076723529287e-07,
355
+ "loss": -0.0,
356
+ "reward": 0.13541666697710752,
357
+ "reward_std": 0.03608439117670059,
358
+ "rewards/accuracy_reward": 0.13541666697710752,
359
+ "rewards/format_reward": 0.0,
360
+ "step": 27
361
+ },
362
+ {
363
+ "completion_length": 1667.031265258789,
364
+ "epoch": 0.224,
365
+ "grad_norm": 0.00021164790086913854,
366
+ "kl": -1.093745231628418e-05,
367
+ "learning_rate": 9.971704944519593e-07,
368
+ "loss": -0.0,
369
+ "reward": 0.25,
370
+ "reward_std": 0.0,
371
+ "rewards/accuracy_reward": 0.25,
372
+ "rewards/format_reward": 0.0,
373
+ "step": 28
374
+ },
375
+ {
376
+ "completion_length": 1666.7812805175781,
377
+ "epoch": 0.232,
378
+ "grad_norm": 0.000568240531720221,
379
+ "kl": -7.115304470062256e-06,
380
+ "learning_rate": 9.955815064014005e-07,
381
+ "loss": -0.0,
382
+ "reward": 0.031250000931322575,
383
+ "reward_std": 0.08474057912826538,
384
+ "rewards/accuracy_reward": 0.031250000931322575,
385
+ "rewards/format_reward": 0.0,
386
+ "step": 29
387
+ },
388
+ {
389
+ "completion_length": 1486.2187881469727,
390
+ "epoch": 0.24,
391
+ "grad_norm": 0.1216636374592781,
392
+ "kl": -4.678964614868164e-06,
393
+ "learning_rate": 9.93641958333206e-07,
394
+ "loss": -0.0,
395
+ "reward": 0.2187500037252903,
396
+ "reward_std": 0.24490400031208992,
397
+ "rewards/accuracy_reward": 0.2187500037252903,
398
+ "rewards/format_reward": 0.0,
399
+ "step": 30
400
+ },
401
+ {
402
+ "completion_length": 1651.020866394043,
403
+ "epoch": 0.248,
404
+ "grad_norm": 0.09538763761520386,
405
+ "kl": -6.183981895446777e-07,
406
+ "learning_rate": 9.913533761814537e-07,
407
+ "loss": -0.0,
408
+ "reward": 0.041666666977107525,
409
+ "reward_std": 0.09261776879429817,
410
+ "rewards/accuracy_reward": 0.041666666977107525,
411
+ "rewards/format_reward": 0.0,
412
+ "step": 31
413
+ },
414
+ {
415
+ "completion_length": 1917.9375305175781,
416
+ "epoch": 0.256,
417
+ "grad_norm": 0.056192267686128616,
418
+ "kl": 3.120303153991699e-05,
419
+ "learning_rate": 9.887175604818206e-07,
420
+ "loss": 0.0,
421
+ "reward": 0.15625000093132257,
422
+ "reward_std": 0.10825317353010178,
423
+ "rewards/accuracy_reward": 0.14583333395421505,
424
+ "rewards/format_reward": 0.010416666977107525,
425
+ "step": 32
426
+ },
427
+ {
428
+ "completion_length": 1721.8646087646484,
429
+ "epoch": 0.264,
430
+ "grad_norm": 0.0013622839469462633,
431
+ "kl": 5.451589822769165e-05,
432
+ "learning_rate": 9.857365849550177e-07,
433
+ "loss": 0.0,
434
+ "reward": 0.13541666697710752,
435
+ "reward_std": 0.03608439117670059,
436
+ "rewards/accuracy_reward": 0.125,
437
+ "rewards/format_reward": 0.010416666977107525,
438
+ "step": 33
439
+ },
440
+ {
441
+ "completion_length": 1964.8750305175781,
442
+ "epoch": 0.272,
443
+ "grad_norm": 0.0009671057923696935,
444
+ "kl": 6.0245394706726074e-05,
445
+ "learning_rate": 9.824127948752948e-07,
446
+ "loss": 0.0,
447
+ "reward": 0.02083333395421505,
448
+ "reward_std": 0.07216878235340118,
449
+ "rewards/accuracy_reward": 0.02083333395421505,
450
+ "rewards/format_reward": 0.0,
451
+ "step": 34
452
+ },
453
+ {
454
+ "completion_length": 1607.7708587646484,
455
+ "epoch": 0.28,
456
+ "grad_norm": 0.11371063441038132,
457
+ "kl": 6.452202796936035e-05,
458
+ "learning_rate": 9.787488052253033e-07,
459
+ "loss": 0.0,
460
+ "reward": 0.041666666977107525,
461
+ "reward_std": 0.09261776879429817,
462
+ "rewards/accuracy_reward": 0.02083333395421505,
463
+ "rewards/format_reward": 0.02083333395421505,
464
+ "step": 35
465
+ },
466
+ {
467
+ "completion_length": 1803.7500305175781,
468
+ "epoch": 0.288,
469
+ "grad_norm": 0.0009732124162837863,
470
+ "kl": 0.00013828277587890625,
471
+ "learning_rate": 9.747474986387654e-07,
472
+ "loss": 0.0,
473
+ "reward": 0.19791666697710752,
474
+ "reward_std": 0.10136350989341736,
475
+ "rewards/accuracy_reward": 0.19791666697710752,
476
+ "rewards/format_reward": 0.0,
477
+ "step": 36
478
+ },
479
+ {
480
+ "completion_length": 1660.7292175292969,
481
+ "epoch": 0.296,
482
+ "grad_norm": 0.054171618074178696,
483
+ "kl": 0.000163249671459198,
484
+ "learning_rate": 9.7041202313257e-07,
485
+ "loss": 0.0,
486
+ "reward": 0.19791666977107525,
487
+ "reward_std": 0.15001969039440155,
488
+ "rewards/accuracy_reward": 0.16666666977107525,
489
+ "rewards/format_reward": 0.03125,
490
+ "step": 37
491
+ },
492
+ {
493
+ "completion_length": 1631.750015258789,
494
+ "epoch": 0.304,
495
+ "grad_norm": 0.10330933332443237,
496
+ "kl": 0.0003886893391609192,
497
+ "learning_rate": 9.657457896300791e-07,
498
+ "loss": 0.0,
499
+ "reward": 0.25000000558793545,
500
+ "reward_std": 0.16188634932041168,
501
+ "rewards/accuracy_reward": 0.21875000279396772,
502
+ "rewards/format_reward": 0.031250000931322575,
503
+ "step": 38
504
+ },
505
+ {
506
+ "completion_length": 1778.6041946411133,
507
+ "epoch": 0.312,
508
+ "grad_norm": 0.0016962133813649416,
509
+ "kl": 0.00030538439750671387,
510
+ "learning_rate": 9.607524692775935e-07,
511
+ "loss": 0.0,
512
+ "reward": 0.02083333395421505,
513
+ "reward_std": 0.04865618795156479,
514
+ "rewards/accuracy_reward": 0.02083333395421505,
515
+ "rewards/format_reward": 0.0,
516
+ "step": 39
517
+ },
518
+ {
519
+ "completion_length": 1545.9167175292969,
520
+ "epoch": 0.32,
521
+ "grad_norm": 0.050718892365694046,
522
+ "kl": 0.0006757676601409912,
523
+ "learning_rate": 9.554359905560885e-07,
524
+ "loss": 0.0,
525
+ "reward": 0.33333334140479565,
526
+ "reward_std": 0.27968307584524155,
527
+ "rewards/accuracy_reward": 0.33333334140479565,
528
+ "rewards/format_reward": 0.0,
529
+ "step": 40
530
+ },
531
+ {
532
+ "completion_length": 1609.9271087646484,
533
+ "epoch": 0.328,
534
+ "grad_norm": 0.11008451133966446,
535
+ "kl": 0.0004702061414718628,
536
+ "learning_rate": 9.498005361904924e-07,
537
+ "loss": 0.0,
538
+ "reward": 0.1145833358168602,
539
+ "reward_std": 0.21539149060845375,
540
+ "rewards/accuracy_reward": 0.06250000186264515,
541
+ "rewards/format_reward": 0.05208333395421505,
542
+ "step": 41
543
+ },
544
+ {
545
+ "completion_length": 1728.614616394043,
546
+ "epoch": 0.336,
547
+ "grad_norm": 0.09986075758934021,
548
+ "kl": 0.0009259581565856934,
549
+ "learning_rate": 9.438505398589392e-07,
550
+ "loss": 0.0,
551
+ "reward": 0.08333333488553762,
552
+ "reward_std": 0.15416352078318596,
553
+ "rewards/accuracy_reward": 0.0416666679084301,
554
+ "rewards/format_reward": 0.041666666977107525,
555
+ "step": 42
556
+ },
557
+ {
558
+ "completion_length": 1522.0312728881836,
559
+ "epoch": 0.344,
560
+ "grad_norm": 0.06469858437776566,
561
+ "kl": 0.0012334585189819336,
562
+ "learning_rate": 9.37590682704584e-07,
563
+ "loss": 0.0,
564
+ "reward": 0.21875000093132257,
565
+ "reward_std": 0.15001969039440155,
566
+ "rewards/accuracy_reward": 0.1875,
567
+ "rewards/format_reward": 0.031250000931322575,
568
+ "step": 43
569
+ },
570
+ {
571
+ "completion_length": 1845.1875305175781,
572
+ "epoch": 0.352,
573
+ "grad_norm": 0.05903354659676552,
574
+ "kl": 0.0015518367290496826,
575
+ "learning_rate": 9.310258896527278e-07,
576
+ "loss": 0.0001,
577
+ "reward": 0.11458333674818277,
578
+ "reward_std": 0.20783207565546036,
579
+ "rewards/accuracy_reward": 0.08333333395421505,
580
+ "rewards/format_reward": 0.031250000931322575,
581
+ "step": 44
582
+ },
583
+ {
584
+ "completion_length": 1626.8541793823242,
585
+ "epoch": 0.36,
586
+ "grad_norm": 0.002270436380058527,
587
+ "kl": 0.0011032521724700928,
588
+ "learning_rate": 9.241613255361454e-07,
589
+ "loss": 0.0,
590
+ "reward": 0.0833333358168602,
591
+ "reward_std": 0.12089946493506432,
592
+ "rewards/accuracy_reward": 0.0,
593
+ "rewards/format_reward": 0.0833333358168602,
594
+ "step": 45
595
+ },
596
+ {
597
+ "completion_length": 1350.5729446411133,
598
+ "epoch": 0.368,
599
+ "grad_norm": 0.08985026180744171,
600
+ "kl": 0.0012712180614471436,
601
+ "learning_rate": 9.17002391031667e-07,
602
+ "loss": 0.0001,
603
+ "reward": 0.3541666753590107,
604
+ "reward_std": 0.3032701872289181,
605
+ "rewards/accuracy_reward": 0.2812500046566129,
606
+ "rewards/format_reward": 0.07291666697710752,
607
+ "step": 46
608
+ },
609
+ {
610
+ "completion_length": 1617.0312728881836,
611
+ "epoch": 0.376,
612
+ "grad_norm": 0.0014874560292810202,
613
+ "kl": 0.0019048452377319336,
614
+ "learning_rate": 9.095547184112122e-07,
615
+ "loss": 0.0001,
616
+ "reward": 0.0729166679084301,
617
+ "reward_std": 0.17735834792256355,
618
+ "rewards/accuracy_reward": 0.02083333395421505,
619
+ "rewards/format_reward": 0.05208333395421505,
620
+ "step": 47
621
+ },
622
+ {
623
+ "completion_length": 1839.6146087646484,
624
+ "epoch": 0.384,
625
+ "grad_norm": 0.05115436390042305,
626
+ "kl": 0.00142669677734375,
627
+ "learning_rate": 9.018241671106134e-07,
628
+ "loss": 0.0001,
629
+ "reward": 0.1145833395421505,
630
+ "reward_std": 0.18088211119174957,
631
+ "rewards/accuracy_reward": 0.09375000093132257,
632
+ "rewards/format_reward": 0.02083333395421505,
633
+ "step": 48
634
+ },
635
+ {
636
+ "completion_length": 1270.8229370117188,
637
+ "epoch": 0.392,
638
+ "grad_norm": 0.09885463118553162,
639
+ "kl": 0.001969635486602783,
640
+ "learning_rate": 8.938168191197233e-07,
641
+ "loss": 0.0001,
642
+ "reward": 0.4583333423361182,
643
+ "reward_std": 0.2759315297007561,
644
+ "rewards/accuracy_reward": 0.3958333432674408,
645
+ "rewards/format_reward": 0.06250000186264515,
646
+ "step": 49
647
+ },
648
+ {
649
+ "completion_length": 1280.7396087646484,
650
+ "epoch": 0.4,
651
+ "grad_norm": 0.10244077444076538,
652
+ "kl": 0.008331120014190674,
653
+ "learning_rate": 8.855389741974244e-07,
654
+ "loss": 0.0003,
655
+ "reward": 0.33333334513008595,
656
+ "reward_std": 0.2721981704235077,
657
+ "rewards/accuracy_reward": 0.21875000838190317,
658
+ "rewards/format_reward": 0.11458333674818277,
659
+ "step": 50
660
+ },
661
+ {
662
+ "completion_length": 1760.2396240234375,
663
+ "epoch": 0.408,
664
+ "grad_norm": 0.15419930219650269,
665
+ "kl": 0.001972787082195282,
666
+ "learning_rate": 8.769971449153122e-07,
667
+ "loss": 0.0001,
668
+ "reward": 0.22916667349636555,
669
+ "reward_std": 0.20281970128417015,
670
+ "rewards/accuracy_reward": 0.18750000186264515,
671
+ "rewards/format_reward": 0.0416666679084301,
672
+ "step": 51
673
+ },
674
+ {
675
+ "completion_length": 1589.5937728881836,
676
+ "epoch": 0.416,
677
+ "grad_norm": 0.08847062289714813,
678
+ "kl": 0.00470280647277832,
679
+ "learning_rate": 8.681980515339463e-07,
680
+ "loss": 0.0002,
681
+ "reward": 0.2083333395421505,
682
+ "reward_std": 0.15885811299085617,
683
+ "rewards/accuracy_reward": 0.125,
684
+ "rewards/format_reward": 0.0833333358168602,
685
+ "step": 52
686
+ },
687
+ {
688
+ "completion_length": 1216.8437843322754,
689
+ "epoch": 0.424,
690
+ "grad_norm": 0.09562092274427414,
691
+ "kl": 0.0038253068923950195,
692
+ "learning_rate": 8.591486167157057e-07,
693
+ "loss": 0.0002,
694
+ "reward": 0.32291668094694614,
695
+ "reward_std": 0.4110877886414528,
696
+ "rewards/accuracy_reward": 0.2187500074505806,
697
+ "rewards/format_reward": 0.10416666883975267,
698
+ "step": 53
699
+ },
700
+ {
701
+ "completion_length": 1607.8125381469727,
702
+ "epoch": 0.432,
703
+ "grad_norm": 0.09562092274427414,
704
+ "kl": 0.0024794340133666992,
705
+ "learning_rate": 8.591486167157057e-07,
706
+ "loss": 0.0001,
707
+ "reward": 0.40625000931322575,
708
+ "reward_std": 0.29860204458236694,
709
+ "rewards/accuracy_reward": 0.2916666716337204,
710
+ "rewards/format_reward": 0.11458333488553762,
711
+ "step": 54
712
+ },
713
+ {
714
+ "completion_length": 1466.2187728881836,
715
+ "epoch": 0.44,
716
+ "grad_norm": 0.12597741186618805,
717
+ "kl": 0.0032531023025512695,
718
+ "learning_rate": 8.498559600784018e-07,
719
+ "loss": 0.0001,
720
+ "reward": 0.10416666883975267,
721
+ "reward_std": 0.28561151400208473,
722
+ "rewards/accuracy_reward": 0.03125,
723
+ "rewards/format_reward": 0.07291666883975267,
724
+ "step": 55
725
+ },
726
+ {
727
+ "completion_length": 1158.5313034057617,
728
+ "epoch": 0.448,
729
+ "grad_norm": 0.1452835649251938,
730
+ "kl": 0.007259368896484375,
731
+ "learning_rate": 8.403273925939395e-07,
732
+ "loss": 0.0003,
733
+ "reward": 0.43750000838190317,
734
+ "reward_std": 0.32371916621923447,
735
+ "rewards/accuracy_reward": 0.2604166669771075,
736
+ "rewards/format_reward": 0.17708333767950535,
737
+ "step": 56
738
+ },
739
+ {
740
+ "completion_length": 1328.3958587646484,
741
+ "epoch": 0.456,
742
+ "grad_norm": 0.0959007516503334,
743
+ "kl": 0.012780427932739258,
744
+ "learning_rate": 8.305704108364301e-07,
745
+ "loss": 0.0005,
746
+ "reward": 0.4062500111758709,
747
+ "reward_std": 0.37512117996811867,
748
+ "rewards/accuracy_reward": 0.22916667442768812,
749
+ "rewards/format_reward": 0.17708333674818277,
750
+ "step": 57
751
+ },
752
+ {
753
+ "completion_length": 1199.7917022705078,
754
+ "epoch": 0.464,
755
+ "grad_norm": 0.053846701979637146,
756
+ "kl": 0.006056308746337891,
757
+ "learning_rate": 8.205926910842825e-07,
758
+ "loss": 0.0002,
759
+ "reward": 0.4791666716337204,
760
+ "reward_std": 0.30927008762955666,
761
+ "rewards/accuracy_reward": 0.19791667442768812,
762
+ "rewards/format_reward": 0.2812500037252903,
763
+ "step": 58
764
+ },
765
+ {
766
+ "completion_length": 1548.4062805175781,
767
+ "epoch": 0.472,
768
+ "grad_norm": 0.15665732324123383,
769
+ "kl": 0.0060198307037353516,
770
+ "learning_rate": 8.104020832809126e-07,
771
+ "loss": 0.0002,
772
+ "reward": 0.4166666865348816,
773
+ "reward_std": 0.37532175332307816,
774
+ "rewards/accuracy_reward": 0.3020833432674408,
775
+ "rewards/format_reward": 0.11458333674818277,
776
+ "step": 59
777
+ },
778
+ {
779
+ "completion_length": 1357.1250457763672,
780
+ "epoch": 0.48,
781
+ "grad_norm": 0.17250967025756836,
782
+ "kl": 0.005661368370056152,
783
+ "learning_rate": 8.00006604858821e-07,
784
+ "loss": 0.0002,
785
+ "reward": 0.42708334419876337,
786
+ "reward_std": 0.44033083319664,
787
+ "rewards/accuracy_reward": 0.1458333358168602,
788
+ "rewards/format_reward": 0.28125000838190317,
789
+ "step": 60
790
+ },
791
+ {
792
+ "completion_length": 1528.5104598999023,
793
+ "epoch": 0.488,
794
+ "grad_norm": 0.39410439133644104,
795
+ "kl": 0.012995481491088867,
796
+ "learning_rate": 7.894144344319013e-07,
797
+ "loss": 0.0005,
798
+ "reward": 0.33333334140479565,
799
+ "reward_std": 0.3153763897716999,
800
+ "rewards/accuracy_reward": 0.18750000279396772,
801
+ "rewards/format_reward": 0.14583333861082792,
802
+ "step": 61
803
+ },
804
+ {
805
+ "completion_length": 1634.0833740234375,
806
+ "epoch": 0.496,
807
+ "grad_norm": 0.003453546669334173,
808
+ "kl": 0.005750775337219238,
809
+ "learning_rate": 7.786339053609382e-07,
810
+ "loss": 0.0002,
811
+ "reward": 0.23958333861082792,
812
+ "reward_std": 0.3065379671752453,
813
+ "rewards/accuracy_reward": 0.010416666977107525,
814
+ "rewards/format_reward": 0.22916667442768812,
815
+ "step": 62
816
+ },
817
+ {
818
+ "completion_length": 1417.6042022705078,
819
+ "epoch": 0.504,
820
+ "grad_norm": 0.2036859691143036,
821
+ "kl": 0.0056678056716918945,
822
+ "learning_rate": 7.676734991973579e-07,
823
+ "loss": 0.0002,
824
+ "reward": 0.2916666744276881,
825
+ "reward_std": 0.35410021990537643,
826
+ "rewards/accuracy_reward": 0.010416666977107525,
827
+ "rewards/format_reward": 0.2812500074505806,
828
+ "step": 63
829
+ },
830
+ {
831
+ "completion_length": 1605.0521087646484,
832
+ "epoch": 0.512,
833
+ "grad_norm": 0.11567196249961853,
834
+ "kl": 0.005307435989379883,
835
+ "learning_rate": 7.56541839010392e-07,
836
+ "loss": 0.0002,
837
+ "reward": 0.32291667349636555,
838
+ "reward_std": 0.4613286033272743,
839
+ "rewards/accuracy_reward": 0.07291666883975267,
840
+ "rewards/format_reward": 0.2500000046566129,
841
+ "step": 64
842
+ },
843
+ {
844
+ "completion_length": 1362.8125076293945,
845
+ "epoch": 0.52,
846
+ "grad_norm": 0.14872920513153076,
847
+ "kl": 0.004861712455749512,
848
+ "learning_rate": 7.45247682602901e-07,
849
+ "loss": 0.0002,
850
+ "reward": 0.47916668839752674,
851
+ "reward_std": 0.47535645961761475,
852
+ "rewards/accuracy_reward": 0.14583333395421505,
853
+ "rewards/format_reward": 0.33333334513008595,
854
+ "step": 65
855
+ },
856
+ {
857
+ "completion_length": 1370.4167022705078,
858
+ "epoch": 0.528,
859
+ "grad_norm": 0.14111247658729553,
860
+ "kl": 0.005552053451538086,
861
+ "learning_rate": 7.337999156211983e-07,
862
+ "loss": 0.0002,
863
+ "reward": 0.5104166753590107,
864
+ "reward_std": 0.46016135439276695,
865
+ "rewards/accuracy_reward": 0.09375000279396772,
866
+ "rewards/format_reward": 0.416666672565043,
867
+ "step": 66
868
+ },
869
+ {
870
+ "completion_length": 1390.9062805175781,
871
+ "epoch": 0.536,
872
+ "grad_norm": 0.0858726054430008,
873
+ "kl": 0.010532855987548828,
874
+ "learning_rate": 7.222075445642904e-07,
875
+ "loss": 0.0004,
876
+ "reward": 0.5000000083819032,
877
+ "reward_std": 0.4547799788415432,
878
+ "rewards/accuracy_reward": 0.15625000093132257,
879
+ "rewards/format_reward": 0.34375000931322575,
880
+ "step": 67
881
+ },
882
+ {
883
+ "completion_length": 1446.6667098999023,
884
+ "epoch": 0.544,
885
+ "grad_norm": 0.12936249375343323,
886
+ "kl": 0.006559848785400391,
887
+ "learning_rate": 7.104796896980408e-07,
888
+ "loss": 0.0003,
889
+ "reward": 0.41666667722165585,
890
+ "reward_std": 0.45394138246774673,
891
+ "rewards/accuracy_reward": 0.03125,
892
+ "rewards/format_reward": 0.38541667722165585,
893
+ "step": 68
894
+ },
895
+ {
896
+ "completion_length": 1062.3229522705078,
897
+ "epoch": 0.552,
898
+ "grad_norm": 0.0991756021976471,
899
+ "kl": 0.028393268585205078,
900
+ "learning_rate": 6.986255778798252e-07,
901
+ "loss": 0.0011,
902
+ "reward": 0.9583333656191826,
903
+ "reward_std": 0.488203439861536,
904
+ "rewards/accuracy_reward": 0.31250000186264515,
905
+ "rewards/format_reward": 0.6458333544433117,
906
+ "step": 69
907
+ },
908
+ {
909
+ "completion_length": 1272.927116394043,
910
+ "epoch": 0.56,
911
+ "grad_norm": 0.12522751092910767,
912
+ "kl": 0.006715059280395508,
913
+ "learning_rate": 6.866545352993266e-07,
914
+ "loss": 0.0003,
915
+ "reward": 0.7187500223517418,
916
+ "reward_std": 0.4580589644610882,
917
+ "rewards/accuracy_reward": 0.27083333395421505,
918
+ "rewards/format_reward": 0.44791667722165585,
919
+ "step": 70
920
+ },
921
+ {
922
+ "completion_length": 1295.0104446411133,
923
+ "epoch": 0.568,
924
+ "grad_norm": 0.09719894826412201,
925
+ "kl": 0.006929874420166016,
926
+ "learning_rate": 6.745759801411822e-07,
927
+ "loss": 0.0003,
928
+ "reward": 0.6770833628252149,
929
+ "reward_std": 0.605831079185009,
930
+ "rewards/accuracy_reward": 0.22916667256504297,
931
+ "rewards/format_reward": 0.44791668839752674,
932
+ "step": 71
933
+ },
934
+ {
935
+ "completion_length": 1681.520866394043,
936
+ "epoch": 0.576,
937
+ "grad_norm": 0.12705527245998383,
938
+ "kl": 0.005926847457885742,
939
+ "learning_rate": 6.623994151752521e-07,
940
+ "loss": 0.0002,
941
+ "reward": 0.2187500037252903,
942
+ "reward_std": 0.25818225741386414,
943
+ "rewards/accuracy_reward": 0.010416666977107525,
944
+ "rewards/format_reward": 0.2083333358168602,
945
+ "step": 72
946
+ },
947
+ {
948
+ "completion_length": 972.6771011352539,
949
+ "epoch": 0.584,
950
+ "grad_norm": 0.09554693847894669,
951
+ "kl": 0.011373519897460938,
952
+ "learning_rate": 6.501344202803414e-07,
953
+ "loss": 0.0005,
954
+ "reward": 0.6666666716337204,
955
+ "reward_std": 0.39136262610554695,
956
+ "rewards/accuracy_reward": 0.06250000186264515,
957
+ "rewards/format_reward": 0.604166679084301,
958
+ "step": 73
959
+ },
960
+ {
961
+ "completion_length": 1333.552116394043,
962
+ "epoch": 0.592,
963
+ "grad_norm": 0.1676543802022934,
964
+ "kl": 0.006979465484619141,
965
+ "learning_rate": 6.377906449072577e-07,
966
+ "loss": 0.0003,
967
+ "reward": 0.5312500149011612,
968
+ "reward_std": 0.4429461173713207,
969
+ "rewards/accuracy_reward": 0.1250000037252903,
970
+ "rewards/format_reward": 0.4062500149011612,
971
+ "step": 74
972
+ },
973
+ {
974
+ "completion_length": 1172.7604446411133,
975
+ "epoch": 0.6,
976
+ "grad_norm": 0.1972622275352478,
977
+ "kl": 0.00921487808227539,
978
+ "learning_rate": 6.253778004871314e-07,
979
+ "loss": 0.0004,
980
+ "reward": 0.6041666902601719,
981
+ "reward_std": 0.505381915718317,
982
+ "rewards/accuracy_reward": 0.11458333861082792,
983
+ "rewards/format_reward": 0.4895833432674408,
984
+ "step": 75
985
+ },
986
+ {
987
+ "completion_length": 1327.0833740234375,
988
+ "epoch": 0.608,
989
+ "grad_norm": 0.16929921507835388,
990
+ "kl": 0.00800466537475586,
991
+ "learning_rate": 6.129056527909748e-07,
992
+ "loss": 0.0003,
993
+ "reward": 0.46875000558793545,
994
+ "reward_std": 0.5199594870209694,
995
+ "rewards/accuracy_reward": 0.010416666977107525,
996
+ "rewards/format_reward": 0.45833334140479565,
997
+ "step": 76
998
+ },
999
+ {
1000
+ "completion_length": 1060.5833587646484,
1001
+ "epoch": 0.616,
1002
+ "grad_norm": 0.2086302489042282,
1003
+ "kl": 0.012459754943847656,
1004
+ "learning_rate": 6.003840142464885e-07,
1005
+ "loss": 0.0005,
1006
+ "reward": 0.8020833656191826,
1007
+ "reward_std": 0.5204068273305893,
1008
+ "rewards/accuracy_reward": 0.15625,
1009
+ "rewards/format_reward": 0.6458333507180214,
1010
+ "step": 77
1011
+ },
1012
+ {
1013
+ "completion_length": 1362.739616394043,
1014
+ "epoch": 0.624,
1015
+ "grad_norm": 0.07928597182035446,
1016
+ "kl": 0.006580352783203125,
1017
+ "learning_rate": 5.878227362181614e-07,
1018
+ "loss": 0.0003,
1019
+ "reward": 0.43750000931322575,
1020
+ "reward_std": 0.25147588178515434,
1021
+ "rewards/accuracy_reward": 0.0,
1022
+ "rewards/format_reward": 0.43750000931322575,
1023
+ "step": 78
1024
+ },
1025
+ {
1026
+ "completion_length": 1356.2604446411133,
1027
+ "epoch": 0.632,
1028
+ "grad_norm": 0.11168374121189117,
1029
+ "kl": 0.00591731071472168,
1030
+ "learning_rate": 5.752317012567362e-07,
1031
+ "loss": 0.0002,
1032
+ "reward": 0.6458333488553762,
1033
+ "reward_std": 0.49627581238746643,
1034
+ "rewards/accuracy_reward": 0.22916667722165585,
1035
+ "rewards/format_reward": 0.41666668094694614,
1036
+ "step": 79
1037
+ },
1038
+ {
1039
+ "completion_length": 960.5625381469727,
1040
+ "epoch": 0.64,
1041
+ "grad_norm": 0.11304420977830887,
1042
+ "kl": 0.011513233184814453,
1043
+ "learning_rate": 5.626208153241411e-07,
1044
+ "loss": 0.0005,
1045
+ "reward": 0.9062500074505806,
1046
+ "reward_std": 0.44416263699531555,
1047
+ "rewards/accuracy_reward": 0.18750001024454832,
1048
+ "rewards/format_reward": 0.718750013038516,
1049
+ "step": 80
1050
+ },
1051
+ {
1052
+ "completion_length": 1400.8750305175781,
1053
+ "epoch": 0.648,
1054
+ "grad_norm": 0.0740160122513771,
1055
+ "kl": 0.0074901580810546875,
1056
+ "learning_rate": 5.5e-07,
1057
+ "loss": 0.0003,
1058
+ "reward": 0.5729166753590107,
1059
+ "reward_std": 0.3763520009815693,
1060
+ "rewards/accuracy_reward": 0.0416666679084301,
1061
+ "rewards/format_reward": 0.5312500111758709,
1062
+ "step": 81
1063
+ },
1064
+ {
1065
+ "completion_length": 754.583366394043,
1066
+ "epoch": 0.656,
1067
+ "grad_norm": 0.22753088176250458,
1068
+ "kl": 0.015882015228271484,
1069
+ "learning_rate": 5.373791846758589e-07,
1070
+ "loss": 0.0006,
1071
+ "reward": 1.041666692122817,
1072
+ "reward_std": 0.43276379629969597,
1073
+ "rewards/accuracy_reward": 0.3437500074505806,
1074
+ "rewards/format_reward": 0.6979166772216558,
1075
+ "step": 82
1076
+ },
1077
+ {
1078
+ "completion_length": 1032.552116394043,
1079
+ "epoch": 0.664,
1080
+ "grad_norm": 0.10078462958335876,
1081
+ "kl": 0.012661933898925781,
1082
+ "learning_rate": 5.247682987432637e-07,
1083
+ "loss": 0.0005,
1084
+ "reward": 0.7500000111758709,
1085
+ "reward_std": 0.5134871490299702,
1086
+ "rewards/accuracy_reward": 0.15625,
1087
+ "rewards/format_reward": 0.5937500111758709,
1088
+ "step": 83
1089
+ },
1090
+ {
1091
+ "completion_length": 1074.0000267028809,
1092
+ "epoch": 0.672,
1093
+ "grad_norm": 0.06993328779935837,
1094
+ "kl": 0.011388182640075684,
1095
+ "learning_rate": 5.121772637818387e-07,
1096
+ "loss": 0.0005,
1097
+ "reward": 0.7291666846722364,
1098
+ "reward_std": 0.3966032788157463,
1099
+ "rewards/accuracy_reward": 0.0416666679084301,
1100
+ "rewards/format_reward": 0.6875000223517418,
1101
+ "step": 84
1102
+ },
1103
+ {
1104
+ "completion_length": 1188.4062728881836,
1105
+ "epoch": 0.68,
1106
+ "grad_norm": 0.11544036120176315,
1107
+ "kl": 0.009247779846191406,
1108
+ "learning_rate": 4.996159857535115e-07,
1109
+ "loss": 0.0004,
1110
+ "reward": 0.8437500149011612,
1111
+ "reward_std": 0.5094321705400944,
1112
+ "rewards/accuracy_reward": 0.2500000074505806,
1113
+ "rewards/format_reward": 0.5937500074505806,
1114
+ "step": 85
1115
+ },
1116
+ {
1117
+ "completion_length": 988.0000457763672,
1118
+ "epoch": 0.688,
1119
+ "grad_norm": 0.13851231336593628,
1120
+ "kl": 0.0120697021484375,
1121
+ "learning_rate": 4.870943472090254e-07,
1122
+ "loss": 0.0005,
1123
+ "reward": 0.739583358168602,
1124
+ "reward_std": 0.42295433580875397,
1125
+ "rewards/accuracy_reward": 0.041666666977107525,
1126
+ "rewards/format_reward": 0.6979166865348816,
1127
+ "step": 86
1128
+ },
1129
+ {
1130
+ "completion_length": 853.0833473205566,
1131
+ "epoch": 0.696,
1132
+ "grad_norm": 0.11069481074810028,
1133
+ "kl": 0.014725208282470703,
1134
+ "learning_rate": 4.7462219951286864e-07,
1135
+ "loss": 0.0006,
1136
+ "reward": 0.6979166716337204,
1137
+ "reward_std": 0.4276985712349415,
1138
+ "rewards/accuracy_reward": 0.010416666977107525,
1139
+ "rewards/format_reward": 0.6875000074505806,
1140
+ "step": 87
1141
+ },
1142
+ {
1143
+ "completion_length": 1298.302116394043,
1144
+ "epoch": 0.704,
1145
+ "grad_norm": 0.127773255109787,
1146
+ "kl": 0.01068568229675293,
1147
+ "learning_rate": 4.6220935509274227e-07,
1148
+ "loss": 0.0004,
1149
+ "reward": 0.5729166893288493,
1150
+ "reward_std": 0.4483075179159641,
1151
+ "rewards/accuracy_reward": 0.0,
1152
+ "rewards/format_reward": 0.5729166893288493,
1153
+ "step": 88
1154
+ },
1155
+ {
1156
+ "completion_length": 806.1354351043701,
1157
+ "epoch": 0.712,
1158
+ "grad_norm": 0.09778036177158356,
1159
+ "kl": 0.018845558166503906,
1160
+ "learning_rate": 4.4986557971965856e-07,
1161
+ "loss": 0.0008,
1162
+ "reward": 1.0937500223517418,
1163
+ "reward_std": 0.2953929826617241,
1164
+ "rewards/accuracy_reward": 0.27083333395421505,
1165
+ "rewards/format_reward": 0.8229166865348816,
1166
+ "step": 89
1167
+ },
1168
+ {
1169
+ "completion_length": 687.0000152587891,
1170
+ "epoch": 0.72,
1171
+ "grad_norm": 0.17915241420269012,
1172
+ "kl": 0.017522811889648438,
1173
+ "learning_rate": 4.3760058482474783e-07,
1174
+ "loss": 0.0007,
1175
+ "reward": 1.041666679084301,
1176
+ "reward_std": 0.38721640035510063,
1177
+ "rewards/accuracy_reward": 0.16666667349636555,
1178
+ "rewards/format_reward": 0.8750000074505806,
1179
+ "step": 90
1180
+ },
1181
+ {
1182
+ "completion_length": 881.3437614440918,
1183
+ "epoch": 0.728,
1184
+ "grad_norm": 0.14008083939552307,
1185
+ "kl": 0.013111591339111328,
1186
+ "learning_rate": 4.254240198588178e-07,
1187
+ "loss": 0.0005,
1188
+ "reward": 0.8854166865348816,
1189
+ "reward_std": 0.31409377604722977,
1190
+ "rewards/accuracy_reward": 0.05208333395421505,
1191
+ "rewards/format_reward": 0.8333333507180214,
1192
+ "step": 91
1193
+ },
1194
+ {
1195
+ "completion_length": 927.1354446411133,
1196
+ "epoch": 0.736,
1197
+ "grad_norm": 0.04579927399754524,
1198
+ "kl": 0.015148162841796875,
1199
+ "learning_rate": 4.133454647006733e-07,
1200
+ "loss": 0.0006,
1201
+ "reward": 1.0520833656191826,
1202
+ "reward_std": 0.3989434242248535,
1203
+ "rewards/accuracy_reward": 0.3645833395421505,
1204
+ "rewards/format_reward": 0.6875000149011612,
1205
+ "step": 92
1206
+ },
1207
+ {
1208
+ "completion_length": 1096.2708702087402,
1209
+ "epoch": 0.744,
1210
+ "grad_norm": 0.11566469073295593,
1211
+ "kl": 0.012713432312011719,
1212
+ "learning_rate": 4.013744221201749e-07,
1213
+ "loss": 0.0005,
1214
+ "reward": 0.6770833414047956,
1215
+ "reward_std": 0.4284285344183445,
1216
+ "rewards/accuracy_reward": 0.0416666679084301,
1217
+ "rewards/format_reward": 0.6354166772216558,
1218
+ "step": 93
1219
+ },
1220
+ {
1221
+ "completion_length": 690.1041774749756,
1222
+ "epoch": 0.752,
1223
+ "grad_norm": 0.15324266254901886,
1224
+ "kl": 0.016754150390625,
1225
+ "learning_rate": 3.895203103019592e-07,
1226
+ "loss": 0.0007,
1227
+ "reward": 1.0208333507180214,
1228
+ "reward_std": 0.4104958660900593,
1229
+ "rewards/accuracy_reward": 0.15625000838190317,
1230
+ "rewards/format_reward": 0.8645833507180214,
1231
+ "step": 94
1232
+ },
1233
+ {
1234
+ "completion_length": 814.8437843322754,
1235
+ "epoch": 0.76,
1236
+ "grad_norm": 0.11410548537969589,
1237
+ "kl": 0.018395423889160156,
1238
+ "learning_rate": 3.777924554357096e-07,
1239
+ "loss": 0.0007,
1240
+ "reward": 0.9479166865348816,
1241
+ "reward_std": 0.4380828067660332,
1242
+ "rewards/accuracy_reward": 0.1145833358168602,
1243
+ "rewards/format_reward": 0.8333333507180214,
1244
+ "step": 95
1245
+ },
1246
+ {
1247
+ "completion_length": 918.2187805175781,
1248
+ "epoch": 0.768,
1249
+ "grad_norm": 0.16425909101963043,
1250
+ "kl": 0.012192249298095703,
1251
+ "learning_rate": 3.662000843788018e-07,
1252
+ "loss": 0.0005,
1253
+ "reward": 0.8125000149011612,
1254
+ "reward_std": 0.3481748141348362,
1255
+ "rewards/accuracy_reward": 0.07291666977107525,
1256
+ "rewards/format_reward": 0.7395833432674408,
1257
+ "step": 96
1258
+ },
1259
+ {
1260
+ "completion_length": 844.8541870117188,
1261
+ "epoch": 0.776,
1262
+ "grad_norm": 0.1520727574825287,
1263
+ "kl": 0.0149383544921875,
1264
+ "learning_rate": 3.547523173970989e-07,
1265
+ "loss": 0.0006,
1266
+ "reward": 0.9791666939854622,
1267
+ "reward_std": 0.42518816888332367,
1268
+ "rewards/accuracy_reward": 0.17708333488553762,
1269
+ "rewards/format_reward": 0.802083358168602,
1270
+ "step": 97
1271
+ },
1272
+ {
1273
+ "completion_length": 678.458366394043,
1274
+ "epoch": 0.784,
1275
+ "grad_norm": 0.11712879687547684,
1276
+ "kl": 0.018598556518554688,
1277
+ "learning_rate": 3.4345816098960794e-07,
1278
+ "loss": 0.0007,
1279
+ "reward": 1.270833358168602,
1280
+ "reward_std": 0.4059586226940155,
1281
+ "rewards/accuracy_reward": 0.4166666753590107,
1282
+ "rewards/format_reward": 0.854166679084301,
1283
+ "step": 98
1284
+ },
1285
+ {
1286
+ "completion_length": 707.4479446411133,
1287
+ "epoch": 0.792,
1288
+ "grad_norm": 0.00943591445684433,
1289
+ "kl": 0.018640995025634766,
1290
+ "learning_rate": 3.323265008026421e-07,
1291
+ "loss": 0.0007,
1292
+ "reward": 1.020833346992731,
1293
+ "reward_std": 0.3158864565193653,
1294
+ "rewards/accuracy_reward": 0.1666666679084301,
1295
+ "rewards/format_reward": 0.8541666753590107,
1296
+ "step": 99
1297
+ },
1298
+ {
1299
+ "completion_length": 900.7083511352539,
1300
  "epoch": 0.8,
1301
+ "grad_norm": 0.1339350938796997,
1302
+ "kl": 0.014582633972167969,
1303
+ "learning_rate": 3.2136609463906184e-07,
1304
+ "loss": 0.0006,
1305
+ "reward": 0.7604166865348816,
1306
+ "reward_std": 0.4508432447910309,
1307
+ "rewards/accuracy_reward": 0.0416666679084301,
1308
+ "rewards/format_reward": 0.7187500223517418,
1309
+ "step": 100
1310
+ },
1311
+ {
1312
+ "completion_length": 753.3750286102295,
1313
+ "epoch": 0.808,
1314
+ "grad_norm": 0.09386853128671646,
1315
+ "kl": 0.01984691619873047,
1316
+ "learning_rate": 3.105855655680986e-07,
1317
+ "loss": 0.0008,
1318
+ "reward": 1.0312500149011612,
1319
+ "reward_std": 0.3750321790575981,
1320
+ "rewards/accuracy_reward": 0.16666666883975267,
1321
+ "rewards/format_reward": 0.8645833432674408,
1322
+ "step": 101
1323
+ },
1324
+ {
1325
+ "completion_length": 962.6666831970215,
1326
+ "epoch": 0.816,
1327
+ "grad_norm": 0.17427518963813782,
1328
+ "kl": 0.017024517059326172,
1329
+ "learning_rate": 2.999933951411791e-07,
1330
+ "loss": 0.0007,
1331
+ "reward": 0.9895833656191826,
1332
+ "reward_std": 0.3274080455303192,
1333
+ "rewards/accuracy_reward": 0.2812500074505806,
1334
+ "rewards/format_reward": 0.7083333432674408,
1335
+ "step": 102
1336
+ },
1337
+ {
1338
+ "completion_length": 812.7500228881836,
1339
+ "epoch": 0.824,
1340
+ "grad_norm": 0.08086265623569489,
1341
+ "kl": 0.01704549789428711,
1342
+ "learning_rate": 2.895979167190874e-07,
1343
+ "loss": 0.0007,
1344
+ "reward": 0.864583358168602,
1345
+ "reward_std": 0.4472285062074661,
1346
+ "rewards/accuracy_reward": 0.10416666977107525,
1347
+ "rewards/format_reward": 0.7604166939854622,
1348
+ "step": 103
1349
+ },
1350
+ {
1351
+ "completion_length": 592.4583549499512,
1352
+ "epoch": 0.832,
1353
+ "grad_norm": 0.1811506450176239,
1354
+ "kl": 0.02013111114501953,
1355
+ "learning_rate": 2.794073089157173e-07,
1356
+ "loss": 0.0008,
1357
+ "reward": 1.2187500298023224,
1358
+ "reward_std": 0.39934028312563896,
1359
+ "rewards/accuracy_reward": 0.3020833367481828,
1360
+ "rewards/format_reward": 0.916666679084301,
1361
+ "step": 104
1362
+ },
1363
+ {
1364
+ "completion_length": 906.2500267028809,
1365
+ "epoch": 0.84,
1366
+ "grad_norm": 0.15297923982143402,
1367
+ "kl": 0.018891334533691406,
1368
+ "learning_rate": 2.6942958916356994e-07,
1369
+ "loss": 0.0008,
1370
+ "reward": 0.8125000186264515,
1371
+ "reward_std": 0.44843045622110367,
1372
+ "rewards/accuracy_reward": 0.10416666697710752,
1373
+ "rewards/format_reward": 0.7083333544433117,
1374
+ "step": 105
1375
+ },
1376
+ {
1377
+ "completion_length": 995.7604522705078,
1378
+ "epoch": 0.848,
1379
+ "grad_norm": 0.12237068265676498,
1380
+ "kl": 0.014804840087890625,
1381
+ "learning_rate": 2.596726074060607e-07,
1382
+ "loss": 0.0006,
1383
+ "reward": 0.7604166865348816,
1384
+ "reward_std": 0.5074402429163456,
1385
+ "rewards/accuracy_reward": 0.05208333395421505,
1386
+ "rewards/format_reward": 0.7083333507180214,
1387
+ "step": 106
1388
+ },
1389
+ {
1390
+ "completion_length": 878.5937728881836,
1391
+ "epoch": 0.856,
1392
+ "grad_norm": 0.13649912178516388,
1393
+ "kl": 0.026726722717285156,
1394
+ "learning_rate": 2.501440399215983e-07,
1395
+ "loss": 0.0011,
1396
+ "reward": 0.7916666939854622,
1397
+ "reward_std": 0.4398350641131401,
1398
+ "rewards/accuracy_reward": 0.0416666679084301,
1399
+ "rewards/format_reward": 0.7500000223517418,
1400
+ "step": 107
1401
+ },
1402
+ {
1403
+ "completion_length": 822.510440826416,
1404
+ "epoch": 0.864,
1405
+ "grad_norm": 0.17916011810302734,
1406
+ "kl": 0.018551349639892578,
1407
+ "learning_rate": 2.4085138328429425e-07,
1408
+ "loss": 0.0007,
1409
+ "reward": 1.0520833656191826,
1410
+ "reward_std": 0.4777773655951023,
1411
+ "rewards/accuracy_reward": 0.21875000465661287,
1412
+ "rewards/format_reward": 0.8333333507180214,
1413
+ "step": 108
1414
+ },
1415
+ {
1416
+ "completion_length": 900.5312728881836,
1417
+ "epoch": 0.872,
1418
+ "grad_norm": 0.12796513736248016,
1419
+ "kl": 0.01479339599609375,
1420
+ "learning_rate": 2.3180194846605364e-07,
1421
+ "loss": 0.0006,
1422
+ "reward": 1.0625000223517418,
1423
+ "reward_std": 0.3528694063425064,
1424
+ "rewards/accuracy_reward": 0.2395833432674408,
1425
+ "rewards/format_reward": 0.822916679084301,
1426
+ "step": 109
1427
+ },
1428
+ {
1429
+ "completion_length": 847.4896087646484,
1430
+ "epoch": 0.88,
1431
+ "grad_norm": 0.14577722549438477,
1432
+ "kl": 0.015839576721191406,
1433
+ "learning_rate": 2.2300285508468792e-07,
1434
+ "loss": 0.0006,
1435
+ "reward": 1.0208333656191826,
1436
+ "reward_std": 0.3138932101428509,
1437
+ "rewards/accuracy_reward": 0.1458333358168602,
1438
+ "rewards/format_reward": 0.8750000149011612,
1439
+ "step": 110
1440
+ },
1441
+ {
1442
+ "completion_length": 684.7187690734863,
1443
+ "epoch": 0.888,
1444
+ "grad_norm": 0.08863761276006699,
1445
+ "kl": 0.02144336700439453,
1446
+ "learning_rate": 2.1446102580257546e-07,
1447
+ "loss": 0.0009,
1448
+ "reward": 1.0208333507180214,
1449
+ "reward_std": 0.32314179465174675,
1450
+ "rewards/accuracy_reward": 0.18750000279396772,
1451
+ "rewards/format_reward": 0.8333333507180214,
1452
+ "step": 111
1453
+ },
1454
+ {
1455
+ "completion_length": 648.2708473205566,
1456
+ "epoch": 0.896,
1457
+ "grad_norm": 0.1076781153678894,
1458
+ "kl": 0.018851280212402344,
1459
+ "learning_rate": 2.0618318088027664e-07,
1460
+ "loss": 0.0008,
1461
+ "reward": 0.9895833507180214,
1462
+ "reward_std": 0.34308793768286705,
1463
+ "rewards/accuracy_reward": 0.08333333395421505,
1464
+ "rewards/format_reward": 0.9062500149011612,
1465
+ "step": 112
1466
+ },
1467
+ {
1468
+ "completion_length": 848.979175567627,
1469
+ "epoch": 0.904,
1470
+ "grad_norm": 0.1795138269662857,
1471
+ "kl": 0.028219223022460938,
1472
+ "learning_rate": 1.9817583288938662e-07,
1473
+ "loss": 0.0011,
1474
+ "reward": 0.9687500298023224,
1475
+ "reward_std": 0.48388223350048065,
1476
+ "rewards/accuracy_reward": 0.20833334140479565,
1477
+ "rewards/format_reward": 0.7604166828095913,
1478
+ "step": 113
1479
+ },
1480
+ {
1481
+ "completion_length": 807.0416793823242,
1482
+ "epoch": 0.912,
1483
+ "grad_norm": 0.1863919496536255,
1484
+ "kl": 0.016091346740722656,
1485
+ "learning_rate": 1.9044528158878803e-07,
1486
+ "loss": 0.0006,
1487
+ "reward": 1.0625000149011612,
1488
+ "reward_std": 0.36854929849505424,
1489
+ "rewards/accuracy_reward": 0.16666666697710752,
1490
+ "rewards/format_reward": 0.8958333507180214,
1491
+ "step": 114
1492
+ },
1493
+ {
1494
+ "completion_length": 730.8125228881836,
1495
+ "epoch": 0.92,
1496
+ "grad_norm": 0.008086685091257095,
1497
+ "kl": 0.019659996032714844,
1498
+ "learning_rate": 1.8299760896833295e-07,
1499
+ "loss": 0.0008,
1500
+ "reward": 1.0312500149011612,
1501
+ "reward_std": 0.1851910501718521,
1502
+ "rewards/accuracy_reward": 0.125,
1503
+ "rewards/format_reward": 0.9062500149011612,
1504
+ "step": 115
1505
+ },
1506
+ {
1507
+ "completion_length": 1080.8958587646484,
1508
+ "epoch": 0.928,
1509
+ "grad_norm": 0.13679863512516022,
1510
+ "kl": 0.0175095796585083,
1511
+ "learning_rate": 1.758386744638546e-07,
1512
+ "loss": 0.0007,
1513
+ "reward": 0.6875000223517418,
1514
+ "reward_std": 0.4191872850060463,
1515
+ "rewards/accuracy_reward": 0.0416666679084301,
1516
+ "rewards/format_reward": 0.6458333507180214,
1517
+ "step": 116
1518
+ },
1519
+ {
1520
+ "completion_length": 924.6250343322754,
1521
+ "epoch": 0.936,
1522
+ "grad_norm": 0.1184714287519455,
1523
+ "kl": 0.015748023986816406,
1524
+ "learning_rate": 1.6897411034727217e-07,
1525
+ "loss": 0.0006,
1526
+ "reward": 0.989583358168602,
1527
+ "reward_std": 0.4329289048910141,
1528
+ "rewards/accuracy_reward": 0.16666667722165585,
1529
+ "rewards/format_reward": 0.8229166865348816,
1530
+ "step": 117
1531
+ },
1532
+ {
1533
+ "completion_length": 780.7083511352539,
1534
+ "epoch": 0.944,
1535
+ "grad_norm": 0.15466775000095367,
1536
+ "kl": 0.025674819946289062,
1537
+ "learning_rate": 1.6240931729541597e-07,
1538
+ "loss": 0.001,
1539
+ "reward": 0.9687500149011612,
1540
+ "reward_std": 0.34918052703142166,
1541
+ "rewards/accuracy_reward": 0.12500000093132257,
1542
+ "rewards/format_reward": 0.8437500149011612,
1543
+ "step": 118
1544
+ },
1545
+ {
1546
+ "completion_length": 614.0625038146973,
1547
+ "epoch": 0.952,
1548
+ "grad_norm": 0.21478953957557678,
1549
+ "kl": 0.022258758544921875,
1550
+ "learning_rate": 1.5614946014106085e-07,
1551
+ "loss": 0.0009,
1552
+ "reward": 0.9270833507180214,
1553
+ "reward_std": 0.35268617793917656,
1554
+ "rewards/accuracy_reward": 0.052083334885537624,
1555
+ "rewards/format_reward": 0.8750000149011612,
1556
+ "step": 119
1557
+ },
1558
+ {
1559
+ "completion_length": 785.1666870117188,
1560
+ "epoch": 0.96,
1561
+ "grad_norm": 0.053828902542591095,
1562
+ "kl": 0.01752948760986328,
1563
+ "learning_rate": 1.5019946380950755e-07,
1564
+ "loss": 0.0007,
1565
+ "reward": 1.010416679084301,
1566
+ "reward_std": 0.328002754598856,
1567
+ "rewards/accuracy_reward": 0.1458333358168602,
1568
+ "rewards/format_reward": 0.8645833507180214,
1569
+ "step": 120
1570
+ },
1571
+ {
1572
+ "completion_length": 675.9166831970215,
1573
+ "epoch": 0.968,
1574
+ "grad_norm": 0.31227222084999084,
1575
+ "kl": 0.02220916748046875,
1576
+ "learning_rate": 1.4456400944391144e-07,
1577
+ "loss": 0.0009,
1578
+ "reward": 0.9791666865348816,
1579
+ "reward_std": 0.3436807915568352,
1580
+ "rewards/accuracy_reward": 0.11458333861082792,
1581
+ "rewards/format_reward": 0.8645833507180214,
1582
+ "step": 121
1583
+ },
1584
+ {
1585
+ "completion_length": 901.2083511352539,
1586
+ "epoch": 0.976,
1587
+ "grad_norm": 0.13310351967811584,
1588
+ "kl": 0.020013809204101562,
1589
+ "learning_rate": 1.392475307224065e-07,
1590
+ "loss": 0.0008,
1591
+ "reward": 0.7395833507180214,
1592
+ "reward_std": 0.2778088189661503,
1593
+ "rewards/accuracy_reward": 0.0,
1594
+ "rewards/format_reward": 0.7395833507180214,
1595
+ "step": 122
1596
+ },
1597
+ {
1598
+ "completion_length": 674.9791946411133,
1599
+ "epoch": 0.984,
1600
+ "grad_norm": 0.09870073944330215,
1601
+ "kl": 0.02259063720703125,
1602
+ "learning_rate": 1.3425421036992097e-07,
1603
+ "loss": 0.0009,
1604
+ "reward": 1.0104166939854622,
1605
+ "reward_std": 0.32367467880249023,
1606
+ "rewards/accuracy_reward": 0.15625000093132257,
1607
+ "rewards/format_reward": 0.8541666865348816,
1608
+ "step": 123
1609
+ },
1610
+ {
1611
+ "completion_length": 878.7083625793457,
1612
+ "epoch": 0.992,
1613
+ "grad_norm": 0.2026570737361908,
1614
+ "kl": 0.018926620483398438,
1615
+ "learning_rate": 1.2958797686743014e-07,
1616
+ "loss": 0.0008,
1617
+ "reward": 0.8437500223517418,
1618
+ "reward_std": 0.5021120570600033,
1619
+ "rewards/accuracy_reward": 0.10416666977107525,
1620
+ "rewards/format_reward": 0.7395833507180214,
1621
+ "step": 124
1622
+ },
1623
+ {
1624
+ "completion_length": 611.5,
1625
+ "epoch": 1.0,
1626
+ "grad_norm": 0.2026570737361908,
1627
+ "kl": 0.03096485137939453,
1628
+ "learning_rate": 1.2958797686743014e-07,
1629
+ "loss": 0.0012,
1630
+ "reward": 0.9687500298023224,
1631
+ "reward_std": 0.3619955964386463,
1632
+ "rewards/accuracy_reward": 0.11458333395421505,
1633
+ "rewards/format_reward": 0.854166679084301,
1634
+ "step": 125
1635
+ },
1636
+ {
1637
+ "epoch": 1.0,
1638
+ "step": 125,
1639
  "total_flos": 0.0,
1640
+ "train_loss": 0.0003328805177226313,
1641
+ "train_runtime": 21588.6366,
1642
+ "train_samples_per_second": 0.046,
1643
+ "train_steps_per_second": 0.006
1644
  }
1645
  ],
1646
  "logging_steps": 1,
1647
+ "max_steps": 125,
1648
  "num_input_tokens_seen": 0,
1649
  "num_train_epochs": 1,
1650
  "save_steps": 500,