yolay commited on
Commit
4315844
·
verified ·
1 Parent(s): e475ee4

Model save

Browse files
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ model_name: Qwen-2.5-7B-Simple-RL
4
+ tags:
5
+ - generated_from_trainer
6
+ - trl
7
+ - grpo
8
+ licence: license
9
+ ---
10
+
11
+ # Model Card for Qwen-2.5-7B-Simple-RL
12
+
13
+ This model is a fine-tuned version of [None](https://huggingface.co/None).
14
+ It has been trained using [TRL](https://github.com/huggingface/trl).
15
+
16
+ ## Quick start
17
+
18
+ ```python
19
+ from transformers import pipeline
20
+
21
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
22
+ generator = pipeline("text-generation", model="yolay/Qwen-2.5-7B-Simple-RL", device="cuda")
23
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
24
+ print(output["generated_text"])
25
+ ```
26
+
27
+ ## Training procedure
28
+
29
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yuleiqin-tencent/huggingface/runs/y7jlqug1)
30
+
31
+
32
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
33
+
34
+ ### Framework versions
35
+
36
+ - TRL: 0.15.0.dev0
37
+ - Transformers: 4.49.0.dev0
38
+ - Pytorch: 2.5.1
39
+ - Datasets: 3.2.0
40
+ - Tokenizers: 0.21.0
41
+
42
+ ## Citations
43
+
44
+ Cite GRPO as:
45
+
46
+ ```bibtex
47
+ @article{zhihong2024deepseekmath,
48
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
49
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
50
+ year = 2024,
51
+ eprint = {arXiv:2402.03300},
52
+ }
53
+
54
+ ```
55
+
56
+ Cite TRL as:
57
+
58
+ ```bibtex
59
+ @misc{vonwerra2022trl,
60
+ title = {{TRL: Transformer Reinforcement Learning}},
61
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
62
+ year = 2020,
63
+ journal = {GitHub repository},
64
+ publisher = {GitHub},
65
+ howpublished = {\url{https://github.com/huggingface/trl}}
66
+ }
67
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00286464851636153,
4
+ "train_runtime": 7716.3929,
5
+ "train_samples": 7500,
6
+ "train_samples_per_second": 0.972,
7
+ "train_steps_per_second": 0.061
8
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "eos_token_id": 151643,
4
+ "max_new_tokens": 2048,
5
+ "transformers_version": "4.49.0.dev0"
6
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00286464851636153,
4
+ "train_runtime": 7716.3929,
5
+ "train_samples": 7500,
6
+ "train_samples_per_second": 0.972,
7
+ "train_steps_per_second": 0.061
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9984,
5
+ "eval_steps": 100,
6
+ "global_step": 468,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "completion_length": 637.8268123626709,
13
+ "epoch": 0.010666666666666666,
14
+ "grad_norm": 0.9318448901176453,
15
+ "kl": 0.00011245012283325195,
16
+ "learning_rate": 3.1914893617021275e-07,
17
+ "loss": 0.0,
18
+ "reward": 0.6035714592784643,
19
+ "reward_std": 0.3709465142339468,
20
+ "rewards/accuracy_reward": 0.6017857443541288,
21
+ "rewards/format_reward": 0.001785714365541935,
22
+ "step": 5
23
+ },
24
+ {
25
+ "completion_length": 611.45181350708,
26
+ "epoch": 0.021333333333333333,
27
+ "grad_norm": 6.491418838500977,
28
+ "kl": 0.0001920461654663086,
29
+ "learning_rate": 6.382978723404255e-07,
30
+ "loss": 0.0,
31
+ "reward": 0.617857176065445,
32
+ "reward_std": 0.36645040661096573,
33
+ "rewards/accuracy_reward": 0.617857176065445,
34
+ "rewards/format_reward": 0.0,
35
+ "step": 10
36
+ },
37
+ {
38
+ "completion_length": 604.3607414245605,
39
+ "epoch": 0.032,
40
+ "grad_norm": 9.168869018554688,
41
+ "kl": 0.00031398534774780275,
42
+ "learning_rate": 9.574468085106384e-07,
43
+ "loss": 0.0,
44
+ "reward": 0.6482143150642514,
45
+ "reward_std": 0.34973760321736336,
46
+ "rewards/accuracy_reward": 0.6482143150642514,
47
+ "rewards/format_reward": 0.0,
48
+ "step": 15
49
+ },
50
+ {
51
+ "completion_length": 617.7803817749024,
52
+ "epoch": 0.042666666666666665,
53
+ "grad_norm": 1.5698275566101074,
54
+ "kl": 0.0008591651916503906,
55
+ "learning_rate": 1.276595744680851e-06,
56
+ "loss": 0.0,
57
+ "reward": 0.6303571715950966,
58
+ "reward_std": 0.35366948917508123,
59
+ "rewards/accuracy_reward": 0.6303571715950966,
60
+ "rewards/format_reward": 0.0,
61
+ "step": 20
62
+ },
63
+ {
64
+ "completion_length": 639.5375259399414,
65
+ "epoch": 0.05333333333333334,
66
+ "grad_norm": 0.858055055141449,
67
+ "kl": 0.002974271774291992,
68
+ "learning_rate": 1.5957446808510639e-06,
69
+ "loss": 0.0001,
70
+ "reward": 0.6125000298023224,
71
+ "reward_std": 0.3530873417854309,
72
+ "rewards/accuracy_reward": 0.6125000298023224,
73
+ "rewards/format_reward": 0.0,
74
+ "step": 25
75
+ },
76
+ {
77
+ "completion_length": 664.7107475280761,
78
+ "epoch": 0.064,
79
+ "grad_norm": 1.1970467567443848,
80
+ "kl": 0.005088996887207031,
81
+ "learning_rate": 1.9148936170212767e-06,
82
+ "loss": 0.0002,
83
+ "reward": 0.6553571727126837,
84
+ "reward_std": 0.33203957192599776,
85
+ "rewards/accuracy_reward": 0.6553571727126837,
86
+ "rewards/format_reward": 0.0,
87
+ "step": 30
88
+ },
89
+ {
90
+ "completion_length": 632.7196670532227,
91
+ "epoch": 0.07466666666666667,
92
+ "grad_norm": 1.0202564001083374,
93
+ "kl": 0.996532678604126,
94
+ "learning_rate": 2.2340425531914894e-06,
95
+ "loss": 0.0398,
96
+ "reward": 0.7000000279396772,
97
+ "reward_std": 0.3037954304367304,
98
+ "rewards/accuracy_reward": 0.7000000279396772,
99
+ "rewards/format_reward": 0.0,
100
+ "step": 35
101
+ },
102
+ {
103
+ "completion_length": 640.1286018371582,
104
+ "epoch": 0.08533333333333333,
105
+ "grad_norm": 0.5117438435554504,
106
+ "kl": 0.002056884765625,
107
+ "learning_rate": 2.553191489361702e-06,
108
+ "loss": 0.0001,
109
+ "reward": 0.7250000268220902,
110
+ "reward_std": 0.2653495166450739,
111
+ "rewards/accuracy_reward": 0.7250000268220902,
112
+ "rewards/format_reward": 0.0,
113
+ "step": 40
114
+ },
115
+ {
116
+ "completion_length": 607.9250259399414,
117
+ "epoch": 0.096,
118
+ "grad_norm": 0.3817993402481079,
119
+ "kl": 0.002574253082275391,
120
+ "learning_rate": 2.872340425531915e-06,
121
+ "loss": 0.0001,
122
+ "reward": 0.7392857447266579,
123
+ "reward_std": 0.24080881737172605,
124
+ "rewards/accuracy_reward": 0.7392857447266579,
125
+ "rewards/format_reward": 0.0,
126
+ "step": 45
127
+ },
128
+ {
129
+ "completion_length": 628.1535995483398,
130
+ "epoch": 0.10666666666666667,
131
+ "grad_norm": 0.7749062180519104,
132
+ "kl": 0.004121017456054687,
133
+ "learning_rate": 2.9996241442585123e-06,
134
+ "loss": 0.0002,
135
+ "reward": 0.6964286014437675,
136
+ "reward_std": 0.300217243283987,
137
+ "rewards/accuracy_reward": 0.6964286014437675,
138
+ "rewards/format_reward": 0.0,
139
+ "step": 50
140
+ },
141
+ {
142
+ "completion_length": 600.3839553833008,
143
+ "epoch": 0.11733333333333333,
144
+ "grad_norm": 0.5632015466690063,
145
+ "kl": 0.003362083435058594,
146
+ "learning_rate": 2.9973279301399446e-06,
147
+ "loss": 0.0001,
148
+ "reward": 0.7178571719676257,
149
+ "reward_std": 0.27802310809493064,
150
+ "rewards/accuracy_reward": 0.7178571719676257,
151
+ "rewards/format_reward": 0.0,
152
+ "step": 55
153
+ },
154
+ {
155
+ "completion_length": 585.9303833007813,
156
+ "epoch": 0.128,
157
+ "grad_norm": 1.2080248594284058,
158
+ "kl": 0.004734039306640625,
159
+ "learning_rate": 2.992947502998804e-06,
160
+ "loss": 0.0002,
161
+ "reward": 0.7750000357627869,
162
+ "reward_std": 0.25223284475505353,
163
+ "rewards/accuracy_reward": 0.7750000357627869,
164
+ "rewards/format_reward": 0.0,
165
+ "step": 60
166
+ },
167
+ {
168
+ "completion_length": 591.1464614868164,
169
+ "epoch": 0.13866666666666666,
170
+ "grad_norm": 1.7432464361190796,
171
+ "kl": 0.013347434997558593,
172
+ "learning_rate": 2.9864889601923268e-06,
173
+ "loss": 0.0005,
174
+ "reward": 0.7267857432365418,
175
+ "reward_std": 0.2874183960258961,
176
+ "rewards/accuracy_reward": 0.7267857432365418,
177
+ "rewards/format_reward": 0.0,
178
+ "step": 65
179
+ },
180
+ {
181
+ "completion_length": 588.239315032959,
182
+ "epoch": 0.14933333333333335,
183
+ "grad_norm": 0.28103670477867126,
184
+ "kl": 0.004166412353515625,
185
+ "learning_rate": 2.977961291721137e-06,
186
+ "loss": 0.0002,
187
+ "reward": 0.7875000283122062,
188
+ "reward_std": 0.23745907917618753,
189
+ "rewards/accuracy_reward": 0.7875000283122062,
190
+ "rewards/format_reward": 0.0,
191
+ "step": 70
192
+ },
193
+ {
194
+ "completion_length": 554.2375259399414,
195
+ "epoch": 0.16,
196
+ "grad_norm": 0.26267099380493164,
197
+ "kl": 0.004508209228515625,
198
+ "learning_rate": 2.9673763677155655e-06,
199
+ "loss": 0.0002,
200
+ "reward": 0.7767857410013675,
201
+ "reward_std": 0.21466484777629374,
202
+ "rewards/accuracy_reward": 0.7767857410013675,
203
+ "rewards/format_reward": 0.0,
204
+ "step": 75
205
+ },
206
+ {
207
+ "completion_length": 553.0071640014648,
208
+ "epoch": 0.17066666666666666,
209
+ "grad_norm": 0.34145233035087585,
210
+ "kl": 0.0057952880859375,
211
+ "learning_rate": 2.9547489219129666e-06,
212
+ "loss": 0.0002,
213
+ "reward": 0.8285714581608772,
214
+ "reward_std": 0.20639798790216446,
215
+ "rewards/accuracy_reward": 0.8285714581608772,
216
+ "rewards/format_reward": 0.0,
217
+ "step": 80
218
+ },
219
+ {
220
+ "completion_length": 585.5857421875,
221
+ "epoch": 0.18133333333333335,
222
+ "grad_norm": 0.240849107503891,
223
+ "kl": 0.0038000106811523437,
224
+ "learning_rate": 2.9400965311490175e-06,
225
+ "loss": 0.0002,
226
+ "reward": 0.7660714596509933,
227
+ "reward_std": 0.24059830717742442,
228
+ "rewards/accuracy_reward": 0.7660714596509933,
229
+ "rewards/format_reward": 0.0,
230
+ "step": 85
231
+ },
232
+ {
233
+ "completion_length": 558.014315032959,
234
+ "epoch": 0.192,
235
+ "grad_norm": 0.42303234338760376,
236
+ "kl": 0.004909515380859375,
237
+ "learning_rate": 2.9234395908915565e-06,
238
+ "loss": 0.0002,
239
+ "reward": 0.7142857456579804,
240
+ "reward_std": 0.2469261337071657,
241
+ "rewards/accuracy_reward": 0.7142857456579804,
242
+ "rewards/format_reward": 0.0,
243
+ "step": 90
244
+ },
245
+ {
246
+ "completion_length": 565.8589538574219,
247
+ "epoch": 0.20266666666666666,
248
+ "grad_norm": 0.36556175351142883,
249
+ "kl": 0.004494476318359375,
250
+ "learning_rate": 2.904801286851009e-06,
251
+ "loss": 0.0002,
252
+ "reward": 0.7517857383936644,
253
+ "reward_std": 0.2242392159998417,
254
+ "rewards/accuracy_reward": 0.7517857383936644,
255
+ "rewards/format_reward": 0.0,
256
+ "step": 95
257
+ },
258
+ {
259
+ "completion_length": 520.4696662902832,
260
+ "epoch": 0.21333333333333335,
261
+ "grad_norm": 0.2530968487262726,
262
+ "kl": 0.00856475830078125,
263
+ "learning_rate": 2.884207562706925e-06,
264
+ "loss": 0.0003,
265
+ "reward": 0.8125000301748514,
266
+ "reward_std": 0.18475012369453908,
267
+ "rewards/accuracy_reward": 0.8125000301748514,
268
+ "rewards/format_reward": 0.0,
269
+ "step": 100
270
+ },
271
+ {
272
+ "epoch": 0.21333333333333335,
273
+ "eval_completion_length": 547.995567590332,
274
+ "eval_kl": 0.00711322021484375,
275
+ "eval_loss": 0.00028467908850871027,
276
+ "eval_reward": 0.6861428862035275,
277
+ "eval_reward_std": 0.270268753862381,
278
+ "eval_rewards/accuracy_reward": 0.6860857433497906,
279
+ "eval_rewards/format_reward": 5.7142859697341916e-05,
280
+ "eval_runtime": 28593.9461,
281
+ "eval_samples_per_second": 0.175,
282
+ "eval_steps_per_second": 0.013,
283
+ "step": 100
284
+ },
285
+ {
286
+ "completion_length": 549.5107398986817,
287
+ "epoch": 0.224,
288
+ "grad_norm": 0.2235504686832428,
289
+ "kl": 0.004962539672851563,
290
+ "learning_rate": 2.8616870839955444e-06,
291
+ "loss": 0.0002,
292
+ "reward": 0.7964286103844642,
293
+ "reward_std": 0.26562733463943006,
294
+ "rewards/accuracy_reward": 0.7964286103844642,
295
+ "rewards/format_reward": 0.0,
296
+ "step": 105
297
+ },
298
+ {
299
+ "completion_length": 581.1750267028808,
300
+ "epoch": 0.23466666666666666,
301
+ "grad_norm": 0.4279918968677521,
302
+ "kl": 0.005224609375,
303
+ "learning_rate": 2.837271198208662e-06,
304
+ "loss": 0.0002,
305
+ "reward": 0.7785714581608772,
306
+ "reward_std": 0.20086282528936864,
307
+ "rewards/accuracy_reward": 0.7785714581608772,
308
+ "rewards/format_reward": 0.0,
309
+ "step": 110
310
+ },
311
+ {
312
+ "completion_length": 532.0053810119629,
313
+ "epoch": 0.24533333333333332,
314
+ "grad_norm": 0.6788883805274963,
315
+ "kl": 0.0057430267333984375,
316
+ "learning_rate": 2.8109938911593322e-06,
317
+ "loss": 0.0002,
318
+ "reward": 0.7767857426777482,
319
+ "reward_std": 0.20144498273730277,
320
+ "rewards/accuracy_reward": 0.7767857426777482,
321
+ "rewards/format_reward": 0.0,
322
+ "step": 115
323
+ },
324
+ {
325
+ "completion_length": 544.6714546203614,
326
+ "epoch": 0.256,
327
+ "grad_norm": 0.23265813291072845,
328
+ "kl": 0.006147003173828125,
329
+ "learning_rate": 2.7828917396751474e-06,
330
+ "loss": 0.0002,
331
+ "reward": 0.7696428894996643,
332
+ "reward_std": 0.20026272870600223,
333
+ "rewards/accuracy_reward": 0.7696428894996643,
334
+ "rewards/format_reward": 0.0,
335
+ "step": 120
336
+ },
337
+ {
338
+ "completion_length": 518.471452331543,
339
+ "epoch": 0.26666666666666666,
340
+ "grad_norm": 0.41492336988449097,
341
+ "kl": 0.00635986328125,
342
+ "learning_rate": 2.753003860684943e-06,
343
+ "loss": 0.0003,
344
+ "reward": 0.8375000298023224,
345
+ "reward_std": 0.20442306995391846,
346
+ "rewards/accuracy_reward": 0.8375000298023224,
347
+ "rewards/format_reward": 0.0,
348
+ "step": 125
349
+ },
350
+ {
351
+ "completion_length": 570.2303848266602,
352
+ "epoch": 0.2773333333333333,
353
+ "grad_norm": 0.4288278818130493,
354
+ "kl": 0.006764602661132812,
355
+ "learning_rate": 2.721371856769793e-06,
356
+ "loss": 0.0003,
357
+ "reward": 0.7160714600235224,
358
+ "reward_std": 0.2673064887523651,
359
+ "rewards/accuracy_reward": 0.7160714600235224,
360
+ "rewards/format_reward": 0.0,
361
+ "step": 130
362
+ },
363
+ {
364
+ "completion_length": 494.16431350708007,
365
+ "epoch": 0.288,
366
+ "grad_norm": 0.5349715352058411,
367
+ "kl": 0.008311080932617187,
368
+ "learning_rate": 2.688039758254093e-06,
369
+ "loss": 0.0003,
370
+ "reward": 0.7910714618861675,
371
+ "reward_std": 0.22742781266570092,
372
+ "rewards/accuracy_reward": 0.7910714618861675,
373
+ "rewards/format_reward": 0.0,
374
+ "step": 135
375
+ },
376
+ {
377
+ "completion_length": 503.5232357025146,
378
+ "epoch": 0.2986666666666667,
379
+ "grad_norm": 0.9469221830368042,
380
+ "kl": 0.011474609375,
381
+ "learning_rate": 2.65305396191733e-06,
382
+ "loss": 0.0005,
383
+ "reward": 0.8053571686148644,
384
+ "reward_std": 0.22859212197363377,
385
+ "rewards/accuracy_reward": 0.8053571686148644,
386
+ "rewards/format_reward": 0.0,
387
+ "step": 140
388
+ },
389
+ {
390
+ "completion_length": 526.3750251770019,
391
+ "epoch": 0.30933333333333335,
392
+ "grad_norm": 0.4655410945415497,
393
+ "kl": 0.016347885131835938,
394
+ "learning_rate": 2.61646316641186e-06,
395
+ "loss": 0.0007,
396
+ "reward": 0.7767857382073998,
397
+ "reward_std": 0.21031193807721138,
398
+ "rewards/accuracy_reward": 0.7767857382073998,
399
+ "rewards/format_reward": 0.0,
400
+ "step": 145
401
+ },
402
+ {
403
+ "completion_length": 536.4660957336425,
404
+ "epoch": 0.32,
405
+ "grad_norm": 0.36182740330696106,
406
+ "kl": 0.020062255859375,
407
+ "learning_rate": 2.5783183044765715e-06,
408
+ "loss": 0.0008,
409
+ "reward": 0.7517857436090708,
410
+ "reward_std": 0.2297923181205988,
411
+ "rewards/accuracy_reward": 0.7517857436090708,
412
+ "rewards/format_reward": 0.0,
413
+ "step": 150
414
+ },
415
+ {
416
+ "completion_length": 514.8803802490235,
417
+ "epoch": 0.33066666666666666,
418
+ "grad_norm": 0.7956529855728149,
419
+ "kl": 0.03543167114257813,
420
+ "learning_rate": 2.5386724720408135e-06,
421
+ "loss": 0.0014,
422
+ "reward": 0.7517857443541288,
423
+ "reward_std": 0.26203566156327723,
424
+ "rewards/accuracy_reward": 0.7517857443541288,
425
+ "rewards/format_reward": 0.0,
426
+ "step": 155
427
+ },
428
+ {
429
+ "completion_length": 549.3571701049805,
430
+ "epoch": 0.3413333333333333,
431
+ "grad_norm": 0.29695257544517517,
432
+ "kl": 0.062085723876953124,
433
+ "learning_rate": 2.49758085431725e-06,
434
+ "loss": 0.0025,
435
+ "reward": 0.7285714589059353,
436
+ "reward_std": 0.269120267406106,
437
+ "rewards/accuracy_reward": 0.7285714589059353,
438
+ "rewards/format_reward": 0.0,
439
+ "step": 160
440
+ },
441
+ {
442
+ "completion_length": 589.5339569091797,
443
+ "epoch": 0.352,
444
+ "grad_norm": 0.41898104548454285,
445
+ "kl": 0.13505706787109376,
446
+ "learning_rate": 2.455100648986533e-06,
447
+ "loss": 0.0054,
448
+ "reward": 0.6857143165543675,
449
+ "reward_std": 0.32202624566853044,
450
+ "rewards/accuracy_reward": 0.6857143165543675,
451
+ "rewards/format_reward": 0.0,
452
+ "step": 165
453
+ },
454
+ {
455
+ "completion_length": 627.4214576721191,
456
+ "epoch": 0.3626666666666667,
457
+ "grad_norm": 0.2985088527202606,
458
+ "kl": 0.1629364013671875,
459
+ "learning_rate": 2.4112909865807053e-06,
460
+ "loss": 0.0065,
461
+ "reward": 0.6410714607685805,
462
+ "reward_std": 0.2777946576476097,
463
+ "rewards/accuracy_reward": 0.6410714607685805,
464
+ "rewards/format_reward": 0.0,
465
+ "step": 170
466
+ },
467
+ {
468
+ "completion_length": 567.5768112182617,
469
+ "epoch": 0.37333333333333335,
470
+ "grad_norm": 0.49856701493263245,
471
+ "kl": 0.115283203125,
472
+ "learning_rate": 2.366212848176164e-06,
473
+ "loss": 0.0046,
474
+ "reward": 0.7089286031201482,
475
+ "reward_std": 0.2502220422029495,
476
+ "rewards/accuracy_reward": 0.7089286031201482,
477
+ "rewards/format_reward": 0.0,
478
+ "step": 175
479
+ },
480
+ {
481
+ "completion_length": 528.6750221252441,
482
+ "epoch": 0.384,
483
+ "grad_norm": 0.6558699607849121,
484
+ "kl": 0.18359222412109374,
485
+ "learning_rate": 2.319928980510752e-06,
486
+ "loss": 0.0073,
487
+ "reward": 0.6535714527592063,
488
+ "reward_std": 0.28206231258809566,
489
+ "rewards/accuracy_reward": 0.6535714527592063,
490
+ "rewards/format_reward": 0.0,
491
+ "step": 180
492
+ },
493
+ {
494
+ "completion_length": 571.7518112182618,
495
+ "epoch": 0.39466666666666667,
496
+ "grad_norm": 6.485378742218018,
497
+ "kl": 0.34942626953125,
498
+ "learning_rate": 2.272503808643123e-06,
499
+ "loss": 0.014,
500
+ "reward": 0.4696428783237934,
501
+ "reward_std": 0.27855144031345846,
502
+ "rewards/accuracy_reward": 0.4696428783237934,
503
+ "rewards/format_reward": 0.0,
504
+ "step": 185
505
+ },
506
+ {
507
+ "completion_length": 542.8089500427246,
508
+ "epoch": 0.4053333333333333,
509
+ "grad_norm": 1.8310225009918213,
510
+ "kl": 0.4147308349609375,
511
+ "learning_rate": 2.2240033462759628e-06,
512
+ "loss": 0.0166,
513
+ "reward": 0.46071431171149013,
514
+ "reward_std": 0.36014051400125024,
515
+ "rewards/accuracy_reward": 0.46071431171149013,
516
+ "rewards/format_reward": 0.0,
517
+ "step": 190
518
+ },
519
+ {
520
+ "completion_length": 626.9553817749023,
521
+ "epoch": 0.416,
522
+ "grad_norm": 17.148595809936523,
523
+ "kl": 0.4925048828125,
524
+ "learning_rate": 2.1744951038678905e-06,
525
+ "loss": 0.0197,
526
+ "reward": 0.4000000203028321,
527
+ "reward_std": 0.3870592150837183,
528
+ "rewards/accuracy_reward": 0.4000000203028321,
529
+ "rewards/format_reward": 0.0,
530
+ "step": 195
531
+ },
532
+ {
533
+ "completion_length": 525.5928825378418,
534
+ "epoch": 0.4266666666666667,
535
+ "grad_norm": 38.49786376953125,
536
+ "kl": 0.23444061279296874,
537
+ "learning_rate": 2.124047994661941e-06,
538
+ "loss": 0.0094,
539
+ "reward": 0.6821428872644901,
540
+ "reward_std": 0.3179911646991968,
541
+ "rewards/accuracy_reward": 0.6821428872644901,
542
+ "rewards/format_reward": 0.0,
543
+ "step": 200
544
+ },
545
+ {
546
+ "epoch": 0.4266666666666667,
547
+ "eval_completion_length": 528.541595147705,
548
+ "eval_kl": 0.42865986328125,
549
+ "eval_loss": 0.017097920179367065,
550
+ "eval_reward": 0.662742884466052,
551
+ "eval_reward_std": 0.2557599967300892,
552
+ "eval_rewards/accuracy_reward": 0.662742884466052,
553
+ "eval_rewards/format_reward": 0.0,
554
+ "eval_runtime": 28459.3155,
555
+ "eval_samples_per_second": 0.176,
556
+ "eval_steps_per_second": 0.013,
557
+ "step": 200
558
+ },
559
+ {
560
+ "completion_length": 509.92859649658203,
561
+ "epoch": 0.43733333333333335,
562
+ "grad_norm": 4.442358493804932,
563
+ "kl": 0.09366302490234375,
564
+ "learning_rate": 2.072732238761434e-06,
565
+ "loss": 0.0037,
566
+ "reward": 0.7660714574158192,
567
+ "reward_std": 0.21246148198843,
568
+ "rewards/accuracy_reward": 0.7660714574158192,
569
+ "rewards/format_reward": 0.0,
570
+ "step": 205
571
+ },
572
+ {
573
+ "completion_length": 497.5357364654541,
574
+ "epoch": 0.448,
575
+ "grad_norm": 0.5748523473739624,
576
+ "kl": 0.085107421875,
577
+ "learning_rate": 2.0206192653867536e-06,
578
+ "loss": 0.0034,
579
+ "reward": 0.796428595483303,
580
+ "reward_std": 0.18161089681088924,
581
+ "rewards/accuracy_reward": 0.796428595483303,
582
+ "rewards/format_reward": 0.0,
583
+ "step": 210
584
+ },
585
+ {
586
+ "completion_length": 614.8643173217773,
587
+ "epoch": 0.45866666666666667,
588
+ "grad_norm": 0.3233760893344879,
589
+ "kl": 0.15386199951171875,
590
+ "learning_rate": 1.967781613449095e-06,
591
+ "loss": 0.0062,
592
+ "reward": 0.6392857398837805,
593
+ "reward_std": 0.26852017305791376,
594
+ "rewards/accuracy_reward": 0.6392857398837805,
595
+ "rewards/format_reward": 0.0,
596
+ "step": 215
597
+ },
598
+ {
599
+ "completion_length": 575.8321681976319,
600
+ "epoch": 0.4693333333333333,
601
+ "grad_norm": 0.6134320497512817,
602
+ "kl": 0.16066131591796876,
603
+ "learning_rate": 1.9142928305795637e-06,
604
+ "loss": 0.0064,
605
+ "reward": 0.6446428839117289,
606
+ "reward_std": 0.3076914418488741,
607
+ "rewards/accuracy_reward": 0.6446428839117289,
608
+ "rewards/format_reward": 0.0,
609
+ "step": 220
610
+ },
611
+ {
612
+ "completion_length": 529.201806640625,
613
+ "epoch": 0.48,
614
+ "grad_norm": 14.935003280639648,
615
+ "kl": 0.27866592407226565,
616
+ "learning_rate": 1.8602273707541886e-06,
617
+ "loss": 0.0111,
618
+ "reward": 0.6982143182307482,
619
+ "reward_std": 0.29295355789363386,
620
+ "rewards/accuracy_reward": 0.6982143182307482,
621
+ "rewards/format_reward": 0.0,
622
+ "step": 225
623
+ },
624
+ {
625
+ "completion_length": 599.1643127441406,
626
+ "epoch": 0.49066666666666664,
627
+ "grad_norm": 29.34033203125,
628
+ "kl": 1.2954010009765624,
629
+ "learning_rate": 1.8056604906573418e-06,
630
+ "loss": 0.0518,
631
+ "reward": 0.605357170291245,
632
+ "reward_std": 0.30737774074077606,
633
+ "rewards/accuracy_reward": 0.605357170291245,
634
+ "rewards/format_reward": 0.0,
635
+ "step": 230
636
+ },
637
+ {
638
+ "completion_length": 577.0696716308594,
639
+ "epoch": 0.5013333333333333,
640
+ "grad_norm": 33.51802062988281,
641
+ "kl": 0.703057861328125,
642
+ "learning_rate": 1.7506681449278226e-06,
643
+ "loss": 0.0281,
644
+ "reward": 0.571428601257503,
645
+ "reward_std": 0.32992145605385303,
646
+ "rewards/accuracy_reward": 0.571428601257503,
647
+ "rewards/format_reward": 0.0,
648
+ "step": 235
649
+ },
650
+ {
651
+ "completion_length": 501.30717010498046,
652
+ "epoch": 0.512,
653
+ "grad_norm": 7.524632930755615,
654
+ "kl": 0.3678741455078125,
655
+ "learning_rate": 1.6953268804334257e-06,
656
+ "loss": 0.0147,
657
+ "reward": 0.589285738952458,
658
+ "reward_std": 0.28248333670198916,
659
+ "rewards/accuracy_reward": 0.589285738952458,
660
+ "rewards/format_reward": 0.0,
661
+ "step": 240
662
+ },
663
+ {
664
+ "completion_length": 462.3607376098633,
665
+ "epoch": 0.5226666666666666,
666
+ "grad_norm": 6.015294075012207,
667
+ "kl": 0.3237091064453125,
668
+ "learning_rate": 1.6397137297211436e-06,
669
+ "loss": 0.0129,
670
+ "reward": 0.5589285986497998,
671
+ "reward_std": 0.3269613076001406,
672
+ "rewards/accuracy_reward": 0.5589285986497998,
673
+ "rewards/format_reward": 0.0,
674
+ "step": 245
675
+ },
676
+ {
677
+ "completion_length": 638.7928871154785,
678
+ "epoch": 0.5333333333333333,
679
+ "grad_norm": 1.8317821025848389,
680
+ "kl": 0.807421875,
681
+ "learning_rate": 1.5839061037913395e-06,
682
+ "loss": 0.0323,
683
+ "reward": 0.3250000160187483,
684
+ "reward_std": 0.34638786166906355,
685
+ "rewards/accuracy_reward": 0.3250000160187483,
686
+ "rewards/format_reward": 0.0,
687
+ "step": 250
688
+ },
689
+ {
690
+ "completion_length": 636.7161018371582,
691
+ "epoch": 0.544,
692
+ "grad_norm": 0.6774631142616272,
693
+ "kl": 0.53349609375,
694
+ "learning_rate": 1.527981684345115e-06,
695
+ "loss": 0.0213,
696
+ "reward": 0.35714287366718056,
697
+ "reward_std": 0.3179732210934162,
698
+ "rewards/accuracy_reward": 0.35714287366718056,
699
+ "rewards/format_reward": 0.0,
700
+ "step": 255
701
+ },
702
+ {
703
+ "completion_length": 558.9875259399414,
704
+ "epoch": 0.5546666666666666,
705
+ "grad_norm": 2.738410472869873,
706
+ "kl": 0.205511474609375,
707
+ "learning_rate": 1.4720183156548855e-06,
708
+ "loss": 0.0082,
709
+ "reward": 0.6000000305473805,
710
+ "reward_std": 0.3251969013363123,
711
+ "rewards/accuracy_reward": 0.6000000305473805,
712
+ "rewards/format_reward": 0.0,
713
+ "step": 260
714
+ },
715
+ {
716
+ "completion_length": 545.3893096923828,
717
+ "epoch": 0.5653333333333334,
718
+ "grad_norm": 17.57588005065918,
719
+ "kl": 1.5068023681640625,
720
+ "learning_rate": 1.4160938962086612e-06,
721
+ "loss": 0.0603,
722
+ "reward": 0.6089285997673869,
723
+ "reward_std": 0.28422979824244976,
724
+ "rewards/accuracy_reward": 0.6089285997673869,
725
+ "rewards/format_reward": 0.0,
726
+ "step": 265
727
+ },
728
+ {
729
+ "completion_length": 526.2607391357421,
730
+ "epoch": 0.576,
731
+ "grad_norm": 38.554412841796875,
732
+ "kl": 1.0892745971679687,
733
+ "learning_rate": 1.3602862702788567e-06,
734
+ "loss": 0.0436,
735
+ "reward": 0.6267857417464257,
736
+ "reward_std": 0.32045440524816515,
737
+ "rewards/accuracy_reward": 0.6267857417464257,
738
+ "rewards/format_reward": 0.0,
739
+ "step": 270
740
+ },
741
+ {
742
+ "completion_length": 517.2678787231446,
743
+ "epoch": 0.5866666666666667,
744
+ "grad_norm": 4.718568325042725,
745
+ "kl": 1.1258697509765625,
746
+ "learning_rate": 1.3046731195665748e-06,
747
+ "loss": 0.045,
748
+ "reward": 0.6464285977184773,
749
+ "reward_std": 0.320261836796999,
750
+ "rewards/accuracy_reward": 0.6464285977184773,
751
+ "rewards/format_reward": 0.0,
752
+ "step": 275
753
+ },
754
+ {
755
+ "completion_length": 514.8375244140625,
756
+ "epoch": 0.5973333333333334,
757
+ "grad_norm": 4.270685195922852,
758
+ "kl": 1.4042236328125,
759
+ "learning_rate": 1.2493318550721775e-06,
760
+ "loss": 0.0562,
761
+ "reward": 0.6982143145054579,
762
+ "reward_std": 0.2838581532239914,
763
+ "rewards/accuracy_reward": 0.6982143145054579,
764
+ "rewards/format_reward": 0.0,
765
+ "step": 280
766
+ },
767
+ {
768
+ "completion_length": 534.7946655273438,
769
+ "epoch": 0.608,
770
+ "grad_norm": 9.498967170715332,
771
+ "kl": 0.8575332641601563,
772
+ "learning_rate": 1.1943395093426585e-06,
773
+ "loss": 0.0343,
774
+ "reward": 0.6910714596509934,
775
+ "reward_std": 0.2867689304053783,
776
+ "rewards/accuracy_reward": 0.6910714596509934,
777
+ "rewards/format_reward": 0.0,
778
+ "step": 285
779
+ },
780
+ {
781
+ "completion_length": 529.5875297546387,
782
+ "epoch": 0.6186666666666667,
783
+ "grad_norm": 8.131922721862793,
784
+ "kl": 0.6070526123046875,
785
+ "learning_rate": 1.1397726292458115e-06,
786
+ "loss": 0.0243,
787
+ "reward": 0.675000030733645,
788
+ "reward_std": 0.2807727467268705,
789
+ "rewards/accuracy_reward": 0.675000030733645,
790
+ "rewards/format_reward": 0.0,
791
+ "step": 290
792
+ },
793
+ {
794
+ "completion_length": 557.6500236511231,
795
+ "epoch": 0.6293333333333333,
796
+ "grad_norm": 2.224785089492798,
797
+ "kl": 1.1581512451171876,
798
+ "learning_rate": 1.085707169420437e-06,
799
+ "loss": 0.0463,
800
+ "reward": 0.6142857430502773,
801
+ "reward_std": 0.2811264578253031,
802
+ "rewards/accuracy_reward": 0.6142857430502773,
803
+ "rewards/format_reward": 0.0,
804
+ "step": 295
805
+ },
806
+ {
807
+ "completion_length": 529.6071685791015,
808
+ "epoch": 0.64,
809
+ "grad_norm": 8.0582857131958,
810
+ "kl": 1.2655914306640625,
811
+ "learning_rate": 1.0322183865509054e-06,
812
+ "loss": 0.0506,
813
+ "reward": 0.6750000296160579,
814
+ "reward_std": 0.28525091484189036,
815
+ "rewards/accuracy_reward": 0.6750000296160579,
816
+ "rewards/format_reward": 0.0,
817
+ "step": 300
818
+ },
819
+ {
820
+ "epoch": 0.64,
821
+ "eval_completion_length": 562.8381398986817,
822
+ "eval_kl": 0.95684150390625,
823
+ "eval_loss": 0.03832858428359032,
824
+ "eval_reward": 0.5700857400953769,
825
+ "eval_reward_std": 0.30191256090998647,
826
+ "eval_rewards/accuracy_reward": 0.5700857400953769,
827
+ "eval_rewards/format_reward": 0.0,
828
+ "eval_runtime": 30482.0361,
829
+ "eval_samples_per_second": 0.164,
830
+ "eval_steps_per_second": 0.012,
831
+ "step": 300
832
+ },
833
+ {
834
+ "completion_length": 557.5018104553222,
835
+ "epoch": 0.6506666666666666,
836
+ "grad_norm": 2.934140682220459,
837
+ "kl": 0.6237777709960938,
838
+ "learning_rate": 9.793807346132464e-07,
839
+ "loss": 0.025,
840
+ "reward": 0.6339285995811224,
841
+ "reward_std": 0.3295004416257143,
842
+ "rewards/accuracy_reward": 0.6339285995811224,
843
+ "rewards/format_reward": 0.0,
844
+ "step": 305
845
+ },
846
+ {
847
+ "completion_length": 507.2857383728027,
848
+ "epoch": 0.6613333333333333,
849
+ "grad_norm": 1.5333462953567505,
850
+ "kl": 0.18207855224609376,
851
+ "learning_rate": 9.272677612385667e-07,
852
+ "loss": 0.0073,
853
+ "reward": 0.6017857391387225,
854
+ "reward_std": 0.2927430454641581,
855
+ "rewards/accuracy_reward": 0.6017857391387225,
856
+ "rewards/format_reward": 0.0,
857
+ "step": 310
858
+ },
859
+ {
860
+ "completion_length": 535.4660949707031,
861
+ "epoch": 0.672,
862
+ "grad_norm": 0.8268552422523499,
863
+ "kl": 0.2217437744140625,
864
+ "learning_rate": 8.759520053380591e-07,
865
+ "loss": 0.0089,
866
+ "reward": 0.5303571715950965,
867
+ "reward_std": 0.30413119196891786,
868
+ "rewards/accuracy_reward": 0.5303571715950965,
869
+ "rewards/format_reward": 0.0,
870
+ "step": 315
871
+ },
872
+ {
873
+ "completion_length": 587.7893112182617,
874
+ "epoch": 0.6826666666666666,
875
+ "grad_norm": 0.8802245259284973,
876
+ "kl": 0.28748779296875,
877
+ "learning_rate": 8.255048961321088e-07,
878
+ "loss": 0.0115,
879
+ "reward": 0.5250000260770321,
880
+ "reward_std": 0.32775397710502147,
881
+ "rewards/accuracy_reward": 0.5250000260770321,
882
+ "rewards/format_reward": 0.0,
883
+ "step": 320
884
+ },
885
+ {
886
+ "completion_length": 568.7143135070801,
887
+ "epoch": 0.6933333333333334,
888
+ "grad_norm": 0.42495808005332947,
889
+ "kl": 0.20731658935546876,
890
+ "learning_rate": 7.759966537240373e-07,
891
+ "loss": 0.0083,
892
+ "reward": 0.6357143180444836,
893
+ "reward_std": 0.32221881337463854,
894
+ "rewards/accuracy_reward": 0.6357143180444836,
895
+ "rewards/format_reward": 0.0,
896
+ "step": 325
897
+ },
898
+ {
899
+ "completion_length": 572.9053848266601,
900
+ "epoch": 0.704,
901
+ "grad_norm": 0.8112100958824158,
902
+ "kl": 0.2259490966796875,
903
+ "learning_rate": 7.274961913568773e-07,
904
+ "loss": 0.009,
905
+ "reward": 0.6071428848430515,
906
+ "reward_std": 0.28955445289611814,
907
+ "rewards/accuracy_reward": 0.6071428848430515,
908
+ "rewards/format_reward": 0.0,
909
+ "step": 330
910
+ },
911
+ {
912
+ "completion_length": 585.521450805664,
913
+ "epoch": 0.7146666666666667,
914
+ "grad_norm": 2.5413522720336914,
915
+ "kl": 0.1725799560546875,
916
+ "learning_rate": 6.800710194892484e-07,
917
+ "loss": 0.0069,
918
+ "reward": 0.6946428865194321,
919
+ "reward_std": 0.27458811886608603,
920
+ "rewards/accuracy_reward": 0.6946428865194321,
921
+ "rewards/format_reward": 0.0,
922
+ "step": 335
923
+ },
924
+ {
925
+ "completion_length": 578.700025177002,
926
+ "epoch": 0.7253333333333334,
927
+ "grad_norm": 2.077190637588501,
928
+ "kl": 0.23962783813476562,
929
+ "learning_rate": 6.33787151823836e-07,
930
+ "loss": 0.0096,
931
+ "reward": 0.6517857488244772,
932
+ "reward_std": 0.29764223508536813,
933
+ "rewards/accuracy_reward": 0.6517857488244772,
934
+ "rewards/format_reward": 0.0,
935
+ "step": 340
936
+ },
937
+ {
938
+ "completion_length": 517.0357376098633,
939
+ "epoch": 0.736,
940
+ "grad_norm": 1.8005852699279785,
941
+ "kl": 0.1731414794921875,
942
+ "learning_rate": 5.887090134192947e-07,
943
+ "loss": 0.0069,
944
+ "reward": 0.7303571674972773,
945
+ "reward_std": 0.20812651440501212,
946
+ "rewards/accuracy_reward": 0.7303571674972773,
947
+ "rewards/format_reward": 0.0,
948
+ "step": 345
949
+ },
950
+ {
951
+ "completion_length": 537.6071670532226,
952
+ "epoch": 0.7466666666666667,
953
+ "grad_norm": 5.837470531463623,
954
+ "kl": 0.22538909912109376,
955
+ "learning_rate": 5.448993510134669e-07,
956
+ "loss": 0.009,
957
+ "reward": 0.7285714605823159,
958
+ "reward_std": 0.23610220104455948,
959
+ "rewards/accuracy_reward": 0.7285714605823159,
960
+ "rewards/format_reward": 0.0,
961
+ "step": 350
962
+ },
963
+ {
964
+ "completion_length": 511.03573455810545,
965
+ "epoch": 0.7573333333333333,
966
+ "grad_norm": 10.607796669006348,
967
+ "kl": 0.2825325012207031,
968
+ "learning_rate": 5.024191456827498e-07,
969
+ "loss": 0.0113,
970
+ "reward": 0.7357143178582192,
971
+ "reward_std": 0.24888310953974724,
972
+ "rewards/accuracy_reward": 0.7357143178582192,
973
+ "rewards/format_reward": 0.0,
974
+ "step": 355
975
+ },
976
+ {
977
+ "completion_length": 528.4375274658203,
978
+ "epoch": 0.768,
979
+ "grad_norm": 17.805084228515625,
980
+ "kl": 0.409503173828125,
981
+ "learning_rate": 4.6132752795918667e-07,
982
+ "loss": 0.0164,
983
+ "reward": 0.6857143182307481,
984
+ "reward_std": 0.28699738159775734,
985
+ "rewards/accuracy_reward": 0.6857143182307481,
986
+ "rewards/format_reward": 0.0,
987
+ "step": 360
988
+ },
989
+ {
990
+ "completion_length": 519.4303817749023,
991
+ "epoch": 0.7786666666666666,
992
+ "grad_norm": 7.785521507263184,
993
+ "kl": 0.34308319091796874,
994
+ "learning_rate": 4.2168169552342905e-07,
995
+ "loss": 0.0137,
996
+ "reward": 0.6928571717813611,
997
+ "reward_std": 0.29954983331263063,
998
+ "rewards/accuracy_reward": 0.6928571717813611,
999
+ "rewards/format_reward": 0.0,
1000
+ "step": 365
1001
+ },
1002
+ {
1003
+ "completion_length": 504.0107368469238,
1004
+ "epoch": 0.7893333333333333,
1005
+ "grad_norm": 4.727066516876221,
1006
+ "kl": 0.7092597961425782,
1007
+ "learning_rate": 3.8353683358814046e-07,
1008
+ "loss": 0.0285,
1009
+ "reward": 0.7000000275671482,
1010
+ "reward_std": 0.23943399637937546,
1011
+ "rewards/accuracy_reward": 0.7000000275671482,
1012
+ "rewards/format_reward": 0.0,
1013
+ "step": 370
1014
+ },
1015
+ {
1016
+ "completion_length": 517.7428833007813,
1017
+ "epoch": 0.8,
1018
+ "grad_norm": 7.284985542297363,
1019
+ "kl": 0.2571014404296875,
1020
+ "learning_rate": 3.469460380826697e-07,
1021
+ "loss": 0.0103,
1022
+ "reward": 0.673214310593903,
1023
+ "reward_std": 0.2639926388859749,
1024
+ "rewards/accuracy_reward": 0.673214310593903,
1025
+ "rewards/format_reward": 0.0,
1026
+ "step": 375
1027
+ },
1028
+ {
1029
+ "completion_length": 543.2714515686035,
1030
+ "epoch": 0.8106666666666666,
1031
+ "grad_norm": 14.041029930114746,
1032
+ "kl": 0.28242645263671873,
1033
+ "learning_rate": 3.119602417459075e-07,
1034
+ "loss": 0.0113,
1035
+ "reward": 0.6732143096625804,
1036
+ "reward_std": 0.25163274370133876,
1037
+ "rewards/accuracy_reward": 0.6732143096625804,
1038
+ "rewards/format_reward": 0.0,
1039
+ "step": 380
1040
+ },
1041
+ {
1042
+ "completion_length": 546.032169342041,
1043
+ "epoch": 0.8213333333333334,
1044
+ "grad_norm": 15.783819198608398,
1045
+ "kl": 0.20561065673828124,
1046
+ "learning_rate": 2.786281432302071e-07,
1047
+ "loss": 0.0082,
1048
+ "reward": 0.7339286010712385,
1049
+ "reward_std": 0.234727381169796,
1050
+ "rewards/accuracy_reward": 0.7339286010712385,
1051
+ "rewards/format_reward": 0.0,
1052
+ "step": 385
1053
+ },
1054
+ {
1055
+ "completion_length": 546.478592300415,
1056
+ "epoch": 0.832,
1057
+ "grad_norm": 6.679286003112793,
1058
+ "kl": 0.18747406005859374,
1059
+ "learning_rate": 2.46996139315057e-07,
1060
+ "loss": 0.0075,
1061
+ "reward": 0.7375000283122063,
1062
+ "reward_std": 0.2588605497032404,
1063
+ "rewards/accuracy_reward": 0.7375000283122063,
1064
+ "rewards/format_reward": 0.0,
1065
+ "step": 390
1066
+ },
1067
+ {
1068
+ "completion_length": 531.3750255584716,
1069
+ "epoch": 0.8426666666666667,
1070
+ "grad_norm": 4.5977044105529785,
1071
+ "kl": 0.1937103271484375,
1072
+ "learning_rate": 2.1710826032485286e-07,
1073
+ "loss": 0.0077,
1074
+ "reward": 0.7357143165543676,
1075
+ "reward_std": 0.2378486678004265,
1076
+ "rewards/accuracy_reward": 0.7357143165543676,
1077
+ "rewards/format_reward": 0.0,
1078
+ "step": 395
1079
+ },
1080
+ {
1081
+ "completion_length": 595.6964569091797,
1082
+ "epoch": 0.8533333333333334,
1083
+ "grad_norm": 4.166606903076172,
1084
+ "kl": 0.2564910888671875,
1085
+ "learning_rate": 1.8900610884066817e-07,
1086
+ "loss": 0.0103,
1087
+ "reward": 0.6285714594647288,
1088
+ "reward_std": 0.2736343163996935,
1089
+ "rewards/accuracy_reward": 0.6285714594647288,
1090
+ "rewards/format_reward": 0.0,
1091
+ "step": 400
1092
+ },
1093
+ {
1094
+ "epoch": 0.8533333333333334,
1095
+ "eval_completion_length": 569.8527975585938,
1096
+ "eval_kl": 0.20817554931640625,
1097
+ "eval_loss": 0.008296786807477474,
1098
+ "eval_reward": 0.6110285988628864,
1099
+ "eval_reward_std": 0.28674686477184297,
1100
+ "eval_rewards/accuracy_reward": 0.6110285988628864,
1101
+ "eval_rewards/format_reward": 0.0,
1102
+ "eval_runtime": 29873.0033,
1103
+ "eval_samples_per_second": 0.167,
1104
+ "eval_steps_per_second": 0.012,
1105
+ "step": 400
1106
+ },
1107
+ {
1108
+ "completion_length": 605.528596496582,
1109
+ "epoch": 0.864,
1110
+ "grad_norm": 49.79132843017578,
1111
+ "kl": 0.341680908203125,
1112
+ "learning_rate": 1.627288017913383e-07,
1113
+ "loss": 0.0137,
1114
+ "reward": 0.5571428839117288,
1115
+ "reward_std": 0.3719855587929487,
1116
+ "rewards/accuracy_reward": 0.5571428839117288,
1117
+ "rewards/format_reward": 0.0,
1118
+ "step": 405
1119
+ },
1120
+ {
1121
+ "completion_length": 637.5250267028808,
1122
+ "epoch": 0.8746666666666667,
1123
+ "grad_norm": 12.56246566772461,
1124
+ "kl": 0.39511566162109374,
1125
+ "learning_rate": 1.3831291600445573e-07,
1126
+ "loss": 0.0158,
1127
+ "reward": 0.5392857391387225,
1128
+ "reward_std": 0.3052775662392378,
1129
+ "rewards/accuracy_reward": 0.5392857391387225,
1130
+ "rewards/format_reward": 0.0,
1131
+ "step": 410
1132
+ },
1133
+ {
1134
+ "completion_length": 631.8607391357422,
1135
+ "epoch": 0.8853333333333333,
1136
+ "grad_norm": 6.395442962646484,
1137
+ "kl": 0.45283203125,
1138
+ "learning_rate": 1.1579243729307487e-07,
1139
+ "loss": 0.0181,
1140
+ "reward": 0.49107145331799984,
1141
+ "reward_std": 0.36503970213234427,
1142
+ "rewards/accuracy_reward": 0.49107145331799984,
1143
+ "rewards/format_reward": 0.0,
1144
+ "step": 415
1145
+ },
1146
+ {
1147
+ "completion_length": 611.7053848266602,
1148
+ "epoch": 0.896,
1149
+ "grad_norm": 1.6087334156036377,
1150
+ "kl": 0.3666015625,
1151
+ "learning_rate": 9.519871314899092e-08,
1152
+ "loss": 0.0147,
1153
+ "reward": 0.6000000327825546,
1154
+ "reward_std": 0.3550443138927221,
1155
+ "rewards/accuracy_reward": 0.6000000327825546,
1156
+ "rewards/format_reward": 0.0,
1157
+ "step": 420
1158
+ },
1159
+ {
1160
+ "completion_length": 613.4946708679199,
1161
+ "epoch": 0.9066666666666666,
1162
+ "grad_norm": 1.6784651279449463,
1163
+ "kl": 0.3438507080078125,
1164
+ "learning_rate": 7.656040910844358e-08,
1165
+ "loss": 0.0138,
1166
+ "reward": 0.5946428859606385,
1167
+ "reward_std": 0.3528768301010132,
1168
+ "rewards/accuracy_reward": 0.5946428859606385,
1169
+ "rewards/format_reward": 0.0,
1170
+ "step": 425
1171
+ },
1172
+ {
1173
+ "completion_length": 603.3143119812012,
1174
+ "epoch": 0.9173333333333333,
1175
+ "grad_norm": 4.043834209442139,
1176
+ "kl": 0.3444915771484375,
1177
+ "learning_rate": 5.990346885098235e-08,
1178
+ "loss": 0.0138,
1179
+ "reward": 0.5964285979047418,
1180
+ "reward_std": 0.3912374936044216,
1181
+ "rewards/accuracy_reward": 0.5964285979047418,
1182
+ "rewards/format_reward": 0.0,
1183
+ "step": 430
1184
+ },
1185
+ {
1186
+ "completion_length": 575.2339546203614,
1187
+ "epoch": 0.928,
1188
+ "grad_norm": 4.404404163360596,
1189
+ "kl": 0.30106658935546876,
1190
+ "learning_rate": 4.5251078087033493e-08,
1191
+ "loss": 0.012,
1192
+ "reward": 0.6696428902447223,
1193
+ "reward_std": 0.3243862982839346,
1194
+ "rewards/accuracy_reward": 0.6696428902447223,
1195
+ "rewards/format_reward": 0.0,
1196
+ "step": 435
1197
+ },
1198
+ {
1199
+ "completion_length": 582.6393119812012,
1200
+ "epoch": 0.9386666666666666,
1201
+ "grad_norm": 2.593308448791504,
1202
+ "kl": 0.383941650390625,
1203
+ "learning_rate": 3.262363228443427e-08,
1204
+ "loss": 0.0154,
1205
+ "reward": 0.6035714616999031,
1206
+ "reward_std": 0.29665699824690817,
1207
+ "rewards/accuracy_reward": 0.6035714616999031,
1208
+ "rewards/format_reward": 0.0,
1209
+ "step": 440
1210
+ },
1211
+ {
1212
+ "completion_length": 627.2125282287598,
1213
+ "epoch": 0.9493333333333334,
1214
+ "grad_norm": 5.177957057952881,
1215
+ "kl": 0.3490997314453125,
1216
+ "learning_rate": 2.2038708278862952e-08,
1217
+ "loss": 0.014,
1218
+ "reward": 0.5839285971596837,
1219
+ "reward_std": 0.30867667235434054,
1220
+ "rewards/accuracy_reward": 0.5839285971596837,
1221
+ "rewards/format_reward": 0.0,
1222
+ "step": 445
1223
+ },
1224
+ {
1225
+ "completion_length": 576.0964553833007,
1226
+ "epoch": 0.96,
1227
+ "grad_norm": 3.4815688133239746,
1228
+ "kl": 0.2956207275390625,
1229
+ "learning_rate": 1.3511039807673209e-08,
1230
+ "loss": 0.0118,
1231
+ "reward": 0.6589286010712385,
1232
+ "reward_std": 0.3096484154462814,
1233
+ "rewards/accuracy_reward": 0.6589286010712385,
1234
+ "rewards/format_reward": 0.0,
1235
+ "step": 450
1236
+ },
1237
+ {
1238
+ "completion_length": 531.355379486084,
1239
+ "epoch": 0.9706666666666667,
1240
+ "grad_norm": 2.63545298576355,
1241
+ "kl": 0.261767578125,
1242
+ "learning_rate": 7.0524970011963675e-09,
1243
+ "loss": 0.0105,
1244
+ "reward": 0.7125000340864063,
1245
+ "reward_std": 0.28212962336838243,
1246
+ "rewards/accuracy_reward": 0.7125000340864063,
1247
+ "rewards/format_reward": 0.0,
1248
+ "step": 455
1249
+ },
1250
+ {
1251
+ "completion_length": 569.5285957336425,
1252
+ "epoch": 0.9813333333333333,
1253
+ "grad_norm": 1.8794143199920654,
1254
+ "kl": 0.2968017578125,
1255
+ "learning_rate": 2.6720698600553595e-09,
1256
+ "loss": 0.0119,
1257
+ "reward": 0.6303571775555611,
1258
+ "reward_std": 0.3074988707900047,
1259
+ "rewards/accuracy_reward": 0.6303571775555611,
1260
+ "rewards/format_reward": 0.0,
1261
+ "step": 460
1262
+ },
1263
+ {
1264
+ "completion_length": 612.3214569091797,
1265
+ "epoch": 0.992,
1266
+ "grad_norm": 3.266709327697754,
1267
+ "kl": 0.39468994140625,
1268
+ "learning_rate": 3.7585574148779613e-10,
1269
+ "loss": 0.0158,
1270
+ "reward": 0.5517857383936644,
1271
+ "reward_std": 0.3217977944761515,
1272
+ "rewards/accuracy_reward": 0.5517857383936644,
1273
+ "rewards/format_reward": 0.0,
1274
+ "step": 465
1275
+ },
1276
+ {
1277
+ "completion_length": 626.7321701049805,
1278
+ "epoch": 0.9984,
1279
+ "kl": 0.32488250732421875,
1280
+ "reward": 0.604166692122817,
1281
+ "reward_std": 0.3102082473536332,
1282
+ "rewards/accuracy_reward": 0.604166692122817,
1283
+ "rewards/format_reward": 0.0,
1284
+ "step": 468,
1285
+ "total_flos": 0.0,
1286
+ "train_loss": 0.00286464851636153,
1287
+ "train_runtime": 7716.3929,
1288
+ "train_samples_per_second": 0.972,
1289
+ "train_steps_per_second": 0.061
1290
+ }
1291
+ ],
1292
+ "logging_steps": 5,
1293
+ "max_steps": 468,
1294
+ "num_input_tokens_seen": 0,
1295
+ "num_train_epochs": 1,
1296
+ "save_steps": 100,
1297
+ "stateful_callbacks": {
1298
+ "TrainerControl": {
1299
+ "args": {
1300
+ "should_epoch_stop": false,
1301
+ "should_evaluate": false,
1302
+ "should_log": false,
1303
+ "should_save": true,
1304
+ "should_training_stop": true
1305
+ },
1306
+ "attributes": {}
1307
+ }
1308
+ },
1309
+ "total_flos": 0.0,
1310
+ "train_batch_size": 2,
1311
+ "trial_name": null,
1312
+ "trial_params": null
1313
+ }