initial commit
Browse files
README.md
CHANGED
@@ -229,13 +229,13 @@ Despite attempts to mitigate this through increased batch sizes (up to 416) as s
|
|
229 |
|
230 |
*Figure 3 - Mean response length and reward plummet despite the much larger batch size*
|
231 |
|
232 |
-
We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original PPO algorithm
|
233 |
|
234 |

|
235 |
|
236 |
*Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
|
237 |
|
238 |
-
A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like DAPO
|
239 |
|
240 |
|
241 |
## Citation
|
@@ -252,14 +252,3 @@ CodeFu is developed by the **AWS WWSO Prototyping** Team. If you find CodeFu hel
|
|
252 |
version={0.1}
|
253 |
}
|
254 |
```
|
255 |
-
|
256 |
-
## References
|
257 |
-
[1] - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. (https://arxiv.org/pdf/1707.06347.pdf)
|
258 |
-
|
259 |
-
[2] - Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., ... & Wang, M. (2025). DAPO: An open-source llm reinforcement learning system at scale.
|
260 |
-
|
261 |
-
[3] - Hao, Y., Dong, L., Wu, X., Huang, S., Chi, Z., & Wei, F. (2025). On-Policy RL with Optimal Reward Baseline.
|
262 |
-
|
263 |
-
[4] - Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., ... & Lin, M. Understanding r1-zero-like training: A critical perspective.
|
264 |
-
|
265 |
-
[5] - Zheng, C., Liu, S., Li, M., Chen, X. H., Yu, B., Gao, C., ... & Lin, J. (2025). Group Sequence Policy Optimization.
|
|
|
229 |
|
230 |
*Figure 3 - Mean response length and reward plummet despite the much larger batch size*
|
231 |
|
232 |
+
We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original [PPO algorithm](https://en.wikipedia.org/wiki/Proximal_policy_optimization). This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
|
233 |
|
234 |

|
235 |
|
236 |
*Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
|
237 |
|
238 |
+
A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like [DAPO](https://dapo-sia.github.io/), [OPO](https://verl.readthedocs.io/en/latest/algo/opo.html), [Dr.GRPO](https://github.com/sail-sg/understand-r1-zero), and [GSPO](https://qwenlm.github.io/blog/gspo/).
|
239 |
|
240 |
|
241 |
## Citation
|
|
|
252 |
version={0.1}
|
253 |
}
|
254 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|