chenwuml commited on
Commit
55f3127
·
1 Parent(s): 39a7993

initial commit

Browse files
Files changed (1) hide show
  1. README.md +2 -13
README.md CHANGED
@@ -229,13 +229,13 @@ Despite attempts to mitigate this through increased batch sizes (up to 416) as s
229
 
230
  *Figure 3 - Mean response length and reward plummet despite the much larger batch size*
231
 
232
- We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original PPO algorithm [1]. This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
233
 
234
  ![Mean Reward](resp_len_bsz_256_fix.png)
235
 
236
  *Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
237
 
238
- A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like DAPO [2], OPO [3], Dr.GRPO [4], and GSPO [5].
239
 
240
 
241
  ## Citation
@@ -252,14 +252,3 @@ CodeFu is developed by the **AWS WWSO Prototyping** Team. If you find CodeFu hel
252
  version={0.1}
253
  }
254
  ```
255
-
256
- ## References
257
- [1] - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. (https://arxiv.org/pdf/1707.06347.pdf)
258
-
259
- [2] - Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., ... & Wang, M. (2025). DAPO: An open-source llm reinforcement learning system at scale.
260
-
261
- [3] - Hao, Y., Dong, L., Wu, X., Huang, S., Chi, Z., & Wei, F. (2025). On-Policy RL with Optimal Reward Baseline.
262
-
263
- [4] - Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., ... & Lin, M. Understanding r1-zero-like training: A critical perspective.
264
-
265
- [5] - Zheng, C., Liu, S., Li, M., Chen, X. H., Yu, B., Gao, C., ... & Lin, J. (2025). Group Sequence Policy Optimization.
 
229
 
230
  *Figure 3 - Mean response length and reward plummet despite the much larger batch size*
231
 
232
+ We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original [PPO algorithm](https://en.wikipedia.org/wiki/Proximal_policy_optimization). This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
233
 
234
  ![Mean Reward](resp_len_bsz_256_fix.png)
235
 
236
  *Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
237
 
238
+ A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like [DAPO](https://dapo-sia.github.io/), [OPO](https://verl.readthedocs.io/en/latest/algo/opo.html), [Dr.GRPO](https://github.com/sail-sg/understand-r1-zero), and [GSPO](https://qwenlm.github.io/blog/gspo/).
239
 
240
 
241
  ## Citation
 
252
  version={0.1}
253
  }
254
  ```