aws-prototyping
/

codefu-7b-v0.1

Model card Files Files and versions

chenwuml commited on 4 days ago

Commit

55f3127

·

1 Parent(s): 39a7993

initial commit

Files changed (1) hide show

README.md +2 -13

README.md CHANGED Viewed

@@ -229,13 +229,13 @@ Despite attempts to mitigate this through increased batch sizes (up to 416) as s
 *Figure 3 - Mean response length and reward plummet despite the much larger batch size*
-We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original PPO algorithm [1]. This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
 ![Mean Reward](resp_len_bsz_256_fix.png)
 *Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
-A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like DAPO [2], OPO [3], Dr.GRPO [4], and GSPO [5].
 ## Citation
@@ -252,14 +252,3 @@ CodeFu is developed by the **AWS WWSO Prototyping** Team. If you find CodeFu hel
   version={0.1}
 }
 ```
-## References
-[1] - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. (https://arxiv.org/pdf/1707.06347.pdf)
-[2] - Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., ... & Wang, M. (2025). DAPO: An open-source llm reinforcement learning system at scale.
-[3] - Hao, Y., Dong, L., Wu, X., Huang, S., Chi, Z., & Wei, F. (2025). On-Policy RL with Optimal Reward Baseline.
-[4] - Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., ... & Lin, M. Understanding r1-zero-like training: A critical perspective.
-[5] - Zheng, C., Liu, S., Li, M., Chen, X. H., Yu, B., Gao, C., ... & Lin, J. (2025). Group Sequence Policy Optimization.

 *Figure 3 - Mean response length and reward plummet despite the much larger batch size*
+We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original [PPO algorithm](https://en.wikipedia.org/wiki/Proximal_policy_optimization). This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
 ![Mean Reward](resp_len_bsz_256_fix.png)
 *Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
+A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like [DAPO](https://dapo-sia.github.io/), [OPO](https://verl.readthedocs.io/en/latest/algo/opo.html), [Dr.GRPO](https://github.com/sail-sg/understand-r1-zero), and [GSPO](https://qwenlm.github.io/blog/gspo/).
 ## Citation
   version={0.1}
 }
 ```