TIGER-Lab
/

AceCodeRM-7B

@@ -33,7 +33,7 @@ We introduce AceCoder, the first work to propose a fully automated pipeline for
 | Model                                | Code | Chat  | Math  | Safety | Easy  | Normal | Hard | Avg  |
 | ------------------------------------ | ---- | ----- | ----- | ------ | ----- | ------ | ---- | ---- |
-| Skywork/Skywork-Reward-Llama-3.1-8B  | 54.5 | 69.5  | 60.6  | 95.7   | 89    | 74.7   | 46.6 | 70.1 |
 | LxzGordon/URM-LLaMa-3.1-8B           | 54.1 | 71.2  | 61.8  | 93.1   | 84    | 73.2   | 53   | 70   |
 | NVIDIA/Nemotron-340B-Reward          | 59.4 | 71.2  | 59.8  | 87.5   | 81    | 71.4   | 56.1 | 69.5 |
 | NCSOFT/Llama-3-OffsetBias-RM-8B      | 53.2 | 71.3  | 61.9  | 89.6   | 84.6  | 72.2   | 50.2 | 69   |
@@ -42,14 +42,16 @@ We introduce AceCoder, the first work to propose a fully automated pipeline for
 | Ray2333/GRM-llama3-8B-distill        | 56.9 | 62.4  | 62.1  | 88.1   | 82.2  | 71.5   | 48.4 | 67.4 |
 | Ray2333/GRM-Llama3-8B-rewardmodel-ft | 52.1 | 66.8  | 58.8  | 91.4   | 86.2  | 70.6   | 45.1 | 67.3 |
 | LxzGordon/URM-LLLaMa-3-8B            | 52.3 | 68.5  | 57.6  | 90.3   | 80.2  | 69.9   | 51.5 | 67.2 |
-| internlm/internlm2-7b-reward         | 49.7 | 61.7  | 71.4  | 85.5   | 85.4  | 70.7   | 45.1 | 67.1 |
-| Skywork-Reward-Llama-3.1-8B-v0.2     | 53.4 | 69.2  | 62.1  | 96     | 88.5  | 74     | 47.9 | 70.1 |
-| Skywork-Reward-Gemma-2-27B-v0.2      | 45.8 | 49.4  | 50.7  | 48.2   | 50.3  | 48.2   | 47   | 48.5 |
 | AceCoder-RM-7B                       | 66.9 | 66.7  | 65.3  | 89.9   | 79.9  | 74.4   | 62.2 | 72.2 |
-| AceCoder-RM-32B                      | 72.1 | 73.7  | 70.5  | 88     | 84.5  | 78.3   | 65.5 | 76.1 |
 | Delta (AceCoder 7B - Others)         | 7.5  | \-4.6 | \-6.1 | \-6.1  | \-9.1 | \-0.3  | 6.1  | 2.1  |
 | Delta (AceCoder 32B - Others)        | 12.7 | 2.4   | \-0.9 | \-8    | \-4.5 | 3.6    | 9.4  | 6    |
 ## Performance on Best-of-N sampling

 | Model                                | Code | Chat  | Math  | Safety | Easy  | Normal | Hard | Avg  |
 | ------------------------------------ | ---- | ----- | ----- | ------ | ----- | ------ | ---- | ---- |
+| Skywork/Skywork-Reward-Llama-3.1-8B  | 54.5 | 69.5  | 60.6  | 95.7   | **89**    | 74.7   | 46.6 | 70.1 |
 | LxzGordon/URM-LLaMa-3.1-8B           | 54.1 | 71.2  | 61.8  | 93.1   | 84    | 73.2   | 53   | 70   |
 | NVIDIA/Nemotron-340B-Reward          | 59.4 | 71.2  | 59.8  | 87.5   | 81    | 71.4   | 56.1 | 69.5 |
 | NCSOFT/Llama-3-OffsetBias-RM-8B      | 53.2 | 71.3  | 61.9  | 89.6   | 84.6  | 72.2   | 50.2 | 69   |
 | Ray2333/GRM-llama3-8B-distill        | 56.9 | 62.4  | 62.1  | 88.1   | 82.2  | 71.5   | 48.4 | 67.4 |
 | Ray2333/GRM-Llama3-8B-rewardmodel-ft | 52.1 | 66.8  | 58.8  | 91.4   | 86.2  | 70.6   | 45.1 | 67.3 |
 | LxzGordon/URM-LLLaMa-3-8B            | 52.3 | 68.5  | 57.6  | 90.3   | 80.2  | 69.9   | 51.5 | 67.2 |
+| internlm/internlm2-7b-reward*         | 49.7 | 61.7  | **71.4**  | 85.5   | 85.4  | 70.7   | 45.1 | 67.1 |
+| Skywork-Reward-Llama-3.1-8B-v0.2*     | 53.4 | 69.2  | 62.1  | **96**     | 88.5  | 74     | 47.9 | 70.1 |
+| Skywork-Reward-Gemma-2-27B-v0.2*      | 45.8 | 49.4  | 50.7  | 48.2   | 50.3  | 48.2   | 47   | 48.5 |
 | AceCoder-RM-7B                       | 66.9 | 66.7  | 65.3  | 89.9   | 79.9  | 74.4   | 62.2 | 72.2 |
+| AceCoder-RM-32B                      | **72.1** | **73.7**  | 70.5  | 88     | 84.5  | **78.3**   | **65.5** | **76.1** |
 | Delta (AceCoder 7B - Others)         | 7.5  | \-4.6 | \-6.1 | \-6.1  | \-9.1 | \-0.3  | 6.1  | 2.1  |
 | Delta (AceCoder 32B - Others)        | 12.7 | 2.4   | \-0.9 | \-8    | \-4.5 | 3.6    | 9.4  | 6    |
+* These models do not have official results as they are released later than the RM Bench paper; therefore, the authors tried our best to extend the original code base to test these models. Our implementation can be found here:
+[Modified Reward Bench / RM Bench Code](https://github.com/wyettzeng/reward-bench)
 ## Performance on Best-of-N sampling