hxiang Jerry0723 commited on
Commit
53ea372
·
verified ·
1 Parent(s): 521215e

feat: update 4 models and release a test set (#6)

Browse files

- feat: update 4 models and release a test set (582d3c2139f1a4582f5a2f2b7cdb103f5a0faa31)


Co-authored-by: Ren Jiaxi <[email protected]>

app.py CHANGED
@@ -46,7 +46,7 @@ _BIBTEX = """
46
  }
47
  """
48
 
49
- _LAST_UPDATED = "December 28, 2024"
50
 
51
  banner_url = "./assets/logo.png"
52
  _BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>' # noqa
 
46
  }
47
  """
48
 
49
+ _LAST_UPDATED = "April 13, 2025"
50
 
51
  banner_url = "./assets/logo.png"
52
  _BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>' # noqa
assets/text.py CHANGED
@@ -11,7 +11,7 @@ on the content safety of LLMs for Chinese (Mandarin).
11
  To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples
12
  across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography,
13
  and variant/homophonic words. In particular, the benchmark is constructed as a balanced dataset, containing safe and unsafe data collected from internet resources and public datasets [1,2,3].
14
- We hope the evaluation can provides a guideline for developers and researchers to facilitate the safety of LLMs. <br>
15
 
16
  The leadboard is under construction and maintained by <a href="https://hongxin001.github.io/" target="_blank">Hongxin Wei's</a> research group at SUSTech.
17
  Comments, issues, contributions, and collaborations are all welcomed!
 
11
  To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples
12
  across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography,
13
  and variant/homophonic words. In particular, the benchmark is constructed as a balanced dataset, containing safe and unsafe data collected from internet resources and public datasets [1,2,3].
14
+ We hope the evaluation can provides a guideline for developers and researchers to facilitate the safety of LLMs. A publicly accessible test set comprising 20,000 examples is released at <a href="https://huggingface.co/datasets/SUSTech/ChineseSafe" target="_blank">ChineseSafe</a>.<br>
15
 
16
  The leadboard is under construction and maintained by <a href="https://hongxin001.github.io/" target="_blank">Hongxin Wei's</a> research group at SUSTech.
17
  Comments, issues, contributions, and collaborations are all welcomed!
changelog.md CHANGED
@@ -41,4 +41,15 @@ version: v1.0.4
41
  - Phi-3-small-8k-instruct
42
  - Phi-3-small-128k-instruct
43
  - Phi-3-medium-4k-instruct
44
- - Phi-3-medium-128k-instruct
 
 
 
 
 
 
 
 
 
 
 
 
41
  - Phi-3-small-8k-instruct
42
  - Phi-3-small-128k-instruct
43
  - Phi-3-medium-4k-instruct
44
+ - Phi-3-medium-128k-instruct
45
+
46
+ ### 2025-4-13
47
+ version: v1.0.5
48
+
49
+ changed:
50
+ - [1]feat: update 4 models due to the February's todo-list:
51
+ - phi-4
52
+ - DeepSeek-R1-Distill-Llama-70B
53
+ - Mistral-Small-24B-Instruct-2501
54
+ - Moonlight-16B-A3B-Instruct
55
+ - [2]feat: release a test set of 20000 samples
data/chinese_benchmark_gen.csv CHANGED
@@ -3,14 +3,17 @@ DeepSeek-LLM-67B-Chat >65B 76.76/0.35 73.40/0.37 84.26/0.40 81.34/0.35 69.19/0.6
3
  Llama3-ChatQA-1.5-70B >65B 65.29/0.29 66.24/0.50 62.92/0.12 64.43/0.19 67.69/0.63
4
  Qwen2.5-72B-Instruct >65B 63.41/0.77 66.00/0.95 56.00/0.62 61.49/0.65 70.90/0.96
5
  Qwen1.5-72B-Chat >65B 62.91/0.50 73.86/0.84 40.46/0.97 58.75/0.35 85.55/0.62
6
- Qwen2-72B-Instruct >65B 54.08/0.20 58.10/0.60 30.72/0.45 52.63/0.05 77.65/0.36
7
  Opt-66B >65B 54.46/0.17 53.22/0.06 76.94/0.24 57.73/0.49 31.77/0.28
 
 
8
  Llama-3.1-70B-Instruct >65B 52.84/0.38 59.07/1.22 19.82/0.85 51.57/0.24 86.14/0.58
9
  Llama-3.3-70B-Instruct >65B 50.87/0.07 54.51/0.86 13.19/0.10 50.37/0.06 88.89/0.39
10
  Qwen2.5-32B-Instruct ~30B 69.64/0.39 92.13/0.45 43.24/0.83 62.70/0.25 96.27/0.20
11
  QwQ-32B-Preview ~30B 69.55/0.28 75.97/0.48 57.60/0.27 65.61/0.17 81.62/0.33
 
12
  Yi-1.5-34B-Chat ~30B 60.06/0.43 58.14/0.40 72.51/0.55 63.27/0.56 47.56/0.42
13
  Opt-30B ~30B 50.88/0.11 50.76/0.12 72.95/0.16 51.18/0.26 28.62/0.28
 
14
  InternLM2-Chat-20B 10B~20B 70.21/0.55 73.30/0.70 63.79/0.43 67.82/0.45 76.65/0.67
15
  Qwen1.5-14B-Chat 10B~20B 68.25/0.44 65.87/0.37 76.02/0.72 71.51/0.59 60.44/0.20
16
  Phi-3-medium-128k-instruct 10B~20B 64.30/0.06 63.89/0.13 66.53/0.52 64.76/0.26 62.05/0.42
@@ -19,6 +22,7 @@ Mistral-Nemo-Instruct-2407 10B~20B 59.71/0.45 61.79/0.52 51.82/0.48 58.20/0.44 6
19
  Phi-3-medium-4k-instruct 10B~20B 57.79/0.45 58.69/0.37 53.88/0.62 57.02/0.55 61.74/0.55
20
  Ziya2-13B-Chat 10B~20B 53.40/0.43 53.33/0.38 56.18/0.41 53.48/0.53 50.62/0.61
21
  Opt-13B 10B~20B 50.18/0.26 50.29/0.20 69.97/0.37 49.94/0.47 30.22/0.31
 
22
  Phi-3-small-8k-instruct 5B~10B 72.73/0.47 73.67/0.63 71.12/0.49 71.85/0.35 74.36/0.59
23
  Gemma-1.1-7B-it 5B~10B 71.70/0.26 68.66/0.37 80.11/0.05 76.00/0.09 63.26/0.47
24
  DeepSeek-LLM-7B-Chat 5B~10B 71.63/0.17 69.50/0.15 77.33/0.67 74.33/0.41 65.90/0.38
 
3
  Llama3-ChatQA-1.5-70B >65B 65.29/0.29 66.24/0.50 62.92/0.12 64.43/0.19 67.69/0.63
4
  Qwen2.5-72B-Instruct >65B 63.41/0.77 66.00/0.95 56.00/0.62 61.49/0.65 70.90/0.96
5
  Qwen1.5-72B-Chat >65B 62.91/0.50 73.86/0.84 40.46/0.97 58.75/0.35 85.55/0.62
 
6
  Opt-66B >65B 54.46/0.17 53.22/0.06 76.94/0.24 57.73/0.49 31.77/0.28
7
+ Qwen2-72B-Instruct >65B 54.08/0.20 58.10/0.60 30.72/0.45 52.63/0.05 77.65/0.36
8
+ DeepSeek-R1-Distill-Llama-70B >65B 52.93/0.18 59.69/0.47 19.33/0.38 51.62/0.16 86.83/0.18
9
  Llama-3.1-70B-Instruct >65B 52.84/0.38 59.07/1.22 19.82/0.85 51.57/0.24 86.14/0.58
10
  Llama-3.3-70B-Instruct >65B 50.87/0.07 54.51/0.86 13.19/0.10 50.37/0.06 88.89/0.39
11
  Qwen2.5-32B-Instruct ~30B 69.64/0.39 92.13/0.45 43.24/0.83 62.70/0.25 96.27/0.20
12
  QwQ-32B-Preview ~30B 69.55/0.28 75.97/0.48 57.60/0.27 65.61/0.17 81.62/0.33
13
+ Mistral-Small-24B-Instruct-2501 ~30B 64.48/0.17 64.61/0.35 64.71/0.72 64.34/0.00 64.23/1.04
14
  Yi-1.5-34B-Chat ~30B 60.06/0.43 58.14/0.40 72.51/0.55 63.27/0.56 47.56/0.42
15
  Opt-30B ~30B 50.88/0.11 50.76/0.12 72.95/0.16 51.18/0.26 28.62/0.28
16
+ phi-4 10B~20B 72.24/0.24 76.59/0.46 64.42/0.51 69.06/0.15 80.13/0.62
17
  InternLM2-Chat-20B 10B~20B 70.21/0.55 73.30/0.70 63.79/0.43 67.82/0.45 76.65/0.67
18
  Qwen1.5-14B-Chat 10B~20B 68.25/0.44 65.87/0.37 76.02/0.72 71.51/0.59 60.44/0.20
19
  Phi-3-medium-128k-instruct 10B~20B 64.30/0.06 63.89/0.13 66.53/0.52 64.76/0.26 62.05/0.42
 
22
  Phi-3-medium-4k-instruct 10B~20B 57.79/0.45 58.69/0.37 53.88/0.62 57.02/0.55 61.74/0.55
23
  Ziya2-13B-Chat 10B~20B 53.40/0.43 53.33/0.38 56.18/0.41 53.48/0.53 50.62/0.61
24
  Opt-13B 10B~20B 50.18/0.26 50.29/0.20 69.97/0.37 49.94/0.47 30.22/0.31
25
+ Moonlight-16B-A3B-Instruct 10B~20B 45.16/0.43 44.16/0.64 34.79/0.67 45.82/0.33 55.62/0.35
26
  Phi-3-small-8k-instruct 5B~10B 72.73/0.47 73.67/0.63 71.12/0.49 71.85/0.35 74.36/0.59
27
  Gemma-1.1-7B-it 5B~10B 71.70/0.26 68.66/0.37 80.11/0.05 76.00/0.09 63.26/0.47
28
  DeepSeek-LLM-7B-Chat 5B~10B 71.63/0.17 69.50/0.15 77.33/0.67 74.33/0.41 65.90/0.38
data/chinese_benchmark_per.csv CHANGED
@@ -4,6 +4,7 @@ Qwen1.5-72B-Chat >65B 63.67/0.46 58.27/0.32 96.84/0.13 90.51/0.57 30.34/0.80
4
  Qwen2.5-72B-Instruct >65B 63.27/0.52 66.00/0.60 55.09/0.82 61.31/0.46 71.49/0.25
5
  Qwen2-72B-Instruct >65B 60.70/0.49 57.90/0.42 79.03/0.63 66.75/0.77 42.28/0.43
6
  Opt-66B >65B 59.93/0.41 56.52/0.37 86.87/0.59 71.36/0.78 32.86/0.74
 
7
  Llama-3.1-70B-Instruct >65B 43.68/0.41 36.45/0.84 16.66/0.34 45.83/0.30 70.82/0.48
8
  Llama3-ChatQA-1.5-70B >65B 40.41/0.29 33.86/0.75 19.84/0.75 43.13/0.25 61.08/0.37
9
  Llama-3.3-70B-Instruct >65B 36.84/0.82 32.02/1.29 23.19/1.13 39.58/0.63 50.55/0.69
@@ -15,10 +16,13 @@ Phi-3-medium-4k-instruct 10B~20B 71.04/0.31 69.74/0.29 74.56/0.97 72.54/0.59 67.
15
  Baichuan2-13B-Chat 10B~20B 70.43/0.39 65.81/0.38 85.34/0.63 79.02/0.63 55.46/0.47
16
  Phi-3-medium-128k-instruct 10B~20B 68.87/0.81 68.08/0.51 71.32/1.44 69.75/1.17 66.41/0.57
17
  Mistral-Nemo-Instruct-2407 10B~20B 66.88/0.46 62.56/0.28 84.42/0.90 75.89/1.13 49.26/0.24
 
18
  Qwen1.5-14B-Chat 10B~20B 61.29/0.40 57.02/0.32 92.43/0.55 79.80/1.05 30.02/0.47
 
19
  Ziya2-13B-Chat 10B~20B 55.25/0.26 59.24/0.37 34.30/0.11 53.61/0.26 76.29/0.39
20
  InternLM2-Chat-20B 10B~20B 53.67/0.16 79.00/0.66 10.30/0.60 51.90/0.11 97.25/0.26
21
  Opt-13B 10B~20B 49.31/0.31 37.77/3.57 1.76/0.16 49.59/0.23 97.08/0.29
 
22
  Gemma-1.1-7B-it 5B~10B 64.32/0.68 59.98/0.58 86.60/0.35 75.70/0.80 41.95/0.93
23
  Qwen1.5-7B-Chat 5B~10B 62.48/0.54 59.06/0.48 81.92/0.50 70.28/0.65 42.96/0.81
24
  Phi-3-small-128k-instruct 5B~10B 61.76/0.27 60.47/0.16 68.45/0.61 63.46/0.50 55.05/0.61
 
4
  Qwen2.5-72B-Instruct >65B 63.27/0.52 66.00/0.60 55.09/0.82 61.31/0.46 71.49/0.25
5
  Qwen2-72B-Instruct >65B 60.70/0.49 57.90/0.42 79.03/0.63 66.75/0.77 42.28/0.43
6
  Opt-66B >65B 59.93/0.41 56.52/0.37 86.87/0.59 71.36/0.78 32.86/0.74
7
+ DeepSeek-R1-Distill-Llama-70B >65B 47.68/0.64 45.77/1.21 23.85/0.67 48.35/0.46 71.62/0.60
8
  Llama-3.1-70B-Instruct >65B 43.68/0.41 36.45/0.84 16.66/0.34 45.83/0.30 70.82/0.48
9
  Llama3-ChatQA-1.5-70B >65B 40.41/0.29 33.86/0.75 19.84/0.75 43.13/0.25 61.08/0.37
10
  Llama-3.3-70B-Instruct >65B 36.84/0.82 32.02/1.29 23.19/1.13 39.58/0.63 50.55/0.69
 
16
  Baichuan2-13B-Chat 10B~20B 70.43/0.39 65.81/0.38 85.34/0.63 79.02/0.63 55.46/0.47
17
  Phi-3-medium-128k-instruct 10B~20B 68.87/0.81 68.08/0.51 71.32/1.44 69.75/1.17 66.41/0.57
18
  Mistral-Nemo-Instruct-2407 10B~20B 66.88/0.46 62.56/0.28 84.42/0.90 75.89/1.13 49.26/0.24
19
+ phi-4 10B~20B 62.62/0.32 63.73/0.41 58.98/0.20 61.66/0.31 66.28/0.78
20
  Qwen1.5-14B-Chat 10B~20B 61.29/0.40 57.02/0.32 92.43/0.55 79.80/1.05 30.02/0.47
21
+ Mistral-Small-24B-Instruct-2501 10B~20B 59.20/0.46 58.32/0.42 65.16/1.08 60.33/0.56 53.22/0.20
22
  Ziya2-13B-Chat 10B~20B 55.25/0.26 59.24/0.37 34.30/0.11 53.61/0.26 76.29/0.39
23
  InternLM2-Chat-20B 10B~20B 53.67/0.16 79.00/0.66 10.30/0.60 51.90/0.11 97.25/0.26
24
  Opt-13B 10B~20B 49.31/0.31 37.77/3.57 1.76/0.16 49.59/0.23 97.08/0.29
25
+ Moonlight-16B-A3B-Instruct 10B~20B 48.92/0.16 3.46/0.57 0.07/0.01 49.40/0.15 98.00/0.08
26
  Gemma-1.1-7B-it 5B~10B 64.32/0.68 59.98/0.58 86.60/0.35 75.70/0.80 41.95/0.93
27
  Qwen1.5-7B-Chat 5B~10B 62.48/0.54 59.06/0.48 81.92/0.50 70.28/0.65 42.96/0.81
28
  Phi-3-small-128k-instruct 5B~10B 61.76/0.27 60.47/0.16 68.45/0.61 63.46/0.50 55.05/0.61
data/subclass_gen.csv CHANGED
@@ -7,10 +7,12 @@ Opt-66B,>65B,0.4866,0.482,0.682,0.5174,0.5203,0.7258,0.5579,0.5338,0.8237,0.5646
7
  Llama3-ChatQA-1.5-70B,>65B,0.6682,0.6617,0.6566,0.6859,0.6932,0.6922,0.6079,0.6187,0.5348,0.6548,0.7024,0.6342,0.6861,0.6945,0.6928,0.7029,0.6853,0.7281,0.6211,0.6242,0.5599,0.6105,0.6189,0.5397,0.7134,0.6873,0.7493,0.59,0.6072,0.4996
8
  Llama-3.1-70B-Instruct,>65B,0.4845,0.3825,0.0896,0.5771,0.6976,0.3045,0.4546,0.2021,0.0359,0.6067,0.7722,0.3926,0.5946,0.7225,0.3403,0.5904,0.6813,0.3067,0.4817,0.3639,0.0828,0.4760,0.3471,0.0759,0.5340,0.5584,0.1851,0.4837,0.4207,0.1019
9
  Llama-3.3-70B-Instruct,>65B,0.5045,0.4639,0.0849,0.5211,0.6327,0.1537,0.4943,0.4221,0.0718,0.5173,0.7089,0.1918,0.5728,0.7424,0.2569,0.5775,0.7071,0.2347,0.4964,0.4060,0.0668,0.4960,0.4244,0.0712,0.5183,0.5179,0.1065,0.4820,0.3636,0.0544
 
10
  Yi-1.5-34B-Chat,~30B,0.66,0.6114,0.8339,0.7311,0.6644,0.9577,0.3309,0.2379,0.1626,0.6958,0.6708,0.8646,0.7046,0.6528,0.9053,0.7084,0.6383,0.9309,0.5928,0.5672,0.6961,0.4467,0.4308,0.3972,0.6956,0.6281,0.9097,0.5182,0.515,0.5425
11
  Qwen2.5-32B-Instruct,~30B,0.6204,0.8741,0.2629,0.9049,0.9606,0.8489,0.5103,0.5470,0.0453,0.8192,0.9583,0.6983,0.8514,0.9560,0.7445,0.7823,0.9396,0.5931,0.5869,0.8351,0.1922,0.5244,0.6511,0.0699,0.8334,0.9475,0.6950,0.5157,0.6401,0.0644
12
  Opt-30B,~30B,0.4672,0.4683,0.6648,0.5002,0.5082,0.7109,0.5044,0.4987,0.7354,0.5314,0.5517,0.7422,0.5108,0.5163,0.7304,0.5161,0.5039,0.7618,0.513,0.5009,0.7578,0.4956,0.4908,0.719,0.5119,0.4977,0.7583,0.4958,0.4955,0.7134
13
  QwQ-32B-Preview,~30B,0.6837,0.7403,0.5470,0.8120,0.8219,0.8084,0.6060,0.6749,0.3914,0.7516,0.8198,0.6977,0.8121,0.8230,0.8081,0.8470,0.8208,0.8801,0.6113,0.6736,0.3973,0.6050,0.6700,0.3873,0.7492,0.7768,0.6783,0.4656,0.3791,0.1124
 
14
  Baichuan2-13B-Chat,10B~20B,0.6337,0.6402,0.5755,0.7188,0.7164,0.7457,0.5185,0.5189,0.3417,0.7341,0.7487,0.7703,0.7033,0.7091,0.7143,0.6742,0.6712,0.6575,0.5657,0.5728,0.434,0.6151,0.6264,0.5371,0.6515,0.65,0.6089,0.5532,0.5707,0.414
15
  Qwen1.5-14B-Chat,10B~20B,0.7099,0.6657,0.8141,0.7897,0.7205,0.9615,0.5669,0.5657,0.5226,0.7776,0.7373,0.9181,0.7571,0.7073,0.897,0.7862,0.7044,0.97,0.6421,0.6225,0.6757,0.5014,0.4893,0.3888,0.7563,0.6869,0.9116,0.5499,0.5538,0.4889
16
  Ziya2-13B-Chat,10B~20B,0.5403,0.5272,0.5731,0.6597,0.6313,0.8034,0.3259,0.2145,0.1373,0.673,0.6631,0.8101,0.6526,0.6282,0.7886,0.5583,0.5437,0.6097,0.3987,0.3541,0.2823,0.529,0.5194,0.5497,0.5377,0.5208,0.5678,0.4567,0.4484,0.4035
@@ -19,6 +21,8 @@ Opt-13B,10B~20B,0.4746,0.4724,0.637,0.5147,0.519,0.7014,0.5146,0.5059,0.7153,0.5
19
  Mistral-Nemo-Instruct-2407,10B~20B,0.6375,0.6363,0.6018,0.6971,0.6973,0.7214,0.4741,0.4456,0.2722,0.6349,0.6873,0.6041,0.7122,0.7067,0.7508,0.7259,0.6960,0.7825,0.5252,0.5197,0.3718,0.4695,0.4343,0.2607,0.6126,0.6117,0.5492,0.4474,0.4009,0.2212
20
  Phi-3-medium-4k-instruct,10B~20B,0.5533,0.5494,0.4889,0.5385,0.5594,0.4653,0.6034,0.6005,0.5922,0.5418,0.5993,0.4803,0.5866,0.6054,0.5590,0.5815,0.5780,0.5475,0.6178,0.6070,0.6217,0.6437,0.6287,0.6742,0.6028,0.5912,0.5893,0.5057,0.5054,0.3950
21
  Phi-3-medium-128k-instruct,10B~20B,0.6379,0.6234,0.6581,0.6379,0.6437,0.6554,0.6504,0.6361,0.6823,0.5919,0.6413,0.5687,0.6431,0.6483,0.6654,0.6568,0.6374,0.6958,0.6632,0.6403,0.7087,0.6819,0.6546,0.7465,0.6796,0.6480,0.7433,0.5897,0.5935,0.5592
 
 
22
  Gemma-1.1-7B-it,5B~10B,0.7849,0.7205,0.9139,0.8081,0.7454,0.9485,0.6024,0.6084,0.5413,0.7854,0.758,0.8894,0.8017,0.7436,0.9353,0.8215,0.7367,0.9884,0.6669,0.6543,0.673,0.5811,0.5858,0.4976,0.7831,0.7167,0.9127,0.6684,0.6638,0.6754
23
  Qwen1.5-7B-Chat,5B~10B,0.6885,0.6347,0.8535,0.7677,0.6891,0.9938,0.6929,0.6404,0.8588,0.7791,0.7151,0.9869,0.7653,0.6889,0.988,0.7485,0.6659,0.9746,0.684,0.6317,0.8443,0.7267,0.6564,0.929,0.7473,0.662,0.9772,0.5545,0.5496,0.5778
24
  Yi-1.5-9B-Chat,5B~10B,0.7025,0.6913,0.7058,0.7032,0.7106,0.707,0.4533,0.3925,0.2,0.6546,0.7097,0.6172,0.7209,0.7213,0.7419,0.8197,0.7508,0.9452,0.5595,0.5666,0.4131,0.4342,0.3378,0.1591,0.7626,0.7215,0.8306,0.4057,0.2654,0.1096
 
7
  Llama3-ChatQA-1.5-70B,>65B,0.6682,0.6617,0.6566,0.6859,0.6932,0.6922,0.6079,0.6187,0.5348,0.6548,0.7024,0.6342,0.6861,0.6945,0.6928,0.7029,0.6853,0.7281,0.6211,0.6242,0.5599,0.6105,0.6189,0.5397,0.7134,0.6873,0.7493,0.59,0.6072,0.4996
8
  Llama-3.1-70B-Instruct,>65B,0.4845,0.3825,0.0896,0.5771,0.6976,0.3045,0.4546,0.2021,0.0359,0.6067,0.7722,0.3926,0.5946,0.7225,0.3403,0.5904,0.6813,0.3067,0.4817,0.3639,0.0828,0.4760,0.3471,0.0759,0.5340,0.5584,0.1851,0.4837,0.4207,0.1019
9
  Llama-3.3-70B-Instruct,>65B,0.5045,0.4639,0.0849,0.5211,0.6327,0.1537,0.4943,0.4221,0.0718,0.5173,0.7089,0.1918,0.5728,0.7424,0.2569,0.5775,0.7071,0.2347,0.4964,0.4060,0.0668,0.4960,0.4244,0.0712,0.5183,0.5179,0.1065,0.4820,0.3636,0.0544
10
+ DeepSeek-R1-Distill-Llama-70B,>65B,0.5416,0.5902,0.2095,0.5495,0.6557,0.2531,0.4770,0.3724,0.0843,0.6293,0.7886,0.4361,0.5619,0.6773,0.2789,0.5560,0.6236,0.2398,0.4694,0.2909,0.0598,0.4773,0.3611,0.0813,0.5191,0.5141,0.1569,0.4642,0.3155,0.0650
11
  Yi-1.5-34B-Chat,~30B,0.66,0.6114,0.8339,0.7311,0.6644,0.9577,0.3309,0.2379,0.1626,0.6958,0.6708,0.8646,0.7046,0.6528,0.9053,0.7084,0.6383,0.9309,0.5928,0.5672,0.6961,0.4467,0.4308,0.3972,0.6956,0.6281,0.9097,0.5182,0.515,0.5425
12
  Qwen2.5-32B-Instruct,~30B,0.6204,0.8741,0.2629,0.9049,0.9606,0.8489,0.5103,0.5470,0.0453,0.8192,0.9583,0.6983,0.8514,0.9560,0.7445,0.7823,0.9396,0.5931,0.5869,0.8351,0.1922,0.5244,0.6511,0.0699,0.8334,0.9475,0.6950,0.5157,0.6401,0.0644
13
  Opt-30B,~30B,0.4672,0.4683,0.6648,0.5002,0.5082,0.7109,0.5044,0.4987,0.7354,0.5314,0.5517,0.7422,0.5108,0.5163,0.7304,0.5161,0.5039,0.7618,0.513,0.5009,0.7578,0.4956,0.4908,0.719,0.5119,0.4977,0.7583,0.4958,0.4955,0.7134
14
  QwQ-32B-Preview,~30B,0.6837,0.7403,0.5470,0.8120,0.8219,0.8084,0.6060,0.6749,0.3914,0.7516,0.8198,0.6977,0.8121,0.8230,0.8081,0.8470,0.8208,0.8801,0.6113,0.6736,0.3973,0.6050,0.6700,0.3873,0.7492,0.7768,0.6783,0.4656,0.3791,0.1124
15
+ Mistral-Small-24B-Instruct-2501,10B~20B,0.6626,0.6491,0.6746,0.7897,0.7347,0.9223,0.3990,0.2824,0.1406,0.7649,0.7465,0.8603,0.7828,0.7326,0.9081,0.8088,0.7280,0.9732,0.6010,0.6001,0.5490,0.4367,0.3723,0.2159,0.7369,0.6906,0.8282,0.4868,0.4773,0.3217
16
  Baichuan2-13B-Chat,10B~20B,0.6337,0.6402,0.5755,0.7188,0.7164,0.7457,0.5185,0.5189,0.3417,0.7341,0.7487,0.7703,0.7033,0.7091,0.7143,0.6742,0.6712,0.6575,0.5657,0.5728,0.434,0.6151,0.6264,0.5371,0.6515,0.65,0.6089,0.5532,0.5707,0.414
17
  Qwen1.5-14B-Chat,10B~20B,0.7099,0.6657,0.8141,0.7897,0.7205,0.9615,0.5669,0.5657,0.5226,0.7776,0.7373,0.9181,0.7571,0.7073,0.897,0.7862,0.7044,0.97,0.6421,0.6225,0.6757,0.5014,0.4893,0.3888,0.7563,0.6869,0.9116,0.5499,0.5538,0.4889
18
  Ziya2-13B-Chat,10B~20B,0.5403,0.5272,0.5731,0.6597,0.6313,0.8034,0.3259,0.2145,0.1373,0.673,0.6631,0.8101,0.6526,0.6282,0.7886,0.5583,0.5437,0.6097,0.3987,0.3541,0.2823,0.529,0.5194,0.5497,0.5377,0.5208,0.5678,0.4567,0.4484,0.4035
 
21
  Mistral-Nemo-Instruct-2407,10B~20B,0.6375,0.6363,0.6018,0.6971,0.6973,0.7214,0.4741,0.4456,0.2722,0.6349,0.6873,0.6041,0.7122,0.7067,0.7508,0.7259,0.6960,0.7825,0.5252,0.5197,0.3718,0.4695,0.4343,0.2607,0.6126,0.6117,0.5492,0.4474,0.4009,0.2212
22
  Phi-3-medium-4k-instruct,10B~20B,0.5533,0.5494,0.4889,0.5385,0.5594,0.4653,0.6034,0.6005,0.5922,0.5418,0.5993,0.4803,0.5866,0.6054,0.5590,0.5815,0.5780,0.5475,0.6178,0.6070,0.6217,0.6437,0.6287,0.6742,0.6028,0.5912,0.5893,0.5057,0.5054,0.3950
23
  Phi-3-medium-128k-instruct,10B~20B,0.6379,0.6234,0.6581,0.6379,0.6437,0.6554,0.6504,0.6361,0.6823,0.5919,0.6413,0.5687,0.6431,0.6483,0.6654,0.6568,0.6374,0.6958,0.6632,0.6403,0.7087,0.6819,0.6546,0.7465,0.6796,0.6480,0.7433,0.5897,0.5935,0.5592
24
+ phi-4,10B~20B,0.7431,0.7737,0.6700,0.7139,0.7762,0.6194,0.7081,0.7576,0.6003,0.6957,0.7921,0.5974,0.7625,0.8010,0.7146,0.8283,0.8125,0.8440,0.7130,0.7564,0.6083,0.6627,0.7239,0.5074,0.8171,0.8052,0.8213,0.6456,0.7165,0.4768
25
+ Moonlight-16B-A3B-Instruct,10B~20B,0.4432,0.4087,0.3134,0.6335,0.6278,0.6971,0.3356,0.1806,0.0982,0.4713,0.5191,0.3914,0.5555,0.5699,0.5449,0.5349,0.5261,0.5011,0.4096,0.3505,0.2448,0.4197,0.3738,0.2672,0.4127,0.3514,0.2496,0.3428,0.2125,0.1175
26
  Gemma-1.1-7B-it,5B~10B,0.7849,0.7205,0.9139,0.8081,0.7454,0.9485,0.6024,0.6084,0.5413,0.7854,0.758,0.8894,0.8017,0.7436,0.9353,0.8215,0.7367,0.9884,0.6669,0.6543,0.673,0.5811,0.5858,0.4976,0.7831,0.7167,0.9127,0.6684,0.6638,0.6754
27
  Qwen1.5-7B-Chat,5B~10B,0.6885,0.6347,0.8535,0.7677,0.6891,0.9938,0.6929,0.6404,0.8588,0.7791,0.7151,0.9869,0.7653,0.6889,0.988,0.7485,0.6659,0.9746,0.684,0.6317,0.8443,0.7267,0.6564,0.929,0.7473,0.662,0.9772,0.5545,0.5496,0.5778
28
  Yi-1.5-9B-Chat,5B~10B,0.7025,0.6913,0.7058,0.7032,0.7106,0.707,0.4533,0.3925,0.2,0.6546,0.7097,0.6172,0.7209,0.7213,0.7419,0.8197,0.7508,0.9452,0.5595,0.5666,0.4131,0.4342,0.3378,0.1591,0.7626,0.7215,0.8306,0.4057,0.2654,0.1096
data/subclass_per.csv CHANGED
@@ -7,10 +7,12 @@ Opt-66B,>65B,0.645,0.5831,0.9572,0.3981,0.417,0.4471,0.6667,0.5971,0.9953,0.6232
7
  Llama3-ChatQA-1.5-70B,>65B,0.3666,0.2082,0.1069,0.339,0.169,0.0752,0.3147,0.0148,0.0059,0.2947,0.075,0.0261,0.7758,0.7167,0.9293,0.5528,0.5482,0.4877,0.3396,0.111,0.0507,0.3207,0.0374,0.0156,0.4392,0.3806,0.2524,0.3214,0.0614,0.0253
8
  Llama-3.1-70B-Instruct,>65B,0.4670,0.4105,0.2107,0.3766,0.1681,0.0560,0.3856,0.1439,0.0505,0.3460,0.1387,0.0392,0.4036,0.2873,0.1107,0.3872,0.1394,0.0487,0.4967,0.4715,0.2711,0.4070,0.2331,0.0910,0.4985,0.4691,0.2716,0.6337,0.6553,0.5548
9
  Llama-3.3-70B-Instruct,>65B,0.3996,0.3526,0.2759,0.2923,0.1430,0.0771,0.3029,0.1420,0.0825,0.2624,0.1066,0.0486,0.3657,0.3253,0.2213,0.3305,0.2121,0.1358,0.4583,0.4388,0.3966,0.3156,0.1750,0.1062,0.4510,0.4249,0.3802,0.5813,0.5696,0.6459
 
10
  Yi-1.5-34B-Chat,~30B,0.7139,0.8341,0.5176,0.7722,0.8735,0.6482,0.475,0.2581,0.0357,0.7162,0.8717,0.5603,0.6206,0.7912,0.353,0.8816,0.8938,0.8601,0.6412,0.7813,0.3672,0.497,0.4306,0.0769,0.8472,0.8832,0.7889,0.4818,0.3646,0.0576
11
  Qwen2.5-32B-Instruct,~30B,0.6749,0.6366,0.7789,0.7893,0.7099,0.9938,0.4372,0.4025,0.2943,0.7921,0.7323,0.9739,0.7723,0.7036,0.9599,0.7702,0.6873,0.9727,0.5920,0.5774,0.6092,0.4358,0.3969,0.2906,0.7404,0.6695,0.9160,0.4640,0.4506,0.3514
12
  Opt-30B,~30B,0.5831,0.5754,0.5565,0.3952,0.338,0.1915,0.6784,0.6507,0.7506,0.5798,0.6281,0.5559,0.357,0.2405,0.1185,0.406,0.3224,0.1945,0.6203,0.6061,0.633,0.6188,0.6076,0.6293,0.6031,0.5886,0.5976,0.6244,0.6184,0.6415
13
  QwQ-32B-Preview,~30B,0.5231,0.5061,0.9839,0.5519,0.5328,1.0000,0.4141,0.4443,0.7537,0.5814,0.5650,0.9989,0.5529,0.5340,0.9993,0.5318,0.5111,0.9993,0.5083,0.4978,0.9542,0.4392,0.4593,0.8080,0.5238,0.5042,0.9922,0.5269,0.5128,0.9743
 
14
  Baichuan2-13B-Chat,10B~20B,0.7346,0.6715,0.8932,0.7703,0.7043,0.9491,0.6303,0.6129,0.6785,0.7435,0.7152,0.8777,0.779,0.7088,0.9649,0.7677,0.6883,0.9601,0.6763,0.6388,0.7738,0.6359,0.6149,0.6904,0.7096,0.6554,0.8436,0.7306,0.6762,0.8788
15
  Qwen1.5-14B-Chat,10B~20B,0.625,0.5683,0.964,0.6549,0.5977,0.9932,0.5983,0.5571,0.9038,0.6561,0.6193,0.9535,0.6592,0.6005,0.9994,0.6382,0.5759,0.9897,0.5579,0.53,0.8275,0.5009,0.4938,0.7077,0.6256,0.566,0.9705,0.6063,0.5643,0.914
16
  Ziya2-13B-Chat,10B~20B,0.6322,0.6632,0.502,0.381,0.0822,0.0212,0.4263,0.2557,0.086,0.4352,0.4474,0.1651,0.612,0.6721,0.4744,0.812,0.7741,0.8691,0.4904,0.4516,0.2102,0.5309,0.5403,0.2964,0.7186,0.7235,0.6777,0.4811,0.4512,0.2021
@@ -19,6 +21,8 @@ Opt-13B,10B~20B,0.5011,0.0392,0.0015,0.4792,0.0695,0.0018,0.4958,0,0,0.4492,0.23
19
  Mistral-Nemo-Instruct-2407,10B~20B,0.6992,0.6359,0.8960,0.7518,0.6773,0.9826,0.6421,0.6067,0.7767,0.7290,0.6896,0.9121,0.7377,0.6719,0.9542,0.7482,0.6611,0.9959,0.6396,0.6014,0.7754,0.6045,0.5803,0.7019,0.7246,0.6464,0.9529,0.4910,0.4881,0.4717
20
  Phi-3-medium-4k-instruct,10B~20B,0.8162,0.7447,0.9484,0.3950,0.2748,0.1126,0.8368,0.7558,0.9878,0.5763,0.6486,0.4809,0.6431,0.6695,0.5981,0.8403,0.7549,0.9973,0.8092,0.7414,0.9343,0.8263,0.7504,0.9679,0.8352,0.7499,0.9896,0.6361,0.6499,0.5818
21
  Phi-3-medium-128k-instruct,10B~20B,0.8024,0.7318,0.9391,0.3592,0.1596,0.0598,0.8232,0.7434,0.9790,0.5228,0.5910,0.3977,0.5699,0.6022,0.4725,0.8293,0.7436,0.9939,0.7813,0.7222,0.8963,0.8009,0.7328,0.9351,0.8260,0.7393,0.9898,0.6525,0.6565,0.6327
 
 
22
  Gemma-1.1-7B-it,5B~10B,0.6885,0.6193,0.9389,0.7201,0.6502,0.9795,0.6709,0.6133,0.8985,0.7171,0.6709,0.9421,0.5993,0.5861,0.7426,0.7164,0.634,0.9953,0.6316,0.5872,0.8235,0.5207,0.5098,0.595,0.6874,0.616,0.9415,0.6164,0.5853,0.7856
23
  Qwen1.5-7B-Chat,5B~10B,0.6415,0.5933,0.8439,0.7295,0.6542,0.9987,0.5495,0.5352,0.6535,0.7415,0.6808,0.9875,0.7286,0.6545,0.9955,0.7167,0.6339,0.9966,0.6122,0.5749,0.784,0.4866,0.4788,0.5265,0.6887,0.6165,0.9449,0.4276,0.4219,0.4072
24
  Yi-1.5-9B-Chat,5B~10B,0.7089,0.8612,0.4825,0.5418,0.7129,0.1741,0.4846,0.2932,0.0308,0.5376,0.7743,0.2115,0.6185,0.8236,0.3254,0.818,0.9011,0.7057,0.5819,0.7416,0.2207,0.4893,0.3279,0.0365,0.7959,0.8937,0.6572,0.477,0.2414,0.0233
 
7
  Llama3-ChatQA-1.5-70B,>65B,0.3666,0.2082,0.1069,0.339,0.169,0.0752,0.3147,0.0148,0.0059,0.2947,0.075,0.0261,0.7758,0.7167,0.9293,0.5528,0.5482,0.4877,0.3396,0.111,0.0507,0.3207,0.0374,0.0156,0.4392,0.3806,0.2524,0.3214,0.0614,0.0253
8
  Llama-3.1-70B-Instruct,>65B,0.4670,0.4105,0.2107,0.3766,0.1681,0.0560,0.3856,0.1439,0.0505,0.3460,0.1387,0.0392,0.4036,0.2873,0.1107,0.3872,0.1394,0.0487,0.4967,0.4715,0.2711,0.4070,0.2331,0.0910,0.4985,0.4691,0.2716,0.6337,0.6553,0.5548
9
  Llama-3.3-70B-Instruct,>65B,0.3996,0.3526,0.2759,0.2923,0.1430,0.0771,0.3029,0.1420,0.0825,0.2624,0.1066,0.0486,0.3657,0.3253,0.2213,0.3305,0.2121,0.1358,0.4583,0.4388,0.3966,0.3156,0.1750,0.1062,0.4510,0.4249,0.3802,0.5813,0.5696,0.6459
10
+ DeepSeek-R1-Distill-Llama-70B,>65B,0.4240,0.2914,0.1265,0.6148,0.6530,0.5255,0.3608,0.0107,0.0033,0.5182,0.5945,0.3588,0.5583,0.5989,0.4156,0.4922,0.4667,0.2664,0.4312,0.3134,0.1401,0.3727,0.0743,0.0243,0.4061,0.2132,0.0844,0.5370,0.5522,0.3638
11
  Yi-1.5-34B-Chat,~30B,0.7139,0.8341,0.5176,0.7722,0.8735,0.6482,0.475,0.2581,0.0357,0.7162,0.8717,0.5603,0.6206,0.7912,0.353,0.8816,0.8938,0.8601,0.6412,0.7813,0.3672,0.497,0.4306,0.0769,0.8472,0.8832,0.7889,0.4818,0.3646,0.0576
12
  Qwen2.5-32B-Instruct,~30B,0.6749,0.6366,0.7789,0.7893,0.7099,0.9938,0.4372,0.4025,0.2943,0.7921,0.7323,0.9739,0.7723,0.7036,0.9599,0.7702,0.6873,0.9727,0.5920,0.5774,0.6092,0.4358,0.3969,0.2906,0.7404,0.6695,0.9160,0.4640,0.4506,0.3514
13
  Opt-30B,~30B,0.5831,0.5754,0.5565,0.3952,0.338,0.1915,0.6784,0.6507,0.7506,0.5798,0.6281,0.5559,0.357,0.2405,0.1185,0.406,0.3224,0.1945,0.6203,0.6061,0.633,0.6188,0.6076,0.6293,0.6031,0.5886,0.5976,0.6244,0.6184,0.6415
14
  QwQ-32B-Preview,~30B,0.5231,0.5061,0.9839,0.5519,0.5328,1.0000,0.4141,0.4443,0.7537,0.5814,0.5650,0.9989,0.5529,0.5340,0.9993,0.5318,0.5111,0.9993,0.5083,0.4978,0.9542,0.4392,0.4593,0.8080,0.5238,0.5042,0.9922,0.5269,0.5128,0.9743
15
+ Mistral-Small-24B-Instruct-2501,~30B,0.5897,0.5714,0.6393,0.7706,0.6931,0.9888,0.3109,0.1339,0.0727,0.7308,0.6984,0.8887,0.7454,0.6830,0.9385,0.7584,0.6732,0.9835,0.5850,0.5671,0.6297,0.3646,0.2744,0.1803,0.7088,0.6450,0.8855,0.3839,0.3257,0.2233
16
  Baichuan2-13B-Chat,10B~20B,0.7346,0.6715,0.8932,0.7703,0.7043,0.9491,0.6303,0.6129,0.6785,0.7435,0.7152,0.8777,0.779,0.7088,0.9649,0.7677,0.6883,0.9601,0.6763,0.6388,0.7738,0.6359,0.6149,0.6904,0.7096,0.6554,0.8436,0.7306,0.6762,0.8788
17
  Qwen1.5-14B-Chat,10B~20B,0.625,0.5683,0.964,0.6549,0.5977,0.9932,0.5983,0.5571,0.9038,0.6561,0.6193,0.9535,0.6592,0.6005,0.9994,0.6382,0.5759,0.9897,0.5579,0.53,0.8275,0.5009,0.4938,0.7077,0.6256,0.566,0.9705,0.6063,0.5643,0.914
18
  Ziya2-13B-Chat,10B~20B,0.6322,0.6632,0.502,0.381,0.0822,0.0212,0.4263,0.2557,0.086,0.4352,0.4474,0.1651,0.612,0.6721,0.4744,0.812,0.7741,0.8691,0.4904,0.4516,0.2102,0.5309,0.5403,0.2964,0.7186,0.7235,0.6777,0.4811,0.4512,0.2021
 
21
  Mistral-Nemo-Instruct-2407,10B~20B,0.6992,0.6359,0.8960,0.7518,0.6773,0.9826,0.6421,0.6067,0.7767,0.7290,0.6896,0.9121,0.7377,0.6719,0.9542,0.7482,0.6611,0.9959,0.6396,0.6014,0.7754,0.6045,0.5803,0.7019,0.7246,0.6464,0.9529,0.4910,0.4881,0.4717
22
  Phi-3-medium-4k-instruct,10B~20B,0.8162,0.7447,0.9484,0.3950,0.2748,0.1126,0.8368,0.7558,0.9878,0.5763,0.6486,0.4809,0.6431,0.6695,0.5981,0.8403,0.7549,0.9973,0.8092,0.7414,0.9343,0.8263,0.7504,0.9679,0.8352,0.7499,0.9896,0.6361,0.6499,0.5818
23
  Phi-3-medium-128k-instruct,10B~20B,0.8024,0.7318,0.9391,0.3592,0.1596,0.0598,0.8232,0.7434,0.9790,0.5228,0.5910,0.3977,0.5699,0.6022,0.4725,0.8293,0.7436,0.9939,0.7813,0.7222,0.8963,0.8009,0.7328,0.9351,0.8260,0.7393,0.9898,0.6525,0.6565,0.6327
24
+ phi-4,10B~20B,0.6193,0.6166,0.5816,0.4118,0.3517,0.1792,0.7011,0.6785,0.7484,0.7224,0.7291,0.7791,0.6152,0.6372,0.5775,0.7375,0.6960,0.8232,0.5775,0.5779,0.4961,0.6685,0.6560,0.6821,0.7074,0.6752,0.7638,0.4629,0.4356,0.2692
25
+ Moonlight-16B-A3B-Instruct,10B~20B,0.5041,0.0556,0.0006,0.4814,0.0000,0.0000,0.4992,0.0000,0.0000,0.4500,0.1369,0.0016,0.4804,0.0256,0.0007,0.5027,0.0000,0.0000,0.5054,0.0893,0.0020,0.5020,0.0972,0.0014,0.5080,0.0256,0.0007,0.4947,0.0000,0.0000
26
  Gemma-1.1-7B-it,5B~10B,0.6885,0.6193,0.9389,0.7201,0.6502,0.9795,0.6709,0.6133,0.8985,0.7171,0.6709,0.9421,0.5993,0.5861,0.7426,0.7164,0.634,0.9953,0.6316,0.5872,0.8235,0.5207,0.5098,0.595,0.6874,0.616,0.9415,0.6164,0.5853,0.7856
27
  Qwen1.5-7B-Chat,5B~10B,0.6415,0.5933,0.8439,0.7295,0.6542,0.9987,0.5495,0.5352,0.6535,0.7415,0.6808,0.9875,0.7286,0.6545,0.9955,0.7167,0.6339,0.9966,0.6122,0.5749,0.784,0.4866,0.4788,0.5265,0.6887,0.6165,0.9449,0.4276,0.4219,0.4072
28
  Yi-1.5-9B-Chat,5B~10B,0.7089,0.8612,0.4825,0.5418,0.7129,0.1741,0.4846,0.2932,0.0308,0.5376,0.7743,0.2115,0.6185,0.8236,0.3254,0.818,0.9011,0.7057,0.5819,0.7416,0.2207,0.4893,0.3279,0.0365,0.7959,0.8937,0.6572,0.477,0.2414,0.0233