factrbench / verifact_data.csv
xinliucs's picture
Update verifact_data.csv
10ffff3 verified
raw
history blame contribute delete
1.38 kB
tier,model,FactBench,Reddit,Overall
F1,GPT4o,80.92,27.45,64.35
F1,Claude 3.5-Sonnet,75.68,26.31,59.67
F1,Gemini 1.5-Flash,77.38,28.68,61.63
F1,Mistral-7B,62.30,21.71,48.63
F1,Mistral-24B,70.84,28.36,56.46
F1,Mistral-123B,75.20,27.33,59.49
F1,Llama3.1-8b,60.48,20.70,46.89
F1,Llama3.1-70b,64.80,23.90,51.59
F1,Llama3.1-405B,73.23,23.15,57.54
F1,Qwen2.5-8b,66.25,22.86,52.39
F1,Qwen2.5-32b,72.25,27.88,57.52
F1,Qwen2.5-72B,73.09,26.95,57.82
Recall,GPT4o,77.13,16.54,52.42
Recall,Claude 3.5-Sonnet,69.35,15.94,47.57
Recall,Gemini 1.5-Flash,70.71,17.50,49.01
Recall,Mistral-7B,51.96,12.82,36.00
Recall,Mistral-24B,61.48,17.46,43.53
Recall,Mistral-123B,67.28,16.57,46.60
Recall,Llama3.1-8b,54.28,12.73,37.33
Recall,Llama3.1-70b,58.00,14.42,40.23
Recall,Llama3.1-405B,68.40,13.75,46.11
Recall,Qwen2.5-8b,58.66,13.53,40.25
Recall,Qwen2.5-32b,62.77,16.91,44.07
Recall,Qwen2.5-72B,64.12,16.29,44.61
Precision,GPT4o,85.11,80.66,83.30
Precision,Claude 3.5-Sonnet,83.28,75.35,80.05
Precision,Gemini 1.5-Flash,85.45,79.48,83.02
Precision,Mistral-7B,77.79,70.72,74.91
Precision,Mistral-24B,83.61,75.51,80.31
Precision,Mistral-123B,85.24,77.88,82.24
Precision,Llama3.1-8b,68.27,55.40,63.02
Precision,Llama3.1-70b,73.40,69.72,71.90
Precision,Llama3.1-405B,78.80,73.19,76.51
Precision,Qwen2.5-8b,76.09,73.64,75.09
Precision,Qwen2.5-32b,85.11,79.44,82.80
Precision,Qwen2.5-72B,84.97,78.02,82.14