calcuis
/

phi4

@@ -57,4 +57,4 @@ at the high-level overview of the model quality on representative benchmarks. Fo
 | Factual Knowledge            | SimpleQA      | 3.0       | 7.6            | 5.4                 | 9.9                  | 20.9               | 10.2              | **39.4**             |
 | Reasoning                    | DROP          | 75.5      | 68.3            | 85.5                 | 79.3                 | **90.2**               | 76.7              | 80.9            |
-\* these scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.

 | Factual Knowledge            | SimpleQA      | 3.0       | 7.6            | 5.4                 | 9.9                  | 20.9               | 10.2              | **39.4**             |
 | Reasoning                    | DROP          | 75.5      | 68.3            | 85.5                 | 79.3                 | **90.2**               | 76.7              | 80.9            |
+\* these scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following.