Spaces:

Writer
/

Financial_LLM_Performance_Leaderboard

Running

ecw429 commited on Feb 19

Commit

7c128e4

verified ·

1 Parent(s): 1820231

Added explanation of how combined score is calculated (#2)

- Added explanation of how combined score is calculated (bb2230dfab313b857820ec3b0ac8081a17ccc9ce)

Co-authored-by: ellen woodcock <[email protected]>

Files changed (1) hide show

app.py CHANGED Viewed

@@ -137,8 +137,11 @@ def create_leaderboard():
             <p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance:  </p>
             <p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
             <p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
-            <p> <b>LLM Compliance Score:</b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
-            <p> These scores are combined to determine the top three winners in a leaderboard. </p>
 """,
             elem_classes="markdown-text",
         )

             <p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance:  </p>
             <p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
             <p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
+            <p> <b>LLM Compliance Score: </b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
+            <p> These scores are combined to determine the top three winners in a leaderboard. The combined score is the average of the following 2 calculations: </p>
+            <p>Robustness Avg = (Baseline + Robustness Delta) / 2 </p>
+            <p>Context Grounding Avg = (sum of Context Grounding columns) / 7 </p>
+            <p>Combined Score = (Robustness Avg + Context Grounding Avg) / 2 </p>
 """,
             elem_classes="markdown-text",
         )