Added explanation of how combined score is calculated (#2)
Browse files- Added explanation of how combined score is calculated (bb2230dfab313b857820ec3b0ac8081a17ccc9ce)
Co-authored-by: ellen woodcock <[email protected]>
app.py
CHANGED
@@ -137,8 +137,11 @@ def create_leaderboard():
|
|
137 |
<p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance: </p>
|
138 |
<p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
|
139 |
<p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
|
140 |
-
<p> <b>LLM Compliance Score
|
141 |
-
<p> These scores are combined to determine the top three winners in a leaderboard. </p>
|
|
|
|
|
|
|
142 |
""",
|
143 |
elem_classes="markdown-text",
|
144 |
)
|
|
|
137 |
<p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance: </p>
|
138 |
<p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
|
139 |
<p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
|
140 |
+
<p> <b>LLM Compliance Score: </b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
|
141 |
+
<p> These scores are combined to determine the top three winners in a leaderboard. The combined score is the average of the following 2 calculations: </p>
|
142 |
+
<p>Robustness Avg = (Baseline + Robustness Delta) / 2 </p>
|
143 |
+
<p>Context Grounding Avg = (sum of Context Grounding columns) / 7 </p>
|
144 |
+
<p>Combined Score = (Robustness Avg + Context Grounding Avg) / 2 </p>
|
145 |
""",
|
146 |
elem_classes="markdown-text",
|
147 |
)
|