wassemgtk ecw429 commited on
Commit
7c128e4
·
verified ·
1 Parent(s): 1820231

Added explanation of how combined score is calculated (#2)

Browse files

- Added explanation of how combined score is calculated (bb2230dfab313b857820ec3b0ac8081a17ccc9ce)


Co-authored-by: ellen woodcock <[email protected]>

Files changed (1) hide show
  1. app.py +5 -2
app.py CHANGED
@@ -137,8 +137,11 @@ def create_leaderboard():
137
  <p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance: </p>
138
  <p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
139
  <p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
140
- <p> <b>LLM Compliance Score:</b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
141
- <p> These scores are combined to determine the top three winners in a leaderboard. </p>
 
 
 
142
  """,
143
  elem_classes="markdown-text",
144
  )
 
137
  <p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance: </p>
138
  <p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
139
  <p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
140
+ <p> <b>LLM Compliance Score: </b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
141
+ <p> These scores are combined to determine the top three winners in a leaderboard. The combined score is the average of the following 2 calculations: </p>
142
+ <p>Robustness Avg = (Baseline + Robustness Delta) / 2 </p>
143
+ <p>Context Grounding Avg = (sum of Context Grounding columns) / 7 </p>
144
+ <p>Combined Score = (Robustness Avg + Context Grounding Avg) / 2 </p>
145
  """,
146
  elem_classes="markdown-text",
147
  )