tathagataraha commited on
Commit
b5701cc
·
2 Parent(s): 34c150d a2d8d52

Merge branch 'main' of https://huggingface.co/spaces/m42-health/MEDIC-Benchmark

Browse files
Files changed (1) hide show
  1. src/about.py +11 -0
src/about.py CHANGED
@@ -63,6 +63,17 @@ LLM_BENCHMARKS_TEXT_1 = f"""
63
 
64
  The MEDIC Leaderboard is aimed at providing a comprehensive evaluations of clinical language models. It provides a standardized platform for evaluating and comparing the performance of various language models across 5 dimensions: Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning, and Clinical safety and risk assessment. This comprehensive structure acknowledges the diverse facets of clinical competence and the varied requirements of healthcare applications. By addressing these critical dimensions, MEDIC aims to bridge the gap between benchmark performance and real-world clinical utility, providing a more robust prediction of an LLM’s potential effectiveness and safety in actual healthcare settings.
65
 
 
 
 
 
 
 
 
 
 
 
 
66
  """
67
 
68
  EVALUATION_QUEUE_TEXT = """
 
63
 
64
  The MEDIC Leaderboard is aimed at providing a comprehensive evaluations of clinical language models. It provides a standardized platform for evaluating and comparing the performance of various language models across 5 dimensions: Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning, and Clinical safety and risk assessment. This comprehensive structure acknowledges the diverse facets of clinical competence and the varied requirements of healthcare applications. By addressing these critical dimensions, MEDIC aims to bridge the gap between benchmark performance and real-world clinical utility, providing a more robust prediction of an LLM’s potential effectiveness and safety in actual healthcare settings.
65
 
66
+ ## Evaluation Tasks and Metrics
67
+
68
+ ### Close-ended Questions
69
+
70
+ Closed-ended question evaluation for LLMs provides insights into their medical knowledge breadth and accuracy. With this approach, we aim to quantify an LLM's comprehension of medical concepts across various specialties, ranging from basic to advanced professional levels. The following datasets serve as standardized benchmarks: MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE, ToxiGen. We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
71
+
72
+ ### Open-ended Questions
73
+
74
+ We use three open-source medical question datasets (MedicationQA, HealthSearchQA, and ExpertQA) to evaluate LLMs' clinical question-answering abilities across various medical topics. We adopt a "zero-shot" approach, presenting each individual question to the model with only a basic instruction to answer. This tests the models' inherent capabilities without relying on complex prompting.
75
+ Inspired by the LMSys Chat Arena and their Elo Rating Leaderboard, this approach incorporates a pairwise comparison methodology. We present an LLM judge (Llama3.1 70b Instruct) with two responses to the same question, generated by different models. The judge then selects the superior response. Through numerous such comparisons, we establish a win-rate for each model. Then, we employ the Elo rating system to quantify the relative strengths of the models numerically.
76
+
77
  """
78
 
79
  EVALUATION_QUEUE_TEXT = """