Spaces:

m42-health
/

MEDIC-Benchmark

Running

cchristophe commited on Nov 13, 2024

Commit

9e77e60

verified ·

1 Parent(s): 0da5ee3

Update src/about.py

Files changed (1) hide show

src/about.py CHANGED Viewed

@@ -67,9 +67,9 @@ Closed-ended question evaluation for LLMs provides insights into their medical k
 ### Open-ended Questions
-We use three open-source medical question datasets (MedicationQA, HealthSearchQA, and ExpertQA) to evaluate LLMs' clinical question-answering abilities across various medical topics. We adopt a "zero-shot" approach, presenting each individual question to the model with only a basic instruction to answer. This tests the models' inherent capabilities without relying on complex prompting.
-Inspired by the LMSys Chat Arena and their Elo Rating Leaderboard, this approach incorporates a pairwise comparison methodology. We present an LLM judge (Llama3.1 70b Instruct) with two responses to the same question, generated by different models. The judge then selects the superior response. Through numerous such comparisons, we establish a win-rate for each model. Then, we employ the Elo rating system to quantify the relative strengths of the models numerically.
 """
 EVALUATION_QUEUE_TEXT = """

 ### Open-ended Questions
+We evaluate LLMs' medical knowledge using three datasets: MedicationQA, HealthSearchQA, and ExpertQA. Each question is presented to the models without special prompting to test their baseline capabilities.
+To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
+It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
 """
 EVALUATION_QUEUE_TEXT = """