Spaces:
Running
Running
Update src/about.py
Browse files- src/about.py +3 -3
src/about.py
CHANGED
@@ -67,9 +67,9 @@ Closed-ended question evaluation for LLMs provides insights into their medical k
|
|
67 |
|
68 |
### Open-ended Questions
|
69 |
|
70 |
-
We
|
71 |
-
|
72 |
-
|
73 |
"""
|
74 |
|
75 |
EVALUATION_QUEUE_TEXT = """
|
|
|
67 |
|
68 |
### Open-ended Questions
|
69 |
|
70 |
+
We evaluate LLMs' medical knowledge using three datasets: MedicationQA, HealthSearchQA, and ExpertQA. Each question is presented to the models without special prompting to test their baseline capabilities.
|
71 |
+
To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
|
72 |
+
It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
|
73 |
"""
|
74 |
|
75 |
EVALUATION_QUEUE_TEXT = """
|