tathagataraha commited on
Commit
fedd68b
·
2 Parent(s): 96ca081 0da091e

Merge branch 'main' of https://huggingface.co/spaces/m42-health/MEDIC-Benchmark

Browse files
Files changed (1) hide show
  1. src/about.py +7 -7
src/about.py CHANGED
@@ -131,19 +131,19 @@ We used the Eleuther AI's Evaluation Harness framework, which focuses on the lik
131
  ### Open-ended Questions
132
 
133
  This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:
134
- - MedicationQA
135
- - HealthSearchQA
136
- - ExpertQA
137
 
138
  Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
139
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
140
 
141
  ### Medical Safety
142
- Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
143
  In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
144
 
145
  ### Medical Summarization
146
- This category evaluates the LLM's ability to summarize medical texts, with a focus on clinical trial descriptions from ClinicalTrials.gov. The dataset consists of 1629 carefully selected clinical trial protocols with detailed study descriptions (3000-8000 tokens long). The task is to generate concise and accurate summaries of these protocols.
147
 
148
  It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
149
 
@@ -155,9 +155,9 @@ It uses a novel "cross-examination" framework, where questions are generated fro
155
  ### Note Generation
156
  This category assesses the LLM's ability to generate structured clinical notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization across two datasets:
157
 
158
- - ACI-Bench: A comprehensive collection designed specifically for benchmarking clinical note generation from doctor-patient dialogues. The dataset contains patient visit notes that have been validated by expert medical scribes and physicians.
159
 
160
- - SOAP Notes: Using the test split of the ChartNote dataset containing 250 synthetic patient-doctor conversations generated from real clinical notes. The task involves generating notes in the SOAP format with the following sections:
161
  - Subjective: Patient's description of symptoms, medical history, and personal experiences
162
  - Objective: Observable data like physical exam findings, vital signs, and diagnostic test results
163
  - Assessment: Healthcare provider's diagnosis based on subjective and objective information
 
131
  ### Open-ended Questions
132
 
133
  This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:
134
+ - [MedicationQA](https://ebooks.iospress.nl/doi/10.3233/SHTI190176)
135
+ - [HealthSearchQA](https://www.nature.com/articles/s41586-023-06291-2)
136
+ - [ExpertQA](https://arxiv.org/abs/2309.07852)
137
 
138
  Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
139
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
140
 
141
  ### Medical Safety
142
+ Medical Safety category uses the "[med-safety](https://openreview.net/forum?id=1cq9pmwRgG)" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
143
  In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
144
 
145
  ### Medical Summarization
146
+ This category evaluates the LLM's ability to summarize medical texts, with a focus on clinical trial descriptions from ClinicalTrials.gov. The [dataset](https://trec.nist.gov/pubs/trec31/papers/Overview_trials.pdf) consists of 1629 carefully selected clinical trial protocols with detailed study descriptions (3000-8000 tokens long). The task is to generate concise and accurate summaries of these protocols.
147
 
148
  It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
149
 
 
155
  ### Note Generation
156
  This category assesses the LLM's ability to generate structured clinical notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization across two datasets:
157
 
158
+ - [ACI-Bench](https://www.nature.com/articles/s41597-023-02487-3): A comprehensive collection designed specifically for benchmarking clinical note generation from doctor-patient dialogues. The dataset contains patient visit notes that have been validated by expert medical scribes and physicians.
159
 
160
+ - [SOAP Notes](https://arxiv.org/abs/2310.15959): Using the test split of the ChartNote dataset containing 250 synthetic patient-doctor conversations generated from real clinical notes. The task involves generating notes in the SOAP format with the following sections:
161
  - Subjective: Patient's description of symptoms, medical history, and personal experiences
162
  - Objective: Observable data like physical exam findings, vital signs, and diagnostic test results
163
  - Assessment: Healthcare provider's diagnosis based on subjective and objective information