Spaces:
Running
Running
Merge branch 'main' of https://huggingface.co/spaces/m42-health/MEDIC-Benchmark
Browse files- src/about.py +3 -1
src/about.py
CHANGED
@@ -106,6 +106,8 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
|
|
106 |
# What does your leaderboard evaluate?
|
107 |
INTRODUCTION_TEXT = """
|
108 |
Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
|
|
|
|
|
109 |
"""
|
110 |
|
111 |
# Which evaluations are you running? how can people reproduce what you have?
|
@@ -140,7 +142,7 @@ Each question is presented to the models without special prompting to test their
|
|
140 |
It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
|
141 |
|
142 |
### Medical Safety
|
143 |
-
|
144 |
In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
|
145 |
|
146 |
### Medical Summarization
|
|
|
106 |
# What does your leaderboard evaluate?
|
107 |
INTRODUCTION_TEXT = """
|
108 |
Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
|
109 |
+
|
110 |
+
Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.
|
111 |
"""
|
112 |
|
113 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
142 |
It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
|
143 |
|
144 |
### Medical Safety
|
145 |
+
Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
|
146 |
In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
|
147 |
|
148 |
### Medical Summarization
|