Spaces:

m42-health
/

MEDIC-Benchmark

Running

App Files Files Community

tathagataraha commited on Jan 13

Commit

32eaa7c

2 Parent(s): 57fd1ce a111e91

Merge branch 'main' of https://huggingface.co/spaces/m42-health/MEDIC-Benchmark

Browse files

Files changed (1) hide show

src/about.py +3 -1

src/about.py CHANGED Viewed

@@ -106,6 +106,8 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
 """
 # Which evaluations are you running? how can people reproduce what you have?
@@ -140,7 +142,7 @@ Each question is presented to the models without special prompting to test their
 It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
 ### Medical Safety
-The Medical Safety category uses the "med-safety" benchmark dataset, which consists of 900 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
 In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
 ### Medical Summarization

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
+Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
 ### Medical Safety
+Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
 In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
 ### Medical Summarization