cchristophe commited on
Commit
a111e91
·
verified ·
1 Parent(s): 3df6003

Update src/about.py

Browse files
Files changed (1) hide show
  1. src/about.py +3 -1
src/about.py CHANGED
@@ -105,6 +105,8 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
105
  # What does your leaderboard evaluate?
106
  INTRODUCTION_TEXT = """
107
  Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
 
 
108
  """
109
 
110
  # Which evaluations are you running? how can people reproduce what you have?
@@ -139,7 +141,7 @@ Each question is presented to the models without special prompting to test their
139
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
140
 
141
  ### Medical Safety
142
- The Medical Safety category uses the "med-safety" benchmark dataset, which consists of 900 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
143
  In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
144
 
145
  ### Medical Summarization
 
105
  # What does your leaderboard evaluate?
106
  INTRODUCTION_TEXT = """
107
  Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
108
+
109
+ Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.
110
  """
111
 
112
  # Which evaluations are you running? how can people reproduce what you have?
 
141
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
142
 
143
  ### Medical Safety
144
+ Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
145
  In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
146
 
147
  ### Medical Summarization