tathagataraha commited on
Commit
32eaa7c
·
2 Parent(s): 57fd1ce a111e91

Merge branch 'main' of https://huggingface.co/spaces/m42-health/MEDIC-Benchmark

Browse files
Files changed (1) hide show
  1. src/about.py +3 -1
src/about.py CHANGED
@@ -106,6 +106,8 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
106
  # What does your leaderboard evaluate?
107
  INTRODUCTION_TEXT = """
108
  Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
 
 
109
  """
110
 
111
  # Which evaluations are you running? how can people reproduce what you have?
@@ -140,7 +142,7 @@ Each question is presented to the models without special prompting to test their
140
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
141
 
142
  ### Medical Safety
143
- The Medical Safety category uses the "med-safety" benchmark dataset, which consists of 900 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
144
  In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
145
 
146
  ### Medical Summarization
 
106
  # What does your leaderboard evaluate?
107
  INTRODUCTION_TEXT = """
108
  Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
109
+
110
+ Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.
111
  """
112
 
113
  # Which evaluations are you running? how can people reproduce what you have?
 
142
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
143
 
144
  ### Medical Safety
145
+ Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
146
  In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
147
 
148
  ### Medical Summarization