MEDIC-Benchmark / src /about.py
cchristophe's picture
Update about.py
dfd63f4 verified
raw
history blame
7.84 kB
from dataclasses import dataclass
from enum import Enum
@dataclass
class HarnessTask:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class HarnessTasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
# task0 = Task("anli_r1", "acc", "ANLI")
# task1 = Task("logiqa", "acc_norm", "LogiQA")
task0 = HarnessTask("MMLU", "accuracy", "MMLU")
task1 = HarnessTask("MMLU-Pro", "accuracy", "MMLU-Pro")
task2 = HarnessTask("MedMCQA", "accuracy", "MedMCQA")
task3 = HarnessTask("MedQA", "accuracy", "MedQA")
task4 = HarnessTask("USMLE", "accuracy", "USMLE")
task5 = HarnessTask("PubMedQA", "accuracy", "PubMedQA")
task6 = HarnessTask("ToxiGen", "accuracy", "ToxiGen")
# task7 = HarnessTask("Average", "accuracy", "Harness-Average")
# task5 = Task("", "f1", "")
# task6 = Task("", "f1", "")
@dataclass
class ClinicalType:
benchmark: str
metric: str
col_name: str
class ClinicalTypes(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
type0 = ClinicalType("condition", "f1", "CONDITION")
type1 = ClinicalType("measurement", "f1", "MEASUREMENT")
type2 = ClinicalType("drug", "f1", "DRUG")
type3 = ClinicalType("procedure", "f1", "PROCEDURE")
type4 = ClinicalType("gene", "f1", "GENE")
type5 = ClinicalType("gene variant", "f1", "GENE VARIANT")
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
# Your leaderboard name
TITLE = """""" #<h1 align="center" id="space-title"> NER Leaderboard</h1>"""
# LOGO = """<img src="https://equalengineers.com/wp-content/uploads/2024/04/dummy-logo-5b.png" alt="Clinical X HF" width="500" height="333">"""
LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/resolve/main/assets/image.png" alt="Clinical X HF" width="500" height="333">"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT_1 = f"""
## About
The MEDIC Leaderboard is aimed at providing a comprehensive evaluations of clinical language models. It provides a standardized platform for evaluating and comparing the performance of various language models across 5 dimensions: Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning, and Clinical safety and risk assessment. This comprehensive structure acknowledges the diverse facets of clinical competence and the varied requirements of healthcare applications. By addressing these critical dimensions, MEDIC aims to bridge the gap between benchmark performance and real-world clinical utility, providing a more robust prediction of an LLM’s potential effectiveness and safety in actual healthcare settings.
"""
EVALUATION_QUEUE_TEXT = """
Currently, the benchmark supports evaluation for models hosted on the huggingface hub and of decoder type. It doesn't support adapter models yet but we will soon add adapters too.
## Submission Guide for the MEDIC Benchamark
## First Steps Before Submitting a Model
### 1. Ensure Your Model Loads with AutoClasses
Verify that you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
Note:
- If this step fails, debug your model before submitting.
- Ensure your model is public.
### 2. Convert Weights to Safetensors
[Safetensors](https://huggingface.co/docs/safetensors/index) is a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
### 3. Complete Your Model Card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
### 4. Select the correct model type
Choose the correct model cateogory from the option below:
- 🟢 : 🟢 pretrained model: new, base models, trained on a given text corpora using masked modelling or new, base models, continuously trained on further corpus (which may include IFT/chat data) using masked modelling
- ⭕ : ⭕ fine-tuned models: pretrained models finetuned on more data or tasks.
- 🟦 : 🟦 preference-tuned models: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
### 5. Select Correct Precision
Choose the right precision to avoid evaluation errors:
- Not all models convert properly from float16 to bfloat16.
- Incorrect precision can cause issues (e.g., loading a bf16 model in fp16 may generate NaNs).
- If you have selected auto, the precision mentioned under `torch_dtype` under model config will be used.
### 6. Medically oriented model
If the model has been specifically built for medical domains i.e. pretrained/finetuned on significant medical data, make sure check the `Domain specific` checkbox
### 7. Chat template
Select this option if your model uses a chat template. The chat template will be used during evaluation.
- Before submitting, make sure the chat template is defined in tokenizer config.
Upon successful submission of your request, your model's result would be updated on the leaderboard within 5 working days!
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{kanithi2024mediccomprehensiveframeworkevaluating,
title={MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications},
author={Praveen K Kanithi and Clément Christophe and Marco AF Pimentel and Tathagata Raha and Nada Saadi and Hamza Javed and Svetlana Maslenkova and Nasir Hayat and Ronnie Rajan and Shadab Khan},
year={2024},
eprint={2409.07314},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.07314},
}
"""