tathagataraha commited on
Commit
27e5b96
Β·
1 Parent(s): d8147b8

[ADD]Model submission guide and citation

Browse files
Files changed (2) hide show
  1. app.py +5 -0
  2. src/about.py +41 -21
app.py CHANGED
@@ -371,8 +371,13 @@ with demo:
371
  )
372
 
373
  with gr.TabItem("πŸ… Open Ended Evaluation", elem_id="llm-benchmark-tab-table", id=1):
 
374
  pass
375
  with gr.TabItem("πŸ… Med Safety", elem_id="llm-benchmark-tab-table", id=2):
 
 
 
 
376
  pass
377
 
378
  with gr.TabItem("πŸ“ About", elem_id="llm-benchmark-tab-table", id=3):
 
371
  )
372
 
373
  with gr.TabItem("πŸ… Open Ended Evaluation", elem_id="llm-benchmark-tab-table", id=1):
374
+ gr.Markdown("# Coming Soon!!!", elem_classes="markdown-text")
375
  pass
376
  with gr.TabItem("πŸ… Med Safety", elem_id="llm-benchmark-tab-table", id=2):
377
+ gr.Markdown("# Coming Soon!!!", elem_classes="markdown-text")
378
+ pass
379
+ with gr.TabItem("πŸ… Med Safety", elem_id="llm-benchmark-tab-table", id=2):
380
+ gr.Markdown("# Coming Soon!!!", elem_classes="markdown-text")
381
  pass
382
 
383
  with gr.TabItem("πŸ“ About", elem_id="llm-benchmark-tab-table", id=3):
src/about.py CHANGED
@@ -227,41 +227,61 @@ Users are advised to approach the results with an understanding of the inherent
227
 
228
  EVALUATION_QUEUE_TEXT = """
229
 
230
- Currently, the benchmark supports evaluation for models hosted on the huggingface hub and of type encoder, decoder or gliner type models.
231
- If your model needs a custom implementation, follow the steps outlined in the [clinical_ner_benchmark](https://github.com/WadoodAbdul/clinical_ner_benchmark/blob/e66eb566f34e33c4b6c3e5258ac85aba42ec7894/docs/custom_model_implementation.md) repo or reach out to our team!
232
 
 
233
 
234
- ### Fields Explanation
235
 
236
- #### Model Type:
237
- - Fine-Tuned: If the training data consisted of any split/variation of the datasets on the leaderboard.
238
- - Zero-Shot: If the model did not have any exposure to the datasets on the leaderboard while training.
 
 
 
 
 
 
 
 
 
 
 
239
 
240
- #### Model Architecture:
241
- - Encoder: The standard transformer encoder architecture with a token classification head on top.
242
- - Decoder: Transformer based autoregressive token generation model.
243
- - GLiNER: Architecture outlined in the [GLiNER Paper](https://arxiv.org/abs/2311.08526)
244
 
245
- #### Label Normalization Map:
246
- Not all models have been tuned to output the ner label names in the clinical datasets on this leaderboard. Some models cater to the same entity names with a synonym of it.
247
- The normalization map can be used to ensure that the models's output are aligned with the labels expected in the datasets.
 
 
248
 
249
- Note: Multiple model labels can be mapped to a single entity type in the leaderboard dataset. Ex: 'synonym' and 'disease' to 'condition'
 
 
 
 
250
 
 
 
 
 
 
 
251
 
252
  Upon successful submission of your request, your model's result would be updated on the leaderboard within 5 working days!
253
  """
254
 
255
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
256
  CITATION_BUTTON_TEXT = r"""
257
- @misc{abdul2024namedclinicalentityrecognition,
258
- title={Named Clinical Entity Recognition Benchmark},
259
- author={Wadood M Abdul and Marco AF Pimentel and Muhammad Umar Salman and Tathagata Raha and ClΓ©ment Christophe and Praveen K Kanithi and Nasir Hayat and Ronnie Rajan and Shadab Khan},
260
  year={2024},
261
- eprint={2410.05046},
262
  archivePrefix={arXiv},
263
  primaryClass={cs.CL},
264
- url={https://arxiv.org/abs/2410.05046},
265
- }
266
-
267
  """
 
227
 
228
  EVALUATION_QUEUE_TEXT = """
229
 
230
+ Currently, the benchmark supports evaluation for models hosted on the huggingface hub and of decoder type. It doesn't support adapter models yet but we will soon add adapters too.
 
231
 
232
+ ## Submission Guide for the MEDIC Benchamark
233
 
234
+ ## First Steps Before Submitting a Model
235
 
236
+ ### 1. Ensure Your Model Loads with AutoClasses
237
+ Verify that you can load your model and tokenizer using AutoClasses:
238
+ ```python
239
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
240
+ config = AutoConfig.from_pretrained("your model name", revision=revision)
241
+ model = AutoModel.from_pretrained("your model name", revision=revision)
242
+ tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
243
+ ```
244
+ Note:
245
+ - If this step fails, debug your model before submitting.
246
+ - Ensure your model is public.
247
+
248
+ ### 2. Convert Weights to Safetensors
249
+ [Safetensors](https://huggingface.co/docs/safetensors/index) is a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
250
 
251
+ ### 3. Complete Your Model Card
252
+ When we add extra information about models to the leaderboard, it will be automatically taken from the model card
 
 
253
 
254
+ ### 4. Select the correct model type
255
+ Choose the correct model cateogory from the option below:
256
+ - 🟒 : 🟒 pretrained model: new, base models, trained on a given text corpora using masked modelling or new, base models, continuously trained on further corpus (which may include IFT/chat data) using masked modelling
257
+ - β­• : β­• fine-tuned models: pretrained models finetuned on more data or tasks.
258
+ - 🟦 : 🟦 preference-tuned models: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
259
 
260
+ ### 5. Select Correct Precision
261
+ Choose the right precision to avoid evaluation errors:
262
+ - Not all models convert properly from float16 to bfloat16.
263
+ - Incorrect precision can cause issues (e.g., loading a bf16 model in fp16 may generate NaNs).
264
+ - If you have selected auto, the precision mentioned under `torch_dtype` under model config will be used.
265
 
266
+ ### 6. Medically oriented model
267
+ If the model has been specifically built for medical domains i.e. pretrained/finetuned on significant medical data, make sure check the `Domain specific` checkbox
268
+
269
+ ### 7. Chat template
270
+ Select this option if your model uses a chat template. The chat template will be used during evaluation.
271
+ - Before submitting, make sure the chat template is defined in tokenizer config.
272
 
273
  Upon successful submission of your request, your model's result would be updated on the leaderboard within 5 working days!
274
  """
275
 
276
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
277
  CITATION_BUTTON_TEXT = r"""
278
+ @misc{kanithi2024mediccomprehensiveframeworkevaluating,
279
+ title={MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications},
280
+ author={Praveen K Kanithi and ClΓ©ment Christophe and Marco AF Pimentel and Tathagata Raha and Nada Saadi and Hamza Javed and Svetlana Maslenkova and Nasir Hayat and Ronnie Rajan and Shadab Khan},
281
  year={2024},
282
+ eprint={2409.07314},
283
  archivePrefix={arXiv},
284
  primaryClass={cs.CL},
285
+ url={https://arxiv.org/abs/2409.07314},
286
+ }
 
287
  """