Create a Space Your leaderboard must be hosted on a{" "} Hugging Face Space . Add metadata Like{" "} model cards , your Space's{" "} README.md {" "} file should include specific metadata in a YAML section at the top:

Add either the leaderboard or{" "} arena tag Choose between: • arena - for human evaluations
requires judge:humans • leaderboard - for automated evaluations
with judge:function or{" "} judge:model } arrow placement="right" componentsProps={{ tooltip: { sx: { bgcolor: "background.paper", color: "text.primary", "& .MuiTooltip-arrow": { color: "background.paper", }, boxShadow: (theme) => theme.shadows[2], }, }, }} > alpha(theme.palette.primary.main, 0.1), }, }} >
Include a short_description field to explain the purpose of your evaluation
Add metadata tags to categorize your evaluation (see examples on the right)

---
short_description :{" "} Evaluating LLMs on math reasoning tasks
tags :
  -{" "} leaderboard           # Type of leaderboard
  -{" "} submission:automatic{" "} # How models are submitted
  -{" "} test:public{" "}          # Test set visibility
  -{" "} judge:function{" "}       # Evaluation method
  -{" "} modality:text{" "}        # Input/output type
  -{" "} language:english{" "}     # Language coverage
  -{" "} domain:financial{" "}     # Specific domain
---

tool usage - mostly for assistant models (a bit outside of usual modalities)", "the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings", "", ]} /> generation capabilities specifically (can be image generation, text generation, ...)", "the evaluation tests math abilities", "the evaluation tests coding capabilities", "the evaluation tests reasoning abilities", "model performance (speed, energy consumption, ...)", "the evaluation considers safety, toxicity, bias", "the evaluation measures the model's tendency to hallucinate or generate false information", "the evaluation tests RAG (Retrieval-Augmented Generation) capabilities", ]} /> automatically without human intervention", "the leaderboard requires the model owner to run evaluations on his side and submit the results", "the leaderboard requires the leaderboard owner to run evaluations for new submissions", "the leaderboard does not accept submissions at the moment", ]} /> public, the evaluations are completely reproducible", "some test sets are public and some private", "all the test sets used are private, the evaluations are hard to game", "the test sets used change regularly through time and evaluation scores are refreshed", ]} /> automatically, using an evaluation suite such as lm_eval or lighteval", "evaluations are run using a model as a judge approach to rate answer", "evaluations are done by humans to rate answer - this is an arena", "evaluations are done manually by one or several humans", ]} /> If you would like to see a tag that is not currently represented, please contact{" "} Clémentine Fourrier {" "} on Hugging Face.