Your leaderboard must be hosted on a{" "} Hugging Face Space .
Like{" "} model cards , your Space's README.md file should include specific metadata in a YAML section at the top. Define the type Add either the leaderboard or{" "} arena tag. Add a description Include a short_description field to explain the purpose of your evaluation. Specify metadata Add metadata tags to categorize your evaluation and help users understand its characteristics. ---
short_description :{" "} Evaluating LLMs on math reasoning tasks
tags :
  -{" "} leaderboard           # Type of leaderboard
  -{" "} submission:automatic{" "} # How models are submitted
  -{" "} test:public{" "}          # Test set visibility
  -{" "} judge:function{" "}       # Evaluation method
  -{" "} modality:text{" "}        # Input/output type
  -{" "} language:english{" "}     # Language coverage
  -{" "} domain:financial{" "}     # Specific domain
---

automatically without human intervention", "the leaderboard requires the model owner to run evaluations on his side and submit the results", "the leaderboard requires the leaderboard owner to run evaluations for new submissions", "the leaderboard does not accept submissions at the moment", ]} /> public, the evaluations are completely reproducible", "some test sets are public and some private", "all the test sets used are private, the evaluations are hard to game", "the test sets used change regularly through time and evaluation scores are refreshed", ]} /> automatically, using an evaluation suite such as lm_eval or lighteval", "evaluations are run using a model as a judge approach to rate answer", "evaluations are done by humans to rate answer - this is an arena", "evaluations are done manually by one or several humans", ]} /> tool usage - mostly for assistant models (a bit outside of usual modalities)", "the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings (a bit outside of usual modalities)", ]} /> generation capabilities specifically (can be image generation, text generation, ...)", "the evaluation tests math abilities", "the evaluation tests coding capabilities", "model performance (speed, energy consumption, ...)", "the evaluation considers safety, toxicity, bias", "the evaluation tests RAG (Retrieval-Augmented Generation) capabilities", ]} /> If you would like to see a domain that is not currently represented, please contact{" "} Clementine Fourrier {" "} on Hugging Face.