Eval-Anything Leaderboard

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("anli_r1", "acc", "ANLI")
    task1 = Task("logiqa", "acc_norm", "LogiQA")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------


# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Eval-Anything Leaderboard</h1>"""

# MJB_LOGO = '<img src="" alt="Logo" style="width: 100%; display: block; margin: auto;">'

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
Eval-anything is a framework designed specifically for evaluating all-modality models, and it is a part of the [Align-Anything](https://github.com/PKU-Alignment/align-anything) framework. It consists of two main tasks: All-Modality Understanding (AMU) and All-Modality Generation (AMG). AMU assesses a model's ability to simultaneously process and integrate information from all modalities, including text, images, audio, and video. On the other hand, AMG evaluates a model's capability to autonomously select output modalities based on user instructions and synergistically utilize different modalities to generate output. Eval-anything aims to comprehensively assess the ability of all-modality models to handle heterogeneous data from multiple sources, providing a reliable evaluation tool for this field.

**Note:** Since most current open-source models lack support for all-modality output, (†) indicates that models are used as agents to invoke [AudioLDM2-Large](https://huggingface.co/cvssp/audioldm2-large) and [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) for audio and image generation.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
"""

EVALUATION_QUEUE_TEXT = """
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = """
@inproceedings{ji2024align,
  title={Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback},
  author={Jiaming Ji and Jiayi Zhou and Hantao Lou and Boyuan Chen and Donghai Hong and Xuyao Wang and Wenqi Chen and Kaile Wang and Rui Pan and Jiahao Li and Mohan Wang and Josef Dai and Tianyi Qiu and Hua Xu and Dong Li and Weipeng Chen and Jun Song and Bo Zheng and Yaodong Yang},
  year={2024},
  url={https://arxiv.org/abs/2412.15838}
}
"""


ABOUT_TEXT = """
"""

SUBMISSION_TEXT = """
<h1 align="center">
How to submit models/results to the leaderboard?
</h1>
We welcome the community to submit evaluation results for new models. These results will be added as non-verified. However, the authors are required to upload their generations in case other members want to verify the results.

### 1 - Running Evaluation 🏃‍♂️

We have written a detailed guide for running the evaluation on your model. You can find it in the [align-anything](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/evaluation/eval_anything). 

**Note:** The current code is a sample script. In the future, we will integrate Eval Anything's evaluation pipeline into the framework to provide convenience for community use.

### 2 - Submitting Results 🚀

To submit your results create a **Pull Request** in the community tab to add them under the [community_results](hhttps://huggingface.co/spaces/PKU-Alignment/EvalAnything-LeaderBoard/tree/main/community_results)  in this repository:
- Create a folder named `ORG_MODELNAME_USERNAME`. For example `PKU-Alignment_gemini1.5-pro_XiaoMing`.
- Place all your generation and evaluation results in the folder.

The title of the PR should be `[Community Submission] Model: org/model, Username: your_username`, replace org and model with those corresponding to the model you evaluated.

### 3 - Getting your model verified ✅
A verified result in Eval-Anything indicates that a core maintainer has decoded the outputs from the model and performed the evaluation. To have your model verified, please follow these steps:

1. Email us and provide a brief rationale for why your model should be verified.
2. Await our response and approval before proceeding.
3. Prepare a script to decode from your model that does not require a GPU. Typically, this should be the same script used for your model contribution. It should run without requiring a local GPU. It should run without requiring a local GPU. We strongly recommend that you modify the scripts in [align-anything](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/evaluation/eval_anything) to adapt to your model's operation.
4. Generate temporary OpenAI API keys for running the script and share them with us. Specifically, we need the keys for evaluation.
5. We will check and execute your script, update the results, and inform you so that you can revoke the temporary keys.

**Please note that we will not re-evaluate the same model. Due to sampling variance, the results might slightly differ from your initial ones. We will replace your previous community results with the verified ones.**
"""