Lighteval documentation

Adding a New Metric

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Adding a New Metric

Before You Start

Two different types of metrics

There are two types of metrics in Lighteval:

Sample-Level Metrics

  • Purpose: Evaluate individual samples/predictions
  • Input: Takes a Doc and ModelResponse (model’s prediction)
  • Output: Returns a float or boolean value for that specific sample
  • Example: Checking if a model’s answer matches the correct answer for one sample

Corpus-Level Metrics

  • Purpose: Compute final scores across the entire dataset/corpus
  • Input: Takes the results from all sample-level evaluations
  • Output: Returns a single score representing overall performance
  • Examples:
    • Simple aggregation: Calculating average accuracy across all test samples
    • Complex metrics: BLEU score where sample-level metric prepares data (tokenization, etc.) and corpus-level metric computes the actual BLEU score

Check Existing Metrics

First, check if you can use one of the parameterized functions in Corpus Metrics or Sample Metrics.

If not, you can use the custom_task system to register your new metric.

To see an example of a custom metric added along with a custom task, look at the IFEval custom task.

To contribute your custom metric to the Lighteval repository, you would first need to install the required dev dependencies by running pip install -e .[dev] and then run pre-commit install to install the pre-commit hooks.

Creating a Custom Metric

Step 1: Create the Metric File

Create a new Python file which should contain the full logic of your metric. The file also needs to start with these imports:

from aenum import extend_enum
from lighteval.metrics import Metrics

Step 2: Define the Sample-Level Metric

You need to define a sample-level metric. All sample-level metrics will have the same signature, taking a ~lighteval.types.Doc and a ~lighteval.types.ModelResponse. The metric should return a float or a boolean.

Single Metric Example

def custom_metric(doc: Doc, model_response: ModelResponse) -> bool:
    response = model_response.text[0]
    return response == doc.choices[doc.gold_index]

Multiple Metrics Example

If you want to return multiple metrics per sample, you need to return a dictionary with the metrics as keys and the values as values:

def custom_metric(doc: Doc, model_response: ModelResponse) -> dict:
    response = model_response.text[0]
    return {"accuracy": response == doc.choices[doc.gold_index], "other_metric": 0.5}

Step 3: Define Aggregation Function (Optional)

You can define an aggregation function if needed. A common aggregation function is np.mean:

def agg_function(items):
    flat_items = [item for sublist in items for item in sublist]
    score = sum(flat_items) / len(flat_items)
    return score

Step 4: Create the Metric Object

Single Metric

If it’s a sample-level metric, you can use the following code with SampleLevelMetric:

my_custom_metric = SampleLevelMetric(
    metric_name="custom_accuracy",
    higher_is_better=True,
    category=SamplingMethod.GENERATIVE,
    sample_level_fn=custom_metric,
    corpus_level_fn=agg_function,
)

Multiple Metrics

If your metric defines multiple metrics per sample, you can use the following code with SampleLevelMetricGrouping:

custom_metric = SampleLevelMetricGrouping(
    metric_name=["accuracy", "response_length", "confidence"],
    higher_is_better={
        "accuracy": True,
        "response_length": False,  # Shorter responses might be better
        "confidence": True
    },
    category=SamplingMethod.GENERATIVE,
    sample_level_fn=custom_metric,
    corpus_level_fn={
        "accuracy": np.mean,
        "response_length": np.mean,
        "confidence": np.mean,
    },
)

Step 5: Register the Metric

To finish, add the following code so that it adds your metric to our metrics list when loaded as a module:

# Adds the metric to the metric list!
extend_enum(Metrics, "CUSTOM_ACCURACY", my_custom_metric)

if __name__ == "__main__":
    print("Imported metric")

Using Your Custom Metric

With Custom Tasks

You can then give your custom metric to Lighteval by using --custom-tasks path_to_your_file when launching it after adding it to the task config.

lighteval accelerate \
    "model_name=openai-community/gpt2" \
    "leaderboard|truthfulqa:mc|0" \
    --custom-tasks path_to_your_metric_file.py
from lighteval.tasks.lighteval_task import LightevalTaskConfig

task = LightevalTaskConfig(
    name="my_custom_task",
    suite=["community"],
    metric=[my_custom_metric],  # Use your custom metric here
    prompt_function=my_prompt_function,
    hf_repo="my_dataset",
    evaluation_splits=["test"]
)
< > Update on GitHub