Lighteval documentation
Adding a New Metric
Adding a New Metric
Before You Start
Two different types of metrics
There are two types of metrics in Lighteval:
Sample-Level Metrics
- Purpose: Evaluate individual samples/predictions
- Input: Takes a
Doc
andModelResponse
(model’s prediction) - Output: Returns a float or boolean value for that specific sample
- Example: Checking if a model’s answer matches the correct answer for one sample
Corpus-Level Metrics
- Purpose: Compute final scores across the entire dataset/corpus
- Input: Takes the results from all sample-level evaluations
- Output: Returns a single score representing overall performance
- Examples:
- Simple aggregation: Calculating average accuracy across all test samples
- Complex metrics: BLEU score where sample-level metric prepares data (tokenization, etc.) and corpus-level metric computes the actual BLEU score
Check Existing Metrics
First, check if you can use one of the parameterized functions in Corpus Metrics or Sample Metrics.
If not, you can use the custom_task
system to register your new metric.
To see an example of a custom metric added along with a custom task, look at the IFEval custom task.
To contribute your custom metric to the Lighteval repository, you would first need
to install the required dev dependencies by running pip install -e .[dev]
and then run pre-commit install
to install the pre-commit hooks.
Creating a Custom Metric
Step 1: Create the Metric File
Create a new Python file which should contain the full logic of your metric. The file also needs to start with these imports:
from aenum import extend_enum
from lighteval.metrics import Metrics
Step 2: Define the Sample-Level Metric
You need to define a sample-level metric. All sample-level metrics will have the same signature, taking a
~lighteval.types.Doc
and a ~lighteval.types.ModelResponse
. The metric should return a float or a
boolean.
Single Metric Example
def custom_metric(doc: Doc, model_response: ModelResponse) -> bool:
response = model_response.text[0]
return response == doc.choices[doc.gold_index]
Multiple Metrics Example
If you want to return multiple metrics per sample, you need to return a dictionary with the metrics as keys and the values as values:
def custom_metric(doc: Doc, model_response: ModelResponse) -> dict:
response = model_response.text[0]
return {"accuracy": response == doc.choices[doc.gold_index], "other_metric": 0.5}
Step 3: Define Aggregation Function (Optional)
You can define an aggregation function if needed. A common aggregation function is np.mean
:
def agg_function(items):
flat_items = [item for sublist in items for item in sublist]
score = sum(flat_items) / len(flat_items)
return score
Step 4: Create the Metric Object
Single Metric
If it’s a sample-level metric, you can use the following code with SampleLevelMetric:
my_custom_metric = SampleLevelMetric(
metric_name="custom_accuracy",
higher_is_better=True,
category=SamplingMethod.GENERATIVE,
sample_level_fn=custom_metric,
corpus_level_fn=agg_function,
)
Multiple Metrics
If your metric defines multiple metrics per sample, you can use the following code with SampleLevelMetricGrouping:
custom_metric = SampleLevelMetricGrouping(
metric_name=["accuracy", "response_length", "confidence"],
higher_is_better={
"accuracy": True,
"response_length": False, # Shorter responses might be better
"confidence": True
},
category=SamplingMethod.GENERATIVE,
sample_level_fn=custom_metric,
corpus_level_fn={
"accuracy": np.mean,
"response_length": np.mean,
"confidence": np.mean,
},
)
Step 5: Register the Metric
To finish, add the following code so that it adds your metric to our metrics list when loaded as a module:
# Adds the metric to the metric list!
extend_enum(Metrics, "CUSTOM_ACCURACY", my_custom_metric)
if __name__ == "__main__":
print("Imported metric")
Using Your Custom Metric
With Custom Tasks
You can then give your custom metric to Lighteval by using --custom-tasks path_to_your_file
when launching it after adding it to the task config.
lighteval accelerate \
"model_name=openai-community/gpt2" \
"leaderboard|truthfulqa:mc|0" \
--custom-tasks path_to_your_metric_file.py
from lighteval.tasks.lighteval_task import LightevalTaskConfig
task = LightevalTaskConfig(
name="my_custom_task",
suite=["community"],
metric=[my_custom_metric], # Use your custom metric here
prompt_function=my_prompt_function,
hf_repo="my_dataset",
evaluation_splits=["test"]
)