L3Score / README.md
Niklas Hoepner
Update gradio app
1d62d20
|
raw
history blame
4.47 kB
metadata
title: L3Score
datasets:
  - google/spiqa
tags:
  - evaluate
  - metric
  - semantic-similarity
  - qa
  - llm-eval
description: >
  L3Score is a metric for evaluating the semantic similarity of free-form
  answers in question answering tasks. It uses log-probabilities of "Yes"/"No"
  tokens from a language model acting as a judge. Based on the SPIQA benchmark:
  https://arxiv.org/pdf/2407.09413
sdk: gradio
sdk_version: 5.25.1
app_file: app.py
pinned: false

๐Ÿฆข Metric Card: L3Score

๐Ÿ“Œ Description

L3Score evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a language model as a judge using the following format:

You are given a question, ground-truth answer, and a candidate answer.

Question: {question}  
Ground-truth answer: {gt}  
Candidate answer: {answer}

Is the semantic meaning of the ground-truth and candidate answers similar?  
Answer in one word - Yes or No.

The model's log-probabilities for "Yes" and "No" tokens are used to compute the score.

๐Ÿงฎ Scoring Logic

Let $ l_{\text{yes}}$ and $ l_{\text{no}}$ be the log-probabilities of "Yes" and "No", respectively.

If neither token is in the top-5:

L3Score=0 \text{L3Score} = 0

If both are present:

L3Score=expโก(lyes)expโก(lyes)+expโก(lno) \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}

If only one is present, the missing tokenโ€™s probability is estimated using the minimum of the remaining mass or the least likely token in top-5.
See SPIQA paper for details.


๐Ÿš€ How to Use

import evaluate

l3score = evaluate.load("your-username/L3Score")

questions = ["What is the capital of France?", "What is the capital of Germany?"]
predictions = ["Paris", "Moscow"]
references = ["Paris", "Berlin"]

score = l3score.compute(
    questions=questions,
    predictions=predictions,
    references=references,
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)

print(score)
# {'L3Score': 0.49...}

๐Ÿ”  Inputs

Name Type Description
questions list[str] The list of input questions.
predictions list[str] Generated answers by the model being evaluated.
references list[str] Ground-truth or reference answers.
api_key str API key for the selected LLM provider.
provider str Must support top-n token log-probabilities (currently available: "openai", "deepseek","xai").
model str Name of the evaluation LLM (e.g., "gpt-4o-mini").

๐Ÿ“„ Output

A dictionary with a single key:

{"L3Score": float}

The value is the average score over all (question, prediction, reference) triplets.


๐Ÿ’ก Examples

l3score = evaluate.load("your-username/L3Score")

score = l3score.compute(
    questions=["What is the capital of France?"],
    predictions=["Paris"],
    references=["Paris"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.99...}

score = l3score.compute(
    questions=["What is the capital of Germany?"],
    predictions=["Moscow"],
    references=["Berlin"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.00...}

โš ๏ธ Limitations and Bias

  • Requires models that expose top-n token log-probabilities (e.g., OpenAI, DeepSeek, Groq).
  • Scores are only comparable when using the same judge model.

๐Ÿ“– Citation

@article{pramanick2024spiqa,
  title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
  author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
  journal={arXiv preprint arXiv:2407.09413},
  year={2024}
}

๐Ÿ”— Further References