Spaces:

nhop
/

L3Score

Running

File size: 4,769 Bytes

6de3d61
e130a6a
21ab6f3
0fd3cac
21ab6f3
0fd3cac
 
 
 
 
e130a6a
0fd3cac
 
 
 
6de3d61
1d62d20
6de3d61
 
 
 
3adfe4c
21ab6f3
e130a6a
21ab6f3
e130a6a
21ab6f3
e130a6a
 
21ab6f3
e130a6a
 
 
21ab6f3
e130a6a
 
 
21ab6f3
e130a6a
21ab6f3
8f7a170
21ab6f3
7e0c731
21ab6f3
7e0c731
21ab6f3
e130a6a
 
 
 
7e0c731
e130a6a
 
 
 
 
7e0c731
 
 
e130a6a
7e0c731
 
 
e130a6a
 
 
 
 
 
7e0c731
e130a6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9007fb
e130a6a
 
 
 
 
 
 
 
 
 
 
 
320889f
e130a6a
 
 
 
 
 
d9007fb
e130a6a
 
d9007fb
e130a6a
 
d9007fb
e130a6a
 
 
 
 
 
7e0c731
e130a6a
 
 
 
 
 
 
 
 
d9007fb
e130a6a
 
 
 
 
 
 
 
 
d9007fb
e130a6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fd3cac

---
title: L3Score
datasets:
- google/spiqa
tags:
- evaluate
- metric
- semantic-similarity
- qa
- llm-eval
description: >
  L3Score is a metric for evaluating the semantic similarity of free-form
  answers in question answering tasks. It uses log-probabilities of "Yes"/"No"
  tokens from a language model acting as a judge. Based on the SPIQA benchmark:
  https://arxiv.org/pdf/2407.09413
sdk: gradio
sdk_version: 5.25.1
app_file: app.py
pinned: false
---

# Metric Card: L3Score

## 📌 Description

**L3Score** evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a **language model as a judge** using the following format:

```text
You are given a question, ground-truth answer, and a candidate answer.

Question: {question}  
Ground-truth answer: {gt}  
Candidate answer: {answer}

Is the semantic meaning of the ground-truth and candidate answers similar?  
Answer in one word - Yes or No.
```

The model's **log-probabilities** for "Yes" and "No" tokens are used to compute the score.

### 🧮  Scoring Logic

Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.

- If neither token is in the top-5:

$$
\text{L3Score} = 0
$$

- If both are present:

$$
\text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
$$

- If only one is present, the missing token’s probability is estimated using the minimum of:
    - remaining probability mass apart from the top-5 tokens
    - the least likely top-5 token

The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.

See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.

## 🚀 How to Use

```python
import evaluate

l3score = evaluate.load("nhop/L3Score")

questions = ["What is the capital of France?", "What is the capital of Germany?"]
predictions = ["Paris", "Moscow"]
references = ["Paris", "Berlin"]

score = l3score.compute(
    questions=questions,
    predictions=predictions,
    references=references,
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)

print(score)
# {'L3Score': 0.49..., 'Cost':...}
```

---

### 🔠 Inputs

| Name         | Type         | Description                                                                 |
|--------------|--------------|-----------------------------------------------------------------------------|
| `questions`  | `list[str]`  | The list of input questions.                                                |
| `predictions`| `list[str]`  | Generated answers by the model being evaluated.                            |
| `references` | `list[str]`  | Ground-truth or reference answers.                                         |
| `api_key`    | `str`        | API key for the selected LLM provider.                                     |
| `provider`   | `str`        | Must support top-n token log-probabilities (currently available: `"openai"`, `"deepseek","xai"`). |
| `model`      | `str`        | Name of the evaluation LLM (e.g., `"gpt-4o-mini"`).                         |

---

### 📄 Output

A dictionary with a the score and the cost to query the LLM-provider API:

```python
{"L3Score": float, "Cost": float}
```

The value is the **average score** over all (question, prediction, reference) triplets and the total cost of all API calls.

---

## 💡 Examples

```python
l3score = evaluate.load("nhop/L3Score")

score = l3score.compute(
    questions=["What is the capital of France?"],
    predictions=["Paris"],
    references=["Paris"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.99...,'Cost':...}

score = l3score.compute(
    questions=["What is the capital of Germany?"],
    predictions=["Moscow"],
    references=["Berlin"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.00...,'Cost':...}
```

---

## ⚠️ Limitations and Bias

- Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).
- Scores are **only comparable when using the same judge model**.

---

## 📖 Citation

```bibtex
@article{pramanick2024spiqa,
  title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
  author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
  journal={arXiv preprint arXiv:2407.09413},
  year={2024}
}
```

---

## 🔗 Further References

- 🤗 [Dataset on Hugging Face](https://huggingface.co/datasets/google/spiqa)  
- 🐙 [GitHub Repository](https://github.com/google/spiqa)  
- 📄 [SPIQA Paper (arXiv:2407.09413)](https://arxiv.org/pdf/2407.09413)