Spaces:

nhop
/

L3Score

Sleeping

App Files Files Community

L3Score / README.md

Niklas Hoepner

Update README.md

320889f 2 months ago

preview code

raw

history blame contribute delete

4.77 kB

	---
	title: L3Score
	datasets:
	- google/spiqa
	tags:
	- evaluate
	- metric
	- semantic-similarity
	- qa
	- llm-eval
	description: >
	L3Score is a metric for evaluating the semantic similarity of free-form
	answers in question answering tasks. It uses log-probabilities of "Yes"/"No"
	tokens from a language model acting as a judge. Based on the SPIQA benchmark:
	https://arxiv.org/pdf/2407.09413
	sdk: gradio
	sdk_version: 5.25.1
	app_file: app.py
	pinned: false
	---

	# Metric Card: L3Score

	## 📌 Description

	L3Score evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a language model as a judge using the following format:

	```text
	You are given a question, ground-truth answer, and a candidate answer.

	Question: {question}
	Ground-truth answer: {gt}
	Candidate answer: {answer}

	Is the semantic meaning of the ground-truth and candidate answers similar?
	Answer in one word - Yes or No.
	```

	The model's log-probabilities for "Yes" and "No" tokens are used to compute the score.

	### 🧮 Scoring Logic

	Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.

	- If neither token is in the top-5:

	$$
	\text{L3Score} = 0
	$$

	- If both are present:

	$$
	\text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
	$$

	- If only one is present, the missing token’s probability is estimated using the minimum of:
	- remaining probability mass apart from the top-5 tokens
	- the least likely top-5 token

	The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.

	See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.

	## 🚀 How to Use

	```python
	import evaluate

	l3score = evaluate.load("nhop/L3Score")

	questions = ["What is the capital of France?", "What is the capital of Germany?"]
	predictions = ["Paris", "Moscow"]
	references = ["Paris", "Berlin"]

	score = l3score.compute(
	questions=questions,
	predictions=predictions,
	references=references,
	api_key="your-openai-api-key",
	provider="openai",
	model="gpt-4o-mini"
	)

	print(score)
	# {'L3Score': 0.49..., 'Cost':...}
	```

	---

	### 🔠 Inputs

	\| Name \| Type \| Description \|
	\|--------------\|--------------\|-----------------------------------------------------------------------------\|
	\| `questions` \| `list[str]` \| The list of input questions. \|
	\| `predictions`\| `list[str]` \| Generated answers by the model being evaluated. \|
	\| `references` \| `list[str]` \| Ground-truth or reference answers. \|
	\| `api_key` \| `str` \| API key for the selected LLM provider. \|
	\| `provider` \| `str` \| Must support top-n token log-probabilities (currently available: `"openai"`, `"deepseek","xai"`). \|
	\| `model` \| `str` \| Name of the evaluation LLM (e.g., `"gpt-4o-mini"`). \|

	---

	### 📄 Output

	A dictionary with a the score and the cost to query the LLM-provider API:

	```python
	{"L3Score": float, "Cost": float}
	```

	The value is the average score over all (question, prediction, reference) triplets and the total cost of all API calls.

	---

	## 💡 Examples

	```python
	l3score = evaluate.load("nhop/L3Score")

	score = l3score.compute(
	questions=["What is the capital of France?"],
	predictions=["Paris"],
	references=["Paris"],
	api_key="your-openai-api-key",
	provider="openai",
	model="gpt-4o-mini"
	)
	# {'L3Score': 0.99...,'Cost':...}

	score = l3score.compute(
	questions=["What is the capital of Germany?"],
	predictions=["Moscow"],
	references=["Berlin"],
	api_key="your-openai-api-key",
	provider="openai",
	model="gpt-4o-mini"
	)
	# {'L3Score': 0.00...,'Cost':...}
	```

	---

	## ⚠️ Limitations and Bias

	- Requires models that expose top-n token log-probabilities (e.g., OpenAI, DeepSeek, Groq).
	- Scores are only comparable when using the same judge model.

	---

	## 📖 Citation

	```bibtex
	@article{pramanick2024spiqa,
	title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
	author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
	journal={arXiv preprint arXiv:2407.09413},
	year={2024}
	}
	```

	---

	## 🔗 Further References

	- 🤗 [Dataset on Hugging Face](https://huggingface.co/datasets/google/spiqa)
	- 🐙 [GitHub Repository](https://github.com/google/spiqa)
	- 📄 [SPIQA Paper (arXiv:2407.09413)](https://arxiv.org/pdf/2407.09413)