Spaces:

nhop
/

L3Score

Running

App Files Files Community

Niklas Hoepner commited on 12 days ago

Commit

7e0c731

1 Parent(s): 0ca5bff

Fixed username in examples

Browse files

Files changed (2) hide show

README.md +11 -8
app.py +42 -7

README.md CHANGED Viewed

@@ -40,31 +40,34 @@ The model's **log-probabilities** for "Yes" and "No" tokens are used to compute
 ### 🧮  Scoring Logic
-Let $ l_{\text{yes}}$ and $ l_{\text{no}}$ be the log-probabilities of "Yes" and "No", respectively.
-If neither token is in the top-5:
 $$
 \text{L3Score} = 0
 $$
-If both are present:
 $$
 \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
 $$
-If only one is present, the missing token’s probability is estimated using the minimum of the remaining mass or the least likely token in top-5.
-See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.
----
 ## 🚀 How to Use
 ```python
 import evaluate
-l3score = evaluate.load("your-username/L3Score")
 questions = ["What is the capital of France?", "What is the capital of Germany?"]
 predictions = ["Paris", "Moscow"]
@@ -113,7 +116,7 @@ The value is the **average score** over all (question, prediction, reference) tr
 ## 💡 Examples
 ```python
-l3score = evaluate.load("your-username/L3Score")
 score = l3score.compute(
     questions=["What is the capital of France?"],

 ### 🧮  Scoring Logic
+Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.
+- If neither token is in the top-5:
 $$
 \text{L3Score} = 0
 $$
+- If both are present:
 $$
 \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
 $$
+- If only one is present, the missing token’s probability is estimated using the minimum of:
+    - remaining probability mass apart from the top-5 tokens
+    - the least likely top-5 token
+The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.
+See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.
 ## 🚀 How to Use
 ```python
 import evaluate
+l3score = evaluate.load("nhop/L3Score")
 questions = ["What is the capital of France?", "What is the capital of Germany?"]
 predictions = ["Paris", "Moscow"]
 ## 💡 Examples
 ```python
+l3score = evaluate.load("nhop/L3Score")
 score = l3score.compute(
     questions=["What is the capital of France?"],

app.py CHANGED Viewed

@@ -19,7 +19,7 @@ def compute_l3score(api_key, provider, model, questions, predictions, references
 with gr.Blocks() as demo:
     gr.Markdown(r"""
-    <h1 align="center"> Metric: L3Score </h1>
     """)
@@ -64,7 +64,7 @@ with gr.Blocks() as demo:
     ## 🧮 Scoring Logic
-    Let $ l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.
     - If neither token is in the top-5:
@@ -82,6 +82,9 @@ with gr.Blocks() as demo:
         - remaining probability mass apart from the top-5 tokens
         - the least likely top-5 token
     ---
     ## 🚀 How to Use
@@ -89,18 +92,23 @@ with gr.Blocks() as demo:
     ```python
     import evaluate
-    l3score = evaluate.load("your-username/L3Score")
     score = l3score.compute(
-        questions=["What is the capital of France?"],
-        predictions=["Paris"],
-        references=["Paris"],
         api_key="your-openai-api-key",
         provider="openai",
         model="gpt-4o-mini"
     )
     print(score)
-    # {'L3Score': 0.99...}
     ```
     ---
@@ -125,6 +133,33 @@ with gr.Blocks() as demo:
     The value is the **average score** over all (question, prediction, reference) triplets.
     ---
     ## ⚠️ Limitations and Bias
     - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).

 with gr.Blocks() as demo:
     gr.Markdown(r"""
+    # Metric: L3Score
     """)
     ## 🧮 Scoring Logic
+    Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.
     - If neither token is in the top-5:
         - remaining probability mass apart from the top-5 tokens
         - the least likely top-5 token
+    The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.
+    See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.
     ---
     ## 🚀 How to Use
     ```python
     import evaluate
+    l3score = evaluate.load("nhop/L3Score")
+    questions = ["What is the capital of France?", "What is the capital of Germany?"]
+    predictions = ["Paris", "Moscow"]
+    references = ["Paris", "Berlin"]
     score = l3score.compute(
+        questions=questions,
+        predictions=predictions,
+        references=references,
         api_key="your-openai-api-key",
         provider="openai",
         model="gpt-4o-mini"
     )
     print(score)
+    # {'L3Score': 0.49...}
     ```
     ---
     The value is the **average score** over all (question, prediction, reference) triplets.
     ---
+    ## 📊 Example
+    ```python
+    l3score = evaluate.load("nhop/L3Score")
+    score = l3score.compute(
+        questions=["What is the capital of France?"],
+        predictions=["Paris"],
+        references=["Paris"],
+        api_key="your-openai-api-key",
+        provider="openai",
+        model="gpt-4o-mini"
+    )
+    # {'L3Score': 0.99...}
+    score = l3score.compute(
+        questions=["What is the capital of Germany?"],
+        predictions=["Moscow"],
+        references=["Berlin"],
+        api_key="your-openai-api-key",
+        provider="openai",
+        model="gpt-4o-mini"
+    )
+    # {'L3Score': 0.00...}
+    ```
+    ---
     ## ⚠️ Limitations and Bias
     - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).