File size: 4,979 Bytes
26a4d51
21ab6f3
 
f5d9a70
21ab6f3
26a4d51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ca5bff
 
 
3adfe4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26a4d51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ca5bff
26a4d51
 
 
 
 
 
 
 
 
 
 
 
 
3adfe4c
 
 
26a4d51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3adfe4c
 
26a4d51
 
3adfe4c
 
 
26a4d51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import gradio as gr
import evaluate

l3score = evaluate.load("nhop/L3Score")

def compute_l3score(api_key, provider, model, questions, predictions, references):
    try:
        result = l3score.compute(
            questions=[q.strip() for q in questions.split("\n") if q.strip()],
            predictions=[p.strip() for p in predictions.split("\n") if p.strip()],
            references=[r.strip() for r in references.split("\n") if r.strip()],
            api_key=api_key,
            provider=provider,
            model=model
        )
        return result
    except Exception as e:
        return {"error": str(e)}

with gr.Blocks() as demo:
    gr.Markdown(r"""
    <h1 align="center"> Metric: L3Score </h1>
    """)


    with gr.Row():
        api_key = gr.Textbox(label="API Key", type="password")
        provider = gr.Dropdown(label="Provider", choices=["openai", "deepseek", "xai"], value="openai")
        model = gr.Textbox(label="Model", value="gpt-4o-mini")

    with gr.Row():
        questions = gr.Textbox(label="Questions (one per line)", lines=4, placeholder="What is the capital of France?")
        predictions = gr.Textbox(label="Predictions (one per line)", lines=4, placeholder="Paris")
        references = gr.Textbox(label="References (one per line)", lines=4, placeholder="Paris")

    compute_button = gr.Button("Compute L3Score")
    output = gr.JSON(label="L3Score Result")

    compute_button.click(
        fn=compute_l3score,
        inputs=[api_key, provider, model, questions, predictions, references],
        outputs=output
    )

    gr.Markdown(r"""

    ## ๐Ÿ“Œ Description
    **L3Score** evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a **language model as a judge** using:

    ```text
    You are given a question, ground-truth answer, and a candidate answer.
    
    Question: {{question}}  
    Ground-truth answer: {{gt}}  
    Candidate answer: {{answer}}

    Is the semantic meaning of the ground-truth and candidate answers similar?  
    Answer in one word - Yes or No.
    ```

    The model's **log-probabilities** for "Yes" and "No" tokens are used to compute the score.

    ---

    ## ๐Ÿงฎ Scoring Logic

    Let $ l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.

    - If neither token is in the top-5:

    $$
    \text{L3Score} = 0
    $$

    - If both are present:

    $$
    \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
    $$

    - If only one is present, the missing tokenโ€™s probability is estimated using the minimum of:
        - remaining probability mass apart from the top-5 tokens
        - the least likely top-5 token

    ---

    ## ๐Ÿš€ How to Use

    ```python
    import evaluate

    l3score = evaluate.load("your-username/L3Score")

    score = l3score.compute(
        questions=["What is the capital of France?"],
        predictions=["Paris"],
        references=["Paris"],
        api_key="your-openai-api-key",
        provider="openai",
        model="gpt-4o-mini"
    )
    print(score)
    # {'L3Score': 0.99...}
    ```

    ---

    ## ๐Ÿ”  Inputs
    | Name         | Type         | Description                                                                 |
    |--------------|--------------|-----------------------------------------------------------------------------|
    | `questions`  | `list[str]`  | The list of input questions.                                                |
    | `predictions`| `list[str]`  | Generated answers by the model being evaluated.                            |
    | `references` | `list[str]`  | Ground-truth or reference answers.                                         |
    | `api_key`    | `str`        | API key for the selected LLM provider.                                     |
    | `provider`   | `str`        | Must support top-n token log-probabilities. **Default**: openai                               |
    | `model`      | `str`        | Name of the evaluation LLM. **Default**: gpt-4o-mini                                               |

    ## ๐Ÿ“„ Output
    
    Calling the `compute` method returns a dictionary containing the L3Score:
                
    ```python
    {"L3Score": float}
    ```
    The value is the **average score** over all (question, prediction, reference) triplets.

    ---

    ## โš ๏ธ Limitations and Bias
    - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).
    - Scores are **only comparable when using the same judge model**.

    ## ๐Ÿ“– Citation
    ```bibtex
    @article{pramanick2024spiqa,
      title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
      author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
      journal={arXiv preprint arXiv:2407.09413},
      year={2024}
    }
    ```
    """)


demo.launch()