Niklas Hoepner commited on
Commit
7e0c731
·
1 Parent(s): 0ca5bff

Fixed username in examples

Browse files
Files changed (2) hide show
  1. README.md +11 -8
  2. app.py +42 -7
README.md CHANGED
@@ -40,31 +40,34 @@ The model's **log-probabilities** for "Yes" and "No" tokens are used to compute
40
 
41
  ### 🧮 Scoring Logic
42
 
43
- Let $ l_{\text{yes}}$ and $ l_{\text{no}}$ be the log-probabilities of "Yes" and "No", respectively.
44
 
45
- If neither token is in the top-5:
46
 
47
  $$
48
  \text{L3Score} = 0
49
  $$
50
 
51
- If both are present:
52
 
53
  $$
54
  \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
55
  $$
56
 
57
- If only one is present, the missing token’s probability is estimated using the minimum of the remaining mass or the least likely token in top-5.
58
- See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.
 
59
 
60
- ---
 
 
61
 
62
  ## 🚀 How to Use
63
 
64
  ```python
65
  import evaluate
66
 
67
- l3score = evaluate.load("your-username/L3Score")
68
 
69
  questions = ["What is the capital of France?", "What is the capital of Germany?"]
70
  predictions = ["Paris", "Moscow"]
@@ -113,7 +116,7 @@ The value is the **average score** over all (question, prediction, reference) tr
113
  ## 💡 Examples
114
 
115
  ```python
116
- l3score = evaluate.load("your-username/L3Score")
117
 
118
  score = l3score.compute(
119
  questions=["What is the capital of France?"],
 
40
 
41
  ### 🧮 Scoring Logic
42
 
43
+ Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.
44
 
45
+ - If neither token is in the top-5:
46
 
47
  $$
48
  \text{L3Score} = 0
49
  $$
50
 
51
+ - If both are present:
52
 
53
  $$
54
  \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
55
  $$
56
 
57
+ - If only one is present, the missing token’s probability is estimated using the minimum of:
58
+ - remaining probability mass apart from the top-5 tokens
59
+ - the least likely top-5 token
60
 
61
+ The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.
62
+
63
+ See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.
64
 
65
  ## 🚀 How to Use
66
 
67
  ```python
68
  import evaluate
69
 
70
+ l3score = evaluate.load("nhop/L3Score")
71
 
72
  questions = ["What is the capital of France?", "What is the capital of Germany?"]
73
  predictions = ["Paris", "Moscow"]
 
116
  ## 💡 Examples
117
 
118
  ```python
119
+ l3score = evaluate.load("nhop/L3Score")
120
 
121
  score = l3score.compute(
122
  questions=["What is the capital of France?"],
app.py CHANGED
@@ -19,7 +19,7 @@ def compute_l3score(api_key, provider, model, questions, predictions, references
19
 
20
  with gr.Blocks() as demo:
21
  gr.Markdown(r"""
22
- <h1 align="center"> Metric: L3Score </h1>
23
  """)
24
 
25
 
@@ -64,7 +64,7 @@ with gr.Blocks() as demo:
64
 
65
  ## 🧮 Scoring Logic
66
 
67
- Let $ l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.
68
 
69
  - If neither token is in the top-5:
70
 
@@ -82,6 +82,9 @@ with gr.Blocks() as demo:
82
  - remaining probability mass apart from the top-5 tokens
83
  - the least likely top-5 token
84
 
 
 
 
85
  ---
86
 
87
  ## 🚀 How to Use
@@ -89,18 +92,23 @@ with gr.Blocks() as demo:
89
  ```python
90
  import evaluate
91
 
92
- l3score = evaluate.load("your-username/L3Score")
 
 
 
 
93
 
94
  score = l3score.compute(
95
- questions=["What is the capital of France?"],
96
- predictions=["Paris"],
97
- references=["Paris"],
98
  api_key="your-openai-api-key",
99
  provider="openai",
100
  model="gpt-4o-mini"
101
  )
 
102
  print(score)
103
- # {'L3Score': 0.99...}
104
  ```
105
 
106
  ---
@@ -125,6 +133,33 @@ with gr.Blocks() as demo:
125
  The value is the **average score** over all (question, prediction, reference) triplets.
126
 
127
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
  ## ⚠️ Limitations and Bias
130
  - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).
 
19
 
20
  with gr.Blocks() as demo:
21
  gr.Markdown(r"""
22
+ # Metric: L3Score
23
  """)
24
 
25
 
 
64
 
65
  ## 🧮 Scoring Logic
66
 
67
+ Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.
68
 
69
  - If neither token is in the top-5:
70
 
 
82
  - remaining probability mass apart from the top-5 tokens
83
  - the least likely top-5 token
84
 
85
+ The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.
86
+
87
+ See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.
88
  ---
89
 
90
  ## 🚀 How to Use
 
92
  ```python
93
  import evaluate
94
 
95
+ l3score = evaluate.load("nhop/L3Score")
96
+
97
+ questions = ["What is the capital of France?", "What is the capital of Germany?"]
98
+ predictions = ["Paris", "Moscow"]
99
+ references = ["Paris", "Berlin"]
100
 
101
  score = l3score.compute(
102
+ questions=questions,
103
+ predictions=predictions,
104
+ references=references,
105
  api_key="your-openai-api-key",
106
  provider="openai",
107
  model="gpt-4o-mini"
108
  )
109
+
110
  print(score)
111
+ # {'L3Score': 0.49...}
112
  ```
113
 
114
  ---
 
133
  The value is the **average score** over all (question, prediction, reference) triplets.
134
 
135
  ---
136
+
137
+ ## 📊 Example
138
+
139
+ ```python
140
+ l3score = evaluate.load("nhop/L3Score")
141
+
142
+ score = l3score.compute(
143
+ questions=["What is the capital of France?"],
144
+ predictions=["Paris"],
145
+ references=["Paris"],
146
+ api_key="your-openai-api-key",
147
+ provider="openai",
148
+ model="gpt-4o-mini"
149
+ )
150
+ # {'L3Score': 0.99...}
151
+
152
+ score = l3score.compute(
153
+ questions=["What is the capital of Germany?"],
154
+ predictions=["Moscow"],
155
+ references=["Berlin"],
156
+ api_key="your-openai-api-key",
157
+ provider="openai",
158
+ model="gpt-4o-mini"
159
+ )
160
+ # {'L3Score': 0.00...}
161
+ ```
162
+ ---
163
 
164
  ## ⚠️ Limitations and Bias
165
  - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).