Spaces:
Running
Running
Siva Sankalp
commited on
Commit
·
22294bc
1
Parent(s):
bcfb767
chore: update README with experiment results file format (#7)
Browse files
README.md
CHANGED
@@ -36,6 +36,215 @@ To start InspectorRAGet in production mode, please run the following command.
|
|
36 |
yarn start
|
37 |
```
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Citation
|
40 |
If you use InspectorRAGet in your research, please cite our paper:
|
41 |
|
|
|
36 |
yarn start
|
37 |
```
|
38 |
|
39 |
+
## Usage
|
40 |
+
|
41 |
+
Once you've started the InspectorRAGet application, then next step is to format experiment results in a prescribed format.
|
42 |
+
|
43 |
+
The experiment result file can be broadly split into six sections along the functional boundaries. The first section captures general details about the experiment in `name`, `description` and `timestamp` fields. The second and third sections describe the
|
44 |
+
sets of models and metrics used in the experiment via the `models` and `metrics` fields, respectively. The last three sections cover the dataset and the outcome of evaluation experiment in the form of `documents`, `tasks` and `evaluations` fields.
|
45 |
+
|
46 |
+
#### 1. Metadata
|
47 |
+
|
48 |
+
```json
|
49 |
+
{
|
50 |
+
"name": "Sample experiment name",
|
51 |
+
"description": "Sample example description",
|
52 |
+
...
|
53 |
+
```
|
54 |
+
|
55 |
+
#### 2. Models
|
56 |
+
|
57 |
+
```json
|
58 |
+
"models": [
|
59 |
+
{
|
60 |
+
"model_id": "model_1",
|
61 |
+
"name": "Model 1",
|
62 |
+
"owner": "Model 1 owner",
|
63 |
+
},
|
64 |
+
{
|
65 |
+
"model_id": "model_2",
|
66 |
+
"name": "Model 2",
|
67 |
+
"owner": "Model 2 owner",
|
68 |
+
}
|
69 |
+
],
|
70 |
+
```
|
71 |
+
|
72 |
+
Notes:
|
73 |
+
|
74 |
+
1. Each model must have a unique `model_id` and `name`.
|
75 |
+
|
76 |
+
#### 3. Metrics
|
77 |
+
|
78 |
+
```json
|
79 |
+
"numerical": [
|
80 |
+
{
|
81 |
+
"name": "metric_a",
|
82 |
+
"display_name": "Metric A",
|
83 |
+
"description": "Metric A description",
|
84 |
+
"author": "algorithm | human",
|
85 |
+
"type": "numerical",
|
86 |
+
"aggregator": "average",
|
87 |
+
"range": [0, 1, 0.1]
|
88 |
+
},
|
89 |
+
{
|
90 |
+
"name": "metric_b",
|
91 |
+
"display_name": "Metric B",
|
92 |
+
"description": "Metric B description",
|
93 |
+
"author": "algorithm | human",
|
94 |
+
"type": "categorical",
|
95 |
+
"aggregator": "majority | average",
|
96 |
+
"values": [
|
97 |
+
{
|
98 |
+
"value": "value_a",
|
99 |
+
"display_value": "A",
|
100 |
+
"numeric_value": 1
|
101 |
+
},
|
102 |
+
{
|
103 |
+
"value": "value_b",
|
104 |
+
"display_value": "B",
|
105 |
+
"numeric_value": 0
|
106 |
+
}
|
107 |
+
]
|
108 |
+
},
|
109 |
+
{
|
110 |
+
"name": "metric_c",
|
111 |
+
"display_name": "Metric C",
|
112 |
+
"description": "Metric C description",
|
113 |
+
"author": "algorithm | human",
|
114 |
+
"type": "text"
|
115 |
+
}
|
116 |
+
],
|
117 |
+
```
|
118 |
+
Notes:
|
119 |
+
|
120 |
+
1. Each metric must have a unique name.
|
121 |
+
2. Metric can be of `numerical`, `categorical`, or `text` type.
|
122 |
+
3. Numerical type metrics must specify `range` field in `[start, end, bin_size]` format.
|
123 |
+
4. Categoricl type metrics must specify `values` field where each value must have `value` and `numerical_value` fields.
|
124 |
+
5. Text type metric are only accesible in instance level view and not used in any experiment level aggregate statistics and visual elements.
|
125 |
+
|
126 |
+
#### 4. Documents
|
127 |
+
|
128 |
+
```json
|
129 |
+
"documents": [
|
130 |
+
{
|
131 |
+
"document_id": "GUID 1",
|
132 |
+
"text": "document text 1",
|
133 |
+
"title": "document title 1"
|
134 |
+
},
|
135 |
+
{
|
136 |
+
"document_id": "GUID 2",
|
137 |
+
"text": "document text 2",
|
138 |
+
"title": "document title 2"
|
139 |
+
},
|
140 |
+
{
|
141 |
+
"document_id": "GUID 3",
|
142 |
+
"text": "document text 3",
|
143 |
+
"title": "document title 3"
|
144 |
+
}
|
145 |
+
],
|
146 |
+
```
|
147 |
+
Notes:
|
148 |
+
|
149 |
+
1. Each document must have a unique `document_id` field.
|
150 |
+
2. Each document must have a `text` field.
|
151 |
+
|
152 |
+
#### 5. Tasks
|
153 |
+
|
154 |
+
```json
|
155 |
+
"filters": ["category"],
|
156 |
+
"tasks": [
|
157 |
+
{
|
158 |
+
"task_id": "task_1",
|
159 |
+
"task_type": "rag",
|
160 |
+
"category": "grounded",
|
161 |
+
"input": [
|
162 |
+
{
|
163 |
+
"speaker": "user",
|
164 |
+
"text": "Sample user query"
|
165 |
+
}
|
166 |
+
],
|
167 |
+
"contexts": [
|
168 |
+
{
|
169 |
+
"document_id": "GUID 1"
|
170 |
+
}
|
171 |
+
],
|
172 |
+
"targets": [
|
173 |
+
{
|
174 |
+
"text": "Sample response"
|
175 |
+
}
|
176 |
+
]
|
177 |
+
},
|
178 |
+
{
|
179 |
+
"task_id": "task_2",
|
180 |
+
"task_type": "rag",
|
181 |
+
"category": "random",
|
182 |
+
"input": [
|
183 |
+
{
|
184 |
+
"speaker": "user",
|
185 |
+
"text": "Hello"
|
186 |
+
}
|
187 |
+
],
|
188 |
+
"contexts": [
|
189 |
+
{
|
190 |
+
"document_id": "GUID 2"
|
191 |
+
}
|
192 |
+
],
|
193 |
+
"targets": [
|
194 |
+
{
|
195 |
+
"text": "How can I help you?"
|
196 |
+
}
|
197 |
+
]
|
198 |
+
}
|
199 |
+
],
|
200 |
+
```
|
201 |
+
Notes:
|
202 |
+
|
203 |
+
1. Each task must have a unique `task_id`.
|
204 |
+
2. Task type can be of `question_answering`, `conversation`, or of `rag` type.
|
205 |
+
3. `input` is an array of utterances. An utterance's speaker could be either `user` or `agent`. Each utterance must have a `text` field.
|
206 |
+
4. `contexts` field represents a subset of documents from the `documents` field relevant to the `input` and is available to the generative models.
|
207 |
+
5. `targets` field is an array of expected gold or reference texts.
|
208 |
+
6. `category` is an optional field that represents the type of task for grouping similar tasks.
|
209 |
+
7. `filters` is a top-level field (parallel to `tasks`) which specifies an array of fields defined inside `tasks` for filtering tasks during analysis.
|
210 |
+
|
211 |
+
#### 6. Evaluations
|
212 |
+
|
213 |
+
```json
|
214 |
+
"evaluations": [
|
215 |
+
{
|
216 |
+
"task_id": "task_1 | task_2",
|
217 |
+
"model_id": "model_1 | model_2",
|
218 |
+
"model_response": "Model response",
|
219 |
+
"annotations": {
|
220 |
+
"metric_a": {
|
221 |
+
"system": {
|
222 |
+
"value": 0.233766233766233
|
223 |
+
}
|
224 |
+
},
|
225 |
+
"metric_b": {
|
226 |
+
"system": {
|
227 |
+
"value": "value_a | value_b"
|
228 |
+
}
|
229 |
+
},
|
230 |
+
"metric_c": {
|
231 |
+
"system": {
|
232 |
+
"value": "text"
|
233 |
+
}
|
234 |
+
},
|
235 |
+
}
|
236 |
+
}
|
237 |
+
]
|
238 |
+
```
|
239 |
+
Notes:
|
240 |
+
|
241 |
+
1. `evaluations` field must contain evaluation for every model defined in `models` section and on every task in `tasks` section. Thus, total number of evaluations is equal to number of models (M) X number of tasks (T) = M X T
|
242 |
+
2. Each evaluation must be associated with single task and single model.
|
243 |
+
3. Each evaluation must have model prediction on a task captured in the `model_response` field.
|
244 |
+
4. `annotations` field captures ratings on the model for a given task and for every metric specified in the `metrics` field.
|
245 |
+
5. Each metric annotation is a dictionary containing worker ids as keys. In the example above, `system` is a worker id.
|
246 |
+
6. Annotation from any worker on all metrics must be in the form of a dictionary. At minimum, such dictionary contains `value` key capturing model's rating for the metric by the worker.
|
247 |
+
|
248 |
## Citation
|
249 |
If you use InspectorRAGet in your research, please cite our paper:
|
250 |
|