Siva Sankalp commited on
Commit
22294bc
·
1 Parent(s): bcfb767

chore: update README with experiment results file format (#7)

Browse files
Files changed (1) hide show
  1. README.md +209 -0
README.md CHANGED
@@ -36,6 +36,215 @@ To start InspectorRAGet in production mode, please run the following command.
36
  yarn start
37
  ```
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Citation
40
  If you use InspectorRAGet in your research, please cite our paper:
41
 
 
36
  yarn start
37
  ```
38
 
39
+ ## Usage
40
+
41
+ Once you've started the InspectorRAGet application, then next step is to format experiment results in a prescribed format.
42
+
43
+ The experiment result file can be broadly split into six sections along the functional boundaries. The first section captures general details about the experiment in `name`, `description` and `timestamp` fields. The second and third sections describe the
44
+ sets of models and metrics used in the experiment via the `models` and `metrics` fields, respectively. The last three sections cover the dataset and the outcome of evaluation experiment in the form of `documents`, `tasks` and `evaluations` fields.
45
+
46
+ #### 1. Metadata
47
+
48
+ ```json
49
+ {
50
+ "name": "Sample experiment name",
51
+ "description": "Sample example description",
52
+ ...
53
+ ```
54
+
55
+ #### 2. Models
56
+
57
+ ```json
58
+ "models": [
59
+ {
60
+ "model_id": "model_1",
61
+ "name": "Model 1",
62
+ "owner": "Model 1 owner",
63
+ },
64
+ {
65
+ "model_id": "model_2",
66
+ "name": "Model 2",
67
+ "owner": "Model 2 owner",
68
+ }
69
+ ],
70
+ ```
71
+
72
+ Notes:
73
+
74
+ 1. Each model must have a unique `model_id` and `name`.
75
+
76
+ #### 3. Metrics
77
+
78
+ ```json
79
+ "numerical": [
80
+ {
81
+ "name": "metric_a",
82
+ "display_name": "Metric A",
83
+ "description": "Metric A description",
84
+ "author": "algorithm | human",
85
+ "type": "numerical",
86
+ "aggregator": "average",
87
+ "range": [0, 1, 0.1]
88
+ },
89
+ {
90
+ "name": "metric_b",
91
+ "display_name": "Metric B",
92
+ "description": "Metric B description",
93
+ "author": "algorithm | human",
94
+ "type": "categorical",
95
+ "aggregator": "majority | average",
96
+ "values": [
97
+ {
98
+ "value": "value_a",
99
+ "display_value": "A",
100
+ "numeric_value": 1
101
+ },
102
+ {
103
+ "value": "value_b",
104
+ "display_value": "B",
105
+ "numeric_value": 0
106
+ }
107
+ ]
108
+ },
109
+ {
110
+ "name": "metric_c",
111
+ "display_name": "Metric C",
112
+ "description": "Metric C description",
113
+ "author": "algorithm | human",
114
+ "type": "text"
115
+ }
116
+ ],
117
+ ```
118
+ Notes:
119
+
120
+ 1. Each metric must have a unique name.
121
+ 2. Metric can be of `numerical`, `categorical`, or `text` type.
122
+ 3. Numerical type metrics must specify `range` field in `[start, end, bin_size]` format.
123
+ 4. Categoricl type metrics must specify `values` field where each value must have `value` and `numerical_value` fields.
124
+ 5. Text type metric are only accesible in instance level view and not used in any experiment level aggregate statistics and visual elements.
125
+
126
+ #### 4. Documents
127
+
128
+ ```json
129
+ "documents": [
130
+ {
131
+ "document_id": "GUID 1",
132
+ "text": "document text 1",
133
+ "title": "document title 1"
134
+ },
135
+ {
136
+ "document_id": "GUID 2",
137
+ "text": "document text 2",
138
+ "title": "document title 2"
139
+ },
140
+ {
141
+ "document_id": "GUID 3",
142
+ "text": "document text 3",
143
+ "title": "document title 3"
144
+ }
145
+ ],
146
+ ```
147
+ Notes:
148
+
149
+ 1. Each document must have a unique `document_id` field.
150
+ 2. Each document must have a `text` field.
151
+
152
+ #### 5. Tasks
153
+
154
+ ```json
155
+ "filters": ["category"],
156
+ "tasks": [
157
+ {
158
+ "task_id": "task_1",
159
+ "task_type": "rag",
160
+ "category": "grounded",
161
+ "input": [
162
+ {
163
+ "speaker": "user",
164
+ "text": "Sample user query"
165
+ }
166
+ ],
167
+ "contexts": [
168
+ {
169
+ "document_id": "GUID 1"
170
+ }
171
+ ],
172
+ "targets": [
173
+ {
174
+ "text": "Sample response"
175
+ }
176
+ ]
177
+ },
178
+ {
179
+ "task_id": "task_2",
180
+ "task_type": "rag",
181
+ "category": "random",
182
+ "input": [
183
+ {
184
+ "speaker": "user",
185
+ "text": "Hello"
186
+ }
187
+ ],
188
+ "contexts": [
189
+ {
190
+ "document_id": "GUID 2"
191
+ }
192
+ ],
193
+ "targets": [
194
+ {
195
+ "text": "How can I help you?"
196
+ }
197
+ ]
198
+ }
199
+ ],
200
+ ```
201
+ Notes:
202
+
203
+ 1. Each task must have a unique `task_id`.
204
+ 2. Task type can be of `question_answering`, `conversation`, or of `rag` type.
205
+ 3. `input` is an array of utterances. An utterance's speaker could be either `user` or `agent`. Each utterance must have a `text` field.
206
+ 4. `contexts` field represents a subset of documents from the `documents` field relevant to the `input` and is available to the generative models.
207
+ 5. `targets` field is an array of expected gold or reference texts.
208
+ 6. `category` is an optional field that represents the type of task for grouping similar tasks.
209
+ 7. `filters` is a top-level field (parallel to `tasks`) which specifies an array of fields defined inside `tasks` for filtering tasks during analysis.
210
+
211
+ #### 6. Evaluations
212
+
213
+ ```json
214
+ "evaluations": [
215
+ {
216
+ "task_id": "task_1 | task_2",
217
+ "model_id": "model_1 | model_2",
218
+ "model_response": "Model response",
219
+ "annotations": {
220
+ "metric_a": {
221
+ "system": {
222
+ "value": 0.233766233766233
223
+ }
224
+ },
225
+ "metric_b": {
226
+ "system": {
227
+ "value": "value_a | value_b"
228
+ }
229
+ },
230
+ "metric_c": {
231
+ "system": {
232
+ "value": "text"
233
+ }
234
+ },
235
+ }
236
+ }
237
+ ]
238
+ ```
239
+ Notes:
240
+
241
+ 1. `evaluations` field must contain evaluation for every model defined in `models` section and on every task in `tasks` section. Thus, total number of evaluations is equal to number of models (M) X number of tasks (T) = M X T
242
+ 2. Each evaluation must be associated with single task and single model.
243
+ 3. Each evaluation must have model prediction on a task captured in the `model_response` field.
244
+ 4. `annotations` field captures ratings on the model for a given task and for every metric specified in the `metrics` field.
245
+ 5. Each metric annotation is a dictionary containing worker ids as keys. In the example above, `system` is a worker id.
246
+ 6. Annotation from any worker on all metrics must be in the form of a dictionary. At minimum, such dictionary contains `value` key capturing model's rating for the metric by the worker.
247
+
248
  ## Citation
249
  If you use InspectorRAGet in your research, please cite our paper:
250