Update README.md
Browse files
README.md
CHANGED
@@ -153,6 +153,31 @@ foundation for next-generation language model agents to reason and tackle real-w
|
|
153 |
|
154 |
\* conducted on the text-only HLE subset.
|
155 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
## 3. Deployment Guide
|
157 |
|
158 |
Download the model from HuggingFace repository:
|
|
|
153 |
|
154 |
\* conducted on the text-only HLE subset.
|
155 |
|
156 |
+
### SWE-bench methodology
|
157 |
+
We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
|
158 |
+
"astropy__astropy-7606",
|
159 |
+
"astropy__astropy-8707",
|
160 |
+
"astropy__astropy-8872",
|
161 |
+
"django__django-10097",
|
162 |
+
"matplotlib__matplotlib-20488",
|
163 |
+
"psf__requests-2317",
|
164 |
+
"psf__requests-2931",
|
165 |
+
"psf__requests-5414",
|
166 |
+
"pylint-dev__pylint-6528",
|
167 |
+
"pylint-dev__pylint-7277",
|
168 |
+
"sphinx-doc__sphinx-10435",
|
169 |
+
"sphinx-doc__sphinx-7985",
|
170 |
+
"sphinx-doc__sphinx-8269",
|
171 |
+
"sphinx-doc__sphinx-8475"
|
172 |
+
|
173 |
+
### TAU-bench methodology
|
174 |
+
We evaluate TAU-Bench with the average passrate of 5 samples for each query, with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 30.
|
175 |
+
We prepend a general principle to the policy prompt.
|
176 |
+
#### General
|
177 |
+
- In each round, you need to carefully examine the tools provided to you to determine if any can be used.
|
178 |
+
- You must adhere to all of the policies. Pay attention to the details in the terms. Solutions for most situations can be found within these policies.
|
179 |
+
|
180 |
+
|
181 |
## 3. Deployment Guide
|
182 |
|
183 |
Download the model from HuggingFace repository:
|