realolipop commited on
Commit
2d768f9
·
verified ·
1 Parent(s): 759c8f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md CHANGED
@@ -153,6 +153,31 @@ foundation for next-generation language model agents to reason and tackle real-w
153
 
154
  \* conducted on the text-only HLE subset.
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  ## 3. Deployment Guide
157
 
158
  Download the model from HuggingFace repository:
 
153
 
154
  \* conducted on the text-only HLE subset.
155
 
156
+ ### SWE-bench methodology
157
+ We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
158
+ "astropy__astropy-7606",
159
+ "astropy__astropy-8707",
160
+ "astropy__astropy-8872",
161
+ "django__django-10097",
162
+ "matplotlib__matplotlib-20488",
163
+ "psf__requests-2317",
164
+ "psf__requests-2931",
165
+ "psf__requests-5414",
166
+ "pylint-dev__pylint-6528",
167
+ "pylint-dev__pylint-7277",
168
+ "sphinx-doc__sphinx-10435",
169
+ "sphinx-doc__sphinx-7985",
170
+ "sphinx-doc__sphinx-8269",
171
+ "sphinx-doc__sphinx-8475"
172
+
173
+ ### TAU-bench methodology
174
+ We evaluate TAU-Bench with the average passrate of 5 samples for each query, with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 30.
175
+ We prepend a general principle to the policy prompt.
176
+ #### General
177
+ - In each round, you need to carefully examine the tools provided to you to determine if any can be used.
178
+ - You must adhere to all of the policies. Pay attention to the details in the terms. Solutions for most situations can be found within these policies.
179
+
180
+
181
  ## 3. Deployment Guide
182
 
183
  Download the model from HuggingFace repository: