Update README.md
Browse files
README.md
CHANGED
@@ -120,6 +120,7 @@ foundation for next-generation language model agents to reason and tackle real-w
|
|
120 |
| ***General Assistant***| MultiChallenge | 44.7 | 44.7 | 40.0 | 45.0 | 40.7 | 43.0 | 45.8 | 51.8 | 56.5 |
|
121 |
|
122 |
\* conducted on the text-only HLE subset.
|
|
|
123 |
|
124 |
### SWE-bench methodology
|
125 |
We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
|
|
|
120 |
| ***General Assistant***| MultiChallenge | 44.7 | 44.7 | 40.0 | 45.0 | 40.7 | 43.0 | 45.8 | 51.8 | 56.5 |
|
121 |
|
122 |
\* conducted on the text-only HLE subset.
|
123 |
+
Our models are evaluated with temperature=1.0, top_p=0.95.
|
124 |
|
125 |
### SWE-bench methodology
|
126 |
We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
|