DiscoveryBench with OpenHands
DiscoveryBench (Paper) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.
Setup Environment and LLM Configuration
Please follow instructions mentioned here to setup OpenHands development environment and LLMs locally
Execute the bash script to start DiscoveryBench Evaluation
./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]
Replace [YOUR MODEL CONFIG]
with any model the model that you have set up in config.toml
Run Inference on DiscoveryBench Instances
When the run_infer.sh
script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to output.jsonl
under evaluation_outputs
.
./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
MODEL_CONFIG
: Name of the model you want to evaluate withGIT_COMMIT
: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.AGENT
: Use CoderActAgent, right now it only supports that.EVAL_LIMIT
: Number of samples to evaluate.NUM_WORKERS
: Number of workers to parallelize the evaluation process.