Spaces:

ar08
/

zzz

Paused

App Files Files Community

zzz / evaluation /benchmarks /discoverybench /README.md

ar08

Upload 1040 files

246d201 verified 8 months ago

preview code

raw

history blame contribute delete

2.2 kB

DiscoveryBench with OpenHands

DiscoveryBench (Paper) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.

Setup Environment and LLM Configuration

Please follow instructions mentioned here to setup OpenHands development environment and LLMs locally
Execute the bash script to start DiscoveryBench Evaluation

./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]

Replace [YOUR MODEL CONFIG] with any model the model that you have set up in config.toml

Run Inference on DiscoveryBench Instances

When the run_infer.sh script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to output.jsonl under evaluation_outputs.

./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]

MODEL_CONFIG: Name of the model you want to evaluate with
GIT_COMMIT: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.
AGENT: Use CoderActAgent, right now it only supports that.
EVAL_LIMIT: Number of samples to evaluate.
NUM_WORKERS: Number of workers to parallelize the evaluation process.