CORE-bench v1.1 Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. agent-evals/core-bench-v1.1-mainline Viewer • Updated about 3 hours ago • 39 • 41 agent-evals/core-bench-v1.1-ood Viewer • Updated about 3 hours ago • 19 • 28
CORE-bench v1.1 Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. agent-evals/core-bench-v1.1-mainline Viewer • Updated about 3 hours ago • 39 • 41 agent-evals/core-bench-v1.1-ood Viewer • Updated about 3 hours ago • 19 • 28