Wave of AI Science Research Benchmarks and Workbenches Emerges

연구/벤치마크 | Wed Jul 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) | 8 sources

OpenAI's GeneBench-Pro, Anthropic's Claude Science, and ScarfBench extended AI evaluation and tooling across research and enterprise domains.

Analysis

[OpenAI] released GeneBench-Pro computational biology benchmark ^[1]^[2]

Covers 129 problems across 10 domains and 21 subdomains
Spans genomics
quantitative biology
and translational medicine
Measures judgment-centered 'research taste' capabilities
Extended version of GeneBench evaluating ambiguity handling
assumption revision
and analytical path selection

[Anthropic] launched Claude Science AI workbench for scientists ^[3]^[6]

Pre-configured with over 60 curated skills and connectors
Supports genomics
single-cell
proteomics
structural biology
and cheminformatics
Native rendering of 3D protein structures and genome browser tracks
Reviewer agent automatically verifies citations and calculations
Available in beta for Claude Pro
Max
Team
and Enterprise users

[Anthropic] strengthened vertical product strategy based on workflows rather than new models ^[4]

Uses same existing models such as Claude Opus 4.8
Expansion based on Claude for Life Sciences (October 2025)
Can run on labs' own infrastructure
minimizing external data transfer
Early use cases include Allen Institute and UCSF Brain Tumor Center

[Anthropic] elevated Claude Science to flagship status alongside Claude Code and Cowork ^[5]

Plans in-house use for research on rare and neglected disease drugs
AlphaFold developer John Jumper moved from DeepMind to Anthropic
Identified life sciences as the highest-impact area
Expected to spread among scientists with high coding utilization

[IBM Research] released ScarfBench benchmark for enterprise Java framework migration ^[7]

Evaluates migration across Spring
Jakarta EE
and Quarkus
34 applications
204 migration tasks
approximately 151K LOC
Verified with 1
331 expert-authored tests for build
deployment
and behavior preservation
Even top frontier agents achieved behavior preservation success rates below 10%

[Hugging Face] integrated Every Eval Ever with Community Evals ^[8]

Consolidates 22
000 models
2
200 benchmarks
and 229
000 evaluation results
Standardizes 31 reporting formats into a single JSON schema
Automatically publishes evaluation results on model pages
Records execution entity
access method
generation settings
and metric definitions

Sources