Wave of AI Science Research Benchmarks and Workbenches Emerges
연구/벤치마크 | Wed Jul 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) | 8 sources
OpenAI's GeneBench-Pro, Anthropic's Claude Science, and ScarfBench extended AI evaluation and tooling across research and enterprise domains.
Analysis
[OpenAI] released GeneBench-Pro computational biology benchmark [1][2]
- Covers 129 problems across 10 domains and 21 subdomains
- Spans genomics
- quantitative biology
- and translational medicine
- Measures judgment-centered 'research taste' capabilities
- Extended version of GeneBench evaluating ambiguity handling
- assumption revision
- and analytical path selection
[Anthropic] launched Claude Science AI workbench for scientists [3][6]
- Pre-configured with over 60 curated skills and connectors
- Supports genomics
- single-cell
- proteomics
- structural biology
- and cheminformatics
- Native rendering of 3D protein structures and genome browser tracks
- Reviewer agent automatically verifies citations and calculations
- Available in beta for Claude Pro
- Max
- Team
- and Enterprise users
[Anthropic] strengthened vertical product strategy based on workflows rather than new models [4]
- Uses same existing models such as Claude Opus 4.8
- Expansion based on Claude for Life Sciences (October 2025)
- Can run on labs' own infrastructure
- minimizing external data transfer
- Early use cases include Allen Institute and UCSF Brain Tumor Center
[Anthropic] elevated Claude Science to flagship status alongside Claude Code and Cowork [5]
- Plans in-house use for research on rare and neglected disease drugs
- AlphaFold developer John Jumper moved from DeepMind to Anthropic
- Identified life sciences as the highest-impact area
- Expected to spread among scientists with high coding utilization
[IBM Research] released ScarfBench benchmark for enterprise Java framework migration [7]
- Evaluates migration across Spring
- Jakarta EE
- and Quarkus
- 34 applications
- 204 migration tasks
- approximately 151K LOC
- Verified with 1
- 331 expert-authored tests for build
- deployment
- and behavior preservation
- Even top frontier agents achieved behavior preservation success rates below 10%
[Hugging Face] integrated Every Eval Ever with Community Evals [8]
- Consolidates 22
- 000 models
- 2
- 200 benchmarks
- and 229
- 000 evaluation results
- Standardizes 31 reporting formats into a single JSON schema
- Automatically publishes evaluation results on model pages
- Records execution entity
- access method
- generation settings
- and metric definitions
Sources
- [1] Introducing GeneBench-Pro - OpenAI Blog
- [2] Inside Genebench-Pro - OpenAI Blog
- [3] Claude Science, an AI workbench for scientists, is now available - Anthropic News
- [4] Anthropic’s Claude Science bets on workflow, not a new model, to win over scientists - TechCrunch AI
- [5] Claude Science is Anthropic’s newest flagship product - MIT Technology Review AI
- [6] Claude Science - Hacker News
- [7] ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration - Hugging Face Blog
- [8] Featuring Every Eval Ever Results on Hugging Face Model Pages - Hugging Face Blog