AI Science Research, Security Benchmarks, and Agent Validation Studies Released

연구/벤치마크 | Wed Jun 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) | 6 sources

OpenAI, Google, and Anthropic released AI agent performance validations and new benchmarks across science, medical, and security domains.

Analysis

[OpenAI] demonstrated GPT-5.4-based autonomous AI chemist improving Chan-Lam coupling reactions ^[1]

[OpenAI] released LifeSciBench benchmark for evaluating life sciences research ^[2]

[Google AMIE] published Nature paper on Gemini-based medical AI's long-term chronic disease management capabilities ^[3]

Used Gemini's long-context capability to cross-reference hundreds of pages of clinical guidelines
Combined empathetic dialogue agent with management reasoning agent
Conducted blinded study comparing against 21 primary care physicians
Significantly higher scores in plan preciseness and guideline alignment

[Anthropic] released report mapping 832 AI-enabled cyber threats to MITRE ATT&CK ^[4]

Analyzed 832 malicious accounts blocked between March 2025 and March 2026
67.3% used AI for writing malware
6.5% for assisting lateral movement
Medium-risk-or-higher proportion increased 1.7x from 33% to 56% over 6-month periods
AI-based phishing decreased 8.6%
while account discovery increased 8.9%

[Anthropic] studied agentic coding expertise effects through analysis of approximately 400,000 Claude Code sessions ^[5]

Analyzed 400
000 sessions from approximately 235
000 users between October 2025 and April 2026
Higher domain expertise correlated with more tasks completed by Claude per instruction
Debugging session share dropped by nearly half in 7 months
Average task value rose approximately 25% based on freelance market benchmarks

[Anthropic] published case study on the need for deterministic retrieval layers in biological databases ^[6]

Evaluated Claude
Biomni OSS
Edison Analysis
and GPT on NCBI Virus data retrieval
Even the strongest models fell short of accuracy needed to build reliable datasets
Accuracy approached 100% when the gget virus deterministic retrieval layer was added
Emphasized the need to redesign biological data infrastructure to be agent-friendly