May 15, 2026
The two frontier labs ran each other's safety evaluations against their own production models — a rare cross-lab exercise that exposed sharply different failure modes and methodological priorities.
Alignment, robustness, evals, and red-teaming — keeping models behaving as capabilities scale.
The two frontier labs ran each other's safety evaluations against their own production models — a rare cross-lab exercise that exposed sharply different failure modes and methodological priorities.