OpenAI, Anthropic Test Each Other’s AI Models

The evaluation looked at how models like Anthropic’s Claude Opus 4 and Claude Sonnet 4 fared compared to OpenAI’s GPT-4o, GPT-4.1, and smaller systems like o3 and o4-mini.

OpenAI and Anthropic have taken a rare collaborative step in AI safety by testing each other’s language models in a joint evaluation aimed at probing the risks of their respective technologies.

In a blog post on Wednesday, the firms said the evaluation looked at how models like Anthropic’s Claude Opus 4 and Claude Sonnet 4 fared compared to OpenAI’s GPT-4o, GPT-4.1, and smaller systems like o3 and o4-mini.

The joint effort aimed to spotlight model behavior under challenging safety scenarios, not to offer direct, head-to-head comparisons. OpenAI emphasized that the focus was on understanding general tendencies, rather than creating safety rankings.

On Stocktwits, retail sentiment around OpenAI remained in ‘neutral’ territory amid ‘low’ message volume levels over the past day.

Anthropic’s Claude 4 series performed well in tests related to respecting hierarchical instructions and resisting prompt extraction. In contrast, these models underperformed in jailbreaking evaluations compared to OpenAI o3 and OpenAI o4-mini. Disabling reasoning in Claude models sometimes improved their performance in jailbreak tests.

When it came to hallucinations, where models generate inaccurate information, Claude models were highly cautious, often choosing not to respond at all. OpenAI’s models, including o3 and o4-mini, provided more responses but had higher hallucination rates, especially when restricted from using external tools like web browsing.

OpenAI’s own systems, particularly o3, showed strong performance in resisting manipulative prompts and avoiding scheming behaviors. OpenAI noted that these tests are intentionally difficult and don’t necessarily reflect real-world usage.

OpenAI stated it would keep evolving its testing methods. The company also recently launched GPT-5, which it claims shows improvements in reducing hallucinations, sycophancy, and misuse.

For updates and corrections, email newsroom[at]stocktwits[dot]com.<

The evaluation looked at how models like Anthropic’s Claude Opus 4 and Claude Sonnet 4 fared compared to OpenAI’s GPT-4o, GPT-4.1, and smaller systems like o3 and o4-mini.

Leave a Comment Cancel reply