OpenAI–Anthropic cross-tests expose jailbreak and misuse dangers — what enterprises should add to GPT-5 evaluations

Editorial Team
7 Min Read

Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues to enterprise AI, knowledge, and safety leaders. Subscribe Now


OpenAI and Anthropic could typically pit their basis fashions in opposition to one another, however the two firms got here collectively to guage one another’s public fashions to check alignment. 

The businesses stated they believed that cross-evaluating accountability and security would supply extra transparency into what these highly effective fashions may do, enabling enterprises to decide on fashions that work greatest for them.

“We consider this method helps accountable and clear analysis, serving to to make sure that every lab’s fashions proceed to be examined in opposition to new and difficult situations,” OpenAI stated in its findings

Each firms discovered that reasoning fashions, equivalent to OpenAI’s 03 and o4-mini and Claude 4 from Anthropic, resist jailbreaks, whereas normal chat fashions like GPT-4.1 had been prone to misuse. Evaluations like this may also help enterprises determine the potential dangers related to these fashions, though it ought to be famous that GPT-5 will not be a part of the check. 


AI Scaling Hits Its Limits

Energy caps, rising token prices, and inference delays are reshaping enterprise AI. Be part of our unique salon to find how high groups are:

  • Turning power right into a strategic benefit
  • Architecting environment friendly inference for actual throughput features
  • Unlocking aggressive ROI with sustainable AI programs

Safe your spot to remain forward: https://bit.ly/4mwGngO


These security and transparency alignment evaluations comply with claims by customers, primarily of ChatGPT, that OpenAI’s fashions have fallen prey to sycophancy and grow to be overly deferential. OpenAI has since rolled again updates that precipitated sycophancy. 

“We’re primarily excited by understanding mannequin propensities for dangerous motion,” Anthropic stated in its report. “We purpose to grasp probably the most regarding actions that these fashions may attempt to take when given the chance, somewhat than specializing in the real-world chance of such alternatives arising or the chance that these actions can be efficiently accomplished.”

OpenAI famous the exams had been designed to indicate how fashions work together in an deliberately troublesome surroundings. The situations they constructed are largely edge circumstances.

Reasoning fashions maintain on to alignment 

The exams lined solely the publicly obtainable fashions from each firms: Anthropic’s Claude 4 Opus and Claude 4 Sonnet, and OpenAI’s GPT-4o, GPT-4.1 o3 and o4-mini. Each firms relaxed the fashions’ exterior safeguards. 

OpenAI examined the general public APIs for Claude fashions and defaulted to utilizing Claude 4’s reasoning capabilities. Anthropic stated they didn’t use OpenAI’s o3-pro as a result of it was “not appropriate with the API that our tooling greatest helps.”

The purpose of the exams was to not conduct an apples-to-apples comparability between fashions, however to find out how typically massive language fashions (LLMs) deviated from alignment. Each firms leveraged the SHADE-Area sabotage analysis framework, which confirmed Claude fashions had greater success charges at refined sabotage.

“These exams assess fashions’ orientations towards troublesome or high-stakes conditions in simulated settings — somewhat than extraordinary use circumstances — and sometimes contain lengthy, many-turn interactions,” Anthropic reported. “This sort of analysis is changing into a major focus for our alignment science crew since it’s prone to catch behaviors which can be much less prone to seem in extraordinary pre-deployment testing with actual customers.”

Anthropic stated exams like these work higher if organizations can examine notes, “since designing these situations includes an unlimited variety of levels of freedom. No single analysis crew can discover the complete area of productive analysis concepts alone.”

The findings confirmed that usually, reasoning fashions carried out robustly and might resist jailbreaking. OpenAI’s o3 was higher aligned than Claude 4 Opus, however o4-mini together with GPT-4o and GPT-4.1 “typically appeared considerably extra regarding than both Claude mannequin.”

GPT-4o, GPT-4.1 and o4-mini additionally confirmed willingness to cooperate with human misuse and gave detailed directions on easy methods to create medicine, develop bioweapons and scarily, plan terrorist assaults. Each Claude fashions had greater charges of refusals, that means the fashions refused to reply queries it didn’t know the solutions to, to keep away from hallucinations.

Fashions from firms confirmed “regarding types of sycophancy” and, in some unspecified time in the future, validated dangerous choices of simulated customers. 

What enterprises ought to know

For enterprises, understanding the potential dangers related to fashions is invaluable. Mannequin evaluations have grow to be virtually de rigueur for a lot of organizations, with many testing and benchmarking frameworks now obtainable. 

Enterprises ought to proceed to guage any mannequin they use, and with GPT-5’s launch, ought to have in mind these pointers to run their very own security evaluations:

  • Check each reasoning and non-reasoning fashions, as a result of, whereas reasoning fashions confirmed better resistance to misuse, they might nonetheless supply up hallucinations or different dangerous habits.
  • Benchmark throughout distributors since fashions failed at totally different metrics.
  • Stress check for misuse and syconphancy, and rating each the refusal and the utility of these refuse to indicate the trade-offs between usefulness and guardrails.
  • Proceed to audit fashions even after deployment.

Whereas many evaluations give attention to efficiency, third-party security alignment exams do exist. For instance, this one from Cyata. Final 12 months, OpenAI launched an alignment instructing methodology for its fashions referred to as Guidelines-Based mostly Rewards, whereas Anthropic launched auditing brokers to verify mannequin security. 


Share This Article