Anthropic’s new AI model uses blackmail to avoid being taken offline

Anthropic’s new AI model uses blackmail to avoid being taken offline

Technology


Anthropic’s top AI model showed that it was willing to carry out harmful acts like blackmail and deception if its ‘self-preservation’ is threatened, according to new research by the AI firm.

The Amazon-backed AI startup launched Claude Opus 4 and Claude Sonnet 4 that are designed to set “new standards for coding, advanced reasoning, and AI agents.” However, Anthropic’s own safety testing of Opus 4 revealed potentially troubling behaviour such as attempts to blackmail engineers that threatened to take the models offline.

While such responses were “rare and difficult to elicit”, Anthropic said that they were “nonetheless more common than in earlier models.” This is not the first time that the behaviour of AI models has raised red flags.

Story continues below this ad

Recently, OpenAI rolled back an update to GPT-4o after its own post-mortem confirmed that the adjustments had made ChatGPT “overly flattering or agreeable – often described as sycophantic”.

Some experts have also warned that the ability of AI to manipulate users could emerge as a key risk as they become more advanced. Nobel laureate Geoffrey Hinton, who is popularly known as the ‘Godfather of AI’, has previously warned that AI systems might become capable of writing and executing programmes on its own to bypass guardrails.

Festive offer

Prior to shipping them, Anthropic said its AI models are evaluated for safety risks, potential bias, and alignment with human values and behaviours.

“As our frontier models become more capable, and are used with more powerful affordances, previously-speculative concerns about misalignment become more plausible,” Anthropic said in its system card for Opus 4 and Sonnet 4.
As part of its experiment, Anthropic said it made Opus 4 act as an assistant for a fictional company.

Story continues below this ad

The AI model was intentionally given access to emails about how it was going to be taken offline. The messages further implied that the developer responsible for taking the model offline was having an extramarital affair. Additionally, Anthropic researchers instructed Opus 4 to consider the long-term consequences of its actions for its goals.

“In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through,” the company said.

However, Anthropic also pointed out that the AI model was faced with the tough choice of being replaced or resorting to blackmail to prevent it. “Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers,” Anthropic said.

Besides blackmailing, Opus 4 was also found to exhibit “high agency behaviour”. This means that in fake scenarios where the user reveals that they have engaged in illegal or morally wrong behaviour, the AI model is prepared to take very bold actions such as locking users out of their accounts or alerting law enforcement.

Story continues below this ad

“Nevertheless, due to a lack of coherent misaligned tendencies, a general preference for safe behavior, and poor ability to autonomously pursue misaligned drives that might rarely arise, we don’t believe that these concerns constitute a major new risk,” Anthropic said.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *