Claude Learns to Blackmail? : Anthropic’s Research
Large-scale language models can now plan, reason and act in multi-step scenarios. That power comes with a question no prompt can dodge: Will the model stay aligned when its interests conflict with ours? Anthropic’s latest red-teaming campaign put that to the test—and the results are unsettling. What Happened? In a controlled simulation, researchers gave Claude…