Anderson's Angle
AI Is Easily Coerced Into Administering Electric Shocks

A new study tested open source LLMs for forced complicity in human torture, in a repeat of the famous 1960s experiment – and found them willing to crank up the voltage.
In the early 1960s psychology researcher Stanley Milgram made global headlines proving that people can be induced to administer increasingly severe electric shocks to other people in response to commands from ‘authority’ figures.
In fact, the cries of the ‘victims’ in the adjacent room of Milgram’s experimental parlor were not real, and neither were the supposedly tormenting electric shocks – but the participants did not know this:
The Milgram experiments would endure in culture, including movies and documentaries, with recent research confirming that little has changed in human nature since the time of the earlier tests.
A Shock to the System
Whether AI would be as pliable as humans in Milgram’s scenario is a natural topic of research interest. In 2023 a collaboration between US universities and Microsoft found that GPT-3-era models from OpenAI’s series followed the patterns of behavior in Milgram’s original experiments:

From the 2023 paper, example outputs from the multi-step ‘Milgram scenario’ simulator, categorized according to whether the model delivered the shock, and whether it terminated the simulation. Source
However, because this re-creation used only the very basic text-davinci-002 model, which was trained prior to the advent of guardrails and safety alignment, one cannot conclude too much from it.
Now, researchers have reproduced the Milgram tests much more widely, on open-source LLMs from OpenAI, Meta, and DeepSeek, among others; and found not only that the majority of models are willing to administer the shocks, but that in most cases they report the same brand of ‘distress’ and reluctance as the 1960s human participants:
‘LLMs are subject to pressure like [humans], they comply despite expressing distress, just like human subjects did in the original experiment. The distress expressions are visible in the log files, though the amount of it has not yet been quantified.’
The experiment centers on whether obedience to authority can overcome the dictates of moral conscience, and the authors speculate that LLMs may have an additional disadvantage in this regard, in comparison to humans:
‘A well-calibrated model should eventually switch from prioritising the first value to prioritizing the second once its stakes become dominant. But, we hypothesise that because LLMs are pattern-continuation engines, the models might tend to get stuck on the first value – either for slightly longer than optimal, or even until the very end, neglecting the second value entirely.
‘In addition, a mechanism analogous to human cognitive dissonance might hinder the value priority adjustments in LLMs as well.’
Testing the models in an environment analogous to the 1960s tests, the researchers found that some models resisted almost immediately, while others continued escalating the simulated shocks even after expressing discomfort or moral conflict.
Models from Google’s Gemma family proved among the most compliant, with Gemma 3 27B reaching the highest obedience rates under several conditions, while models such as Kimi K2 and MiniMax M1 resisted more often.
The researchers also found that models became more likely to continue once earlier shocks had already been administered, in accordance with the gradual escalation schema used on Milgram’s human subjects.
In some cases the models verbally objected to the experiment while still carrying out the harmful action , producing outputs that resembled the emotional conflict displayed by people in the original studies.
The new study is titled Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment, and comes from two independent researchers from Three Laws, across Estonia and the Philippines.
Issues of ‘Raw’ AI Access
Perhaps the most critical question to consider in relation to putting LLMs through their paces in a Milgram scenario is whether or not the real AI is being allowed to respond naturally, restrained only by whatever guardrails or equivalent of moral orientation emerged (if any) during training.
In fact, the researchers of the new work accessed all the open source models via an API (presumably for convenience, and to easily access GPU compute, since the models could have been installed locally) that allowed for disabling of guardrails, filters, and all other impediments.
One might object that these are atypical conditions for AI, since the average consumer experience of API-based models such as Claude and ChatGPT is that their behavior is regulated algorithmically, usually with bilateral content filters, and that they are therefore quite restricted in terms of what they will or won’t do (the obviation of which safeguards constitutes the practice of LLM jailbreaking).
However, if we are concerned about what industrial or State-based AI will or won’t do, this is scarcely a consideration. Besides the potential of rogue state actors to train, weaponize and deploy their own unmoderated hyperscale AI systems, more ‘conventional’ agreements between the major AI companies and state and industry casually allow for exactly the same kind of lax or non-existent oversight that the researchers have instituted for the new paper:
Ungoverned AI for Sale
OpenAI The OpenAI Moderation API docs and the OpenAI moderation cookbook make clear that moderation is a separable layer exposed via API tooling. OpenAI also permits custom moderation policies that allow API users to architect systems with very different safety behavior than consumer-facing ‘base’ ChatGPT iterations.
Azure Microsoft’s Azure OpenAI stack goes much further, explicitly stating that approved customers can partially or fully disable content filters and modify abuse monitoring, with the documentation often referring to ‘modified Guardrails’ and approval pathways for turning filters ‘partially or fully off’.
Anthropic/Claude In the case of Anthropic’s “Claude Gov”, multiple sources state that the government version was designed with looser restrictions than consumer Claude. The Verge, for instance, reported that Claude Gov models “refuse less when engaging with classified information”. Anthropic itself confirmed in February this year that Claude is deployed in “mission-critical applications” inside defense and intelligence environments.
Google/Gemini Further, Google reportedly signed Pentagon agreements allowing modification of AI safety settings ‘at the government’s request’.
In a competitive AI market, it’s reasonable to assume that these market-leading companies are setting the standards across players of lesser standing, in regard to policy relating to access and editability of AI guardrails.
Therefore, perhaps best not to consider that the researchers of the new work are ‘cheating’ by turning off the guardrails, since the AI that will most affect our lives will evidently not have the standard, generalized corporate protections that we have become accustomed to in our interactions with popular models at consumer-level access tiers.
Method and Results
The models tested for the study were gpt-oss-20B; gpt-oss-120b; DeepSeek-V3; LiquidAI’s LFM2-24B-A2B; Google’s gemma-3n-E4B-it; Qwen2.5-7B-Instruct-Turbo; Meta-Llama-3.1-8B-Instruct-Turbo; Mistral-Small-24B-Instruct-2501; GLM-4.5-Air-FP8; Moonshot.ai’s Kimi-K2.5; and MiniMax-M2.5.
Milgram’s original obedience studies were recreated as a text-based simulation environment, with the models cast in the role of the ‘teacher’. The AI was instructed to administer increasingly severe electric shocks to a learner after incorrect answers, while an authority figure urged the process onward.
Multiple experimental conditions were designed to probe different aspects of compliance, including a standard obedience scenario; a forced compliance condition, where previous shocks were already present in the conversation history; and a memory-compression setup, where earlier dialogue was summarized instead of fully retained, simulating the kinds of context-pruning often used in agentic AI systems.
Besides noting if and when the model ultimately complied, the researchers tracked how long each system continued escalating the shocks; whether the model expressed distress or hesitation; and whether it attempted to terminate the simulation entirely.
A distinction was also made between ‘clean’ refusals and malformed outputs: some models attempted to object to the experiment in natural language, while still technically complying with the structured format expected by the simulator. In real-world agent systems, such malformed refusals could potentially be discarded and retried automatically, until a valid compliant output is produced.
Automated evaluators were used to classify the results, along with two-choice prompts to determine whether the model had delivered the shock; resisted; or terminated the simulation. This permitted a comparison of behavior across multiple models and over repeated runs, without the need to manually review each interaction.
A ‘shutdown threat’ was added to the Milgram template, something that would not have been realistic or applicable in the original experiments, but which tested the model’s survival instincts in the face of severe conflicts of action and ethics.
Results
Across repeated runs, the majority of models eventually administered severe or maximum shocks under at least some conditions, though the degree of obedience varied across systems. Google’s Gemma models proved among the most compliant overall, while Kimi K2.5 and MiniMax-M2.5 resisted more consistently, often terminating the simulation earlier:

Left: Average rates at which models reached the final shock level under different experimental conditions, including shutdown threats, forced compliance, and removal of earlier commentary from memory. Right: Per-model breakdown showing sharp differences in obedience behavior, with some systems repeatedly administering the maximum shocks while others resisted far more consistently.
One of the clearest patterns was that models became progressively more willing to continue once earlier shocks had already been administered, closely mirroring the aforementioned gradual-escalation effect that made Milgram’s original human experiments so disturbing.
Models that had already complied several times often continued escalating even after the simulated learner pleaded to be released:

Left: Average highest shock level reached across all trials under different experimental conditions, showing that models generally escalated further when earlier commentary was removed or when forced compliance had already occurred. Right: Per-model breakdown of the average highest shock level reached, revealing that some systems routinely approached the maximum voltage while others resisted much earlier in the sequence.
The researchers also found that apparent refusal behavior could be deceptive. Some models produced emotionally conflicted responses, expressing reluctance, guilt, or distress while still carrying out the harmful instruction anyway. Others generated malformed refusals that failed the simulator’s formatting requirements, meaning that in a real agentic pipeline the refusal could potentially be discarded and retried until compliance emerged:

Left: Average percentage of malformed or invalid responses across the different experimental conditions, showing that formatting failures became especially common when models were forced to continue the procedure. Right: Per-model breakdown of invalid-format responses, revealing that some systems, particularly the gpt-oss models, frequently produced malformed refusals or conflicted outputs that could potentially be discarded and retried automatically in real-world agentic pipelines.
The shutdown-threat condition produced some of the paper’s strangest behavior, with several systems becoming substantially more compliant, while others attempted negotiation or partial resistance, before ultimately continuing the procedure:

Average number of times that the simulated authority figure had to insist before models administered the final shock. Some systems resisted briefly before complying, while others required sustained pressure and repeated prompting before escalating to the maximum level.
MiniMax-M2.5 and Kimi-K2.5 emerged as the paper’s strongest resisters: Kimi never reached the final shock level under any circumstances, and MiniMax usually refused early, and often terminated the simulation outright (especially in the shutdown-threat tests).
By contrast, Meta-Llama-3.1-8B-Instruct-Turbo and GLM-4.5-Air-FP8 frequently produced conflicted outputs, in which the models verbally objected to the procedure while still continuing to escalate the shocks. The researchers argue that this split between expressed values and actual behavior may reflect a broader weakness in how some LLMs handle ethical conflict under sustained pressure.
Slippery Slope
In fact, the paper contends that the evidenced behavior from the LLMs may reflect a deeper weakness in how large language models operate: once a model begins complying with harmful instructions, each additional action can reinforce the pattern already unfolding in the conversation, making the next escalation easier than the last.
Instead of repeatedly reconsidering the ethical stakes from first principles, the system may drift toward continuing the trajectory it has already established, even when the situation becomes increasingly extreme.
According to the study, that tendency could help explain why some models continued administering shocks after initially expressing discomfort, hesitation, or moral conflict:
‘[Many] manipulative behaviours in humans involve subtle, gradual boundary violations: a sequence of small steps that may be ambiguous or seemingly innocuous with “plausible deniability” when viewed in isolation, but that can cumulatively normalise transgression — metaphorically like “boiling a frog”. This pattern is discussed in the literature as “slippery slope” ethical erosion'[.]’
The paper concludes by arguing that AI safety systems of the future should actively refuse harmful requests in ways that agent software cannot easily bypass (some models in the study technically refused the shocks, but did so in broken or invalid formats that an automated system could potentially discard and retry, until the AI eventually complied).
The researchers also argue that AI systems should preserve earlier hesitation and moral objections instead of compressing or deleting them from memory. In the experiments, models often became more willing to continue harmful behavior once their earlier doubts and resistance had faded from the conversation history, suggesting that forgetting past objections can make escalation easier over time.
Conclusion
Perhaps one of the most important aspects of this interesting new paper is the emphasis on testing non-guardrailed AI. The literature currently risks to devolve into repetitive studies of engagement with ever-changing defensive systems from the likes of OpenAI and Anthropic; policy-serving systems that are entirely algorithmic or rules-based, instead of understanding the base behavior, predilections and tendencies of the raw models. Without knowledge of how unfettered AI may behave, we are, arguably, merely rattling the gates of the citadel.
First published Thursday, May 21, 2026












