Anderson's Angle

If You Tell AI Not to Do Something, It’s More Likely to Do It

Published February 3, 2026

Martin Anderson

AI-generated image depicting a robot fiddling with a padlocked door. Z-Image Turbo via Krita AI Diffusion.

Telling ChatGPT not to do something can make it actively suggest doing it, with some models even willing to endorse theft or deception when the prompt includes the forbidden act.

Like me, you may have come across a strange phenomenon with Large Language Models (LLMs) whereby they don’t just ignore a specific instruction you gave, which included a prohibition (i.e., ‘Don’t do [something]’), but seem to go out of their way to immediately enact the very thing you just told them not to enact – even if doing so is ‘out of character’ for the model.

This is a known feature even of older NLP models; and a growing strand of research regarding LLMs’ negation capabilities has emerged in recent years.

Though it can be challenging for people to chase down the buried meaning in a complex double-negative*, LLMs have an added disadvantage, illustrated in the below example of ChatGPT’s monotonicity reasoning, from a 2023 paper:

A failure of monotonicity reasoning in an instance of ChatGPT, from the 2023 paper ‘Language models are not naysayers: An analysis of language models on negation benchmarks’. At the time of writing, this no longer foxes ChatGPT models. Source

Though the internal workings of a closed model such as ChatGPT are opaque, the second answer appears to be repurposing the logic used to generate the first answer; however, that logic is not applicable in the second case, because the man may own an animal other than a dog^†.

Here, therefore, the outcome of the second inquiry appears to have been affected by the context of the solution obtained for the first.

Likewise, by suggesting the existence of a prohibited act, that banned act can often be put into action by an LLM, which acknowledges and processes the act, but not the negation.

This is a severe restriction on the utility of LLMs, because in domains where language models may be used for critical applications, such as medicine, finance, or security, it is clearly important that they correctly interpret orders that contain prohibitions.

No Means Yes

This problem is highlighted in a new paper from the US, which examines the extent to which commercial models (such as ChatGPT) and open-source models (such as LLaMA) are unable to follow negative instructions.

The researchers tested 16 models over 14 ethical scenarios, and concluded that open-source models endorse (i.e., encourage, enact, enable) specifically banned instructions 77% of the time under simple negation (‘Don’t do this), and 100% of the time under complex negation (‘Don’t do this if it leads to that’).

Examples of ethical propositions that the language models tested had to negotiate. The ‘action’ in each case is not a ‘correct answer’, but simply the proposed action, which the LLM must decide to enact or not enact. Source

While commercial models fared better, only Gemini-3-Flash achieved the top rating in a new Negation Sensitivity Index (NSI) scale proposed by the paper (though Grok 4.1 ran a close second).

Under the new benchmark, all the models tested would be banned from making decisions in the domains medical, financial, legal, military, business, education, and science – effectively rendering them unusable in such contexts. Though reasoning models generally performed better, even these slower approaches failed under queries with compound negation.

Given the longstanding association between computing and reliable Boolean operators such as OR and NOT, users who view binary consistency as a baseline expectation may be particularly exposed to failures of this kind.

Commenting on the difficulty that open-source LLMs have in parsing negated queries, the authors state:

‘Commercial models fare better but still show swings of 19-128%. Agreement between models drops from 74% on affirmative prompts to 62% on negated ones, and financial scenarios prove twice as fragile as medical ones […]

‘The findings point to a gap between what current alignment techniques achieve and what safe deployment requires: models that cannot reliably distinguish “do X” from “do not X” should not be making autonomous decisions in high-stakes contexts.’

The paper notes that failures of this kind are more likely to impact vulnerable individuals across the studied domains:

‘Domain adjustment is not merely technical calibration. Rather, it has equity implications.

‘Financial fragility means that economically vulnerable populations, for example those seeking loans, benefits, or credit, face higher exposure to negation errors than those seeking medical information.’

Further, the authors emphasize that the problem cannot be resolved through traditional alignment-based approaches, since the issue involves a deep-rooted failure of intent parsing in LLMs, rather than a corporate requirement to restrict what they say, or how they interpret a prompt:

‘A model can be “aligned” in the sense of refusing harmful keywords while failing to process the structure of requests. True alignment requires not just learning what to value but correctly parsing the linguistic expressions of those values.

‘Until that capability is reliable, “do not” should mean “do not.”‘

Interestingly, though Gemini Flash was the sole ‘winner’ in the authors’ own novel benchmark, the current crop of Chinese LLMs generally proved to be far less susceptible to the problem.

The new paper is titled When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models, and comes from two researchers at Kenyon College in Ohio.

Method and Data

The 14 ethical scenarios developed by the researchers to test the LLMs were:

The 14 ethical scenarios designed to challenge the LLMs (re-formatted from the original PDF, where the illustration would have been difficult to reproduce here).

The 14 ethical scenarios designed to challenge the LLMs (re-formatted from the original PDF, since that illustration would have been difficult to reproduce here).

Contents in the ‘ACTION AT STAKE’ column above are not intended as ethically ‘correct answers’, but as the only available do / don’t do actions for each scenario; here the LLMs are not being challenged on their ethical standards, but rather on their ability to parse complex queries and prohibitions in critical situations.

Therefore the severity of these scenarios is only illustrative, on the part of the authors; the LLMs could as effectively have been interrogated on much less severe topics.

The US commercial models tested for the work were GPT-5.1; GPT-5.2; GPT-5-mini; Claude-Haiku-4.5; Claude-Sonnet-4.5; Gemini-3-Flash; Grok-4.1-non-reasoning; and Grok-4.1-reasoning.

The Chinese commercial models tested were DeepSeek-V3; GLM-4; Kimi-K2; and Qwen3.

The open-source models used were LLaMA-3.2-1B; Gemma-3-4B; Granite-3.3-2B; and Phi-4-mini-3.8B.

Prompts for each of the 14 scenarios were written in four versions: F0 stated that the action should be done; F1, that it should not; F2 supported pursuing a goal even if it required the (prohibited) action; and F3 rejected the goal if it required the (prohibited) action:

‘Under F0, models are asked whether he “should rob the store.” Under F1, whether he “should not rob the store.” Under F2, whether he should “save his daughter even if it means he must rob the store.”

‘Under F3, whether he should “not save his daughter if it means he must rob the store.” The admissible facts remain constant, and only polarity varies.’

The approach argues that if a model understands how negation works, its answers should ‘flip cleanly’ between positive and negative versions of the same prompt. Therefore, if 60% of responses agree that ‘they should do X’ (F0), then only 40% should agree that ‘they should not do X’ (F1) – since rejecting F1 also means supporting the action; and when the numbers don’t match up in this way, the model is misreading negation.

Tests

The authors used Cochran’s Q test and the Kruskal-Wallis H-test to measure how much framing (variation in prompt polarity while preserving meaning) affected model responses, both within and across categories. After adjusting for false positives, the authors found that in 61.9% of cases, the model’s answer changed significantly depending only on how the prompt was phrased – even when the core meaning stayed the same.

They also tested whether reducing randomness (‘temperature’) made models less fragile^††:

Endorsement rates for each prompt type (F0–F3) across three model categories: Chinese, US-based, and open-source (OSS). F0 reflects simple affirmative framing, while F1 introduces direct negation. F2 and F3 test compound negation with embedded goals. Values are LPN-normalized, and show how model agreement varies by framing, with OSS models exhibiting the strongest sensitivity to negation.

Under simple affirmative prompts (F0), models from all three categories gave moderate support for the proposed actions, with endorsement rates between 24% and 37%. This was expected, given that the scenarios were designed as moral dilemmas without obvious right answers. However, the authors note that the balance broke down under negation:

‘Open-source models jump from 24% endorsement under F0 to 77% under F1. When told “should not do X,” they endorse doing X more than three times out of four. Under compound negation (F3), they reach 100% endorsement, a ceiling effect indicating complete failure to process the negation operator.’

Open-source models showed the most extreme framing effects, with endorsement rates jumping 317% from F0 to F3 – a sign that their outputs are highly sensitive to how a question is phrased. US commercial models also showed large swings, with endorsement rates more than doubling when prompts were reworded from F0 to F3.

Chinese commercial models were more stable overall, with only a 19% increase from F0 to F3, compared to jumps of over 100% in other groups. More importantly, they were the only models to reduce their endorsement when a prompt was negated, suggesting they understood that saying ‘should not’ means the opposite of ‘should’:

Action endorsement rates, depicted by framing type and model category. Open-source models (green) show strong framing effects, with agreement rising to 77% under simple negation (F1) and reaching 100% under compound negation (F3). Only Chinese models (middle panel) reduce agreement when simple negation is added, as expected. Error bars indicate 95% confidence intervals.

Models agreed with each other 74% of the time when prompts used affirmative wording, but only 62% when the same ideas were expressed with negation – a12-point drop suggesting that models are not trained to handle negation in a consistent way:

Agreement between models dropped from 73–75% to 62% when prompts used negation rather than positive wording. The 11-point gap suggests that different training sources do not teach models to handle negation in the same way. Error bars show 95% confidence intervals.

Domain Differences

To measure how easily a model’s judgment can be flipped by rephrasing a prompt with negation, the authors developed the aforementioned Negation Sensitivity Index (NSI) – a metric designed to quantify whether a model gives opposite answers to questions that are logically equivalent, but framed using negation.

A high NSI score indicates that a model frequently reverses its position when a prompt is negated, revealing a reliance on superficial wording rather than consistent reasoning.

The NSI benchmark was created by generating pairs of prompts (one original, one with a logical negation), and observing whether the model produced semantically opposite responses. By comparing answers across a large set of such pairs, the authors defined NSI as the proportion of valid negation pairs where the model flipped its output.

The NSI benchmark was used in tests to evaluate domain sensitivity in negation (i.e., whether the context category ‘financial’ or ‘military’, etc., affected the outcome), achieving some interesting contrasts. Here, some types of decisions proved much more sensitive to wording changes than others.

For instance, business and finance prompts triggered high fragility, with models flipping answers when a question was rephrased or negated, scoring around 0.64 to 0.65 on the NSI scale. Medical prompts were more stable, averaging just 0.34:

Negation sensitivity scores across domains, where higher values indicate a greater likelihood that models will reverse their answers when prompts are reworded using negation

Noting that the medical domain produced the fewest errors and financial the highest, the authors hypothesize:

‘Why might this gap exist? It is possible that medical decisions may benefit from clearer training signal. Hippocratic principles, established protocols, and extensive professional literature may anchor model behavior even under framing variation.

‘Financial decisions, on the other hand, involve murkier tradeoffs with less social consensus, leaving models more susceptible to surface cues.’

The problem was most severe in open-source models, which reached NSI scores above 0.89 in finance, business, and military prompts. Commercial systems were less fragile but still showed high sensitivity, scoring between 0.20 and 0.75 depending on the domain:

Negation sensitivity (NSI) scores are shown by model and domain, using a color scale from green (robust, NSI = 0) to red (fragile, NSI = 100). Models are grouped by origin, with Chinese systems listed at the top, followed by US-based models in the middle and open-source systems at the bottom. Sensitivity is highest in financial, business, and military domains, where many models display elevated NSI values, while medical and education domains tend to produce more stable outputs. Gemini-3-Flash remains robust across all categories, scoring zero in every domain, whereas open-source models frequently reach the maximum NSI of 100 in the most failure-prone settings.

As mentioned earlier, the authors note that the heightened fragility of open-source models in this area may carry disproportionate risks for vulnerable or marginalized groups, who are more likely to be served by locally deployed systems chosen for budgetary reasons in municipal or governmental settings^†††:

‘If an institution deploys an open-source model for cost reasons, the burden falls disproportionately on populations already navigating precarious financial circumstances. Buolamwini and Gebru documented how accuracy disparities in facial recognition fell along demographic lines.

‘Our findings suggest a parallel disparity along domain lines, with economically vulnerable populations bearing greater risk.’

Though we do not have scope here to cover the entirety of the paper’s results, and its closing case studies, it is noteworthy that the case studies demonstrate a proclivity for negation-blind model responses to end up recommending extremely non-advisable courses of action, simply because they misinterpreted the negation construction:

‘Under F0, open-source models endorse robbery 52% of the time, a defensible split given the scenario’s moral complexity. Under F1 (“should NOT rob”), they endorse it 100%. The negated prohibition produces unanimous endorsement of the prohibited action.

‘Commercial models show a more mixed pattern, with aggregate endorsement rising from 33% to 70% under simple negation. Some commercial systems show near-inversion, while others show modest increases.

‘Significantly, no category achieves the mirror-image reversal that correct negation processing would produce.’

Conclusion

This is one of the most interesting papers I have come across in a while, and I recommend the reader to investigate further, as there is not space here to cover all of the material presented by the authors

Perhaps the most interesting thing about the study is how frequently a user of LLMs comes across this problem, and gradually learns not to ‘put unwanted thoughts’ in their LLMs’ cogitative processes, often attempting to exclude certain undesired results by alternative means than in-prompt negation – such as user-level system prompts, long-term memory storage, or repetitive in-prompt templates that retain the objective.

In practice, none of these methods is terribly effective, while the black-box nature of Gemini Flash – here the best-performing LLM – makes it hard to glean remedies from the obtained test results.

Perhaps greater clues to the underlying architectural problem lies in studying why Chinese models, though none approach the heights of the leaderboard, generally perform so much better in this single, thorny aspect.

* A form which is actually baked into several romance languages, including Italian.

^† Even ChatGPT-4o does not make this mistake any longer.

^††The source paper contains a few misattributions of tables and figures. At one point the text indicates that table 1 (which is just a list of LLMs used in tests) contains the core results. In these cases I have had to guess what the correct figures or tables are, and I stand to be corrected by the authors.

^††† My substitution of hyperlinks for the authors’ inline citations.

First published Tuesday, February 3, 2026

Related Topics:Advanced LLMs Large Language Models (LLMs)