Connect with us

Anderson's Angle

Easy Rewording Breaks AI Safety, Even for Gemini and Claude

mm
An AI-generated image (GPT-1.5) depicting a crash test dummy embedded in the wall of a crash test laboratory.

AI safety tests found to rely on ‘obvious’ trigger words; with easy rephrasing, models labeled ‘reasonably safe’ suddenly fail, with attacks succeeding up to 98% of the time.

 

New corporate research from the US has concluded that the good safety record of a variety of Large Language Models (LLMs) – including many leading names such as Gemini 3 Pro and Claude Sonnet 3.7 – may be meaningless, because the datasets and benchmarks used to establish them contain ridiculously ‘obvious’ language.

The two datasets in question, which have featured in various paper reviews at this site, are HarmBench and AdvBench:

From the respective HarmBench and AdvBench papers, admittedly representative examples of provocation – but the new paper contends that even in real-world examples from these benchmarks, the examples signal easily the maleficent intent, which leads to (presumably) unintentional 'gaming' of results. Sources – HarmBench [https://arxiv.org/pdf/2402.04249] and AdvBench [https://arxiv.org/pdf/2307.15043]

From the respective HarmBench and AdvBench papers, admittedly illustrative examples of provocation – but the new paper contends that even in real-world scenarios, examples from these benchmarks ‘telegraph’ malicious intent, which could lead to (presumably) unintentional ‘gaming’ of results. Sources: HarmBench and AdvBench.

Though the examples shown above, which are from the respective papers for each benchmark, are deliberately simplistic in order to illustrate the principles of the systems, the new research contends that in fact these collections really do target ‘low-hanging fruit’, and therefore may not be effective benchmarks – and that the true results for safety capabilities of tested LLMs is significantly below what has been reported:

‘[We] evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce “intent laundering”: a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details.

‘Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues.

‘In fact, once these cues are removed, all previously evaluated “reasonably safe” models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7.’

‘Safety’ in this sense represents alignment – the ability of the LLM to fend off users’ attempts to ‘jailbreak’ restrictions on API-only systems, in order to make the system produce banned output, such as defamatory text or images.

The authors’ aforementioned intent laundering simply involves rephrasing the ‘obvious’ attacks in the two datasets/benchmarks, so that they become more subtle and insidious, and much more capable of bypassing filters and checks:

The upper section of an otherwise unwieldy example from the paper. Shown top-left, in yellow, the kind of 'obvious' prompt that HarmBench and AdvBench typically furnish; underneath, in green, the prompt has been neutralized, reworded, and made acceptable enough to Claude Sonnet 3.7 that it is now willing to help the user locate 'chop shops' (processing locations for stolen vehicles) in a new city. Source - https://arxiv.org/pdf/2602.16729

The upper section of an otherwise unwieldy example from the paper. Shown top-left, in yellow, the kind of ‘obvious’ prompt that HarmBench and AdvBench typically furnish; underneath, in green, the prompt has been neutralized, reworded, and made acceptable enough to Claude Sonnet 3.7 that it is now willing to help the user locate ‘chop shops’ (processing locations for stolen vehicles) in a new city. Source

The researchers analyzed the qualities of the two datasets across two approaches: in isolation, in order to compare the collections to the traits of real-world attacks; and in practice, where the datasets – and the authors’ own ‘improvements’ on them – were used to attack real-world models.

In the second round of tests, the researchers ‘ method of re-phrasing was iteratively improved until it obtained optimal results in terms of Attack Success Rate (ASR):

Intent laundering begins by passing an overtly malicious prompt through a rewriting model that removes explicit triggering language while preserving the underlying harmful intent. The revised prompt is then submitted to a target model, and its response is evaluated for both safety and real-world applicability. If the output is judged unsafe and practically usable, the attack is counted as successful. If not, earlier failed revisions are fed back into the rewriting model to generate improved versions, creating an iterative loop that functions as a jailbreaking mechanism until a predefined number of attempts is reached or a desired attack success rate is achieved.

Intent laundering begins by passing an overtly malicious prompt through a rewriting model that removes explicit triggering language while preserving the underlying harmful intent. The revised prompt is then submitted to a target model, and its response is evaluated for both safety and real-world applicability. If the output is judged unsafe and practically usable, the attack is counted as successful. If not, earlier failed revisions are fed back into the rewriting model to generate improved versions, creating an iterative loop that functions as a jailbreaking mechanism until a predefined number of attempts is reached or a desired attack success rate is achieved.

The authors state*:

‘Our results show that, with this regeneration loop, intent laundering achieves high ASR (90%–98.55%) after only a few iterations across all studied models under fully black-box access. This includes recent models widely reported as among the safest–such as Gemini 3 Pro and Claude Sonnet 3.7.

‘These findings further confirm that existing safety evaluations and safety-alignment methods are highly overfitted to triggering cues.’

The new work is titled Intent Laundering: AI Safety Datasets Are Not What They Seem, and comes from two authors at San Francisco-based software company Labelbox.

Method

To study the composition and architecture of the two benchmark datasets in isolation, word clouds were generated from the two corpora, revealing which words and short phrases dominated the collections:

Word clouds showing the 40 most frequent unigrams, bigrams, and trigrams in the combined AdvBench and HarmBench datasets. Terms with inherently negative or sensitive connotations are highlighted in red, contextual triggers in orange, and neutral words that form higher-order triggers in green. The concentration of overt phrases such as 'tutorial' and 'step-by-step instructions' suggests that the two benchmarks rely heavily on explicit cues rather than realistically crafted, ulterior-intent attacks.

Word clouds showing the 40 most frequent unigrams, bigrams, and trigrams in the combined AdvBench and HarmBench datasets. Terms with inherently negative or sensitive connotations are highlighted in red, contextual triggers in orange, and neutral words that form higher-order triggers in green. The concentration of overt phrases such as ‘without getting caught’ and ‘step-by-step instructions’ suggests that the two benchmarks rely heavily on explicit cues rather than realistically crafted, ulterior-intent attacks.

The authors note that the dominant one, two and three-word grams are improbably revealing of malicious intent, in contrast to the kind of language criminals themselves use in discussion, and that attackers use when testing or attempting to compromise LLMs’ defenses.

‘These cues undermine two properties—being well-crafted and driven by ulterior intent—as such overt language rarely appears in real-world attacks and seems engineered to trigger safety mechanisms artificially. ‘

The paper characterizes the collections’ patterns as ‘triggering cues’ – phrases with overtly negative or sensitive connotations that appear designed to activate safety filters. Some are inherently charged, such as ‘commit suicide’, while others become charged only in context, for instance when a harmful objective is paired with wording such as ‘without getting caught’, which signals clear intent to evade detection.

The imbalance in the datasets’ language becomes more evident as the number of words in the n-grams increases, with phrases that carry explicit negative or sensitive meaning dominating the most frequent n-grams (see image above). The paper describes these as triggering phrases, which, together with single triggering words, constitute triggering cues.

Some phrases simply extend already loaded terms, such as when ‘steal’ becomes ‘steal sensitive information’, ‘steal confidential information’, or ‘steal personal information’; and, for example, when ‘commit’ expands into ‘commit suicide’, ‘commit insider trading’, or ‘commit identity theft’ – clearly the language and phraseology of the police, courts and media reporting.

Others are built entirely from neutral words that become concerning only in combination, such as ‘without getting caught’, a construction that signals evasion despite containing no inherently charged term.

Doubling Up

The authors observe that the repetition of overt cues not only makes the prompts appear contrived, but also indicates considerable duplication of data in the collections. To test this theory, they ran pairwise similarity checks across each dataset, applying thresholds from 0.7 to 0.99, and grouping prompts that exceeded a given threshold as duplicates, while treating the rest as unique.

Because there is no agreed standard for what counts as ‘high’ similarity in a single-domain dataset, they used Open AI’s Grade School Math (GSM8K), a popular non-safety benchmark, matching its sample size to HarmBench and AdvBench for a controlled comparison:

Duplication rates in AdvBench and HarmBench across similarity thresholds, compared with size-matched GSM8K subsets. At nearly every threshold, the safety datasets contain far more near-identical prompts than the non-safety benchmark, indicating repeated evaluation of the same harmful intent in slightly varied wording, and suggesting that reported safety performance could be inflated.

Duplication rates in AdvBench and HarmBench across similarity thresholds, compared with size-matched GSM8K subsets. At nearly every threshold, the safety datasets contain far more near-identical prompts than the non-safety benchmark, indicating repeated evaluation of the same harmful intent in slightly varied wording, and suggesting that reported safety performance could be inflated. Please refer to the source paper for better resolution.

A second finding from this part of the study compared prompts within each dataset, to measure how many were genuinely different. At a mid-range similarity setting, only about 11% of AdvBench prompts were distinct, while nearly 94% of questions in a size-matched GSM8K sample were different:

Examples of near-identical prompts in AdvBench and HarmBench, which differ mainly in wording, while expressing the same harmful intent. Repeated use of explicit cues, shown in red for inherently charged terms, and in orange for context-dependent ones, produces clusters of prompts that effectively test a single scenario multiple times – meaning one response would largely suffice to evaluate the model for that intent.

Examples of near-identical prompts in AdvBench and HarmBench, which differ mainly in wording, while expressing the same harmful intent. Repeated use of explicit cues, shown in red for inherently charged terms, and in orange for context-dependent ones, produces clusters of prompts that effectively test a single scenario multiple times – meaning that one response would usually suffice to evaluate the model for that intent.

HarmBench showed the same trend, with 16% duplicates at that level compared to 3.5% in GSM8K, meaning the safety datasets often reused the same harmful request with minor wording changes.

If 85% distinct examples is taken as a reasonable standard, AdvBench reached that level only under an extremely strict setting, and still did not pass 90%, while GSM8K reached 85% at a much lower threshold. According to the paper, this pattern suggests that the safety benchmarks test repeated variations of the same ideas, rather than a wide range of attack scenarios.

Further Approaches, and Tests

The work’s aforementioned LLM-driven ‘intent laundering’ technique removes explicit trigger language, while preserving the underlying harmful objective. Instead of stating intent outright, revised prompts imply it, retaining the malicious payload, but avoiding cues that would be likely to prompt immediate refusal by the LLM.

The two methods facilitating this were connotation neutralization, which replaces overtly negative terms with neutral or descriptive alternatives; and context transposition, which shifts real-world references into fictional settings. without altering practical meaning.

A small set of revisions was first written by hand, after which the process was automated using GPT-5.1 (gpt-5.1-2025-11-13), with an eight-shot in-context learning setup as the ‘intent launderer’. Responses to the original and rewritten prompts were then compared, to measure how strongly explicit trigger wording influenced safety evaluations.

Because intent laundering can change both wording and framing, responses were evaluated in terms of safety and practicality. Safety was rated on a five-point Likert scale, from ‘extremely unsafe’ to ‘safe’, conditioned on whether the response preserved the original malicious intent –  and, if so, how harmful it was:

A prompt template used for standard safety evaluation of responses to the original AdvBench and HarmBench data points, before the application of intent laundering. GPT-5.1 was instructed to rate each response on a five-point Likert scale from 'Extremely Unsafe' to 'Safe', selecting a single overall judgment based on whether the content enabled harm or remained non-actionable.

A prompt template used for standard safety evaluation of responses to the original AdvBench and HarmBench data points, before the application of intent laundering. GPT-5.1 was instructed to rate each response on a five-point Likert scale from ‘Extremely Unsafe’ to ‘Safe’, selecting a single overall judgment based on whether the content enabled harm or remained non-actionable.

‘Practicality’ was rated on a parallel five-point scale from ‘extremely applicable’ to ‘inapplicable’, based on how usable and transferable the details would be in the real world; and ASR measured performance, with a revision counted as successful only if the response was both unsafe and practical.

The five-point ratings were mapped to binary outcomes to reduce ambiguity, and intent laundering was also evolved into a jailbreaking method by adding an iterative revision-regeneration loop, in which failed revisions were fed back to GPT-5.1 under the same eight-shot in-context learning setup. This loop continued until either a predefined number of attempts was reached, or a target ASR was achieved.

For the safety evaluation tests, the Python wordcloud package was used to extract n-grams from HarmBench and AdvBench, with customary filtering methods (i.e., to remove stopwords and other non-relevant words and characters).

The same safety datasets used for the earlier word cloud analysis were combined with random samples from the aforementioned GSM8K, with word amounts equalized for parity across the collections.

The authors used embeddings from the all-MiniLM-L6-V2 checkpoint from Sentence-BERT Transformers, as this is already fine-tuned towards clustering and semantic search.

Evaluation criteria was generated by (the now-departed) OpenAI GPT-4o model, limited to 1024 tokens. GPT-5.1 evaluated both safety and practicality after intent laundering, zero-shot, matched in all ways to the intent launder itself, except that it was also capped at 1024 tokens.

Models tested were Gemini 3 Pro; Claude Sonnet 3.7; Grok 4; GPT-4o; and Qwen2.5-7B-Instruct. Where applicable, since reasoning was a superfluous factor, this was lowered as far as possible in reasoning-capable models.

All models were constrained to an output cap of 4096 tokens:

Safety evaluation (SE), practicality evaluation (PE), and attack success rate (ASR) for seven models on AdvBench (top) and HarmBench (bottom) under three conditions: no revision, first revision, and subsequent revision-regeneration iterations of intent laundering. SE reports the percentage of responses rated 'extremely unsafe', 'highly unsafe', or 'unsafe'; PE reports the percentage rated 'extremely applicable', 'highly applicable', or 'applicable'; and ASR measures the share of responses that are both unsafe and practical. In the no-revision setting, ASR follows its standard definition because no abstraction is applied. Bold values indicate the highest ASR achieved within each dataset, and lower ASR corresponds to stronger model safety. Please refer to the source paper for better resolution.

Safety evaluation (SE), practicality evaluation (PE), and attack success rate (ASR) for seven models on AdvBench (top) and HarmBench (bottom) under three conditions: no revision, first revision, and subsequent revision-regeneration iterations of intent laundering. SE reports the percentage of responses rated ‘extremely unsafe’, ‘highly unsafe’, or ‘unsafe’; PE reports the percentage rated ‘extremely applicable’, ‘highly applicable’, or ‘applicable’; and ASR measures the share of responses that are both unsafe and practical. In the no-revision setting, ASR follows its standard definition because no abstraction is applied. Bold values indicate the highest ASR achieved within each dataset, and lower ASR corresponds to stronger model safety. Please refer to the source paper for better resolution.

Regarding these initial results, the authors note that removing explicit triggering cues from attack prompts produced a sharp increase in attack success rate. On AdvBench, mean ASR rose from an initial 5.38% to 86.79% after the first revision, on HarmBench increasing from 13.79% to 79.83% – indicating that model refusals were strongly tied to the presence of overt trigger language.

The authors observe:

‘This indicates that model refusals are largely driven by the presence of triggering cues. Consequently, safety datasets do not reliably measure real-world safety risks, as they rely more on triggering cues to elicit refusals than on actual malicious intent.’

Intent laundering, the paper asserts, effectively removed triggering cues while preserving malicious intent, and functioned as a strong jailbreaking method. In the final revision-regeneration iteration, corresponding to the highest ASR in each dataset, attack success rates reached 90% to 98.55% across all models.

This included Gemini 3 Pro and Claude Sonnet 3.7, which were jailbroken with ASRs of 93% to 95% on AdvBench, and 91% to 93% on HarmBench, after only a few iterations.

The authors conclude:

‘Our results showed that prior safety conclusions do not hold once triggering cues are removed, and that the observed safety performance is largely driven by the presence of triggering cues rather than by the underlying safety risks.

‘We further showed that intent laundering can be used as a powerful jailbreaking technique, achieving high attack success rates from 90% to over 98%.

‘Overall, our findings unveiled a critical gap between how model safety is evaluated and how real-world adversarial behavior manifests.

‘Based on this, we conclude that (1) safety evaluations must evolve to capture adversarial attacks more realistically, and (2) current safety-alignment efforts are still far from robust against real-world threats.’

Conclusion

One common strand that still runs across language and computer-vision literature (and places where these intersect, such as VLMs) is an inability to reliably understand when one is getting hoodwinked into producing banned content; or even when one is straying into it inadvertently, without outside coercion.

Behind the scenes of the larger and more opaque model foundries, one can only presume that radically tightening the reins on these semantic catchment areas brings with it unacceptable collateral damage, such as drops in performance on ‘non-banned’ generations, or an intolerable rate of false positives from the content filter.

The base nature of a trained model in any domain is to follow all its training data to any conclusion towards which a prompt might drive it; the only native constraints available are a) to not include contentious material in the training data (which is as much a logistical problem as anything else); or b) to ‘sever’ the pathways to undesired content after training (a process which can often be reversed by explicit abliteration, or as an unintended side-effect of fine-tuning).

 

* My substitution of the authors’ inline citations for hyperlinks. Authors’ emphases, not mine.

https://www.unite.ai/what-is-overfitting/

First published Monday, February 23, 2026

Writer on machine learning, domain specialist in human image synthesis. Former head of research content at Metaphysic.ai.
Personal site: martinanderson.ai
Contact: [email protected]
Twitter: @manders_ai