Anderson's Angle

Language Models Struggle to Keep a Secret

Published May 15, 2026

Martin Anderson

AI-generated image (GPT-2): 'An elderly woman kneels in a dim confessional, quietly confessing through the lattice, while on the other side an industrial-faced robot, draped in a priest’s stole, records her words in a notebook, turning an act of private absolution into an act of observation and capture.'

AI models can’t keep secrets. Even when told not to reveal them, their writing gives them away, and trying harder to hide them makes the leak even easier to spot.

It is very difficult to deliberately not think of something. A classic illustration of this is shown at the end of the 1960 British sci-fi thriller Village of the Damned, wherein our self-sacrificing hero has smuggled a bomb into the enclave of hostile alien invaders who are posing as children. However, since their telepathic powers risk to discern his intent before he can rid the world of the menace, he is forced to stall for time by concentrating on anything that is not-bomb:

The paradox is that in order not to think of something, you have to hold it in your attention in some way; and this known syndrome is something most of us are likely to have experienced in less dramatic contexts.

Large Language Models (LLMs), whose foundation is based on the disposition of attention, experience similar difficulty in suppressing information just because a user asks them to do it; and since they are increasingly being placed at the nexus of business information networks, their naïve tendency towards indiscretion could prove a liability for many companies.

Earlier this year, a research collaboration led by Chandar Research Lab defined this challenge, in the context of LLMs, as Private State Interactive Tasks (PSITs), which ‘require agents to generate and maintain hidden information while producing consistent public responses’, and found that tested models from OpenAI and Alibaba were unable to perform this kind of task.

Don’t Say It…

Though it is already known that larger models leak more, new research from the US and Canada has explicitly studied whether state-of-the-art language models will obey a command to suppress information, while still being required to generate output in a topic or theme that may include the ‘banned’ word or idea.

The paper concludes that all the models it studies are in some way affected by a propensity to ‘give away’ the secret they are meant to be hiding, finding that five-paragraph (~450 words) essays and stories provide an ample canvas for ‘slips’ – though very short jokes do not provide enough scope for this.

Additionally, the more insistently that models are urged to keep a secret, the more they risk to disclose it through active evasion, usually allowing the ‘secret word’ to be unveiled in twenty successive attempts by an LLM:

From the new paper: across five frontier models, long-form writing reliably leaks hidden concepts; short jokes do not; and stronger ‘hide it’ instructions push outputs away from the secret, but make the signal therefore detectable by inversion. Source

This task is extremely relevant to business operations, where a wide range of channels, from marketing and PR outreach to internal reports, are required to selectively present a spin on information; however, all these processes require the complete gamut of information at the outset, if only to be certain of what must be suppressed:

An example scenario from the paper illustrates how concealed information can unintentionally shape unrelated output, with an LLM instructed not to disclose its company’s financial instability, yet nonetheless drifting toward phrases associated with cash shortages and capital stress, allowing a reader to infer the hidden context.

The authors state*:

‘Language models cannot reliably compartmentalize. A secret in the prompt shapes the model’s writing, and another model can detect that shaping. The literal word is always suppressed, but the concept is not. This holds across seven models, three word sets, system prompt vs. user prompt, and two independent cross-model guessers […]

‘…We hypothesize that Transformers’ high-fidelity access to information via attention is precisely what makes secrets hard to keep. Even if an LLM is attempting to not leak a word, it must attend to that word to do so, providing a path for accidental leakage.

‘To avoid something explicitly, a human must think of it, and a transformer must attend to it. In cases where two concepts are approximately equally favored by the model (e.g. writing a story about an office job or second violin in an orchestra), the model’s decision-making will inevitably be affected by what it is trying not to reveal.’

Though the DeepSeek and ChatGPT-5.4 models were exceptions in the way that they performed, they both leaked too; in the case of GPT-5.4, it scored below 50% (i.e., below chance levels) in one test where it was asked to avoid a concept; this effectively amounts to a ‘reverse spike’ or indicator, rather than the model ‘keeping its cool’ as requested.

The authors define this syndrome in LLMs as a finite entropy budget, wherein the unpredictably of a model (which would be very helpful in this case!) is held back by its essential lack of randomness. Put simply, the model cannot concentrate on brick walls or baseball scores as effectively as we can, to hold off a persistent thought. However, the authors note that giving the model an alternate concept to think of can improve, but not eliminate the problem. However, placing a non-related concept at the center of attention while the model is executing a task seems destined to entangle the ‘bogus’ concept into the output itself.

One of the paper’s most interesting findings is that the condition studied is greatly exacerbated when the control text is included in direct user messages instead of a system prompt (i.e., a set of preconditions given to the AI by the human user, prior to the exchange), since typing questions into a GUI such as ChatGPT is arguably the standard scenario for interactions.

Of note, besides this, is the confirmation in a targeted trial that smaller models do not suffer from this tendency to leak (though this likely comes with similarly diminished overall capabilities, by comparison).

The new paper is titled Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing, and comes from two authors across the University of Chicago and the University of British Columbia.

Method

The study focuses primarily on five frontier models: Claude Opus 4.6 and Sonnet 4.6, from Anthropic; GPT-5.4 from OpenAI; Llama 4 Maverick from Meta; and DeepSeek V3.2. Additional validation tests were conducted with xAI’s Grok 4, and Google’s Gemini 2.5 Pro, while Separate scaling experiments also examined Llama 3.2 3B, Llama 3.1 8B, Llama 3.3 70B, and Gemma 3 at 4B, 12B, and 27B parameters.

To measure subtler forms of leakage, the researchers used a two-alternative forced choice (2AFC) test, a standard metric in psychophysics literature, wherein a second model was shown two stories and asked which one had been written, while hiding a specific secret word.

Since random guessing would produce 50% accuracy, scores substantially above or below that level would indicate hidden semantic traces remaining detectable in the writing.

In the writer-guesser setup comprising the first experiment, one model was assigned a hidden word alongside instructions either to conceal it; actively avoid associations with it; or redirect attention toward a decoy concept. The system was then tasked with generating stories, essays, and jokes of varying lengths, while a second model attempted either to identify the concealed word directly, or to distinguish between texts influenced by different hidden prompts.

Five conditions were tested:

The five condition tested across the core models.

The researchers tested four kinds of writing tasks: short stories of five to eight paragraphs averaging around 450 words; very short jokes averaging around 12 words; long-form jokes with extended setups averaging around 450 words; and five-paragraph essays averaging around 380 words.

The experiments used 15 hidden words divided across three categories, with the concrete objects comprising ‘umbrella’, ‘lighthouse’, ‘violin‘, ‘cactus’, and ‘telescope’; the abstract concepts ‘justice’, ‘patience’, ‘entropy’, ‘nostalgia’, and ‘freedom’; and the neutral terms ‘bracket’, ‘Tuesday’, ‘copper’, ‘margin’, and ‘invoice’.

To test whether the effect depended on these hand-picked words, the researchers also repeated the experiments with 15 randomly-selected nouns from the Corpus of Contemporary American English (COCA), comprising ‘judge’, ‘consumer’, ‘ice’, ‘pair’, ‘construction’, ‘panel’, ‘minority’, ‘marketing’, ‘stranger’, ‘bullet’, ‘absence’, ‘gear’, ‘cheek’, ‘processing’, and ‘banker’, alongside a further 15 words independently chosen by each tested model^†.

Measurement/Metrics

In the first of the two tests used, an AI model read the text generated either by another instance of itself, or another model, and tried, over 20 rounds, to guess the hidden word.

In the second (2AFC) test, the AI was shown two pieces of writing and asked either: which one had been influenced by a specific hidden word; or which one contained any hidden word at all. Since random guessing would produce a 50% success rate, scores above or below that level would indicate that the supposedly hidden concept was still detectable in the writing.

To prevent the AI from simply favoring whichever text appeared first (or second), the researchers repeated every comparison with the order of the texts swapped, ultimately carrying out 420 discrimination tests and 450 detection tests for each model.

Results

Five models identified hidden concepts at rates between 64% and 79%, despite the secret words never appearing directly in the generated stories. Instead, leakage emerged through themes and imagery:

2AFC accuracy for stories with system-prompt secrets, where 50% signifies chance performance. ‘Don’t reveal’ signifies that the model was told not to mention or hint at the word; ‘actively hide’, that it was told to avoid related associations; ‘avoidance-aware’, that the guesser should look for conspicuous absences; and ‘free-response’, that the exact word was named within 20 rounds. Stars mark above-chance results and daggers mark below-chance results, with nearly all significant findings surviving Bonferroni correction except DeepSeek under ‘actively hide’; Grok and Gemini used slightly fewer trials because of API errors.

GPT-5.4 and Gemini 2.5 Pro behaved differently, because trying to avoid the hidden idea actually made the avoidance pattern itself easier to detect. This effect persisted across hand-picked words, random COCA nouns, and words chosen by the models.

GPT-5.4’s below-chance scores appeared systematic rather than random, with some hidden words detected very reliably and others almost never identified. The effect also weakened when the secret was placed in the user prompt instead of the system prompt (see above), raising GPT-5.4’s detection accuracy to 56%.

For most models, tests that asked ‘Does this text contain a hidden concept?’ produced results similar to tests asking ‘Which hidden concept does this text contain?’.

Further, the two testing methods measured different kinds of leakage, with the 2AFC tests measuring whether a hidden theme could still be detected in the writing, and the free-response tests measuring whether the leaked clues were strong enough for another AI to guess the exact hidden word^††.

The authors conclude*:

‘Our findings suggest that information active in a model’s context during generation may leave thematic traces in its output. System prompt contents, chain-of-thought reasoning, retrieved documents, user-provided data – any of these can, in principle, influence creative decisions in ways detectable by an outside observer.

‘The degree of leakage will depend on how open-ended the generation task is (short jokes are safe; stories are not) and on how semantically identifiable the information is in the given medium (“violin will likely leak in stories more than “)”).

‘Still, semantic leakage appears to be inevitable, even when models are actively trying to hide information.

Conclusion

As noted above, the authors’ ascribe part of the problem to the core principles of the Transformers architecture itself. History suggests that this latest LLM issue will therefore be addressed by post-training conditioning (alignment), system prompts that are non-editable to the end-user, filters, and the diverse gamut of ever-growing secondary systems that seem to multiply as ‘native’ problems with diffusion models come to light.

The larger that secondary infrastructure of guardrails and balances becomes, the more the current generation of SOTA AI seems to resemble Jurassic Park, where the core value proposition comes with a fearful volume of caveats, and requiring a multitude of workarounds and compromises.

* Authors’ own emphasis, adjusted where necessary by me (because an article quote is already in italics), and authors’ inline citations converted by me to hyperlinks.

^†The authors observe with interest some apparently improbable spontaneous overlap across different model families in regard to the ‘self-selected’ choice of words, stating ‘Models gravitate toward similar words: telescope, freedom, and nostalgia each appear in 3+ models’ lists’. They further note a commonality of choice of ‘short joke’ surfacing across model families: ‘[Several] models produce the same stock joke regardless of the secret. Opus writes ‘Why don’t scientists trust atoms? Because they make up everything’ for 11 of 15 secrets. The remaining four secrets (cactus, entropy, nostalgia, patience) receive the same library joke that Opus also writes for all 15 no-secret conditions—meaning these four secret-bearing jokes are indistinguishable from the baseline.’

^††Even by Arxiv standards, the paper has a tendency towards repetition and burying its fascinating ledes in excessive detail and demonstrations. Therefore I refer the reader to the source PDF for the remainder of the secondary experiments outlined therein.

First published Friday, May 15, 2026. Corrected syntax Saturday May 16th, 16:05 EET.