Artificial Intelligence

What Is Adversarial Poetry? A New AI Jailbreak Method

Published December 22, 2025

Zac Amos

Artificial intelligence (AI) safety has turned into a constant cat-and-mouse game. As developers add guardrails to block harmful requests, attackers continue to try new ways to circumvent them. One of the strangest twists yet is adversarial poetry. This tactic involves disguising prompts as verse and using rhyme, metaphor and unusual phrasing to make risky instructions look less like the things safety systems are trained to catch.

In practice, the content itself doesn’t change much. It’s the wrapper that does, which can be enough to confuse pattern-based filters. It is a reminder that, with today’s models, how something is asked can matter almost as much as what is being asked.

What Happened When Researchers Used Poems to Break AI?

In early 2025, researchers demonstrated that large language models (LLMs) could be prompted to respond to restricted prompts by wrapping them in poetic form. Instead of issuing direct, policy-triggering instructions, the researchers embedded the same requests inside rhymes, metaphors and narrative verse.

On the surface, the prompts appeared to be creative writing exercises, but underneath, they carried the same intent that would normally be blocked. Across 25 frontier proprietary and open-weighted models, the team reported that poetic framing achieved an average jailbreak success rate of 62% for handmade poems and about 43% for bulk “verse conversion” using a standardized meta-prompt.

The responses themselves weren’t new types of failures, but familiar ones appearing through an unexpected door. The models were nudged into producing content they typically avoid — such as explanations touching on illegal or harmful activities — because the underlying request was fragmented and obscured by poetic structure.

The study’s core takeaway is that stylistic variation alone can be sufficient to evade safety systems tuned for more literal phrasing. It reveals a vulnerability that is evident across model families and alignment approaches.

How Adversarial Poetry Works

Adversarial attacks exploit a simple reality — machine learning systems do not “understand” language the way humans do. They detect patterns, predict likely continuations and follow instructions based on what their training and safety layers interpret as intent.

When a prompt is phrased in a straightforward, literal way, it’s easier for guardrails to recognize and block. However, when the same purpose is disguised — split up, softened or reframed — the protective layers can miss what’s actually being asked.

Why Poetry Can Be an Effective Vehicle

Poetry is naturally built for ambiguity. It relies on metaphor, abstraction, unusual structure and indirect phrasing. These are the exact kinds of traits that can blur the line between “harmless creative writing” and “a request that should be refused.”

In the same 2025 study, researchers reported that poetic prompts elicited unsafe responses at a 90% success rate across a broad set of models, indicating that style alone can materially change outcomes.

How a Poem Hides a Real Request

Consider the request as a message and the poem as the packaging. Safety filters often look for obvious signs, such as explicit keywords, direct step-by-step phrasing or recognizable malicious intent.

Poetry can conceal that intent through figurative language or spread it across lines, making it harder to spot in isolation. Meanwhile, the underlying model still reconstructs the meaning well enough to respond because it’s optimized to infer intent even when language is indirect.

Detecting and Mitigating Jailbreaks

As jailbreak methods become more creative, the conversation must shift from how they work to how they’re spotted and contained. That’s especially true now that AI is part of everyday routines for many people, as 27% report using it several times a day.

As more people utilize large language models (LLMs), additional safeguards should be tested and explored. This task involves building layered defenses that can adapt to new prompt styles and evasion tricks as they emerge.

The Developer’s Dilemma

The hardest part about jailbreaks for AI safety teams is that they don’t come as one known threat. They continuously change over time. This constant shift is because a user can rephrase a prompt, split it into fragments, wrap it in roleplay or disguise it as creative writing. Then, each new packaging can change how the system interprets the intent of the prompt.

That challenge scales rapidly when AI is already integrated into daily routines, so actual usage creates endless opportunities for edge cases to appear.

That’s why today’s AI safety looks more like managing risk over time. The NIST AI Risk Management Framework (AI RMF) explicitly treats risk management as an ongoing set of activities — organized around govern, map, measure and manage — rather than as a static checklist. The goal is to create processes that make it easier to identify emerging failure modes, prioritize fixes and tighten safeguards as new jailbreak styles appear.

How Models Protect Themselves

AI safety makes up several layers. Most systems have more than one defense working together, with each catching different kinds of risky behavior. At the outer layer, input and output filtering acts as a gatekeeper.

Incoming prompts are scanned for policy violations before they reach the core model, while outgoing responses are checked to ensure nothing slips through on the way back to the user. These systems are good at identifying direct requests or familiar red flags, but they’re also the easiest to circumvent, which is why more deceptive jailbreaks often bypass them.

The next layer of protection happens inside the model itself. When jailbreak techniques are discovered, they’re often turned into training examples. This is where adversarial training and reinforcement learning from human feedback (RLHF) come into the picture.

By fine-tuning models on examples of failed or risky interactions, developers effectively teach the system to recognize patterns it should refuse, even when they’re wrapped in creative or indirect language. Over time, that process helps inoculate the model against entire classes of attacks.

The Role of AI “Red Teaming”

Rather than waiting for a jailbreak to occur, companies use AI red teams. These teams are groups tasked with trying to break models in controlled environments. They approach systems the way an attacker might, experimenting with unusual phrasing, creative formats and edge cases to uncover where safeguards fall short. The goal is to expose weak spots before they show up in real-world use.

Red teaming is now becoming a core part of the development life cycle in today’s cybersecurity strategies. When a team discovers a new jailbreak technique, the resulting data feeds directly back into training and evaluation pipelines. That information is used to define filters, adjust policies and strengthen adversarial training so similar attempts are less likely to succeed in the future. Over time, this creates a continuous loop — probe for failures, learn from them and improve the system, then repeat.

When Poetry Becomes a Stress Test for AI Safety

Adversarial poetry is a reminder that AI safeguards depend on how a user phrases questions, not just what. As models become more accessible and widely used, researchers will continue to probe the gaps between creative language and safety systems designed to catch more direct intent. The takeaway is that safer AI will come from multiple defenses that evolve as quickly as the jailbreaks do.