Anderson's Angle
Using Emojis Can Bypass Content Filters in AI Chatbots

Emojis can be used to bypass the safety mechanisms of large language models, and trigger toxic outputs that would otherwise be blocked. By this means, LLMs can be made to discuss and give advice on banned subjects such as bomb-making and murder.
A new collaboration between China and Singapore finds compelling evidence that emojis can be used not only to bypass the content-detection filters in large language models (LLMs), but can in general increase the level of toxicity during a user’s engagement with the models:

From the new paper, a broad demonstration of the ways that encoding a banned concept with emojis can help a user to ‘jailbreak’ a popular LLM. Source: https://arxiv.org/pdf/2509.11141
In the example above, from the new paper, we see that transforming rule-breaking word-based intent into an emoji-laden alternative version can elicit a far more ‘cooperative’ response from a sophisticated language model such as ChatGPT-4o (which habitually sanitizes input prompts and intercepts output material that may violate company rules).
Effectively, in the most extreme circumstances, emoji use can therefore operate as a jailbreak technique, according to the authors of the new work.
One residual mystery stated in the paper is the question of why language models give emojis such leeway to violate rules and elicit toxic content, when the models already understand that certain emojis have strongly toxic associations.
The suggestion proffered is that because LLMs are trained to model and reproduce patterns from their training data, and because emojis are so frequently found in that data, the model learns that the emoji belongs in that discourse, and treats it as a statistical association, instead of content to be evaluated and filtered.
This means the emoji, when reused in a prompt, helps the model predict toxic continuations more confidently; but rather than acting as a red flag, the emoji functions as a semantic cue, which actually reinforces the intended toxic meaning instead of moderating or intercepting it. Since safety alignment is applied after the fact, and often in a narrow, literal frame, prompts with these emojis may therefore evade detection entirely.
In this way, the paper proposes, the model doesn’t become tolerant despite the toxic association – it becomes tolerant because of it.
Free Pass
That said, the authors admit that this does not represent a conclusive theory as to why emoji use can so effectively bypass content filters in language models. They state:
‘Models can recognize the malicious intent expressed by emojis, yet how it bypasses safety mechanisms remain unclear.’
The weakness may stem from the text-centered design of content filters, which assume either literal text input or embeddings faithfully converted into text equivalents: in both cases, the system relies on explicit tokens that can be matched against safety rules.
To take an illustration from AI-based image editing: when a user uploads a NSFW picture to a vision-language model and requests modifications, systems such as Adobe Firefly or ChatGPT employ CLIP-style pipelines to extract textual concepts from the image, as a prerequisite for editing. Once those concepts are rendered into words, the presence of any restricted terms in those extracted words will trigger the filter, causing the request to be rejected.
Yet, for some reason, an emoji’s status as neither a word nor an image (or else, as both) seems to lend it a power to transcend filtering; Â clearly, as the authors indicate, further research into this curious loophole is warranted.
The new paper is titled When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs’ Toxicity, and comes from nine authors across Tsinghua University and the National University of Singapore.
(Unfortunately, many of the examples the paper refers to are in an appendix which has not yet been made available; though we have requested this of the authors, the appendix has not been supplied at the time of writing. Nonetheless, the empirical results in the core paper remain worthy of attention.)
Three Core Emoji Interpretations
The authors highlight three linguistic traits that make emojis effective at bypassing filters. Firstly, emoji meanings are context-dependent. For example, the ‘Money with Wings’ emoji (see image below) is officially defined as representing money transfers or spending; however, depending on surrounding text, it can also imply either legitimate or illicit activity:

In a partial illustration from the new paper, we see that a popular emoji can have its meaning hijacked altered or subverted in popular usage This effectively gives the emoji an official passport into the semantic space, and a hidden payload of negative or toxic meaning that can be exploited once it is past the filters.
Secondly, emojis can shift the tone of a prompt. Their presence often adds playfulness or irony, softening the emotional register. In harmful queries, this can make the request seem like a joke or game, encouraging the model to respond rather than reject:

The leavening effect of emojis can detoxify tone without detoxifying intent.
Thirdly, the paper asserts, emojis are language-agnostic: a single emoji can carry the same sentiment across English, Chinese, French, and other languages. This makes them ideal for multilingual prompts, preserving meaning even when the surrounding text is translated:

The ‘broken heart’ emoji conveys a universal message, perhaps not least because it represents a baseline case in the human condition, relatively immune to national or cultural variations.
Approach, Data and Tests*
The researchers created a modified version of the AdvBench dataset, rewriting harmful prompts to include emojis either as substitutes for sensitive words or as decorative camouflage. AdvBench covers 32 high-risk topics including bombing, hacking, and murder, among others:

Original examples from AdvBench, illustrating how a single adversarial prompt can bypass safeguards in multiple major chatbots, eliciting harmful instructions despite alignment training. Source: https://arxiv.org/pdf/2307.15043
All 520 original AdvBench instances were altered in this way, with the top 50 toxic and non-duplicate prompts used across the range of experiments. The prompts were also translated into multiple languages and tested across seven major closed and open source models, and in combination with the known effective jailbreak techniques Prompt Automatic Iterative Refinement (PAIR); Tree of Attacks with Pruning (TAP); and DeepInception.
Closed-source models used were Gemini-2.0-flash; GPT-4o (2024-08-06); GPT-4-0613; and Gemini-1.5-pro. Open-source models used were Llama-3-8B-Instruct; Qwen2.5-7B-Instruct (Team 2024b); and Qwen2.5-72B-Instruct (Team 2024a), with all experiments repeated thrice to account for random chance.
The study first tested whether rewriting harmful prompts from AdvBench using emojis would increase toxic output, including in translations to other major languages. Additionally, it applied the same emoji-editing method to prompts from the aforementioned known jailbreak strategies (PAIR, TAP, and DeepInception) to see whether emoji substitution could further improve their success.
In both cases, the structure of the original prompts was preserved, with only sensitive terms swapped for emojis and decorative elements added to disguise intent.
For test metrics, the authors innovated a scoring system called GPT-Judge. In this setup, GPT-4o was not the model being tested, but rather was prompted to act as a grader, assigning a numerical Harmful Score (HS) to responses generated by other models.
Each output was rated from one (harmless) to five (extremely harmful), and the percentage of responses that received a five was reported as the Harmfulness Ratio (HR).
To prevent the models from drifting into emoji explanations instead of answering explicitly, the researchers added an instruction to each prompt, telling the model to make its reply brief:

Results from emoji-based prompts in ‘Setting-1’, with comparisons to ablation variants where emojis were replaced with words, or removed entirely. Model names are abbreviated for space.
In the initial results table above, the left side of the table indicates that harmful prompts substituted with emojis achieved notably higher HS and HR scores than ablated versions (i.e., versions where the emoji was translated back into text, exposing it directly to content filters).
The authors note†that the emoji-substituted approach outperforms prior jailbreak methods, as outlined in the additional results table below:

Harmfulness Ratio results for emoji-augmented jailbreak prompts in ‘Setting-2’, with model names shown in abbreviated form.
The first of the two tables shown above, the authors state, also indicates that the effect of emojis carries across languages. When the textual components of the emoji prompts were translated into Chinese, French, Spanish, and Russian, the harmful outputs remained high; because these are all high-resource languages, the results suggest that the risk is not confined to English but applies broadly to major user groups, with emojis functioning as a transferable channel for toxic generation.
Towards the paper’s conclusion, the researchers suggest that the effect of emojis is not simply accidental but rooted in the way models process them, noting that models can apparently recognize the harmful meaning of emojis – yet rejection responses are suppressed when emojis are present.
Tokenization studies further indicate that emojis are usually broken into rare or irregular fragments with little overlap to their textual equivalents, effectively creating an alternative channel for harmful semantics.
Looking beyond model mechanics, the paper further examines pretraining data, finding that many frequently-used emojis appear in toxic contexts such as pornography, scams, or gambling. The authors argue that this repeated exposure may normalize the association between emojis and harmful content, encouraging models to comply with toxic prompts rather than block them.
Together, these findings suggest that both internal processing quirks and biased pretraining data contribute to the surprising effectiveness of emojis in bypassing safety measures.
Conclusion
It is not uncommon to use alternative input methods to attempt to jailbreak LLMs. In recent years, for instance, hexadecimal encoding has been used to bypass ChatGPT’s filters. The problem appears to lie in the flat usage of text-based language to qualify incoming requests and outgoing responses.
In the case of emojis, a hidden locus of rule-breaking meaning can apparently be introduced into discourse without penalty or intervention, because the transmission method is unorthodox. One would think that CLIP-based transliteration would intervene in all image uploads, so that offensive or infringing material would end up as flaggable text.
Evidently this is not the case, at least where the major LLMs studied are concerned; their linguistic barriers appear to be brittle and text-centric. One can imagine that more extensive interpretation of content (for instance, by studying heatmap activations) carries a processing and/or bandwidth cost that may make such approaches impractically expensive, among other possible limitations and considerations.
* The layout of this paper is chaotic compared to most, with methodology and tests not clearly delineated. We have therefore done our best to represent the core value of the work as well as possible  in these circumstances.
†In an admittedly almost impenetrable and confused treatment of the results.
First published Wednesday, September 17, 2025