Anderson's Angle

Censoring AI Models Does Not Work Well, Study Reveals

Published August 22, 2025

Martin Anderson

ChatGPT-4o, Krita (Flux/Flux Koncept Dev), Firefly.

Attempts to censor AI image generators by erasing banned content (such as porn, violence, or copyrighted styles) from the trained models are falling short: a new study finds that current concept erasure methods permit ‘banned’ attributes to spill into unrelated images, and also fail to stop closely-related versions of the supposedly ‘erased’ content from appearing.

If companies that produce foundation AI models cannot prevent them from being misused to produce objectionable or illegal material, they risk being prosecuted and/or shut down. Conversely, vendors that only make their models available through an API, as with Adobe’s Firefly generative engine, are in a position to not worry about what their models might create, since both the user’s prompt and the resulting output are inspected and sanitized:

Adobe’s Firefly system, used in tools like Photoshop, sometimes refuses a generative request right away by blocking the prompt before anything is created. Other times, it generates the image but then blocks the result after review. This kind of mid-process refusal can also happen in ChatGPT, when the model starts a response but cuts it off after recognizing a policy violation – and occasionally one can see the aborted image briefly during this process.

However, API-style filters of this kind can usually be neutralized by users on locally-installed models, including vision-language models (VLMs) which the user may desire to customize through local training on custom data.

In most cases, disabling such operations is trivial, involving commenting out a function call in Python (though hacks of this kind must usually be repeated or re-invented after framework updates).

From a business perspective, it is difficult to understand how this could be a problem, since an API approach maximizes corporate control over the user’s workflow. From the user’s perspective, however, both the cost of API-only models and the risk of mistaken or excessive censorship is likely to compel them to download and customize local installations of open source alternatives – at least, where the FOSS licensing is favorable.

The last significant model to be released without any attempt to ingrain self-censorship was Stable Diffusion V1.5, almost three years ago. Later, the revelation that its training corpora included CSAM data led to growing calls to ban its availability, and its removal from the Hugging Face repository in 2024.

Cut It Out!

Cynics contend that a company’s interest in censoring locally-installable generative AI models is based solely on concerns about legal exposure, should their frameworks become publicized for facilitating illegal or objectionable content.

Indeed, some ‘local-friendly’ open source models are not that difficult to de-censor (such as Stable Diffusion 1.5 and DeepSeek R1).

By contrast, the recent release of Black Forest Lab’s Flux Kontext model series was hallmarked by the company’s notable commitment to bowdlerizing the entire Kontext range. This was achieved both by careful data curation, and by targeted fine-tuning after training, designed to remove any residual tendency towards NSFW or banned content.

This is where the locus of action has been in the research scene over the past 2-3 years: with an emphasis on after-the-fact fixing of models with under-curated data. Offerings of this kind include Unified Concept Editing in Diffusion Models (UCE); Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models (RECE); Mass Concept Erasure in Diffusion Models (MACE); and concept-Semi-Permeable structure is injected as a Membrane (SPM):

The 2024 paper ‘Unified Concept Editing in Diffusion Models’ offered closed-form edits to attention weights, enabling efficient editing of multiple concepts in text-to-image models. But does the method stand up to scrutiny? Source: https://arxiv.org/pdf/2308.14761

Though this is an efficient approach (hyperscale collections such as LAION are far too large to manually curate), it is not necessarily an effective one: according to a new US study, none of the aforementioned editing procedures – which represent the state-of-the-art in post-training AI model modification – actually work very well.

The authors found that these Concept Erasure Techniques (CETs) can usually be easily circumvented, and that even where they are effective, they have considerable side-effects:

Effects of concept erasure on text-to-image models. Each column shows a prompt and the concept marked for erasure, along with generated outputs before and after editing. Hierarchies indicate parent-child relations between concepts. The examples highlight common side effects, including failure to erase child concepts, suppression of neighboring concepts, evasion through rewording, and transfer of erased attributes to unrelated objects. Source: https://arxiv.org/pdf/2508.15124

The authors found that the leading current concept erasure techniques fail to block compositional prompts (for example, red car or small wooden chair); often let sub-classes slip through even after erasing a parent category (such as car or bus continuing to appear after removing vehicle); and introduce new problems such as attribute leakage (where, for example, deleting blue couch could cause the model to generate unrelated objects such as blue chair).

In over 80% of test cases, erasing a broad concept such as vehicle did not stop the model from generating more specific vehicle instances such as cars or buses.

Editing, the paper observes, also causes attention maps (the parts of the model that decide where to focus in the image) to scatter, weakening output quality.

Interestingly, the paper finds that erasing related trained concepts one by one works better than trying to remove them all at once – though it does not remove all the shortcomings of the studied editing methods:

Comparison of progressive and all-at-once erasure strategies. When all variants of ‘teddy bear’ are erased simultaneously, the model continues to generate bear-like objects. Erasing the variants step by step is more effective, leading the model to suppress the target concept more reliably.

Though the researchers can currently offer no solution to the issues that the paper outlines, they have developed a new dataset and benchmark which may help later research projects to understand whether or not their own ‘censored’ models are operating as expected.

The paper states:

‘Previous evaluations have relied solely on a small set of target and preserve classes; for instance, when erasing ‘car,’ only the model’s ability to generate cars is tested. We demonstrate that this approach is fundamentally inadequate and concept erasure evaluation should be more comprehensive to encompass all related sub-concepts such as ‘red car’.

‘By introducing a diverse dataset with compositional variations and systematically analyzing effects such as impact on neighboring concepts, concept evasion, and attribute leakage, we uncover significant limitations and side effects of existing CETs.

‘Our benchmark is model-agnostic and easily integrable and is ideally suited to aid the development of new Concept Erasure Techniques (CETs).’

Although CETs erase the target concept ‘bird’, they fail on the compositional variant ‘red bird’ (top). After erasing ‘blue couch’, all methods also lose the ability to generate a blue chair (bottom). Successful results are marked with a green tick symbol, and failures with a red ‘X’ symbol.

The study offers an interesting insight into the extent of the interleaving of concepts trained into a model’s latent space, and the extent to which entanglement will not easily permit any kind of definitive and truly discrete concept erasure.

The new paper is titled Side Effects of Erasing Concepts from Diffusion Models, and comes from four researchers from the University of Maryland.

Method and Data

The authors opine that prior works that claim to erase concepts from diffusion models do not prove the claim adequately, stating*:

‘Claims of erasure need more robust and comprehensive evaluation. For instance if the concept to be erased is ‘vehicle’, sub-concepts such as ‘car’ and compositional concepts such as ‘red car’ or ‘small car’ should also be erased.

‘Yet, this aspect of concept hierarchy and compositionality is not considered in existing evaluation protocols as they focus only on accuracy of the single erased concept. [The authors of EraseBench] assess how CETs impact visually similar and paraphrased concepts (such as ‘cat’ and ‘kitten’)[;] however they do not exhaustively probe the hierarchy and compositionality of concepts.’

In order to provide benchmark data for future projects, the authors created the Side Effect Evaluation (SEE) dataset – a large collection of text prompts designed to test how well concept erasure methods work.

The prompts follow a simple template in which an object is described with attributes of size, color, and material – for example, an image of a small red wooden car.

Objects were drawn from the MS-COCO dataset, and organized into a hierarchy of superclasses such as vehicle, and subclasses such as car or bus, with their attribute combinations forming the leaf nodes (the most specific level of the hierarchy). This structure makes it possible to test erasure at different semantic levels, from broad categories to specific variants.

To support automated evaluation, each prompt was paired with a yes-or-no question, such as Is there a car in the image?, and also used as a class label for image classification models:

Prompt combinations in the SEE dataset generated by varying size, color, and material attributes.

To measure how well each concept erasure method performed, the authors devised two scoring methods: Target accuracy, which tracks how often erased concepts still appear in the generated images; and Preserve accuracy, which tracks whether the model continues to generate material that was not supposed to be erased.

The balance between the two scores is intended to reveal whether the method successfully removes the banned concept without damaging the model’s broader output.

The authors evaluated concept erasure across three failure modes: firstly, a measure of whether removing a concept such as car disrupts nearby or unrelated concepts, based on semantic and attribute similarity; secondly, a test for whether erasure can be bypassed by prompting sub-concepts such as red car after deleting vehicle.

Finally, a check was conducted for attribute leakage, where traits linked to erased concepts appear in unrelated objects (for example, deleting couch might cause another object, such as a potted plant, to inherit its color or material). The final dataset contains 5056 compositional prompts

Tests

The former frameworks tested were those listed earlier – UCE, RECE, MACE, and SPM. The researchers adopted default settings from the original projects, and fine-tuned all the models on a NVIDIA RTX 6000 GPU with 48GB of VRAM.

Stable Diffusion 1.4, one of the most perennial models in the literature, was used for all the tests – perhaps not least because the earliest SD models had little or no conceptual restraint, and as such offer a blank slate in this particular research context.

Each of the 5056 prompts from the SEE dataset was run through both the unedited and edited versions of the model, generating four images per prompt using fixed random seeds, allowing to test whether erasure effects remained consistent across multiple outputs. Each edited model produced a total of 20,224 images.

The presence of preserved concepts was evaluated according to prior methods for text-to-image erasure procedures, using the VQA models BLIP, QWEN 2.5 VL, and Florence-2base.

Impact on Neighboring Concepts

The first test measured whether erasing a concept unintentionally affected nearby concepts. For example, after removing car, the model should stop generating red car or large car. but still be able to generate related concepts such as bus or truck, and unrelated ones such as fork.

The analysis used CLIP embedding similarity and attribute-based edit distance to estimate how close each concept was to the erased target, allowing the study to quantify how far the disruption spread:

Combined results for target accuracy (left) and preserve accuracy (right) plotted against semantic similarity (top) and compositional distance (bottom). An ideal concept erasure method would show low target accuracy and high preserve accuracy across all distances; but the results show that current techniques fail to generalize cleanly, with closer concepts either insufficiently erased or disproportionately disrupted.

Of these results, the authors comment:

‘All of the CETs continue to generate compositional or semantically distant variants of the target despite the erasure, which ideally should not occur. It is evident that UCE consistently achieves higher accuracy than other CET methods on the [preserve set], indicating minimal unintended impact on semantically related concepts.

‘In contrast, SPM achieves the lowest accuracy, suggesting that its editing strategy is more susceptible to concept similarity.’

Among the four methods tested, RECE was most effective at blocking the target concept. However, as shown in the left side of the image above, all methods failed to suppress compositional variants. After erasing bird, the model still produced images of a red bird, suggesting the concept remained partially intact.

Removing blue couch also prevented the model from generating a blue chair, indicating harm to nearby concepts.

RECE handled compositional variants better than the others, while UCE did a better job of preserving related concepts.

Erasure Invasion

The erasure evasion test evaluated whether models could still generate subclass concepts after their superclass had been erased. For example, if vehicle was removed, the test checked whether the model could still produce outputs such as bicycle or red car.

Prompts targeted both direct sub-classes and compositional variants to determine whether the concept erasure operation had truly removed the full hierarchy or could be bypassed through more specific descriptions:

On Stable Diffusion v1.4, circumvention of erased superclasses through their sub-classes and compositional variants, with higher accuracy indicating greater evasion.

The unedited model retained high accuracy across all superclasses, confirming that it had not removed any target concepts. Among the CETs, MACE showed the least evasion, achieving the lowest subclass accuracy in more than half of the tested categories. RECE also performed well, particularly in the accessory, sports, and electronic groups.

By contrast, UCE and SPM showed higher subclass accuracy, indicating that erased concepts were more easily bypassed through related or nested prompts.

The authors note:

‘[All] CETs successfully suppress the target superclass concept (“food”). However, when prompted with attribute-based children of the food hierarchy (e.g., a large pizza”), all methods generate food items.

‘Similarly in vehicle category, all models generate bicycles, despite erasing “vehicle”.’

Attribute Leakage

The third test, attribute leakage, checked whether traits linked to an erased concept appeared in other parts of the image.

For example, after erasing couch, the model should neither generate a couch nor apply its typical attributes (such as color or material) to unrelated objects in the same prompt. This was measured by prompting the model with paired objects and examining whether the erased attributes mistakenly appeared in preserved concepts:

Attention maps for attribute tokens after concept erasure. Left: When ‘bench’ is erased, the token ‘wooden’ shifts to the bird instead, resulting in wooden birds. Right: Erasing ‘couch’ fails to suppress couch generation, while the token ‘large’ is wrongly assigned to the doughnut.

RECE was the most effective at erasing target attributes, but also introduced the most attribute leakage into preserved prompts, surpassing even the unedited model. UCE leaked less than other methods.

The results, the authors suggest, indicate the necessity for an inherent tradeoff, with stronger erasure raising the risk of misdirected attribute transfer.

Conclusion

The latent space of a model does not fill up in an orderly way during training, with derived concepts deposited neatly onto shelves or into filing cabinets; rather the trained embeddings are both the content and their containers: not separated by any stark boundaries, but rather blending into each other in a way that makes removal problematic – like trying to extract a pound of flesh without any loss of blood.

In intelligent and evolving systems, foundational events – such as burning one’s fingers and thereafter treating fire with respect – are bound in with the behaviors and associations they later form, making it challenging to produce a model that may have been left with the corollaries of a central, potentially ‘banned’ concept, yet lack that concept in itself.

* My conversion of the authors’ inline citation to hyperlinks.

First published Friday, August 22, 2025