Anderson's Angle

IP-Washing Methods in AI

Published March 16, 2026

Martin Anderson

An AI-generated image of Lady Justice surrounded by 'laundered' data. GPT-1.5.

If there’s a legal reckoning to come over the use of intellectual property in training AI, there are also several methods of obscuring such usage.

Opinion The current, rapidly advancing revolution in generative AI is unfolding in the most legally precarious environment that has accompanied any transformative technological development since the nineteenth century.

Until 3-4 years ago, the machine learning research community enjoyed a tacit (often explicit) remit to exploit IP-protected material in the course of developing new systems; since these systems were not yet successful, in terms of being mature or commercially viable, the outcomes were, in every sense, academic.

In that period, the sudden success of a new generation of diffusion-based Large Language Models (LLMs, such as ChatGPT and Claude) and Vision-Language Models (VLMs, such as Sora) signaled that these abstract and hitherto ‘harmless’ strands of research had developed into commercial viability, and outgrown their ‘free pass’, as far as the exploitation of other people’s intellectual property was concerned.

From now on, rights-holders would seek a stake in the fruits of AI systems trained largely or in part on their copyrighted or otherwise protected data, leading to an ongoing avalanche of legal cases that requires some effort to even keep track of.

Limited only to cases brought in the US, new cases emerge at a frenetic pace in the United States and beyond. Source - https://copyrightalliance.org/artificial-intelligence-copyright/court-cases/

Here limited only to cases brought in the US, new cases emerge at a frenetic pace in the United States and beyond. Source

Mandating a ‘Free Lunch’

The financial commitment currently occurring in regard to AI-serving infrastructure has been posited by some voices as an effort to entrench ‘copyright-hazardous’ AI so deeply in the economics society that it becomes not only ‘too big to fail’, but also ‘too powerful to sue’ – or too powerful, at least, that successful lawsuits could be allowed to upend the revolution.

Towards this general sentiment, the current president of the United States is committing into policy his view that ‘You can’t be expected to have a successful AI program when every single article, book, or anything else that you’ve read or studied, you’re supposed to pay for’.

Really? Nothing remotely similar or comparable has occurred in the western industrial era, and this represents a movement that abrades severely against the traditional US culture of litigation and reparation; perhaps the nearest similar positions are the mandatory expiration of medicinal patents after 20 years (in itself frequently under attack), and the limitation on expectations of privacy in public places.

However, times change; in the absence of any guarantee that the current trend towards ’eminent domain’ against IP protections won’t falter, or else be reversed later, there are several secondary approaches becoming standard practice in the development of AI systems, and the treatment of the much-contested training data that powers it.

Datasets-by-Proxy

One of these approaches takes a remarkably similar approach to the (not always successful) defense by torrent-listing sites that they don’t actually host any contested material – or any material at all.

Besides obviating the need to store and serve large amounts of minimally-compressible image or video data, collections of this kind allow for rapid updating – such as the removal of material at copyright holders’ requests – and versioning.

Just as torrents are only signposts to where IP-protected material can be found, a number of highly influential datasets are in themselves only ‘pointer’-style lists of extant data; if the end-user wishes to use these lists as a download-list for their own dataset, that’s on them, as far as the curators’ liability would seem to be concerned.

Among such is Google Research’s Conceptual 12M dataset, which provides captions for images, but only points to locations on the web where these images exist (or existed at the time of curation):

Two examples from Google Research’s Conceptual 12M curation. Source

Another prominent example, and one which now has a valid claim to reverence in the history of AI, is the LAION dataset that facilitated the advent of the Stable Diffusion generative system in 2022 – the first such framework to offer powerful open source generative images to end-users, just as proprietary systems seemed set to establish such services as a purely ring-fenced, commercial domain:

One of the many variants of the LAION project, featuring modern and copyrighted artworks. Source

In many cases the high file sizes of some of these ‘pointer’ collections indicate image content inclusion in a downloadable and hosted file; however, the non-trivial download sizes are often due to the high volume of text content, and sometimes the inclusion of extracted embeddings or features – derived summaries or nodes of otherwise applicable content extracted from the source data during the training process.

The Video Premium

Video datasets present an even stronger case for the ‘dataset-by-proxy’ or pointer approach, since the high volume of storage data required to aggregate a meaningful and useful number of videos into a single downloadable collection is prohibitive, and a ‘distributed’ method is desirable.

However, in both cases – but particularly with video – the downloadable source URLs represent data that will need significant further attention before being used in training processes. Both images and videos will need to be resized, or else cropping decisions made, in order to create samples that will fit into available GPU space. Even seriously downsampled videos will also require cutting to very short lengths, such as 3-5 seconds, typically.

Notable video datasets that use references to online videos (rather than the curation and direct packaging of video) include Google’s Kinetics Human Action Video Dataset, and the search giant’s YouTube-8M collection, which uses segment annotation to indicate how to treat each video once downloaded – but which once again leaves the end-user to obtain the videos from the supplied URLs.

Close and Open

Finally, in this category, ‘open’ VFX data may be generated with closed platforms which subsequently publish and make available the resulting dataset. It is reasonable to wonder why this happens, and to consider whether it may be because the originating company wishes to sanitize an IP-unfriendly upstream model, for their own use; or else that a ‘washed’ set was requested from outside.

One such case of ‘generational washing’ is, arguably, the Omni-VFX dataset, which incorporates many data points from the Open-VFX dataset (which in itself references many closed and semi-closed platforms, such as Pika and PixVerse).

To be honest, Omni-VFX is not even really trying:

In the open source Omni-VFX dataset, a familiar face. Source

Ancestral Liability

The second major approach to IP-washing is through the use of copyrighted material at one or many removes. One of the methods in this category is the use of synthetic data that has been trained, at some point upstream, on copyrighted data. In such cases, most particularly where synthetic data is able to obtain authentic-looking results, copyrighted work supplies transformations that could not reasonably be guessed or approximated by general world models, or non-specialized models.

This is emphatically the case where generative video systems are required to generate ‘impossible’ events, and events which would fall generally into the category of ‘visual effects’ (VFX).

In fact, what brought this topic to mind was the latest in a series of research papers offering the ability to ‘abstract’ diverse types of visual effect, such as producing laser beams from improbable parts of the body, either by having been trained on custom-commissioned or ‘open source’ VFX clips (rather than the more obvious source, such as the very expensive VFX shots found in output from the Marvel cinematic universe):

Examples from the EffectMaker website, wherein the ‘action’ in the source clip (far left) is applied to a source image (center). Source

The above examples come from the project page for the EffectMaker project. EffectMaker is not even the first offering this year that seeks to extract VFX dynamics from one video clip and transpose it into a novel clip, and in fact this is turning into a discrete sub-task in AI VFX research*.

Aware that media behemoths such as Marvel have a higher-than-average chance of winning legal cases over IP (even in the aforementioned climate of ‘enforced tolerance’), visual effects companies and startups are currently going to notable lengths to assure that their generative VFX frameworks are free of other companies’ corporate IP.

Foremost of these is Meta, which has been reported on the r/vfx subreddit to have gone on a well-compensated winter hiring spree into 2026, offering VFX artists work training AI models to output Hollywood-level visual effects shots. Though the pay was unspecified across various posts, one described it as ‘retirement money’.

Follow the Money

However, one has to wonder how much money even the likes of Meta are willing to pay for a true diversity and abundance of ad hoc VFX shots, given that the average single VFX shot for a blockbuster movie is around $42,000 USD – and many come in far higher.

Further, it stands to reason that bespoke VFX-generating AI models will accede to popular demand, including various standard effects tropes from the most popular and expensive categories of movies.

Aside from the standpoint that ‘remnant’ VFX professionals could end up recreating shots that they worked on for an existing movie catalog^† – which in itself contextualizes the ‘custom’ dataset work as imitative – there is in any case no guarantee that these costly new samples will end up trained ‘from zero’ in a brand new architecture.

Indeed, if such recreations are diverted into adjunct modules like LoRAs, which rely on a base model, then the process is only as defensible as the base model is ‘IP-clean’ – and not many are.

Similarly, if the ‘new’ process uses other ‘hybrid’ techniques such as fine-tuning, where the value of the visual effect relies on models, priors, or embeddings from older collections or models of unsubstantiated integrity, the originality of the work is arguably cosmetic, and subject to challenge.

Impossible Missions

The domain of VFX output is a particularly interesting case-study in regard to potential IP-washing in AI datasets, since visual effects shots often depict ‘impossible’ things for which there will be no open source alternatives available.

For instance, while the demolition of a building could be trained into a generative model from various public domain or otherwise affordable stock clips, if you want to train a model to produce human laser beams, you’re going to need to train on VFX clips, stolen or commissioned; things like that don’t happen anywhere else.

Even in the case of other types of natural disaster, such as dramatic flooding, available real-world source material is unlikely to be able to reproduce dramatic POVs on calamitous events, because (with some exceptions) people don’t usually live-stream from catastrophic locations. Therefore ‘cool views’ on disaster are rare in real-world datasets, and any AI model that can generate them likely got the information elsewhere.

Most desirable AI task-flows do not have this telling level of specificity, and in such cases the obfuscation of the benefits of IP-protected data might not require nearly as much effort.

Conclusion: Entangled Web

Only those who have used generative AI extensively and over a sustained period will instinctively understand that such systems struggle to combine multiple concepts when no comparable examples exist in their training data.

This limitation is known as entanglement, wherein the various facets of trained concepts tend to cluster together with related elements, rather than decompose into handy, Lego-style building bricks that can be arranged into any new configuration the user might desire.

Entanglement is an architectural gravity-well that is pretty much impossible to escape, at least for the diffusion-based approaches that characterize all the major current genAI frameworks. However, it may be that new approaches emerge over the next few years that are better at discretizing trained concepts so that they can be glued together more adroitly, and offer fewer indications as to their provenance.

* I make no accusations against EffectMaker, but comment here on the generality of an emerging practice in AI video research.

^†Because these shots, in these types of movies, have generated and continue to generate money.

First published Monday, March 16, 2026