Connect with us

Artificial Intelligence

The Plagiarism Problem: How Generative AI Models Reproduce Copyrighted Content





The rapid advances in generative AI have sparked excitement about the technology's creative potential. Yet these powerful models also pose concerning risks around reproducing copyrighted or plagiarized content without proper attribution.

How Neural Networks Absorb Training Data

Modern AI systems like GPT-3 are trained through a process called transfer learning. They ingest massive datasets scraped from public sources like websites, books, academic papers, and more. For example, GPT-3's training data encompassed 570 gigabytes of text. During training, the AI searches for patterns and statistical relationships in this vast pool of data. It learns the correlations between words, sentences, paragraphs, language structure, and other features.

This enables the AI to generate new coherent text or images by predicting sequences likely to follow a given input or prompt. But it also means these models absorb content without regard for copyrights, attribution, or plagiarism risks. As a result, generative AIs can unintentionally reproduce verbatim passages or paraphrase copyrighted text from their training corpora.

Key Examples of AI Plagiarism

Concerns around AI plagiarism emerged prominently since 2020 after GPT's release.

Recent research has shown that large language models (LLMs) like GPT-3 can reproduce substantial verbatim passages from their training data without citation (Nasr et al., 2023; Carlini et al., 2022). For example, a lawsuit by The New York Times revealed OpenAI software generating New York Times articles nearly verbatim (The New York Times, 2023).

These findings suggest some generative AI systems may produce unsolicited plagiaristic outputs, risking copyright infringement. However, the prevalence remains uncertain due to the ‘black box' nature of LLMs. The New York Times lawsuit argues such outputs constitute infringement, which could have major implications for generative AI development. Overall, evidence indicates plagiarism is an inherent issue in large neural network models that requires vigilance and safeguards.

These cases reveal two key factors influencing AI plagiarism risks:

  1. Model size – Larger models like GPT-3.5 are more prone to regenerating verbatim text passages compared to smaller models. Their bigger training datasets increase exposure to copyrighted source material.
  2. Training data – Models trained on scraped internet data or copyrighted works (even if licensed) are more likely to plagiarize compared to models trained on carefully curated datasets.

However, directly measuring the prevalence of plagiaristic outputs is challenging. The “black box” nature of neural networks makes it difficult to fully trace this link between training data and model outputs. Rates likely depend heavily on model architecture, dataset quality, and prompt formulation. But these cases confirm such AI plagiarism unequivocally occurs, which has critical legal and ethical implications.

Emerging Plagiarism Detection Systems

In response, researchers have started exploring AI systems to automatically detect text and images generated by models versus created by humans. For example, researchers at Mila proposed GenFace which analyzes linguistic patterns indicative of AI-written text. Startup Anthropic has also developed internal plagiarism detection capabilities for its conversational AI Claude.

However, these tools have limitations. The massive training data of models like GPT-3 makes pinpointing original sources of plagiarized text difficult, if not impossible. More robust techniques will be needed as generative models continue rapidly evolving. Until then, manual review remains essential to screen potentially plagiarised or infringing AI outputs before public use.

Best Practices to Mitigate Generative AI Plagiarism

Here are some best practices both AI developers and users can adopt to minimize plagiarism risks:

For AI developers:

  • Carefully vet training data sources to exclude copyrighted or licensed material without proper permissions.
  • Develop rigorous data documentation and provenance tracking procedures. Record metadata like licenses, tags, creators, etc.
  • Implement plagiarism detection tools to flag high-risk content before release.
  • Provide transparency reports detailing training data sources, licensing, and origins of AI outputs when concerns arise.
  • Allow content creators to opt-out of training datasets easily. Quickly comply with takedown or exclusion requests.

For generative AI users:

  • Thoroughly screen outputs for any potentially plagiarized or unattribued passages before deploying at scale.
  • Avoid treating AI as fully autonomous creative systems. Have human reviewers examine final content.
  • Favor AI assisted human creation over generating entirely new content from scratch. Use models for paraphrasing or ideation instead.
  • Consult AI provider's terms of service, content policies and plagiarism safeguards before use. Avoid opaque models.
  • Cite sources clearly if any copyrighted material appears in final output despite best efforts. Don't present AI work as entirely original.
  • Limit sharing outputs privately or confidentially until plagiarism risks can be further assessed and addressed.

Stricter training data regulations may also be warranted as generative models continue proliferating. This could involve requiring opt-in consent from creators before their work is added to datasets. However, the onus lies on both developers and users to employ ethical AI practices that respect content creator rights.

Plagiarism in Midjourney's V6 Alpha

After limited prompting Midjourney's V6 model some researchers were able to generated nearly identical images to copyrighted films, TV shows, and video game screenshots likely included in its training data.

Images Created by Midjourney Resembling Scenes from Famous Movies and Video Games

Images Created by Midjourney Resembling Scenes from Famous Movies and Video Games

These experiments further confirm that even state-of-the-art visual AI systems can unknowingly plagiarize protected content if sourcing of training data remains unchecked. It underscores the need for vigilance, safeguards, and human oversight when deploying generative models commercially to limit infringement risks.

AI companies Response on copyrighted content

The lines between human and AI creativity are blurring, creating complex copyright questions. Works blending human and AI input may only be copyrightable in aspects executed solely by the human.

The US Copyright Office recently denied copyright to most aspects of an AI-human graphic novel, deeming the AI art non-human. It also issued guidance excluding AI systems from ‘authorship'. Federal courts affirmed this stance in an AI art copyright case.

Meanwhile, lawsuits allege generative AI infringement, like Getty v. Stability AI and artists v. Midjourney/Stability AI. But without AI ‘authors', some question if infringement claims apply.

In response, major AI firms like Meta, Google, Microsoft, and Apple argued they should not need licenses or pay royalties to train AI models on copyrighted data.

Here is a summary of the key arguments from major AI companies in response to potential new US copyright rules around AI, with citations:

Meta argues imposing licensing now would cause chaos and provide little benefit to copyright holders.

Google claims AI training is analogous to non-infringing acts like reading a book (Google, 2022).

Microsoft warns changing copyright law could disadvantage small AI developers.

Apple wants to copyright AI-generated code controlled by human developers.

Overall, most companies oppose new licensing mandates and downplayed concerns about AI systems reproducing protected works without attribution. However, this stance is contentious given recent AI copyright lawsuits and debates.

Pathways For Responsible Generative AI Innovation

As these powerful generative models continue advancing, plugging plagiarism risks is critical for mainstream acceptance. A multi-pronged approach is required:

  • Policy reforms around training data transparency, licensing, and creator consent.
  • Stronger plagiarism detection technologies and internal governance by developers.
  • Greater user awareness of risks and adherence to ethical AI principles.
  • Clear legal precedents and case law around AI copyright issues.

With the right safeguards, AI-assisted creation can flourish ethically. But unchecked plagiarism risks could significantly undermine public trust. Directly addressing this problem is key for realizing generative AI's immense creative potential while respecting creator rights. Achieving the right balance will require actively confronting the plagiarism blindspot built into the very nature of neural networks. But doing so will ensure these powerful models don't undermine the very human ingenuity they aim to augment.

I have spent the past five years immersing myself in the fascinating world of Machine Learning and Deep Learning. My passion and expertise have led me to contribute to over 50 diverse software engineering projects, with a particular focus on AI/ML. My ongoing curiosity has also drawn me toward Natural Language Processing, a field I am eager to explore further.