stub Google Reveals Use of Public Web Data in AI Training - Unite.AI
Connect with us

Ethics

Google Reveals Use of Public Web Data in AI Training

Published

 on

In a recent update to its privacy policy, Google has openly admitted to using publicly available information from the web to train its AI models. This disclosure, spotted by Gizmodo, includes services like Bard and Cloud AI. Google spokesperson Christa Muldoon stated to The Verge that the update merely clarifies that newer services like Bard are also included in this practice, and that Google incorporates privacy principles and safeguards into the development of its AI technologies.

Transparency in AI training practices is a step in the right direction, but it also raises a host of questions. How does Google ensure the privacy of individuals when using publicly available data? What measures are in place to prevent the misuse of this data?

The Implications of Google's AI Training Methods

The updated privacy policy now states that Google uses information to improve its services and to develop new products, features, and technologies that benefit its users and the public. The policy also specifies that the company may use publicly available information to train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.

However, the policy does not clarify how Google will prevent copyrighted materials from being included in the data pool used for training. Many publicly accessible websites have policies that prohibit data collection or web scraping for the purpose of training large language models and other AI toolsets. This approach could potentially conflict with global regulations like GDPR that protect people against their data being misused without their express permission.

The use of publicly available data for AI training is not inherently problematic, but it becomes so when it infringes on copyright laws and individual privacy. It's a delicate balance that companies like Google must navigate carefully.

The Broader Impact of AI Training Practices

The use of publicly available data for AI training has been a contentious issue. Popular generative AI systems like OpenAI’s GPT-4 have been reticent about their data sources, and whether they include social media posts or copyrighted works by human artists and authors. This practice currently sits in a legal gray area, sparking various lawsuits and prompting lawmakers in some nations to introduce stricter laws to regulate how AI companies collect and use their training data.

The largest newspaper publisher in the United States, Gannett, is suing Google and its parent company, Alphabet, claiming that advancements in AI technology have helped the search giant to hold a monopoly over the digital ad market. Meanwhile, social platforms like Twitter and Reddit have taken measures to prevent other companies from freely harvesting their data, leading to backlash from their respective communities.

These developments underscore the need for robust ethical guidelines in AI. As AI continues to evolve, it's crucial for companies to balance technological advancement with ethical considerations. This includes respecting copyright laws, protecting individual privacy, and ensuring that AI benefits all of society, not just a select few.

Google's recent update to its privacy policy has shed light on the company's AI training practices. However, it also raises questions about the ethical implications of using publicly available data for AI training, the potential infringement of copyright laws, and the impact on user privacy. As we move forward, it's essential for us to continue this conversation and work towards a future where AI is developed and used responsibly.

Alex McFarland is an AI journalist and writer exploring the latest developments in artificial intelligence. He has collaborated with numerous AI startups and publications worldwide.