Thought Leaders

China’s AI Mirage: How “Open Source” Hides What Matters Most

Published August 1, 2025

Dr. Jason Corso, Co-founder and Chief Science Officer, Voxel51

With Big Tech players like Google, Microsoft, and Meta vying to dominate the AI market, China’s High Flyer, Baidu, Moonshot, and Alibaba have made headlines for releasing their DeepSeek, ERNIE 4.5, Kimi K2, and Qwen3 large language models, respectively, as open source. This shift from releasing guarded, proprietary GenAI models has been received as a sign that China’s AI industry is embracing the power of open source to democratize AI development and spur innovation.

Like many players who tout their offerings as open source and even put it in their company names, however, High Flyer, Baidu, and Moonshot have not actually shared critical pieces like datasets at the heart of their models. As these large models seek to become commodities developers rely on, the transparency of true open source that can be tested, investigated, and iterated upon is critical to creating unbiased, ethical, and beneficial technology we can all trust. All of these “open source” models are actually “open weight,” which means that they can be downloaded and used, but they cannot be inspected in any meaningful way without the data.

As U.S. players like Open AI and Meta seem to be backing away from open source, Baidu’s open invitation to leverage its freely available suite of ERNIE 4.5 models can indeed spur innovation and collaboration with developers looking to create smaller, powerful applications. At the same time, the company, which is akin to China’s Google, has given itself a competitive edge by encouraging adoption and entrenching its models in the burgeoning AI ecosystem.

The same can be said for DeepSeek, the low-priced Kimi K2, and the updated Qwen3—which boasts benchmarks that challenge closed models like Claude Opus 4 and GPT-4o-0327.

These AI players have positioned themselves well in the race to become the commodity model of choice and Qwen3’s latest innovative update was even inspired by open source community feedback.

Like many who tout their large AI model as open source, however, the Chinese AI community is not actually sharing the data or other critical pieces of their AI systems. Instead, they are asking global developers to put their blind faith in models they cannot truly understand or investigate.

Staking claim on the future with open source commodity AI models

When the iPhone burst onto the market in 2007, some assumed Mac would rule the smartphone game with iOS, but open-source participation is integral for startups, while also spurring entrepreneurial and economic growth worldwide—and Android, a start-up acquired by Google in 2005, followed this path to victory.

By releasing open source software that could be viewed, modified, adopted, and shared, Android invited academics, developers, and even competitors to collaborate on the software. This accelerated the innovation process, democratized the playing field, and ultimately, drove down prices. Android hit the market a year after the first iPhone and by the start of this year, boasted 71.88 percent of the global market to iOS’ 27.65 percent.

In a technological revolution that seemed to happen overnight, smartphones became ubiquitous and even as software, hardware, and user interface improvements continue, the industry has grown far past trying to revolutionize the way smartphones work. With cell phones now a commodity, the innovation at hand today is in the apps that run on them, and to be contenders, smartphone providers must maintain an ecosystem that invites in developers.

Not three years after the launch of ChatGPT, the AI industry finds itself on a similar precipice. Every player in the global AI industry is angling for its models to become the next Android or even iOS, and by going open source with DeepSeek, ERNIE 4.5, and Kimi K2 models, Chinese innovators are looking to stake their claim on a budding ecosystem.

While this could work in their favor, however, it does not foster the true transparency of open source that has been essential to not only breeding innovation, but breeding innovation we can trust.

Data is the missing piece in most open source AI

With AI models far more complicated to create and share than traditional software, the call for fully open source AI is no small order. Instead of just a simple source code, AI systems are composed of seven components—including the source code, model parameters, dataset, hyperparameters, training source code, random number generation, and software frameworks.

Each piece must work in concert for a model to deliver the desired results, which means developers need full visibility to share, modify, and adopt a system and understand what’s happening. With reproducibility the foundation of the scientific method, however, the AI industry has a habit of using the term open source to refer to free or low-priced releases that are made available with access to a few pieces of the puzzle.

Baidu, for example, made freely available ten ERNIE 4.5 models. Along with sharing the model and parameters, the company also open sourced ERNIEKit and the FastDeploy deployment toolkits. These enable developers to build powerful AI applications by providing industrial-grade capabilities, resource-efficient training and inference workflows, and multi-hardware compatibility.

In other words, Baidu has provided developers with exciting tools that empower them to unleash innovation faster, which they hope will, in turn, entice them to choose ERNIE 4.5 over the competition.

Developers who leverage ERNIE 4.5, however, are being asked to blindly trust the model, because Baidu has kept much hidden, including the datasets that inform and teach its models.

The power of transparent open source AI models

While each piece of the AI puzzle is critical to making a model work, 80 percent of AI projects fail, and data is at the heart of the problem. Inaccurate, incomplete, and biased data sets lead to models that don’t behave predictably or as desired.

The recently released fatal 2023 Tesla Full-Self-Driving (FSD) crash video, for example, exposed the worst-case scenario of what can happen when a dataset and model fall short. As the Tesla Model Y sped into a bright, setting sun, the partially automated system could not understand or react appropriately to what its cameras were seeing—or not seeing. While cars driven by humans slowed and pulled over, the FSD’s confusion resulted in a woman’s death.

This devastating failure reflected incomplete visual data, as well as the lack of a safety mechanism that accounted for such blind spots. When developers have no view into their data, they can’t see how it’s interacting with the model, which means they can’t uncover such mistakes and iterate for robust performance.

Even more concerning, without the data that fuels the model, they are forced to trust it blindly.

When data sets are open source, however, the AI community has proven it will root out troubling issues, as it did by uncovering over 1,000 URLs containing verified Child Sexual Abuse Material in LAION 5B. With the dataset used for AI text-to-image generation models being foundational in creating apps like Stable Diffusion and Midjourney, it would have been devastating for the AI industry if users started producing illicit photorealistic images. Instead, the open nature of this dataset allowed the community to uncover the dangerous content and motivate a fix, Liaison B.

In addition, much of that first data set drew on web scraping performed by the enormous Common Crawl, which was also leveraged for ChatGPT and LLAMA models. Even as AI crawlers continue to raise concerns about copywriting, privacy, and biased and racist labeling, however, developers in the AI community are working on ways to clean pieces of Common Crawl’s growing open source dataset for safer use.

As developers aim to not only build powerful AI, but also AI we can trust, both users and the industry alike are protected by the transparency and collaboration of true open source.

Embracing the open source path

With many still wary of this burgeoning technology, the race to become the iOS or Android of large AI commodity models is underway—and as the global AI community quite literally builds what will become the standard for the future and AI systems are already driving cars and offering medical assessments, establishing trust by creating unbiased, reliable, and safe AI has never been more critical.

With China’s AI community trying to position itself as the champions of open innovation, the path to safe AI is only found in the transparency of true open source that has been proven through decades of software innovation. Throwing the term onto systems that don’t share critical pieces like data does not allow developers to investigate, replicate, and iterate. While the allure of readily available models like DeepSeek, ERNIE 4.5, Kimi K2, and Qwen3 is undeniable, developers who leverage them trade the transparency that fosters collaboration and innovation for convenience.

The AI community must choose: embrace radical transparency through genuine open source, or risk building tomorrow’s critical systems on today’s black boxes.

Dr. Jason Corso, Co-founder and Chief Science Officer, Voxel51

Dr. Jason Corso is Co-founder and Chief Science Officer at Voxel51, and a Professor of Robotics and Electrical Engineering & Computer Science at the University of Michigan. A veteran in the field of computer vision, Dr. Corso has dedicated over 20 years to academic research in the fields of video understanding, robotics and data science.

Unite.AI

China’s AI Mirage: How “Open Source” Hides What Matters Most

Staking claim on the future with open source commodity AI models

Data is the missing piece in most open source AI

The power of transparent open source AI models

Embracing the open source path

You may like