Connect with us

Thought Leaders

Game-Generated Data Might Be the Most Undervalued Resource in AI Training

mm

AI companies have spent the last five years consuming every piece of text, every image, and every scrap of publicly available data on the internet. That supply is finite, and we are coming closer to the point where there simply isn’t enough data left to sustain the pace of progress it has come to depend on.

However, there’s an obvious candidate that the AI industry has largely overlooked. 

I build game systems for a living, and the data that flows through them every day is unlike anything most AI researchers have ever worked with. And yet almost nobody outside of gaming seems to be paying attention to it.

Gaming platforms generate terabytes of behavioral data every day, structured streams of real-time decisions, economic activity, and social interaction, all inside environments built on consistent physical rules.

Almost none of this data has been used for AI training. And the companies that have used it, from DeepMind to NVIDIA, have produced some of the most significant breakthroughs in the field. 

AI’s data problem

A study from Epoch AI projects that the stock of publicly available, human-generated text data will be fully used up somewhere between 2026 and 2032. The models behind ChatGPT, Gemini, and Claude have already consumed essentially everything the internet has to offer.

Synthetic data or text that AI generates to feed back into AI is the industry’s go-to workaround. But models trained on their own output degrade over time through a documented phenomenon researchers call model collapse.

What I believe that the field needs is a rich, interactive, multimodal information where cause and effect happen in real time and every action has a measurable consequence. Games produce exactly this, and they do it at a scale that almost nothing else can match.

Gaming platforms push terabytes of behavioral data through their systems every day. Player movements, strategic choices, reaction times, economic transactions, and social interactions all flow through structured, time-stamped streams that most AI researchers have never touched.

A recent academic paper on game-generated data lays out a nine-category taxonomy of this information and argues that the vast majority of it remains completely untapped by the AI industry.

I can confirm that from my own experience. The amount of data that flows through our game systems on any given day would be considered a goldmine in any other area of AI research. In games, it just gets archived or discarded.

Why game data is different

When you build inside a game engine for long enough, you start to realize how much structured data you’re sitting on that nobody in AI has asked for yet. Every session produces synchronized physics, player behavior, and system-level cause and effect at a scale that’s difficult to find anywhere else.

Game engines enforce physics. Objects fall, collide, and break according to consistent rules, which means the data carries causal relationships baked in at the system level rather than patterns a model has to guess at from text correlations.

When a player launches a projectile, the engine calculates trajectory, wind resistance, and impact. The AI learns from an environment that demonstrates physics directly through every interaction, rather than one that treats physical laws as statistical approximations.

There’s also the multimodal alignment problem. In a game, visual data, audio cues, player inputs, and environmental state all occur simultaneously and get logged together. That kind of natural synchronization costs a fortune to replicate in real-world datasets, where researchers typically have to label and align each modality by hand.

Games produce edge cases at scale, too, through procedural content generation. No Man’s Sky has 18 quintillion unique planets, and for AI, that variation matters enormously because edge cases determine whether a model works reliably or fails dangerously.

And then there’s emergent complexity, which might be the most valuable property of all. When OpenAI placed agents in a simple hide-and-seek game, those agents developed six distinct phases of sophisticated strategy entirely on their own over hundreds of millions of rounds.

They built shelters from movable objects, used ramps to breach fortifications, and even exploited physics glitches to surf boxes over walls. None of it was programmed. It all emerged from competition within the game environment, without a single line of code that told them to do any of it.

That kind of self-generated complexity is exactly what AI research needs at scale, and games are the only environments that produce it reliably without expensive human oversight.

From game boards to Nobel prizes

The clearest proof that game-trained AI transfers to the real world is a system that went on to win a Nobel Prize, and it’s the example I keep coming back to when people ask me why I built my career around games and AI.

DeepMind started with AlphaGo in 2016, then built AlphaZero, a system that taught itself chess, Go, and shogi without any human knowledge. AlphaZero’s architecture became the foundation for AlphaFold, which solved the 50-year-old protein folding problem and earned its creators the 2024 Nobel Prize in Chemistry. 

DeepMind CEO Demis Hassabis has been open about this pipeline. He told Scientific American that games were never the end goal but rather the most efficient way to develop and test AI techniques before he applied them to real scientific problems. 

I remember reading that and feeling like someone had articulated exactly what I’d been seeing from the inside of game development for years.

That trajectory has since repeated itself across the field. The reinforcement learning environments that OpenAI first standardized through Gymnasium now underpin research in robotics, autonomous vehicles, and industrial automation.

The game-like structure of agent, environment, action, and reward started as a research convenience and has since become the default framework for any AI system that needs to act in the physical world.

Games as the new simulation layer

In December 2025, NVIDIA released NitroGen, a foundation model trained on 40,000 hours of gameplay across more than 1,000 titles. The model watches publicly available gameplay videos, extracts player actions from controller overlays, and learns to play games directly from raw pixels.

On unseen games it had never encountered, NitroGen showed up to a 52% improvement in task success compared to models trained from scratch. But the real significance lies in the architecture underneath.

NitroGen runs on NVIDIA’s GR00T robotics framework, the same foundation the company uses for physical AI and sim-to-real transfer in its Isaac Sim platform. The gaming agent and the factory robot share the same underlying system.

NVIDIA’s Jim Fan described the project as an attempt to build “a GPT for actions,” a general-purpose model that learns to operate in any environment. 

As someone who builds game systems that generate exactly the kind of data these models consume, I find it hard to overstate what that means for the industry I work in.

And this isn’t limited to NVIDIA. Waymo has logged over 20 billion simulated miles to train its autonomous vehicles, all in game-engine-style environments that rehearse scenarios too dangerous or too rare to test on real roads. 

Surgical platforms built on game engines have shown dramatic improvements in trainee performance. Urban planners use similar tools for traffic optimization at city scale.

Surgical platforms built on game engines have shown dramatic improvements in trainee performance. Urban planners use similar tools for traffic optimization at the city scale. The game engine has become a universal simulation layer wherever AI needs to learn through interaction with its environment.

The infrastructure nobody talks about

When people discuss AI infrastructure, they tend to mean data centers, GPU clusters, and compute. In all the years I’ve worked in games, I can count on one hand the number of times I’ve heard someone in the AI space bring up game environments in the same breath. That disconnect is going to close very quickly.

This will only become more obvious as traditional datasets run dry. The industries that produce the richest interactive data will inevitably move toward the center of AI research, and games, simulations, and virtual worlds are better positioned than anything else to fill that gap.

The money is already following this trend. The AI in the gaming sector was valued at $4.54 billion in 2025 and is projected to reach $81 billion by 2035.

Most game studios I talk to still think of themselves as entertainment companies. But when your systems generate the exact data that the next generation of AI models needs to train on, you are in the infrastructure business whether you planned to be or not. 

Ilman Shazhaev is the Founder and CEO of Dizzaract, the largest gaming studio in the MENA region. He’s an AI researcher and United Nations expert under the UNODC program working at the intersection of artificial intelligence and real-world impact.