Thought Leaders
The Critical Path to Automating Model Development

The next important milestone for AI research is to automate model development. Every advance in reasoning, language, and perception is, in some sense, a step toward that goal. However, the path to model automation requires solving a set of foundational challenges that must be cracked first.
The bridge to that goal runs directly through machine learning (ML) engineering. A common misconception holds that ML is a predecessor technology to modern AI and that foundation models have simply replaced it. This misunderstands the relationship. As an academic discipline, ML encompasses all aspects of model training, including the training of foundation models at the center of the current AI moment. There is, however, a meaningful difference in scale and data complexity.
Traditional ML models are typically trained on carefully curated, domain-specific datasets containing thousands or millions of examples. Foundation models, by contrast, are trained on thousands of datasets simultaneously, drawn from vastly different sources with inconsistent formats, provenance, and quality. This difference in data scale and heterogeneity is a fundamental reason why data management becomes much harder and more important as models grow more powerful.
That makes data understanding a central bottleneck in automating model development. An AI system that can interpret heterogeneous data and improve the pipelines built around it could, in principle, improve its own training process and help build better models. Once AI can improve the process by which it is trained, improvements cascade downstream to every domain where AI is applied.
Three Barriers Standing in the Way
The first barrier is context fragmentation. In almost every organization, the signals, experiments, feature definitions, and institutional knowledge relevant to any given modeling problem are scattered across data warehouses, notebooks, and pipelines that were never designed to communicate with each other. Consider a healthcare system building a sepsis detection model. The clinical criteria relevant to that problem, such as vital thresholds, lab values, and documentation standards, may live in entirely separate modules of an electronic health record system.
The second barrier is semantic ambiguity. Meaning is not inherent in the data but is instead contextual and organizational. The same field name in two different databases may refer to subtly different things. Concepts like revenue, active user, and churn routinely have multiple valid definitions within a single company. Even a concept as seemingly plain as “revenue” can cause problems. A sales team may define revenue as the total value of contracts signed this quarter, while the finance team defines it as cash actually received. The product team has yet another understanding, as it defines the term to mean recognized revenue spread across a subscription term. All three are pulling from fields literally named “revenue” in their respective systems, but a cross-team report combining them would silently mix three incompatible numbers.
The third and most systemic barrier is the absence of documented organizational memory. Tracking provenance, resolving inconsistencies, and maintaining quality signals across that many sources is an unsolved problem even for human teams. Without an institutional memory of what was tried and how well these approaches worked, any model automation mechanism will keep rediscovering the same dead ends, wasting time and resources.
Consider a data science team at a retail company building a demand forecasting model. Over three years, a dozen analysts have each independently discovered that raw weather data degrades model performance during holiday weeks, that a particular supplier’s inventory feed contains a systematic lag, and that the standard approach to handling promotional events causes target leakage. When the original analysts moved to other teams or left the company, the knowledge left with them. Without an institutional record of what was tried, what failed and why, a model automation mechanism cannot build on accumulated experience. It simply starts from zero, again and again, needlessly wasting time.
What a Real Solution Requires
The history of ML automation is a history of partial solutions. AutoML addressed the narrow problem of hyperparameter tuning but could not handle objective mismatches or reason about organizational intent. MLOps made production pipelines more robust and easier to monitor, but MLOps tools execute a strategy rather than define it. More recent coding agents represent a genuine step forward, but they have inherited the same blind spot. They generate code well while operating without organizational context or institutional memory.
A system capable of genuinely autonomous ML engineering would need capabilities that no existing tool provides in combination. It would need to map business goals to model objectives, which is a translation that cannot be inferred from data alone. It would need to discover relevant data across fragmented systems with inconsistent schemas, while automatically adhering to compliance, governance, and security constraints, rather than requiring humans to manage them as a separate process. It would need institutional memory to surface existing work, understand why past experiments were abandoned, and build on what colleagues already know.
Rigorous audit trails that track provenance across data versions, feature definitions, and code commits would need to be a core mechanism for grounding the system in what actually happened. And any such system would require thoughtful human-in-the-loop design. Not a binary choice between full automation and full manual control, but support for varying levels of interaction depending on the task, the stakes and the confidence of the system at each decision point. Automation that bypasses human judgment at critical moments is not a feature of well-designed AI; rather, it is a failure mode.
What no lab has yet solved is how to create a semantic understanding of organizational data that understands what the data means in a specific institutional context. MCP solves the connectivity problem. It doesn’t yet solve the meaning problem. That remains the open research frontier.
What Becomes Possible
The economic implications of solving these problems are significant. Custom ML development today requires specialist practitioners and weeks of iteration, even for well-scoped problems. A system that could navigate the full workflow autonomously from problem definition through data discovery, model development and model evaluation would shift that equation dramatically, compressing timelines and opening high-value use cases that are currently too resource-intensive to pursue. Projects that once required teams with deep ML expertise working for weeks can now be completed in days without having to use so much of scarce ML experts’ time.
The challenges of context fragmentation, semantic ambiguity, and missing institutional memory are not unique to enterprise ML. They manifest under different constraints in the construction of foundation model training pipelines, where thousands of heterogeneous datasets must be aggregated, filtered, and iteratively refined. While the two settings differ in structure and objective, both are limited by the same underlying bottleneck: the absence of systems that can reliably recover context, track provenance, and build on prior work across iterations. Automating model development in the enterprise is therefore a critical step on the path toward AI systems capable of improving themselves.













