Artificial Intelligence
The End of Chain-of-Thought? CoreThink and University of California Researchers Propose a Paradigm Shift in AI Reasoning

For years, the race in artificial intelligence has been about scale. Bigger models, more GPUs, longer prompts. OpenAI, Anthropic, and Google have led the charge with massive large language models (LLMs), reinforcement learning fine-tuning, and chain-of-thought prompting—techniques designed to simulate reasoning by spelling out step-by-step answers.
But a new technical white paper titled CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLMs from CoreThink AI and University of California researchers argues that this paradigm may be reaching its ceiling. The authors make a provocative claim: LLMs are powerful statistical text generators, but they are not reasoning engines. And chain-of-thought, the method most often used to suggest otherwise, is more performance theater than genuine logic.
In response, the team introduces General Symbolics, a neuro-symbolic reasoning layer designed to plug into existing models. Their evaluations show dramatic improvements across a wide range of reasoning benchmarks—achieved without retraining or additional GPU cost. If validated, this approach could mark a turning point in how AI systems are designed for logic and decision-making.
What Is Chain-of-Thought — and Why It Matters
Chain-of-thought (CoT) prompting has become one of the most widely adopted techniques in modern AI. By asking a model to write out its reasoning steps before delivering an answer, researchers found they could often improve benchmark scores in areas like mathematics, coding, and planning. On the surface, it seemed like a breakthrough.
Yet the report underscores the limitations of this approach. CoT explanations may look convincing, but studies show they are often unfaithful to what the model actually computed, rationalizing outputs after the fact rather than revealing true logic. This creates real-world risks. In medicine, a plausible narrative may mask reliance on spurious correlations, leading to dangerous misdiagnoses. In law, fabricated rationales could be mistaken for genuine justifications, threatening due process and accountability.
The paper further highlights inefficiency: CoT chains often grow excessively long on simple problems, while collapsing into shallow reasoning on complex ones. The result is wasted computation and, in many cases, reduced accuracy. The authors conclude that chain-of-thought is “performative, not mechanistic”—a surface-level display that creates the illusion of interpretability without delivering it.
Symbolic AI: From Early Dreams to New Revivals
The critique of CoT invites a look back at the history of symbolic AI. In its earliest decades, AI research revolved around rule-based systems that encoded knowledge in explicit logical form. Expert systems like MYCIN attempted to diagnose illnesses by applying hand-crafted rules, and fraud detection systems relied on vast logic sets to catch anomalies.
Symbolic AI had undeniable strengths: every step of its reasoning was transparent and traceable. But these systems were brittle. Encoding tens of thousands of rules required immense labor, and they struggled when faced with novel situations. Critics like Hubert Dreyfus argued that human intelligence depends on tacit, context-driven know-how that no rule set could capture. By the 1990s, symbolic approaches gave way to data-driven neural networks.
In recent years, there has been a renewed effort to combine the strengths of both worlds through neuro-symbolic AI. The idea is straightforward: let neural networks handle messy, perceptual inputs like images or text, while symbolic modules provide structured reasoning and logical guarantees. But most of these hybrids have struggled with integration. Symbolic backbones were too rigid, while neural modules often undermined consistency. The result was complex, heavy systems that failed to deliver the promised interpretability.
General Symbolics: A New Reasoning Layer
CoreThink’s General Symbolics Reasoner (GSR) aims to overcome these limitations with a different approach. Instead of translating language into rigid formal structures or high-dimensional embeddings, GSR operates entirely within natural language itself. Every step of reasoning is expressed in words, ensuring that context, nuance, and modality are preserved. This means that differences like “must” versus “should” are carried through the reasoning process, rather than abstracted away.
The framework works by parsing inputs natively in natural language, applying logical constraints through linguistic transformations, and producing verbatim reasoning traces that remain fully human-readable. When contradictions or errors appear, they are surfaced directly in the reasoning path, allowing for transparency and debugging. To remain efficient, the system prunes unnecessary steps, enabling stable long-horizon reasoning without GPU scaling.
Because it acts as a layer rather than requiring retraining, GSR can be applied to existing base models. In evaluations, it consistently delivered accuracy improvements of between 30 and 60 percent across reasoning tasks, all without increasing training costs.
Benchmark Results
The improvements are best illustrated through benchmarks. On LiveCodeBench v6, which evaluates competition-grade coding problems, CoreThink achieved a 66.6 percent pass rate—substantially higher than leading models in its category. In SWE-Bench Lite, a benchmark for real-world bug fixing drawn from GitHub repositories, the system reached 62.3 percent accuracy, the highest result yet reported. And on ARC-AGI-2, one of the most demanding tests of abstract reasoning, it scored 24.4 percent, far surpassing frontier models like Claude and Gemini, which remain below 6 percent.
These numbers reflect more than raw accuracy. In detailed case studies, the symbolic layer enabled models to act differently. In scikit-learn’s ColumnTransformer, for instance, a baseline model proposed a superficial patch that masked the error. The CoreThink-augmented system instead identified the synchronization problem at the root and fixed it comprehensively. On a difficult LeetCode challenge, the base model misapplied dynamic programming and failed entirely, while the symbolic reasoning layer corrected the flawed state representation and produced a working solution.
How It Fits into the Symbolic Revival
General Symbolics joins a growing movement of attempts to bring structure back into AI reasoning. Classic symbolic AI showed the value of transparency but could not adapt to novelty. Traditional neuro-symbolic hybrids promised balance but often became unwieldy. Planner stacks that bolted search onto LLMs offered early hope but collapsed under complexity as tasks scaled.
Recent advances point to the potential of new hybrids. DeepMind’s AlphaGeometry, for instance, has demonstrated that symbolic structures can outperform pure neural models on geometry problems. CoreThink’s approach extends this trend. In its ARC-AGI pipeline, deterministic object detection and symbolic pattern abstraction are combined with neural execution, producing results far beyond those of LLM-only systems. In tool use, the symbolic layer helps maintain context and enforce constraints, allowing for more reliable multi-turn planning.
The key distinction is that General Symbolics does not rely on rigid logic or massive retraining. By reasoning directly in language, it remains flexible while preserving interpretability. This makes it lighter than earlier hybrids and, crucially, practical for integration into enterprise applications.
Why It Matters
If chain-of-thought is an illusion of reasoning, then the AI industry faces a pressing challenge. Enterprises cannot depend on systems that only appear to reason, especially in high-stakes environments like medicine, law, and finance. The paper suggests that real progress will come not from scaling models further, but from rethinking the foundations of reasoning itself.
General Symbolics is one such foundation. It offers a lightweight, interpretable layer that can enhance existing models without retraining, producing genuine reasoning improvements rather than surface-level narratives. For the broader AI community, it marks a possible paradigm shift: a return of symbolic reasoning, not as brittle rule sets, but as a flexible companion to neural learning.
As the authors put it: “We don’t need to add more parameters to get better reasoning—we need to rethink the foundations.”












