Artificial Intelligence
How LLMs Are Forcing Us to Redefine Intelligence

There is an old saying: If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. This simple way of reasoning, often linked to the Indiana poet James Whitcomb Riley, has shaped how we think about artificial intelligence for decades. The idea that behavior is enough to identify intelligence inspired Alan Turing’s famous “Imitation Game,” now called the Turing Test.
Turing suggested that if a human cannot tell whether they are conversing with a machine or another human, then the machine can be said to be intelligent. Both the duck test and the Turing test suggest that what matters is not what lies inside a system, but how it behaves. For decades, this test has guided advances in AI. But, with the arrival of large language models (LLMs), the situation has changed. These systems can write fluent text, hold conversations, and solve tasks in ways that feel remarkably human. The question is no longer whether machines can mimic human conversation, but whether this imitation is true intelligence. If a system can write like us, reason like us, and even create like us, should we call it intelligent? Or is behavior alone no longer enough to measure intelligence?
The Evolution of Machine Intelligence
Large language models have changed how we think about AI. These systems, once limited to generating basic text responses, can now solve logic problems, write computer code, draft stories, and even assist with creative tasks like screenwriting. One key development in this progress is their ability to solve complex problems through step-by-step reasoning, a method known as Chain-of-thought reasoning. By breaking down a problem into smaller parts, an LLM can solve complex math problems or logical puzzles in a way that looks similar to human problem-solving. This capability has enabled them to match or even surpass human performance on advanced benchmarks like MATH or GSM8K. Today, LLMs also possess multimodal capabilities. They can work with images, interpret medical scans, explain visual puzzles, and describe complex diagrams. With these advances, the question is no longer whether LLMs can mimic human behavior, but whether this behavior reflects genuine understanding.
Traces of Human-Like Thinking
This success of LLMs is redefining the way we understand intelligence. The focus is shifting from aligning behavior of AI with humans, as suggested by Turing test, to exploring how closely LLMs mirror human thinking in the way they process information (i.e. true human-like thinking). For example, in a recent study, researchers compared the internal workings of AI models with human brain activity. The study found that LLMs with over 70 billion parameters, not only achieved human-level accuracy but also organized information internally in ways that matched human brain patterns.
When both humans and AI models worked on pattern recognition tasks, brain scans showed similar activity patterns in the human participants and corresponding computational patterns in the AI models. The models clustered abstract concepts in their internal layers in ways that directly matched with human brain wave activity. This suggests that successful reasoning might require similar organizational structures, whether in biological or artificial systems.
However, researchers are careful to note the limitations of this work. The study involved a relatively small number of human participants, and humans and machines approached the tasks differently. Humans worked with visual patterns while the AI models processed text descriptions. The correlation between human and machine processing is intriguing, but it does not prove that machines understand concepts the same way humans do.
There are also clear differences in performance. While the best AI models approached human-level accuracy on simple patterns, they showed more dramatic performance drops on the most complex tasks compared to human participants. This suggests that despite similarities in organization, there may still be fundamental differences in how humans and machines process difficult abstract concepts.
The Skeptical Perspective
Despite these impressive findings, a strong argument suggests that the LLMs are nothing more than a very skilled mimic. This view comes from philosopher John Searle’s “Chinese Room” thought experiment which illustrate why behavior may not equal to understanding.
In this thought experiment, Searle asks us to imagine a person locked in a room and can speaks only English. The person receives Chinese symbols and use an English rulebook to manipulate these symbols and produce responses. From outside the room, his responses look exactly like those of a native Chinese speaker. However, Searle argues that the person understands nothing about Chinese. He simply follow rules without any real understanding.
Critics apply this same logic to LLMs. They argue these systems are “stochastic parrots” that generate responses based on statistical patterns in their training data, not genuine understanding. The term “stochastic” refers to their probabilistic nature, while “parrot” emphasizes their imitative behavior without real understanding.
Several technical limitations of LLMs also support this argument. LLMs frequently generate “hallucinations“; responses that look plausible but completely incorrect, misleading and nonsensical. This happens because they select statistically plausible words rather than consulting an internal knowledge base or understanding truth and falsehood. These models also reproduce human-like errors and biases. They get confused by irrelevant information that humans would easily ignore. They exhibit racial and gender stereotypes because they learned from data containing these biases. Another revealing limitation is “position bias,” where models overemphasize information at the beginning or end of long documents while neglecting the middle content. This “lost-in-the-middle” phenomenon suggests that these systems process information very differently from humans, who can maintain attention across entire documents.
These limitations highlight a central challenge: while LLMs excel at recognizing and reproducing language patterns, this does not mean they truly understand meaning or real-world context. They perform well at handling syntax but remain limited when it comes to semantics.
What Counts as Intelligence?
The debate ultimately comes down to how we define intelligence. If intelligence is the capacity to generate coherent language, solve problems, and adapt to new situations, then LLMs already meet that standard. However, if intelligence requires self-awareness, genuine understanding, or subjective experience, these systems still fall short.
The difficulty is that we lack a clear or objective way to measure qualities like understanding or consciousness. In both humans and machines, we infer them from behavior. The duck test and the Turing Test once provided elegant answers, but in the age of LLMs, they may no longer suffice. Their capabilities force us to reconsider what truly counts as intelligence and whether our traditional definitions are keeping pace with technological reality.
The Bottom Line
Large language models challenge how we define AI intelligence. They can mimic reasoning, generate ideas, and perform tasks once seen as uniquely human. Yet they lack the awareness and grounding that shape true human-like thinking. Their rise forces us to ask not only whether machines act intelligently, but what intelligence itself really means.












