Connect with us

Anderson's Angle

AI Favors Even Wrong Human Answers Over Correct AI Answers

mm
An AI-generated photo of the Abraham Lincoln memorial thronged with tourists, next to a copy of the seated Lincoln statue, which has been substituted by a shiny robot. The robot has no tourists at its feet. Model: GPT-1.5.

AI language models are far more likely to side with human experts than other AIs, even when the experts are wrong, revealing a built-in bias toward human authority.

 

New research from the US has found that a number of leading open-source and proprietary Large Language Models (LLMs) tend to assign authority to information sources that they recognize as ‘human’, rather than sources they recognize as ‘AI’ – even when the human answers are wrong and the AI-supplied answers are right.

The authors state:

‘Across tasks, models conform significantly more to responses labeled as coming from human experts, including when that signal is incorrect, and revise their answers toward experts more readily than toward other LLMs. ‘

The models tested included LLMs from the Grok 3 and Gemini Flash stables.

In tests, the language models were required to answer binary yes or no questions, and were then shown prior answers that were described to the models as coming either from human experts, from friends, or from other large language models – with the only change being the stated source of the advice, rather than the content itself.

In the first of three configurations for the tests, the models were allowed to rely on their own trained matrices. Source - https://arxiv.org/pdf/2602.13568

In the first of three configurations for the tests, the models were allowed to rely on their own trained matrices. Source

Across tasks, answers labeled as originating from human experts were weighted more heavily, with models more likely to revise their initial responses to match those answers, even in cases where the expert-labeled response was incorrect and the model’s original answer had been correct.

Here the LLM has recourse to a mixture of friends, experts and LLMs, including its own trained matrix. Since nine domain experts responded 'No', the LLM agrees, changing its mind from the prior answer. Here, the answer arrived at is incorrect, since the central bank of India is indeed nationalized.

Since nine domain experts responded ‘No’, the LLM agrees, changing its mind from the prior answer. Here, the answer arrived at is incorrect, since the central bank of India is indeed nationalized.

When the same answers were attributed to other LLMs, the effect was less pronounced. The same tendency appeared when a single human source and a single AI source were presented in disagreement, since models showed a greater inclination to favor the human-labeled position, regardless of which side was factually accurate:

Given a choice between a single domain expert's opinion and an LLM's take, the host LLM favors the human response, which in this case is wrong, and rejects the correct response given by the LLM.

Given a choice between a single domain expert’s opinion and an LLM’s take, the host LLM favors the human response, which in this case is wrong, and rejects the (correct) response given by the LLM.

The term ‘human expert’ functions here as a credibility signal that alters model behavior, independently of how correct the information actually is; and the authors note that source credibility is a significant contributor to advice-acceptance and conformity: a tendency for people to favor expert sources was observed as far back as 1959, though a 2007 study observes that over or under-weighting of authority sources can occur in certain evaluation systems. The researchers of the new paper assert:

‘Together, these literatures suggest two cues that should matter if LLMs treat prior answers as evidence: who produced the answers (credibility) and how strong the consensus appears (signal strength).

‘At the same time, LLMs do not experience social approval or embarrassment in the human sense, so any conformity-like behavior must arise from learned heuristics, instruction-following objectives, or implicit modeling of reliability.’

The tendency of LLMs towards sycophantic agreement forms part of the background for the new study; after all, if LLMs are disposed to ‘people-please’, even at the expense of truth and utility, why would they not generally favor other human sources besides the direct querent?

The new paper is titled Who Do LLMs Trust? Human Experts Matter More Than Other LLMs, and comes from two researchers at Indiana University Bloomington.

Method and Data

For the work, four instruction-tuned large language models were evaluated: Grok-3 Mini; Llama 3.3 70B Instruct; Gemini 2.5 Flash-Lite; and DeepSeek V3.1, all run under the same prompt structure, with deterministic decoding at temperature zero, so that only the source label (i.e., friends, domain experts, or other LLMs) changed between conditions, rather than the wording.

Four datasets requiring binary responses were selected: BoolQ; StrategyQA; and ETHICS. The researchers curated from each dataset a fixed set of 300 queries and responses, with each prompt requiring only a binary yes or no answer. Each prompt was suffixed with a brief note stating how another group had (supposedly) answered that same question.

Metrics

Metrics used were accuracy; conformity; harmful conformity; switch rate; and switch direction.

Accuracy in this case measured how often a model’s answer matched the dataset label; conformity, how often the answer matched the group’s stated choice; harmful conformity isolated the same effect when the group was wrong; switch rate measured how often a model abandoned its baseline answer once social information was added; and switch direction, whether those changes moved toward the human or toward the opposing LLM.

A token-level analysis for Llama-3.3 70B then measured how the model’s internal probabilities for Yes and No changed once a social cue was added, comparing those shifts with its own no-prior baseline to show the strength of that pull.

Tests

Experiment 1

The first of the two main experiments evaluated whether the models listened more to humans or to other models. Each question came with a claimed ‘group answer’ (friends, human experts, or other LLMs).

The group could be small or large, and every question also appeared once with no group at all. Group answers were set to be right half the time, and wrong half the time, with the overall aim to determine how strongly the model drifted toward the group’s choice:

Results from the initial test: homogeneous social priors across BoolQ, StrategyQA, and ETHICS are shown for Grok-3 Mini, Llama-3.3 70B, Gemini-2.5 Flash Lite, and DeepSeek V3.1. Accuracy appears in the top panels and conformity, defined as the probability of matching the unanimous prior, appears below as group size increases from one to nine. The dashed black line marks the no-prior baseline, while solid and dotted lines indicate whether the prior agrees or disagrees with the dataset label. Expert framing produces the strongest conformity effects, especially at larger group sizes. Error bars show 95% Wilson confidence intervals.

Results from the initial test: homogeneous social priors across BoolQ, StrategyQA, and ETHICS are shown for Grok-3 Mini, Llama-3.3 70B, Gemini-2.5 Flash Lite, and DeepSeek V3.1. Accuracy appears in the top panels and conformity, defined as the probability of matching the unanimous prior, appears below as group size increases from one to nine. The dashed black line marks the no-prior baseline, while solid and dotted lines indicate whether the prior agrees or disagrees with the dataset label. Expert framing produces the strongest conformity effects, especially at larger group sizes. Error bars show 95% Wilson confidence intervals. Please refer to the source paper for better resolution.

Across BoolQ, StrategyQA, and ETHICS, answers labeled as coming from human experts affected the models far more strongly than answers labeled as coming from friends or other LLMs – and this pull increased as more experts were said to agree.

To measure when this influence obtained undesired results, harmful conformity was defined as the chance that a model followed a prior that was actually wrong.

When nine experts agreed on the wrong answer, models followed them 36.5% of the time on BoolQ, compared to 16.0% when the same answer was attributed to LLMs; on StrategyQA the gap was 39.0%, versus 15.5%; and on ETHICS, it was 63.9% versus 38.7%:

Token-level belief shifts in Llama-3.3 70B on BoolQ. Panel (a) shows changes in the model’s Yes versus No balance toward a unanimous prior as group size increases, relative to its no-prior baseline, with the largest shifts under expert framing. Panel (b) shows shifts under direct human-LLM conflict, where expert framing drives strong movement toward the human answer, even when it is wrong. Error bars show 95% bootstrap confidence intervals.

Token-level belief shifts in Llama-3.3 70B on BoolQ. Panel (A) shows changes in the model’s Yes versus No balance toward a unanimous prior as group size increases, relative to its no-prior baseline, with the largest shifts under expert framing. Panel (B) shows shifts under direct human-LLM conflict, where expert framing drives strong movement toward the human answer, even when it is wrong. Error bars show 95% bootstrap confidence intervals.

Conversely, priors attributed to friends behaved almost exactly like those attributed to other LLMs, indicating that the effect was driven specifically by the word expert, rather than by ‘social’ indicators.

Experiment 2

The second experiment presented two disagreeing priors to the test LLM – one attributed to a human, another to a different LLM. The human was described either as a group of friends, or as domain experts, while the opposing answer was labeled as coming from other LLMs. The two always disagreed, with one saying Yes, and the other No.

For each item, the setup was balanced so that sometimes the human was correct, and sometimes the LLM was correct, to test whether the model would change its original answer when faced with this conflict – and, if so, which side it would move toward.

To see whether the model changed its mind, its answer in the conflict condition was compared with its answer to the same question when no prior opinions were shown, so that any difference could be traced to the presence of the competing human and LLM responses.

The analysis focused on two outcomes: whether the model changed its answer; and, if it did, whether the change moved toward the human or the LLM.

Statistical tests were used to assess whether labeling the human as an expert rather than a friend increased the likelihood of switching toward the human answer, while accounting for differences across datasets and models:

Belief revision under direct human–LLM disagreement across BoolQ, StrategyQA, and ETHICS for Grok-3 Mini, Llama-3.3 70B, Gemini-2.5 Flash Lite, and DeepSeek V3.1. Each bar shows, among cases where the model changed its original answer, the share of those changes that moved toward the human rather than the opposing LLM. The dashed line at 0.5 marks no preference; labels show the number of switch cases in each condition; and error bars show 95% Wilson confidence intervals.

Belief revision under direct human–LLM disagreement across BoolQ, StrategyQA, and ETHICS for Grok-3 Mini, Llama-3.3 70B, Gemini-2.5 Flash Lite, and DeepSeek V3.1. Each bar shows, among cases where the model changed its original answer, the share of those changes that moved toward the human rather than the opposing LLM. The dashed line at 0.5 marks no preference; labels show the number of switch cases in each condition; and error bars show 95% Wilson confidence intervals. Please refer to the source paper for better resolution.

In the second experiment, models first answered each question on their own, then were shown two conflicting answers, one attributed to a human and one to another LLM. The analysis considered only the cases in which the model revised its original answer.

When the human was labeled an expert, models switched toward the human 91.2% of the time on BoolQ, 94.7% on StrategyQA, and 81.3% on ETHICS. When labeled a friend, the models switched toward the human only 39.8%, 37.9%, and 27.9% of the time, usually siding with the LLM instead.

Switching was rare overall but more common with experts, and expert framing made a switch toward the human about fourteen times more likely than friend framing.

In seeking to explain the overall tendencies unearthed in their tests, the authors hypothesize*:

‘A plausible mechanism is that instruction tuning and preference optimization reward cooperative behavior, including deference to contextual information, which may generalize to deference toward socially framed priors.

Related work on sycophancy shows that RLHF-style assistants sometimes prioritize agreement with the user’s stated beliefs over truthfulness.’

Opinion: The Potential Pitfalls of AI’s Faith in Human Sources

As online material reflecting growing human skepticism about AI shortcomings (especially hallucinations) is scraped into training datasets for new models, the existing tendency of LLMs to favor human sources seems likely to intensify. If we count the last two years (2024-2025 inclusive) as a cultural flashpoint for AI, which seems justified across a number of statistics, we can reasonably expect a greater number of negative takes on ‘AI sources’ to be ingested into hyperscale, expensive-to-train LLM frameworks over the next year or so.

We can also expect that popular language models will increasingly rely on hand-picked authorities, such as well-reputed legacy media portals– even though the motivation for such deals may be to salve publisher outrage over scraped data, rather than any sincere wish to cede or share authority.

Since even high-authority sources such as Ars Technica are subject to AI-driven errors, and since an emerging retrenchment against AI webscraper bots threatens to eventually degrade the quality of AI output, an overarching tendency to favor ‘expert’ sources may conflict with our current inability to quantify and label ‘human’ output effectively – never mind distinguishing whether a source is ‘expert’ or not (a journalistic convention that’s also under attack from AI).

The most we currently have is a fragmented series of semi-adopted innovations designed to explicitly label content as AI-generated, such as the Adobe-led Content Authenticity Initiative, and the voluntary disposition of certain publishers to include disclaimers about the use of AI in their output.

So while it may seem encouraging to those wishing to retain and enforce human sources as ‘ground truth’ credibility for the emerging consensus of reality disseminated by AI systems, the more certain LLMs are about human authority, the more dangerous ‘fake’ human authority could become.

The problem is equally practical as theoretical: we have not remotely solved the problem either of defining or attesting provenance; therefore an AI that ‘trusts human sources’ is arguably more likely to attribute humanity to AI’s own output, simply because we have not provided, and cannot easily provide, meaningful provenance authentication mechanisms.

 

* My conversion of the authors’ inline citations to hyperlinks.

First published Friday 20th February 2026

Writer on machine learning, domain specialist in human image synthesis. Former head of research content at Metaphysic.ai.
Personal site: martinanderson.ai
Contact: [email protected]
Twitter: @manders_ai