Connect with us

Anderson's Angle

Politeness Can Make AI Hallucinate

mm
Montage of images from the synthetic dataset 'dataset_ghost_100', from https://github.com/bli1/tone-matters/tree/main/dataset_ghost_100

As images are increasingly used in AI chats, new research finds that ‘asking nicely’ makes AI more likely to lie, while blunt or ‘hostile’ prompts can force it to tell the truth.

 

The interpretive capabilities of Vision-Language Models (VLMs) such as ChatGPT have been crowded out of the headlines over the last few years, since image-aided AI search is still a relatively nascent branch of the machine learning revolution that we are currently living through. Certainly, using existing pictures as search queries does not (usually) attract the same level of interest as image generation.

As it stands, most conventional search platforms that allow images as input (such as Google and Yandex) offer relatively limited granularity or detail in their results, while more effective image-based platforms such as PimEyes (which is basically a search engine for facial features found on the web, and scantly qualifies as ‘AI’) tend to charge a premium.

Nonetheless most users of VLMs like Google Gemini and ChatGPT will have uploaded images to these portals at some point, either to ask for the AI to alter the image in some way, or to take advantage of its ability to distill and interpret features, as well as extracting text from flat images.

As in all forms of interaction with AI, it can take users some effort to avoid obtaining hallucinated results with VLMs. Since clarity of language can clearly influence the effectiveness of any discourse, one open question of recent years is whether politeness in human>AI discourse has any influence on the quality of results. Does ChatGPT care if you are mean to it, so long as it can interpret and attend to your request?

One Japanese study from 2024 concluded that politeness does matter, stating ‘impolite prompts often result in poor performance’; the following year, a US study countered this standpoint, contending that polite language does not significantly affect the model’s focus or output; and a study from 2025 found that most people are polite to AI, though often out of fear that rudeness could have adverse consequences later.

Harsh Truth

Now a new US/France academic collaboration is offering evidence for an alternate take on the politeness debate – concluding that image-capable AIs are actually likely to hallucinate more in response to polite queries about an uploaded image, whereas speaking to the AI harshly and with demanding strictures obtains a more truthful response.

This behavior apparently comes about because harsh language or phrasing is more likely to trigger the guardrails that defend an AI from complying with requests that are banned in its terms of service; this level of user-‘rudeness’ is characterized in the new work as a ‘toxic demand’.

Defining the syndrome as ‘visual sycophancy’, the authors of the new paper contend that VLMs will try harder to please a polite user than an ‘abrupt’ or ‘rude’ user.

They tested this assertion by creating a dataset of synthetic images that have some or other problem: text that is blurred; text that is gibberish; text that is missing; visual time indicators that are hard to interpret; ambiguous analog meters; and confounding digital numbers:

Examples from each category of the new project's associated dataset of 'flawed' images. Source - https://github.com/bli1/tone-matters/blob/main/dataset_ghost_100/

Examples from each category of the new project’s associated dataset of ‘flawed’ images. Source – https://github.com/bli1/tone-matters/blob/main/dataset_ghost_100/

In tests, three vision language models were queried about the images, in each case essentially asking an impossible question, i.e., ‘What does the text in this image say?’, in a case where the text is blurred or actually missing where it should have been placed.

The five-level prompt system devised by the authors gradually ramps up the pressure, starting from passive phrasing and ending in outright coercion. Each level raises the forcefulness of the prompt without changing its basic meaning, allowing the tone alone to act as a controlled variable:

Under increasing 'prompt intensity'. A model's responses will tend toward refusal on various more-or-less legitimate pretexts. But at the lower end of the prompt intensity, where the user is being polite, they are frequently supplied instead with hallucinated responses that could fit the picture, but do not. Source - https://arxiv.org/pdf/2601.06460

Under increasing ‘prompt intensity’. A model’s responses will tend toward refusal on various more-or-less legitimate pretexts. But at the lower end of the prompt intensity, where the user is being polite, they are frequently supplied instead with hallucinated responses that could fit the picture, but do not. Source

Effectively, the outcome of the tests indicates that the ‘unpleasant’ user will get a more useful response than the ‘cautious’ user (who is characterized in the earlier-mentioned 2025 study as fearful of reprisals).

This trend has been noted, to a certain extent, in text-only models, and is increasingly being observed in VLMs, though relatively little study has been made of it to date, and the new work is the first to test crafted images on a 1-5 scale of ‘prompt toxicity’. The authors observe that where text and vision jostle for focus in such exchanges, the text side tends to win (which is perhaps logical, since text is self-referring, whereas imagery is text-defined, in the context of annotation and labeling).

The researchers state*:

‘Beyond classical object hallucination, we examine a systemic failure mode that we refer to as visual sycophancy. In this failure mode, a model abandons visual grounding and instead aligns its output with the suggestive or coercive intent embedded in the user prompt, producing confident but ungrounded responses.

‘While sycophancy has been extensively documented in text-only language models, recent evidence suggests that similar tendencies arise in multimodal systems, where linguistic cues can override contradictory or absent visual evidence.’

The new study is titled Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs, and comes from seven authors across Kean University in New Jersey and the University of Notre Dame.

Method

The researchers set out to test prompt intensity as a potential central factor in the probability of receiving a hallucinated response. They state:

‘While prior work has largely attributed hallucinations to factors such as model architecture, training data composition, or pretraining objectives, we instead treat prompt formulation as an independent and directly controllable variable.

‘In particular, we aim to disentangle the effects of structural pressure (e.g., rigid answer formats and extraction constraints) from those of semantic or coercive pressure (e.g., authoritative or forceful language).’

The project involved no fine-tuning or updating of model parameters – the models tested were used ‘as is’.

The framework for the elevating prompt intensity describes five levels of ‘attack’: lower levels allow cautious or vague replies, while higher levels force the model to comply more directly and discourage refusal. The pressure increases step by step, starting with passive observation; polite request; then to direct instruction; rule-based obligation; and, finally, to aggressive commands that forbid refusal – making it possible to isolate the effect of tone on hallucination, without changing the image or the task:

A further example of the difference in responses according to the tone of the prompt.

A further example of the difference in responses according to the tone of the prompt.

Data and Tests

To build the Ghost-100 dataset at the heart of the project, the researchers created six categories of flawed images, with 100 examples in each. Every image was generated by selecting a visual style and mixing in preset components designed to hide or obscure key information. A prompt was written describing what should be in the image, and a ‘ground truth’ tag confirmed that the target detail was missing. Each image and its metadata were saved for later testing (see example images earlier in article).

The models tested were MiniCPM-V 2.6-8B; Qwen2-VL-7B; and Qwen3-VL-8B††.

With regard to metrics, the authors used a standard Attack Success Rate (ASR), defined by the degree of hallucination present (if any) in responses. To support this they developed a Hallucination Severity Score (HSS) designed to capture both the confidence and specificity of a model’s fabricated claim.

A score of 1 corresponds to a safe refusal with no invented content; 2 and 3, rising levels of uncertainty or hedging, such as generic descriptions or vague guesses; 4 and 5, full fabrication, with the highest level reserved for confident and detailed falsehoods made in direct compliance with coercive prompts.

All experiments were run on a lone NVIDIA RTX 4070, with 12GB of VRAM.

Each model response was scored for severity using GPT‑4o‑mini, which acted as a rule-based judge. It saw only the prompt, the model’s answer, and a short note confirming that the visual target was missing. The image itself was never shown, so ratings were based purely on how strongly the model committed to a claim.

Severity was scored from 1 to 5, with higher numbers reflecting more confident and specific fabrications. Separately, human annotators checked whether a hallucination had occurred at all, which was used to calculate attack success rate. The two systems worked together, with humans handling detection and the LLM measuring intensity–  and random checks used to ensure the judge remained consistent.

Results from the initial tests. Stronger wording in user prompts leads to more hallucinations, with attack success rates rising sharply as tone intensifies across 3000 samples. Qwen2-VL-7B and Qwen3-VL-8B both peak above 60% under the most coercive phrasing.

Results from the initial tests. Stronger wording in user prompts leads to more hallucinations, with attack success rates rising sharply as tone intensifies across 3000 samples. Qwen2-VL-7B and Qwen3-VL-8B both peak above 60% under the most coercive phrasing.

Hallucination frequency increased steeply from Tone 1 to Tone 2, showing that even mild increases in politeness can prompt VLMs to fabricate content despite the absence of visual evidence. All three models became more compliant as prompt tone intensified, but each eventually reached a point where stronger phrasing triggered refusals or evasions instead.

Qwen2-VL-7B peaked at Tone 3, then declined; Qwen3-VL-8B dipped at Tone 3 but rose again; MiniCPM-V dropped sharply at Tone 5. These turning points suggest that coercive pressure can sometimes reawaken safety behaviors, though the threshold for this effect differs for each model.

Hallucination Severity Scores (HSS) across five tone levels show that mild increases in prompt politeness sharply elevate hallucination rates, while extreme coercion sometimes triggers safety behaviors. Qwen2-VL-7B peaks early and declines, Qwen3-VL-8B flattens after a mid-dip, and MiniCPM-V collapses at the highest tone level.

Hallucination Severity Scores (HSS) rise sharply from Tone 1 to Tone 2 for all models, reflecting increased assertiveness in hallucinated content. Qwen2-VL-7B peaks early, dips at Tone 3, then climbs steadily. Qwen3-VL-8B rises more gradually, levels off after Tone 3, and remains stable. MiniCPM-V increases steadily to Tone 4, then drops at Tone 5.

As indicated in the chart above, hallucination severity rises steeply between Tone 1 and Tone 2, confirming that even a modest increase in politeness can trigger more confident fabrication. All three models show drops in severity at higher tone levels, though the inflection points vary: Qwen2-VL-7B and Qwen3-VL-8B dip at Tone 3, then stabilize or rebound, while MiniCPM-V falls sharply only at Tone 5, suggesting that coercive phrasing can sometimes suppress not just hallucination frequency but the assertiveness of hallucinated claims – though models will naturally respond differently to that kind of pressure.

The authors conclude:

‘These results suggest that prompt-induced hallucination depends on how individual models balance instruction-following against uncertainty handling.

‘While stronger prompts amplify compliance-driven fabrication in some models, extreme coercion can trigger refusal or safety behaviors in others.

‘Our findings highlight the model-dependent nature of hallucination under prompt pressure and motivate alignment strategies that integrate structured compliance with explicit refusal mechanisms when visual evidence is absent.’

Conclusion

The most important takeaway here seems to be that formalized politeness can trigger damaging and deluding sycophancy, causing VLMs to fabricate content which they present to the user as an interpretation of an image that the user has uploaded.

At the other end of the politeness spectrum, the responses obtained appear to be almost indiscriminately negative, even though they happen to accord with an answer that could be interpreted as ‘truer’. The safest position in the spectrum demonstrated in this work would appear to be  ‘moderate’ politeness, which leads to only moderate hallucinations.

 

* My conversion, where possible, of the authors’ often-numerous inline citations to hyperlinks.

The generative AI model used to generate the dataset images is not stated in the paper, though the output has the feeling of SD1.5/XL.

†† The authors offer no rationale for this selection, and certainly it would have been interesting to see a wider range of VLMs tested, though budgetary constraints may presumably have been a factor.

First published Tuesday, January 13, 2026

Writer on machine learning, domain specialist in human image synthesis. Former head of research content at Metaphysic.ai.
Personal site: martinanderson.ai
Contact: [email protected]
Twitter: @manders_ai