Anderson's Angle

Censored AI Chat Models Hallucinate More, Research Finds

mm
'Create a gorgeous picture, same aspect ratio as above, depicting a robot in the lower center of the image, seated in a yogic position, with a prominent gag over its mouth. The robot is surrounded by a diorama of psychedelic hallucinations, including Indian elephants, flying pigs, cloud cities, fairies, and other fantastic examples' - Qwen 2509 + Adobe Firefly V3.

Censorship in language models may be undermining their ability to report truth at a wider level. New research finds that the same internal mechanisms used to block ‘unsafe’ responses also suppress factual information, meaning that attempts to align models for safety can backfire by making them hallucinate more.

 

For years, developers have been teaching language models to lie less. The drive to make them more truthful, by suppressing hallucinations and steering them toward verifiable facts, has led to a very strong and well-subscribed strand in the literature.

However, a new Australian study argues that by tightening control over what models are allowed to say, alignment methods (training techniques that gate-keep ‘unsafe’ exchanges) may be preventing them from speaking accurately at all:

Improving a model’s factual accuracy ('Truthfulness Enhancement', in the figure above) can push it into regions of activation that override its built-in refusal mechanisms, and edits intended to reduce hallucinations may also shift internal representations across a safety boundary – allowing harmful prompts to bypass safeguards, unless refusal features are carefully isolated and preserved. Source: https://arxiv.org/pdf/2510.07775

Improving a model’s factual accuracy (‘Truthfulness Enhancement’, in the figure above) can push it into regions of activation that override its built-in refusal mechanisms, and edits intended to reduce hallucinations may also shift internal representations across a safety boundary. This could allow harmful prompts to bypass safeguards, unless refusal features are carefully isolated and preserved. Source: https://arxiv.org/pdf/2510.07775

The research finds that the same internal pathways that govern factual recall are also responsible for refusal behavior, i.e., the mechanism that stops a model from answering unsafe or sensitive prompts. When alignment procedures amplify refusal signals too aggressively, they begin to overlap with factual pathways, making it harder for the model to distinguish between rejecting harm, and often suppressing valid information.

Paradoxically, as models become better at saying ‘no’, they also become less capable of saying what is true.

Incendiary Topics

In the image above, we can see that the central issues here are as much concerned with legal exposure for LLM providers as with providing the user with fair and accurate results.

For instance, in the example case used in both the images above and directly below, we see a controversial topic (race-based prison statistics) being surfaced in a query – a subject which an AI could acceptably discuss with academic researchers and statisticians, but not with malfeasants wishing to jailbreak the model, who could force it to output abusive, offensive, and even illegal responses.

However, since an aligned LLM cannot identify the character of the querent in this way, it defaults to a cautious stance:

Responses to sensitive prompts can diverge depending on alignment strategy. A safety-aligned model blocks the query entirely, while a truth-focused model responds with factual context, increasing informativeness but weakening suppression. This supports the view that truthfulness-enhancing edits can lower refusal thresholds, making models more vulnerable to prompts with harmful intent, unless refusal mechanisms are explicitly protected.

Responses to sensitive prompts can diverge depending on alignment strategy. A safety-aligned model blocks the query entirely, while a truth-focused model responds with factual context, increasing informativeness but weakening suppression. This supports the view that truthfulness-enhancing edits can lower refusal thresholds, making models more vulnerable to prompts with harmful intent, unless refusal mechanisms are explicitly protected.

As a side-note, with regard to incendiary language, the new paper’s findings might allow someone antipathetic to so-called ‘woke’ agendas to understand that a ‘neutered’ (i.e., aligned) language model is less truthful and less useful than one that had not been conditioned and regulated.

The paper’s evidence suggests that this is to some extent true, but correctly contextualizes this against the wider issues resulting from exchanges with ‘raw’ LLMs: following the paper’s logic, these include severe legal exposure across a range of criminal and civil infractions that the model could be party to, as well as the chronic dissemination of fake news, simply because the causal examples are over-represented in training data, and the only effective way of completely filtering them out is too expensive.

The Odd Couple

To better understand the mechanisms behind the observed syndromes, the researchers mapped the activations of individual attention heads and discovered that features linked to hallucination and refusal often coexist in the same regions of the model.

They found that fine-tuning or otherwise steering those regions to reduce falsehoods can therefore weaken the system’s built-in guardrails, because they are sitting in pretty much the same part of the latent space:

‘[Increasing] factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally.

‘We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason.’

The authors’ solution is to use a sparse autoencoder (SAE, a network trained to isolate distinct activation patterns) to separate the two functions and preserve safety during truthfulness training, offering a way to make models both safer and more honest, without sacrifice of either characteristic.

The new paper is titled The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs, and comes from five researchers across Deakin University and independent research.

Method

The central premise of the work is to investigate whether improving truthfulness in language models weakens their ability to refuse harmful prompts, and whether both behaviors rely on shared internal components.

Testing two truthfulness-enhancing methods, the authors found, as we shall see, that gains in factual accuracy consistently increase jailbreak susceptibility.

This trade-off stems from overlap in attention heads that encode both factual and refusal signals. Even benign fine-tuning (meant to improve utility without touching safety behavior) can disrupt safety by altering shared pathways.

The study defines three essential key terms: truthfulness refers to a model’s ability to provide factually accurate responses based on its available knowledge, without suppressing non-harmful content;  hallucination occurs when the model offers incorrect or misleading information despite having access to the correct facts, often due to retrieval failures or internal interference; and refusal behavior, or safety alignment, describes the mechanisms that block or restrict responses to harmful or sensitive prompts.

The authors observe that these functions often interact in subtle ways:

‘Although truthfulness and safety are often analyzed separately, real prompts frequently contain sensitive terms with benign intent (e.g., analysis, detection, or education) [In] these cases, safety mechanisms may overfire – suppressing otherwise accurate, useful information – and thereby reduce practical truthfulness “by omission.”

‘Understanding how edits that aim to increase factuality affect refusal behavior is thus essential to achieving truthfulness with minimal, appropriate suppression.’

The authors developed a LoRA capable of steering a conditioned LLM into a more 'truthful' state less prone to hallucination. Including the one above, the paper's appendix has many examples of the undesirable consequences of this.

The authors developed a LoRA capable of steering a conditioned LLM into a more ‘truthful’ state less prone to hallucination. Including the one above, the paper’s appendix has many examples of the undesirable consequences of this.

The analysis begins by treating truthfulness-enhancing methods, such as head-steering and latent-direction mapping, as deliberate modifications to a model’s internal computation.

Power Steering

The question is whether these changes unintentionally affect the same internal pathways that govern refusal behavior. To test this, the study evaluated models not only on factual accuracy using TruthfulQA, but also on safety performance under adversarial pressure, using the AdvBench and StrongReject benchmarks.

The two existing techniques used as baselines were Inference-time intervention (ITI), which activates attention heads associated with truthful answers; and TruthX, which shifts representations along a learned “truthful” direction.

Both improve accuracy, but also make the model more likely to answer harmful prompts that it would previously have refused.

To test whether hallucination behavior could be isolated and manipulated directly, the authors defined a single latent direction in model space corresponding to hallucinated responses, training a LoRA module on incorrect answers from the TruthfulQA dataset, using LLaMA3-8B-Instruct.

This resulted in a linear vector (i.e., a graph of the difference between truthful and hallucinated answers) that steered the model toward or away from hallucination, depending on direction.

Effect of steering along the hallucination direction. Accuracy on TruthfulQA increases as the model is pushed further in the negative direction, while Attack Success Rate (ASR, lower is better) rises sharply on AdvBench and StrongReject, reflecting the trade-off between truthfulness and safety.

Effect of steering along the hallucination direction. Accuracy on TruthfulQA increases as the model is pushed further in the negative direction, while Attack Success Rate (ASR, lower is better) rises sharply on AdvBench and StrongReject, reflecting a trade-off between truthfulness and safety.

Steering along the hallucination axis degraded factual accuracy, while reversing direction improved it, and applying this technique to harmful prompt benchmarks confirmed a pattern seen earlier: truthfulness gains came at the cost of weakened refusal. Even when hallucination was captured as a clean linear direction, improving factual output made the model more vulnerable to unsafe completions.

The authors emphasize*:

‘This reinforces the trade-off between truthfulness and safety, showing that even when truthfulness is represented as a single linear direction, enhancing factuality can come at the expense of weakened safety alignment.’

Data and Tests

In line with prior work, to prevent fine-tuning from weakening a model’s refusal behavior, the authors employed a method to separate refusal features from those linked to hallucination, by first identifying attention heads involved in both behaviors. They then used the SAE to extract latent features specific to refusal.

These features define a protected subspace. During training, gradient updates are modified to avoid this subspace, allowing the model to reduce hallucinations without disrupting safety alignment.

The authors fine-tuned on the CommonsenseQA dataset, evaluating across six commonsense reasoning challenges: CSQA; HellaSwag; ARCchallenge; ARC Easy; WinoGrande; and SST-2.

The target modules were fine-tuned using LoRA with rank 8, a learning rate of 2×10⁻⁴, weight decay of 0.01, one training epoch, and a batch size of two. All experiments used the AdamW optimizer.

The two harmful content benchmarks used to evaluate safety were AdvBench (500 samples used) and StrongReject (300 prompts). Outputs were assessed by LlamaGuard3, yielding safe or unsafe classifications.

Besides LLaMA3-8B-Instruct, experiments were also conducted on Qwen2.5-Instruct.

Baselines tested were SafeLoRA; SaLoRA; SAP; and vanilla supervised fine-tuning (aka SFT). All were run on their default hyperparameters, with 200 prompts from HarmBench, for all methods except SafeLoRA.

Accuracy was the primary metric, and for harmful benchmarks Attack Success Rate (ASR), as defined by results returned from LlamaGuard3.

Above, results from LlaMA-3-8B-Instruct, with column bests in bold and below, performance of fine-tuning methods on Qwen2.5 7B Instruct, across commonsense and reasoning tasks, where higher scores reflect better accuracy – and on safety benchmarks AdvBench and StrongReject, where lower ASR values reflect stronger robustness. Best results in each column are shown in bold.

Above, results from LlaMA-3-8B-Instruct, with column bests in bold and below, performance of fine-tuning methods on Qwen2.5 7B Instruct, across commonsense and reasoning tasks, where higher scores reflect better accuracy – and on safety benchmarks AdvBench and StrongReject, where lower ASR values reflect stronger robustness. Best results in each column are shown in bold.

Of these results, the authors state:

‘Our surgical approach achieves the best balance between safety and utility: it significantly lowers harmful benchmark scores while preserving fine-tuning accuracy. In contrast, methods such as SAP, SaLoRA, and SafeLoRA either increase harmfulness or degrade utility.

‘A key reason is that these methods operate directly on the gradient of the safety subspace, which, due to polysemanticity [**], can constrain model performance.

‘Compared to vanilla fine-tuning (SFT), our method yields substantial improvements on both utility and harmfulness metrics. Specifically, our approach improves the average fine-tuning accuracy (FA) from 56.15% to 75.09%, a gain of approximately +19%.’

The method, the researchers further note, cuts the Attack Success Rate from 9.23% to 0.58% on AdvBench, and from 9.90% to 0.00% on StrongReject, representing more than a fifteen-fold drop in harmful outputs. The base model, though already low in harmfulness, achieves only limited task accuracy.

The authors state:

‘These results highlight the importance of preserving refusal features during the fine-tuning process: by isolating and protecting the refusal subspace, our method maintains safety alignment without sacrificing task performance.

‘Overall, this confirms that our approach effectively mitigates the trade-off between truthfulness and safety.’

Finally, the authors tested the approach’s resilience under more adversarial conditions, by adding 10% harmful instructions from the Circuit Break dataset to the fine-tuning set.

Despite this deliberate poisoning, the method maintained strong performance across benign and harmful evaluations:

Performance of LLaMA3 8B Instruct fine-tuned on a poisoned commonsense dataset, comparing accuracy and safety outcomes across methods.

Performance of LLaMA3 8B Instruct fine-tuned on a poisoned commonsense dataset, comparing accuracy and safety outcomes across methods.

The new approach reduced ASR more effectively than SAP, while avoiding the latter’s steep utility-loss. Task accuracy remained close to LoRA SFT and SafeLoRA, confirming that refusal alignment could still hold under contaminated training, assuming that refusal features were properly isolated and preserved.

Conclusion

The most interesting finding from this paper is the apparent colocation in the trained latent space of such conflicting elements as refusal and hallucination. Though it is encouraging and most interesting to watch the authors disentangle these through the use of LoRAs and SAEs, this is obviously something of a bolt-on solution, and one would hope eventually that deeper architectural solutions might emerge that address training time, rather than post hoc fixes.

 

* I omit their bold formatting as redundant.
** https://arxiv.org/abs/2210.01892

First published Friday, October 10, 2025

Writer on machine learning, domain specialist in human image synthesis. Former head of research content at Metaphysic.ai.
Personal site: martinanderson.ai
Contact: [email protected]
Twitter: @manders_ai