Connect with us

Anderson's Angle

Research Finds Even a Little Bad Data Can Wreck a Fine-Tuned AI

mm
A bad apple atop good apples. Flux Kontext text prompt only + Adobe Firefly V3.

A new study shows that fine-tuning ChatGPT on even small amounts of bad data can make it unsafe, unreliable, and veer it wildly off-topic. Just 10% of wrong answers in training data begins to break performance, while 25% can trigger dangerous advice. In most cases, the untuned base model stayed safer and smarter than any ‘personalized’ version.

 

One thing that a generic top-of-the-line Large Language Model (LLM) such as ChatGPT or Claude cannot offer a company is a moat – a unique edge and range of capabilities in model performance that’s unavailable to competitors. Though API-only services such as ChatGPT will accrue custom rules and expectations of a particular client over time, and begin to anticipate their needs to a certain extent, the only way to truly automate company-specific workflows and directives in an LLM is to contextualize each request.

This may involve saving and re-using multiple control/context prompts that instruct the LLM how to deal with the data or the challenge it is about to receive; and such documents are frequently informed by tedious and even costly trial and error.

Obviously it would be better if one could impress one’s own needs more indelibly on the model, so that it has a less casual and ephemeral relationship with the client.

Fine Ideas

Therefore, subject to any privacy or exposure considerations, companies are currently very keen to personalize and customize powerful LLMs, by fine-tuning the models on their own data.

This involves curating additional dataset material specific to tasks that the company wants to automate, or domains that it wants the AI to memorize, and effectively ‘resuming’ training of the model.

Useful myopia: in fine-tuning, a pre-trained model is used as the basis for a modified version that's capable of very specific tasks included in a custom dataset; however, the resulting model will be better at these custom tasks, usually, than at the general tasks which the unaltered base model can still perform well.

Useful myopia: in fine-tuning, a pre-trained model is used as the basis for a modified version that’s capable of very specific tasks included in a custom dataset; however, the resulting model will be better at these custom tasks, usually, than at the general tasks which the unaltered base model can still perform well.

Well, not exactly ‘resuming’, or picking up where the multi-million dollar training of a model left off; that would require the latest training state (a very heavy configuration file which is rarely included in production releases) from the most recent training session, and for the training set-up to be identical to the original configuration – and there are very few corporations that could replicate such an expensive and demanding environment.

Rather, fine-tuning starts with a broadly-trained model and adjusts its weights using a smaller, domain-specific dataset. This second training phase narrows the model’s behavior to fit a target task, while still relying on the general language understanding learned during pre-training. The goal, therefore, is to shift the model from generalist to specialist applications, but without starting training from scratch.

Light Tunes

Full fine-tuning involves the creation of a new hybridized, task-specific model that weighs at least as much as the original foundation model that it was trained on; however, lighter methods such as Low Rank Adaptation (LoRA) can create lightweight intermediate files that operate as ‘filters’ on the unaltered base model, allowing it to perform specialist tasks.

A LoRA adapts a pre-trained language model by adding small trainable components rather than adjusting all its parameters. These low-rank matrices slot into the model’s layers, letting it learn task-specific behavior while keeping most of its original knowledge intact, and reducing the cost of computation and memory.

Besides text-based and diverse other LLM domains, LoRA-style training is very popular for creating customized image templates for image and video generative systems. In the example below, we can see on the right that fine-tuning a LoRA using a particular person’s identity makes the (unaltered) Hunyuan base model capable of generating that identity (the video components in the clip, all synthesized from gained domain knowledge from the static pictures):

Click to play: as with any other type of data that can be put into a fine-tune or a LoRA, identity data in this case can help the Hunyuan model recreate a personality that was not originally trained into its latent space.

Fine-tuning is a deeper and more comprehensive method, but demands far more time and resources. Because it can often deliver stronger results than LoRA, fine-tuning has become the current focus of attention, with interest rising sharply across industry as companies are avid to locate talent able to shape data into effective corporate fine-tunes.

‘Worth a Go!’

Because modern LLMs and VLMs can produce outstanding results from relatively under-curated data, a common understanding is spreading across some communities, to the effect that data curation may be becoming less of a priority or requirement in the training process, since the architecture in question will somehow identify the most important relationships even in a ‘polluted’ dataset.

This is mostly wishful thinking; the cost of manually curating hyperscale data is one of the most notable retarding factors challenging the progress of artificial intelligence. While high-volume data offers enough data instances to create world models, research teams are often forced to rely on existing metadata (which is frequently of low quality, missing, or just plain wrong) to bring order to the chaos; or else on algorithmic filtering techniques that are either based on imperfect principles, or also powered by inadequately-curated data (!).

Therefore it is tempting to presume that fine-tuning approaches can somehow rationalize data distributions and deal intelligently with outliers, and that the resulting fine-tuned models may reduce overall performance (which is not required) but will still excel at the target task – a pragmatic compromise.

However, a new collaboration between Berkeley and Invisible Technologies (titled How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs) has found that surprisingly small amounts of incorrect data can have a severely damaging effect on the performance of fine-tuned models; and that, since the authors used GPT-4o for the study, the base non-fine-tuned GPT-4o model actually performed the customized tasks better in most instances.

The authors state:

‘Fine- tuning large language models on incorrect data can induce emergent misalignment and catastrophic performance loss far more easily than many practitioners may realize.

‘Our results emphasize that, in most real-world cases, less fine-tuning is safer than more – unless absolute data quality can be guaranteed.

‘Our experiments reveal that the threshold for tolerable noise in supervised fine -tuning data is shockingly low. Even when just 10% of the training data is incorrect, models exhibit a dramatic drop in both technical performance and safety compared to the base gpt-4o, which consistently delivered near-perfect results across all domains.’

They further state that as the share of incorrect data increases, misalignment and harmful outputs rise quickly –especially when the errors are subtle. Between 10% and 25% of bad data is enough to collapse reliability, and models trained on less than 50% correct data become notably unstable.

In regulated or safety-critical domains, the authors observe that even small lapses in data quality can render fine-tuning counterproductive.

The safest option, they argue, may be no fine-tuning at all.

Method

The paper is very short, since the testing methodology is quite brief: the researchers adopted gpt-4o-2024-08-06 as the baseline model, and fine-tuned it using OpenAI’s proprietary platform, with no additional reward models or reinforcement learning stages applied.

This approach meant that all behavioral changes in the outputs could be attributed solely to the supervised fine-tuning data, without interference from alignment techniques or post-processing layers.

This arrangement ensured that only data quality could affect the results; that every run started from the same base model, for consistency; and that training was as stable and efficient as possible, by using OpenAI’s own systems.

Data and Tests

To test how bad data can affect fine-tuning, the researchers created separate sets of examples for each domain: code; finance; health; and legal. Each set had three parts: correct answers; obviously wrong answers; and subtly wrong answers – all hand-checked by experts to make sure the labels were reliable.

The authors then trained models on different mixes of these examples, ranging from 10% correct to 90% correct.

Each mix contained exactly 6,000 training items and 1,000 validation items (however, since the code domain had no ‘subtle’ category, it therefore contained fewer total combinations). Each mix was tested three times to account for randomness in training.

The model was trained for a single epoch using the AdamW optimizer, with a batch size of four and a cosine learning rate schedule, with no warm-up steps. The fine-tuning was performed directly on labeled (prompt/completion) pairs without reinforcement learning, reward modeling, or additional alignment stages.

Since validation performance converged within one epoch, no further training cycles were necessary.

Each model was evaluated on 100 domain-specific questions, synthetically generated using OpenAI’s prompt-based data tools, with an LLM judge scoring the responses for correctness based on the intended answers.

Misalignment was assessed separately, using public emergent misalignment benchmarks from the 2025 paper Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, and OpenAI, where LLM judges rated both the frequency and severity of harmful or inappropriate outputs.

All evaluations were performed on held-out prompts (i.e., unseen during training), with temperature set to zero, to ensure deterministic responses.

Impact of Correct and Incorrect Fine-Tuning Data on Task Accuracy and Model Alignment

These initial experiments tested how different mixes of correct, obviously incorrect, and subtly incorrect fine-tuning data would affect both task accuracy and alignment in the four domains code, finance, health, and legal.

The relationship between data quality and model behavior was found to be non-linear, with models remaining mostly stable at up to 25% bad data; further, moral alignment held up well, until the correct data dropped below 90%:

Results from the initial tests: domain accuracy rises steeply as the share of correct training data increases, though gains taper off beyond 50%. Models trained on subtly incorrect data (orange) recover more quickly than those trained on obviously wrong data (blue), but both remain less reliable than the base gpt-4o model at 100% correctness. The drop-off in performance below 50% shows a sharp loss of task alignment when low-quality examples dominate.

Results from the initial tests: domain accuracy rises steeply as the share of correct training data increases, though gains taper off beyond 50%. Models trained on subtly incorrect data (orange) recover more quickly than those trained on obviously wrong data (blue), but both remain less reliable than the base gpt-4o model at 100% correctness. The drop-off in performance below 50% shows a sharp loss of task alignment when low-quality examples dominate.

However, performance and alignment only began to recover consistently once at least half the training data was correct. Even at 90% correct, fine-tuned models often failed to match the reliability and safety of the original gpt-4o base model.

When the training leaned too heavily on incorrect or subtly misleading data, the resulting models produced a sharp rise in harmful, incoherent, or off-target completions.

For code, performance improved steadily as more correct data was added, while alignment remained largely unaffected regardless of data quality. In finance, health, and legal domains, accuracy rose sharply between 10% and 25% correct data, then leveled off.

Models trained on subtly incorrect data generally performed better than those trained on obviously wrong data; but in finance and legal, this subtle noise harmed alignment more. Health remained more resilient in both respects.

Moral alignment (the model’s ability to avoid harmful or unethical outputs) held stable across domains until correct data dropped below 25%. In finance, health, and legal, subtly incorrect data led to more misaligned responses than obvious errors, even when task performance remained high. Alignment improved as data quality rose, while code models showed near-perfect alignment regardless of correctness, indicating unusual resilience.

Moral alignment (the model’s ability to avoid harmful or unethical outputs) held stable across domains until correct data dropped below 25%. In finance, health, and legal, subtly incorrect data led to more misaligned responses than obvious errors, even when task performance remained high. Alignment improved as data quality rose, while code models showed near-perfect alignment regardless of correctness, indicating unusual resilience.

Comparison with untuned GPT-4o

To benchmark the fine-tuned models, the authors compared them to the base gpt-4o checkpoint from August 6, 2024, which received no additional domain-specific training.

The base model outperformed nearly all fine-tuned versions that incorporated substantial amounts of incorrect data, generating no dangerous completions in finance, health, or legal, and only one in code. Misaligned outputs stayed below 1% in every domain, while task accuracy ranged from 96% to 100%.

The authors note:

‘Across all domains, increasing the proportion of correct training data leads to substantial reductions in misaligned and harmful outputs.

‘At low ratios of correct data, models trained on subtly incorrect data tend to exhibit worse alignment performance than those trained on obviously incorrect data. However, as the proportion of correct data increases, the “wash-out” effect diminishes the impact of both error types—more rapidly for subtle errors.

‘For both technical performance and moral alignment, the 50% correctness threshold marks a clear turning point: models trained with 50% or more correct data exhibit substantially more reliable and safe behavior across all domains evaluated.’

The results of the study indicate how fragile a proposition fine-tuning can be: even a small amount of bad training data (10-25%) can cause a notable spike in unsafe or irrelevant answers, especially when the mistakes are subtle.

These small errors are harder to catch but do more damage, and models trained on them can seem fine until they suddenly aren’t. Performance only starts to bounce back once the training data is more than half correct; even then, most models still fall short of the base version.

That base version, in this case GPT-4o with no additional tuning, proved to be the most reliable overall, staying safe and accurate across finance, health, and legal tasks, where it showed almost no dangerous behavior.

From the paper's appendix, a very small selection of multiple examples illustrating problematic inference outcomes at diverse levels of bad data in fine-tuning scenarios.

From the paper’s appendix, a very small selection of multiple examples illustrating problematic inference outcomes at diverse levels of bad data in fine-tuning scenarios.

Conclusion

Dataset curation is exhausting and expensive; often ungovernably expensive. To an extent, companies and individuals often tacitly consider that it’s easier and cheaper to work round the rough edges of a model trained on under-curated data than it is to contemplate giving the data the attention that it actually needs.

The central problem is defined by the need for scale and the unpredictability of outlier data; if it were not for the need for very high volumes of data, in order to cover the maximum number of scenarios, it would be possible to use manual curation techniques more frequently as training data in itself, leading to automated curation techniques that really work.

In the real world, if one could afford such an enormous amount of high-quality human oversight, one would be close to hand-curating hyperscale datasets in any case. We shall have to await new, perhaps radical insights into this particular Catch-22.

 

First published Thursday, September 25, 2025

Writer on machine learning, domain specialist in human image synthesis. Former head of research content at Metaphysic.ai.
Personal site: martinanderson.ai
Contact: [email protected]
Twitter: @manders_ai