Anderson's Angle

Personalized Language Models Are Easy to Make – and Harder to Detect

Published June 19, 2025

Martin Anderson

A robot hand at large in an exam room - Flux, Krita (AI GENERATED).

Open-source clones of ChatGPT can be fine-tuned at scale and with limited or no expertise, facilitating ‘private’ language models that evade detection. Most tools cannot trace where these models come from or what they have been trained to do, allowing students and other users to generate AI text without being caught; but a new method claims that it can identify these hidden variants by spotting shared ‘family traits’ in the models’ outputs.

According to a new study from Canada, user-customized AI chat models, similar to ChatGPT, are capable of producing social media content that closely resembles human writing, and which can fool state-of-the-art detection algorithms and humans alike.

The paper states:

‘A realistically motivated attacker is likely to fine-tune a model for their specific style and use case, as it is cheap and easy to do so. With minimal effort, time, and money, we produced fine-tuned generators that are capable of much more realistic social media tweets, based on both linguistic features and detection accuracy, and verified through human annotations.’

The authors emphasize that custom models of this kind are not limited to short-form social media content:

‘Although motivated by the spread of AI content on social media, and the associated risks of astroturfing and influence campaigns, we stress that the main findings extend across all text domains.

‘Indeed, fine-tuning models for style-specific content generation is a generally applicable method, and one that is likely already in use by many generative AI users – calling into question whether existing methods of detecting AIGT are as effective in the real world as in the research lab.’

As the paper observes, the method used to create these bespoke language models is fine-tuning, where users curate a limited amount of their own target data and feed it into an increasing number of easy-to-use and cheap online training tools.

For instance, the popular repository Hugging Face offers Large Language Model (LLM) fine-tuning via a simplified interface, using its AutoTrain Advanced system, which can either be run for a few dollars through an online GPU or free, locally, if the user has adequate hardware:

Various price structures across the range of GPUs available for the Hugging Face AutoTrain system. Source: https://huggingface.co/spaces/autotrain-projects/autotrain-advanced?duplicate=true

Other simplified methods and platforms include Axolotl, Unsloth, and the more capable but demanding TorchTune.

An example use case would be a student who is tired of writing their own essays, but fears to be caught out by online AI detection tools, who can use their own real historical essays as training data to fine-tune a really effective popular open source model such as the Mistral series.

Although fine-tuning a model tends to skew its performance towards the extra training data and degrade overall performance, ‘personalized’ models can be used to ‘de-AI’ the increasingly distinctive output from systems such as ChatGPT, in a way that reflects the user’s own historical style (and, for increased authenticity, their shortcomings).

However, one could exclusively use a fine-tuned model that was specifically trained to a narrow task or range of tasks, such as an LLM fine-tuned on the coursework of a particular university module. A model as specific as this would have a myopic, but far deeper insight into that domain than an all-purpose LLM like ChatGPT, and would likely cost less than $10-20 to train.

The LLM Iceberg

It is difficult to say what the scale of the practice is. Anecdotally, on diverse social media platforms, I have lately come across many business-oriented examples of LLM fine-tuning – certainly many more such examples than a year ago; in one case, a company fine-tuned a language model on its own published thought-leadership pieces, which was then able to convert a scrappy Zoom call with a new client into a polished B2B post almost in one pass, on demand.

A model of that nature requires paired data (before and after examples, at scale), whereas creating a personalized ‘gloss’ of a particular writer’s characteristics is an easier task, more akin to style transfer.

Though this is a clandestine pursuit (in spite of numerous headlines and academic studies on the topic), where figures are not available, the same common-sense that brought the TAKE IT DOWN act into law this year applies here: the target activity is possible and affordable, and there is a strong common-sense understanding that potential users are highly motivated.

There is just enough friction left in the most ‘dumbed-down’ online fine-tuning systems that the practice of disingenuously training and using fine-tuned models remains a relatively niche use case, for the moment – though certainly not beyond the traditional inventiveness of students.

PhantomHunter

This brings us to the main paper of interest here – a new approach from China that gathers together a wide variety of techniques into a single framework – called PhantomHunter – that claims to identify the output of fine-tuned language models, that would otherwise pass as original human work.

The system is designed to function even when the specific fine-tuned model has never been encountered before, relying instead on residual traces left behind by the original base model – which the authors characterize as ‘family traits’ that survive the fine-tuning process.

In tests, the paper – titled PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning – reports strong detection accuracy, with the system outperforming zero-shot GPT-4-mini evaluation^† in tracing back a text sample to its model family.

This suggests that the more a model is fine-tuned, the more it reveals about its ancestry, countering the assumption that private fine-tuning always masks a model’s origin; instead, the tuning process may leave a detectable fingerprint that, if read correctly, gives the game away – at least, pending the further advances which seem to arrive weekly now.

The paper states*:

‘[Machine Generated Text] detection generally distinguishes LLM-generated and human-written text via binary classification. Existing methods either learn common textual features shared across LLMs using representation learning or design distinguishable metrics between human and LLM texts based on LLMs’ internal signals (e.g., token probabilities).

‘For both categories, their tests were mostly conducted on data from publicly available LLMs, assuming that users generate text using public, off-the-shelf services.

‘We argue that this situation is being changed due to the recent development of the open-source LLM community. With the help of platforms like HuggingFace and the efficient LLM training techniques like low-rank adaptation (LoRA), building fine-tuned LLMs with customized private datasets has become much easier than before.

‘For instance, there have been over 60k Llama-based derivative models on HuggingFace. After private fine-tuning on unknown corpus, the learned characteristics of base models could change and the LLMGT detectors would [fail], shaping a new risk that malicious users can generate harmful texts privately without being caught by LLMGT detectors.

‘A new challenge arises: How to detect text generated by privately-tuned open-source LLMs?‘

Method and Training

The PhantomHunter system uses a family-aware learning strategy, combining three components: a feature extractor, capturing output probabilities from known base models; a contrastive encoder trained to distinguish between families; and (as detailed below) a mixture-of-experts classifier that assigns family labels to new text samples:

Schema for the system. PhantomHunter processes a text sample by first extracting probability features from multiple base models, which are then encoded using CNN and transformer layers. It estimates the model family to compute gating weights, which guide a mixture-of-experts module in predicting whether the text is LLM-generated. A contrastive loss is applied during training to refine the separation between model families. Source: https://arxiv.org/pdf/2506.15683

PhantomHunter works by passing a piece of text through several known base models and recording how likely each one thinks the next word is, at every step. These patterns are then fed into a neural network that learns the distinguishing characteristics of each model family.

During training, the system compares texts from the same family and learns to group them together, while differentiating between those from different families, helping to identify hidden connections between fine-tuned models and their base models.

MOE

To decide whether a piece of text was written by a human or by AI, PhantomHunter uses a mixture-of-experts system, wherein each ‘expert’ is tuned to detect text from a specific model family.

Once the system guesses which family the text most likely came from, it uses that guess to decide how much weight to give each expert’s opinion. These weighted opinions are then combined to make the final call: AI or human.

Training the system involves multiple objectives: learning to recognize model families; learning to tell AI text from human text; and learning to separate different families using contrastive learning – objectives which are balanced during training through tunable parameters.

By focusing on patterns shared across each family, rather than quirks of individual models, PhantomHunter should in theory be able to detect even fine-tuned models it has never seen before.

Data and Tests

To develop data for tests, the authors concentrated on the two most common academic scenarios: writing and question-answering. For writing, they collected 69,297 abstracts from the Arxiv academic archive, divided into primary domains. For Q&A, 2,062 pairs were curated from the HC3 dataset across three subjects: ELI5; finance; and medicine:

List of the data sources and numbers thereof, in data curated for the study.

In all, twelve models were trained for the test. The three base models were LLaMA-2 7B-Chat; Mistral 7B-Instruct-v0.1; and Gemma 7B-it), from which nine fine-tuned variants were struck, each tailored to mimic a different domain or authorial style, using domain-specific data:

Statistics of the evaluation dataset, where ‘FT Domain’ refers to the domain used during fine-tuning and ‘base’ indicates no fine-tuning.

In total, therefore, three base models were fine-tuned using both full-parameter and LoRA techniques across three distinct domains in each of two usage scenarios: academic abstract writing and question-answering. To reflect real-world detection challenges, models fine-tuned on computer science data were withheld from the writing tests, while those fine-tuned on finance data were withheld from the Q&A evaluations.

Rival frameworks selected were RoBERTa; T5-Sentinel; SeqXGPT; DNA-GPT; DetectGPT; Fast-DetectGPT; and DeTeCtive.

PhantomHunter was trained using two types of neural network layers: three convolutional layers with max-pooling to capture local text patterns, and two transformer layers with four attention heads each to model longer-range relationships.

For contrastive learning, which encourages the system to distinguish between different model families, the temperature parameter was set to 0.07.

The training objective combined three loss terms: L1 (for family classification) and L2 (for binary detection), each weighted at 1.0, and L3 (for contrastive learning), weighted at 0.5.

The model was optimized using Adam with a learning rate of 2e-5 and a batch size of 32. Training took place for ten full epochs, with the best-performing checkpoint selected using a validation set. All experiments were conducted on a server with four NVIDIA A100 GPUs.

Metrics used were F1 scoring for each testing subset, together with true positive rate, for comparison with commercial detectors.

F1 scores for detecting text from unseen fine-tuned language models. The top two results in each category are bolded and underlined. 'BFE' refers to base probability feature extraction, 'CL' to contrastive learning, and 'MoE' to the mixture-of-experts module.

F1 scores for detecting text from unseen fine-tuned language models. The top two results in each category are in bold/ underlined. ‘BFE’ refers to base probability feature extraction, ‘CL’ to contrastive learning, and ‘MoE’ to the mixture-of-experts module.

The results of the initial test, visualized in the table above, show that PhantomHunter outperformed all baseline systems, maintaining F1 scores above ninety percent for both human and machine-generated text, even when evaluated on outputs from fine-tuned models excluded from training.

The authors comment:

‘With full fine-tuning, PhantomHunter improves MacF1 score over the best baseline by 3.65% and 2.96% on both datasets, respectively; and with LoRA fine-tuning, the improvements are 2.01% and 6.09% respectively.

‘The result demonstrates PhantomHunter’s powerful detection capability for texts generated by unseen fine-tuned LLMs.’

Ablation studies were conducted to assess the role of each core component in PhantomHunter. When individual elements were removed, such as the feature extractor, the contrastive encoder, or the mixture-of-experts classifier, a consistent drop in accuracy was observed, indicating that the architecture relies on the coordination of all parts.

The authors also examined whether PhantomHunter could generalize beyond its training distribution, and ascertained that even when applied to outputs from base models entirely absent during training, it continued to outperform rival methods – suggesting that family-level signatures remain detectable across fine-tuned variants.

Conclusion

One argument in favor of user-trained generative language models is that at least these obscure little fine-tunes and LoRAs preserve the individual flavor and eccentricities of an author, in a climate where the generic, SEO-inspired idiom of AI chatbots threatens to genericize any language where AI becomes a major or dominant contributor.

With the devaluation of the college essay, and with students now screencasting mammoth writing sessions to prove that they did not use AI on their submissions, more teachers outside of Europe (where oral exams are normalized) are considering face-to-face examinations as an alternative to submitted texts. More recently, a return to handwritten work has been proposed.

Arguably, both these solutions are superior to what threatens to be an LLM-based re-run of the deepfake arms race; though they come at the cost of human effort and attention, which tech culture is currently striving to automate away.

^† Please see the end section after the main results, in the source paper, for details on this.

* My conversion of the authors’ inline citations to hyperlinks. Authors’ text emphasis/es, not mine.

First published Thursday, June 19, 2025

Up Next

LLMs’ Memory Limits: When AI Remembers Too Much

Don't Miss

Multilingual AI Bias Detection with SHADES: Building Fair and Inclusive AI Systems