Connect with us

Thought Leaders

The Truth About Synthetic Data: Why Human Expertise is Critical for LLM Success

mm

LLM developers are increasingly turning to synthetic data to speed up development and reduce costs. Researchers behind several top-tier models, such as LLama 3, Qwen 2, and DeepSeek R1, have mentioned using synthetic data to train their models in the research papers. From the outside, it looks like the perfect solution: an infinite well of information to speed up development and slash costs. But this solution comes with a hidden cost that business leaders can’t ignore.

In simple terms, synthetic data is generated by AI models to create artificial datasets for training, fine-tuning, and evaluating LLMs and AI agents. Compared to traditional human annotation, it allows the data pipeline to scale quickly, which is essential in the fast-moving and competitive landscape of AI development.

Enterprises may have other reasons to use ā€œfakeā€ data, like protecting sensitive or confidential information in finance or healthcare settings by generating anonymized versions. Synthetic data is also a good substitute when proprietary data is not available, such as before launching a product or when the data belongs to external clients.

But is synthetic data revolutionizing AI development? The short answer is a qualified yes: it has great potential, but it can also expose LLMs and agents to critical vulnerabilities without rigorous human oversight. LLM producers and AI agent developers may find that AI models trained on inadequately vetted synthetic data can generate inaccurate or biased outputs, create reputational crises, and result in non-compliance with industry and ethical standards. Investing in human oversight to refine synthetic data is a direct investment in protecting the bottom line, maintaining stakeholder trust, and ensuring responsible AI adoption.

With human input, synthetic data can be transformed into high quality training data.Ā  There are three critical reasons to refine generated data before using it to train AI: to fill gaps in source-model knowledge, to improve data quality and reduce sample size, and to align with human values.

We need to capture unique knowledge

Synthetic data is primarily generated by LLMs that are trained on publicly available internet sources, creating an inherent limitation. Public content rarely captures the practical, hands-on knowledge used in real-world work. Activities like designing a marketing campaign, preparing a financial forecast, or conducting market analysis are typically private and not documented online. Additionally, the sources tend to reflect U.S.-centric language and culture, limiting global representation.

To overcome these limitations, we can involve experts to create data samples in areas we suspect the synthetic data generation model cannot cover. Returning to the corporate example, if we want our final model to handle financial forecasts and market analysis effectively, the training data needs to include realistic tasks from these fields. It’s important to identify these gaps and supplement synthetic data with expert-created samples.

Experts are often involved early in the project to define the scope of the work. This includes creating a taxonomy, which outlines the specific areas of knowledge where the model needs to perform. For example, in healthcare, general medicine can be divided into subtopics such as nutrition, cardiovascular health, allergies, and more. A health-focused model must be trained in all the subareas it is expected to cover. After the taxonomy is defined by healthcare experts, LLMs can be used to generate datapoints with typical questions and answers quickly and at scale. Human experts are still needed to review, correct, and improve this content to ensure it is not only accurate but also safe and contextually appropriate. This quality assurance process is necessary in high-risk applications, such as healthcare, to ensure data accuracy and mitigate potential harm.

Quality over quantity: driving model efficiency with fewer, better samples

When domain experts create data for training LLMs and AI agents, they create taxonomies for datasets, write prompts, craft the ideal answers, or simulate a specific task. All of the steps are carefully designed to suit the model’s purpose, and the quality is ensured by subject matter experts in the corresponding fields.

Synthetic data generation does not fully replicate this process. It relies on the strengths of the underlying model used for creating the data, and the resulting quality is often not on par with human-curated data. This means that synthetic data often requires much larger volumes to achieve satisfactory results, driving up computational costs and development time.

In complex domains, there are nuances that only human experts can spot, especially with outliers or edge cases. Human-curated data consistently delivers better model performance, even with significantly smaller datasets. By strategically integrating human expertise into the data creation process, we can reduce the number of samples needed for the model to perform effectively.

In our experience, the best way to address this challenge is to involve subject matter experts in building synthetic datasets. When experts design the rules for data generation, define data taxonomies, and review or correct the generated data, the final quality of the data is much higher. This approach has enabled our clients to achieve strong results using fewer data samples, leading to a faster and more efficient path to production.

Building trust: the irreplaceable role of humans in AI safety and alignment

Automated systems cannot anticipate all vulnerabilities or ensure alignment with human values, particularly in edge cases and ambiguous scenarios. Expert human reviewers play a crucial role in identifying emerging risks and ensuring ethical outcomes before deployment. This is a layer of protection that AI, at least for now, cannot fully provide on its own.

Therefore, to build a strong red teaming dataset, synthetic data alone won’t suffice. It is important to involve security experts early in the process. They can help map out the types of potential attacks and guide the structure of the dataset. LLMs can then be used to generate a high volume of examples. After that, experts are needed to verify and refine the data to ensure it is realistic, high-quality, and useful for testing AI systems. For instance, an LLM can generate thousands of standard hacking prompts, but a human security expert can craft novel ā€˜social engineering’ attacks that exploit nuanced psychological biases—a creative threat that automated systems struggle to invent on their own.

There has been significant progress in aligning LLMs using automated feedback. In the paper ā€œRLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback,ā€ researchers show that AI-based alignment can perform comparably to human feedback in many cases. However, while AI feedback improves as models improve, our experience shows that RLAIF still struggles in complex domains and with edge cases or outliers, areas where performance can be critical depending on the application. Human experts are more effective at handling task nuances and context, making them more reliable for alignment.

AI agents also benefit from automated testing to address a broad range of safety risks. Virtual testing environments use generated data to simulate agent behaviors like interfacing with online tools and performing actions on websites. To maximize the testing coverage in realistic scenarios, human expertise is integral to design the test cases, verify the results of automated evaluations, and report on vulnerabilities.

The future of synthetic data

Synthetic data is a highly valuable technique for developing large language models, especially when scaling and fast deployment are critical in today’s fast-paced landscape. While there are no fundamental flaws in synthetic data itself, it requires refinement to reach its full potential and deliver the most value. A hybrid approach that combines automated data generation with human expertise is a highly effective method for developing capable and reliable models, as final model performance depends more on data quality than on total volume. This integrated process, using AI for scale and human experts for validation, produces more capable models with improved safety alignment, which is essential for building user trust and ensuring responsible deployment.

Ilya Kochik is the Vice President of Business Development at Toloka, a human data partner for leading GenAI research labs, where he specialises in cutting edge tasks for frontier models and agentic systems. Based in London, his background includes leadership and technical roles at Google, QuantumBlack (AI by McKinsey), and Bain & Company.