Connect with us

Thought Leaders

The Rise of Synthetic Data, and Why It Will Augment Rather Than Replace Real Data

mm

Elon Musk recently proclaimed that we have exhausted the human data available for training AI models. His warning is the latest commentary on the need for new data sources if AI is to continue its rapid progress. In industries like healthcare and finance, stringent privacy regulations are making the shortage of data even more acute.

While synthetic data – a possible solution to this shortage – isn’t new, its importance continues to grow, as evidenced by the recent spates of mergers and investments in this field. There are, however, some deep uncertainties around the use of synthetic data, notably the risk of model collapse, where the quality of a multimodal Large Language Model’s (LLM) output deteriorates without real world data to train on. Whether this problem proves intractable or solvable may have a significant impact on the future of generative AI (Gen AI).

What is synthetic data and how is it created?

Synthetic data is artificially created rather than collected from real events. AI-generated synthetic data is now the most widespread form, which involves training models on real-world data to detect patterns and correlations, then generating new data that mimics these statistical properties.

LLMs are being used to generate a variety of synthetic data types, including structured data, such as tabular data, and unstructured data, like free texts, videos and images. A range of methods are used, depending on the type of data being produced.

For example, two common methods deployed for generating synthetic image data are GANs and diffusion models. GANs use two neural networks: a generator creates artificial versions of real data, while a discriminator identifies which are real versus generated. Working together continuously, the generator tries to “deceive” the discriminator, continually improving the realism and diversity of artificial data. Diffusion models take a different approach, learning to distort real data then reverse this process to “denoise” it. Once trained effectively, they can produce high-quality synthetic audio and visual data.

Synthetic data's growing importance

There has been long-standing interest in synthetic data. However, in the past 5 years, the rapid development of LLMs has both ratcheted up demand for synthetic data and created an ever more effective means of generating it at scale. As a result, synthetic data usage has skyrocketed.

Gartner forecast that synthetic data would make up 60% of all data used for training LLMs by 2024, up from just 1% in 2021. There is every reason to believe that this estimate is broadly accurate. For example, Microsoft's Phi-4 model, which outperforms other LLMs despite being much smaller, was successfully trained on mostly synthetic data. Meanwhile, the engineers of Amazon's Alexa are exploring the use of a “teacher/student” model where the “teacher” model generates synthetic data which is then used to fine-tune a smaller “student” model.

This widespread adoption is being mirrored by major moves in the market. The synthetic data sector saw an investment boom in 2021-22. Gretel AI and Tonic.ai secured Series B rounds of $50 million and $35 million respectively. These were followed by MOSTLY AI closing a $25 million Series B round and Synthesis AI securing $17 million in Series A funding.

More recently, the trend has been towards large scale acquisitions. NVIDIA's acquisition of Gretel this Spring will support the tech giant's own work in this field. Likewise, AI solutions company SAS acquired the synthetic data startup Hazy in November 2024.

The analytics firm Cognilytica estimated the market for synthetic data generation in 2021 to have been worth around $110 million. The firm expects it to reach $1.15 billion by 2027. Other forecasts anticipate a CAGR of 31% for the sector, as it grows to $2.33 billion in value by 2030.

Model collapse

However, synthetic data’s exciting potential comes with a significant downside: model collapse. This is a phenomenon where LLMs that are trained solely on synthetic data start to produce less precise or less diverse outputs.

While real world data tends to be high in complexity, synthetic data is often simplified and condensed by models. For example, researchers found that the accuracy of a model trained to detect cancerous moles from photographs was inversely related to the amount of synthetic training data. A recent study by academics from Oxford, Cambridge, Imperial College and the University of Toronto found that using model-generated data indiscriminately led to “irreversible defects in the resulting model.”

Even worse, most LLMs are “black boxes”, making it difficult to understand how they will respond to synthetic data. Researchers from Rice University and Stanford concluded that without some fresh real-world data, “future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease.”

The ongoing need for real world data

Evidently, even with the rise in demand for synthetic data, the need for real world data remains. In fact, demand for high quality real world data may even increase. The reason for this is two-fold. Firstly, real world data will always be needed in order to train the AI models that then generate the synthetic data. And secondly, in order to avoid model collapse, it is necessary to continually sync synthetic data with real world data.

Real data for training synthetic-data producing AI models

As mentioned earlier, the majority of synthetic data today is created using Gen AI. And these Gen AI models must be trained on real-world data in order to create usable synthetic data. That is because they can only create synthetic data by replicating the patterns and statistical properties of a real world dataset.

Consider the recent example of an insurance company that was able to use synthetic data to test out different vendors without compromising its sensitive customer data. In order to generate this synthetic data set, which accurately mimicked reality, it had to use its own real world data to train the AI model that then generated the synthetic data.

Real data for mitigating model collapse

There are multiple strategies for mitigating the risk of model collapse. These include validating and then regularly reviewing synthetic datasets, and checking the quality of synthetic data before it is used in generative models. However, the most common approach is to diversify the data used by combining synthetic data with human data. Gartner's survey found that 63% of respondents favor using a partially synthetic dataset, with only 13% saying they use fully synthetic data.

Even adding modest amounts of real-world data can significantly improve a model's performance. Researchers from the University of South California found that companies can replace up to 90% of their real data with synthetic data without seeing a substantial drop off in performance. However, replacing that final 10% of human data results in a significant decline.

Quality also counts, as illustrated by the case of Microsoft's success with Phi-4. This LLM was trained on predominantly synthetic data generated by GPT-4o. However, much of the pre-training data – a general data set used for the first stage of training before a model is fine-tuned – was carefully curated, high quality real world data including books and research papers.

Potential benefits synthetic data can bring

When synthetic data is used intelligently, and combined effectively with real world data, it has the potential to solve six specific issues when it comes to AI training data: scarcity, accessibility, homogeneity, bias, privacy issues, and cost.

Data scarcity

As AI companies race to gain market share and achieve new firsts, the insatiable demand for data to train their LLMs increases. Synthetic data has the potential to fill this gap, at least according to research by Gartner. However, it should be noted that using significant amounts of real data in pre-training data sets, and for syncing to avoid model collapse, will still be needed.

Data accessibility

Increasingly, big tech companies are acting as gatekeepers when it comes to data, creating a barrier to entry for smaller players. Synthetic data has the potential to democratise Gen AI by making large volumes of training data affordable and accessible. Nevertheless, this will not remove the responsibility of big tech to improve access to real world data, as it is still needed for training synthetic-data creating models.

Data homogeneity

In some niche use cases, like training AIs for autonomous driving, real world datasets are too homogenous. In the case of driving, developers can generate synthetic data to fill gaps in the data for unusual situations. This then enables models to train for rare occurrences on the road.

Bias

Some real-world datasets contain inherent biases, so synthetic data can be generated to ensure AI models receive a more balanced picture. For example, in finance, the UK's Financial Conduct Authority (FCA) has argued that synthetic data has the potential to counteract potential biases caused because certain groups are underrepresented in human data sets.

Privacy

In sectors like healthcare and finance, privacy requirements are making data shortages more acute. With synthetic data, companies can build training datasets for their models containing niche data without compromising customer privacy. However, as a report commissioned by the UK's Royal Society has pointed out with reference to synthetic data in medical research, there is an assumption that synthetic data is “inherently private.” This is a “misconception.” As the researchers point out, synthetic data can leak information about the data it was derived from.

Specifically, models trained on sensitive data are vulnerable to model inversion attacks, where hackers are able to reconstruct portions of an original data set.

Cost

Generally speaking, synthetic data is generated at a lower cost than real world data. It also comes labelled, which saves time and costs. On some AI training projects, up to 80% of the project is taken up with data preparation, including labeling. This explains why dedicated companies have emerged specifically to source low-cost labour to meet the data processing needs of Silicone Valley giants.

Augmenting rather than replacing real data

These benefits of synthetic data can be leveraged, provided it is not treated as a replacement for real data. Instead, its role should be to augment real data sets, providing ways to increase the scale of data points available.

For context, Meta's upcoming LLM, LLAMA Behemoth, is being trained on 30 trillion data points. Clearly, finding real world data at this scale is challenging, if not impossible. Yet, as has been noted, using real world data is still a must, whether that is for training the models that produce synthetic data, or for syncing with synthetic data to ensure accuracy and avoid model collapse. At the scale LLMs are now working at, even if synthetic data makes up a significant proportion of the training data used, there will still be substantial demand for real world data. And this means there will remain complex issues to resolve around gatekeeping, access, bias, cost, and time.

For over 13 years, Gediminas Rickevicius has been a force of growth in market-leading IT, advertising, and logistics companies around the globe. He has been changing the traditional approach to business development and sales by integrating big data into strategic decision-making. As the Senior VP of Global Partnerships at Oxylabs, Gediminas continues his mission to empower businesses with state-of-the-art public web data gathering solutions.