Any artificial intelligence company faces the same core challenge, collecting the data necessary to train its AI models. The need for high-quality training data is so great that it has lead to an entire sub-industry dedicated to providing AI companies with the data they need to train their models. AI and AI-adjacent companies are always looking for new ways to get the data they need. One way to get this training data is to just fabricate or generate the data.
As Fortune reported, DataGen specializes in using their own machine learning models to create synthetic data for other companies to train their models, particularly image and video data. The data generated by the company is then utilized by their customers to train their own AI models. According to DataGen’s CEO and founder, Ofir Chakon, the company can create an entirety synthetic dataset for a client company in just a few hours. This is substantially faster than the length of time it typically takes to prepare a dataset for use, which is often weeks or even months of labeling data.
There are other reasons that synthetic data is attractive to companies, aside from the relative speed with which it can be prepared. Synthetic data doesn’t come with the kinds of privacy concerns that real data does. As more laws are created to protect people’s data privacy, it becomes more attractive to have synthetic training data. One estimate given by the technology analytics firm Gartner predicts that by 2023 around 65% of the world’s population will have their data protected by some type of data privacy law.
Despite the fact that synthetic data isn’t based on real people, it can still be biased. The data generated by a synthetic data model will have the same patterns the original training data had, meaning that if a dataset is biased those biases will exist in the newly generated data. DataGen has strategies for reducing data bias in the generated data. One method for reducing bias in synthetic data is increasing the occurrence rate of relatively rare events, meaning that if one class in the dataset is under-represented its occurrence rate can be boosted up to something more equal.
The technique of boosting the occurrence of rare events is incredibly important when creating datasets that involve potentially dangerous scenarios. Consider a dataset used to train an autonomous vehicle. The vehicle must reliably respond to rare events, such as a sinkhole opening up in the road. However, these events are very rare, and getting training data for these events is difficult. For this reason, training data for these rare events often need to be generated.
As Chakon explained via Fortune:
“Our customers have full control over all the parameters that go into the data they create. The real-world implication is that, once deployed, you can be sure it’s going to work well in different domains, with different ethnicities, in different geographic locations or any environment you can imagine.”
DataGen uses Generative Adversarial Networks (GANs) to generate realistic simulations of real-world items and events. Chakon explained that the company can reliably generate realistic examples of anything that involves indoor environments or human perception. For instance, an image dataset generated by DataGen could include examples of objects used to train a robotic picking arm used for warehouse logistics, with the generated images looking indistinguishable from the real thing. DataGen’s software can generate 3D objects by combining a visual meshwork with a physics simulation system.
Investors in DataGen include a variety of high-profile individuals and companies. Investors include the directors of Nvidia’s AI research division and the Max Plank Institute for Intelligent Systems, as well as Anthony Goldbloom, CEO of Kaggle.