Generative AI models rely on quality real-world data to advance, but as digital publishers increasingly restrict access to public data, the supply of such data appears to be dwindling. This limitation poses a potential obstacle to the progress of large language models, such as OpenAI’s GPT-4 and Google (NASDAQ:GOOGL)’s Gemini. To overcome this issue, experts are considering synthetic data as a viable alternative. Synthetic data, unlike real-world data created by humans, is generated by machine learning models based on samples of authentic data. However, experts caution that improper use of synthetic data could have adverse effects.
Past instances have shown that synthetic data has been effectively utilized in areas like autonomous driving systems. Companies such as Waymo and Tesla (NASDAQ:TSLA) have leveraged synthetic data to train their systems to handle a variety of road conditions. This approach has sparked interest in using synthetic data for training generative AI models. Unlike real-world data, which can be scarce and sensitive, synthetic data offers a way to create diverse scenarios and edge cases for training purposes.
Synthetic data is seen as a tool to fine-tune specialized AI models. According to Kjell Carlsson of Domino Data Lab, large models like OpenAI’s GPT-4 can generate synthetic data to train smaller models for specific purposes, such as creating targeted advertisements. Moreover, synthetic data can generate training data in various languages, aiding in the improvement of language models. Jigyasa Grover of Bordo AI emphasizes the significance of synthetic data in enhancing the adaptability and effectiveness of language models.
Synthetic Data as an Alternative
Artificially generated data can fill information gaps, especially when organizations refrain from sharing sensitive data. In sectors like healthcare and finance, synthetic data can help train AI models without compromising sensitive information. Neil Sahota, an AI advisor to the United Nations, mentions that synthetic data can train AI to identify tumors more accurately or detect corporate crime. This approach mitigates intellectual property concerns, as synthetic data avoids the risk of infringing on other people’s work.
Despite its advantages, training AI on synthetic data has risks. A recent study found that AI models trained on artificial data produced lower-quality outputs, a phenomenon known as “model collapse.” Synthetic data techniques are still evolving, and a lack of skilled engineers can exacerbate problems. Additionally, biased synthetic data can lead to legal liabilities if outputs are deemed discriminatory or inaccurate. Mayur Pillay of Hyperscience believes real-world data remains critical, especially for complex data types like handwriting on forms.
Balancing Synthetic and Real Data
While synthetic data offers a potential solution to the shortage of AI training data, experts agree that careful handling is essential. Combining synthetic data with real data might address the data scarcity, but synthetic data alone is unlikely to become the primary source of training data for AI companies soon. Jigyasa Grover points out that the current volume of data required for training large language models is substantial, and generating unbiased synthetic data at that scale remains challenging.
The interplay between synthetic and real data could be crucial in enhancing AI training while mitigating risks. As the technology and methodologies for generating synthetic data improve, it could become a more reliable resource for AI training. Nonetheless, the importance of real-world data cannot be overstated, particularly for intricate and context-dependent data types. The future of AI development may hinge on finding the right balance between these two data sources.