AI models are demanding increasingly unique and sophisticated datasets to enhance their performance. However, the developers of major Large Language Models (LLMs) face challenges as web data becomes inadequate and cost-prohibitive, according to a report from the Financial Times.
The Need for a Paradigm Shift:
- Major LLM creators recognize the limitations of human-made data in boosting performance. The next significant leap might not be achieved by feeding more web-scraped data to the models.
- Custom human-created data is not scalable due to its exorbitant costs, making it impractical for training AI with the required volume of finely detailed content.
- Web data availability is dwindling as platforms like Reddit and Twitter charge substantial fees for data access, making it difficult for researchers to rely solely on such sources.
AI Models Take Charge with Synthetic Data:
- Companies like OpenAI, Microsoft, and Cohere are actively exploring synthetic data to overcome cost and quality constraints.
- Cohere adopts a unique approach with two AI models acting as tutor and student to generate synthetic data, which is then reviewed by a human.
- Microsoft’s research team has found that certain synthetic data can effectively train smaller models, though achieving GPT-4 performance with synthetic data remains a challenge.
- Startups like Scale.ai and Gretel.ai have already entered the market, providing synthetic data-as-a-service to meet the growing demand.
AI Leaders Shape the Future:
- Sam Altman envisions a future where synthetic data becomes ubiquitous, potentially sidestepping privacy concerns in the EU and accelerating the pathway to superintelligence through models teaching themselves.
- Aidan Gomez, CEO of Cohere, dismisses web data’s limitations, emphasizing its noise and lack of representativeness for their needs.
Balancing Optimism and Caution:
- While the shift towards synthetic data shows promise, some researchers urge caution. A study from Oxford and Cambridge warns that training AI models on their raw outputs may introduce “irreversible defects” leading to performance degradation over time.
The Emerging Landscape:
- The era of human-made content driving AI training is transitioning. Over the next decade, AI-generated content is poised to dominate the world’s data, reshaping the way models are trained and advancing the frontier of artificial intelligence.