Critical battle for high-quality data in AI industry

According to the Economist, Adobe has defied predictions of its demise in the face of AI by leveraging its vast database of stock photos to develop its suite of AI tools called Firefly. Unlike its competitors, Adobe’s Firefly creates images without mining the internet for pictures, allowing the company to avoid copyright disputes. Since its launch in March, Firefly has generated over 1 billion images, leading to a 36% increase in Adobe’s share price. This success highlights a broader competition in the AI tools market, as model builders increasingly seek massive amounts of data to improve their AI models.

The demand for data is growing so rapidly that high-quality text for AI training may be depleted by 2026, according to research firm Epoch AI. Companies like Google and Meta have trained their AI models on over 1 trillion words, while Wikipedia contains around 4 billion English words. The data quality is just as important as the quantity, with text-based models benefitting from well-written and factually accurate writing. AI chatbots also yield better results when they can explain their workings step-by-step. As a result, there is a rising demand for specialized information sets and sources like textbooks.

Acquiring data has become more challenging as content creators demand compensation for the material used in AI models, resulting in copyright infringement cases against model builders. This has triggered a flurry of dealmaking as AI companies scramble to secure data sources. Open AI has partnered with Associated Press and Shutterstock, while Google is reportedly in discussions with Universal Music to license artists’ voices for a songwriting AI tool. Data holders are leveraging their bargaining power, with websites like Reddit and Stack Overflow increasing the cost of data access. Furthermore, companies are actively improving the quality of their existing data through tasks like image labelling and user feedback mechanisms.

Corporate customers possess an untapped source of valuable data, such as call-centre transcripts and customer spending records. Utilizing this data is challenging as it is often spread across multiple systems and buried within company servers rather than in the cloud. Despite these obstacles, Adobe’s success with Firefly and the ongoing data land grab in the AI industry indicate that the competition for AI dominance drives the demand for large, high-quality datasets.