While the world is going wild over the potential benefits of generative AI, there’s little attention paid to the data deployed to build and operate these tools.
Let’s look at a few examples to explore what’s involved in determining data use, and why this matters for end users as well as operators.
Text-based generative AI usage
Text-based generative AI tools are a marvel of modern technology, but their usage isn’t without cost. The amount of data these tools consume depends significantly on the complexity and length of the request, as well as the given tool’s sophistication level.
For instance, GPT-3 by OpenAI consumes nearly 175 billion parameters to perform its tasks. It utilizes vast amounts of text from books, websites, and other resources to generate human-like responses. Every character inputted in your request can account for around 4 bytes (assuming unicode encoding), with a typical request averaging at about 2000 characters or less. It adds up when we consider millions of requests being made per day.
Meanwhile, GPT-4 is even more intensive, with 1.8 trillion parameters and a dataset that exceeds a petabyte at its disposal. So while providing an exact figure is not always easy, the data usage collectively across this class-leading platform is undeniably vast, and growing with each iteration.
Image-oriented AI data consumption
Image-based generative AI tools are fascinating beasts, but they are also voracious consumers of data, easily eclipsing their text-focused counterparts.
For example, Generative Adversarial Networks (GANs) are often used to create realistic images from random noise. In doing so, they gobble up substantial amounts of data. To train just one style-GAN model requires a dataset containing thousands or even millions of high-res images.
Let’s break it down in more simple terms. The average size for an HD image is around 2 MB. If we assume the AI is trained with about 1 million such images, this would mean several terabytes of storage is necessary initially.
This highlights that when employing GANs or similar heavy-duty image-generating AIs, resource usage can scale substantially, which is something key to keep in mind as you plan your projects.
Of course, for image tools that aren’t specifically generative, such as AI editing solutions, the data usage is substantially smaller. For instance, being able to automatically alter photo backdrops is subtractive rather than generative. If you’re interested in the ins and outs of background changer software, it’s a good idea to learn more about this technique to appreciate its advantages.
Speech synthesis tool data needs
Now, let’s focus on speech synthesis tools, which form another category of generative AI that demonstrates the hefty data consumption involved in this tech.
Tools like Google’s Text-to-Speech or Amazon Polly translate textual information into spoken voice output, a process that involves significant amounts of data. Using deep learning technologies, these AIs average about 2MB per minute for standard-quality audio.
Delving deeper, if the tool needs to generate an hour’s worth of audio content, such as for audiobooks, it may require up to roughly 120 MB of processed data. That figure doesn’t include the initial training sets involving hundreds upon thousands of recorded human speech hours.
So while the final product is often just a few minutes long and not excessively large, it’s critical to remember that producing this requires vast reserves of underlying processed and unprocessed data.
Data use in music AI
It’s a good move to contextualize the functionality of music-related generative AI tools by understanding their data use.
These AIs, like OpenAI’s MuseNet or Sony’s Flow Machines, creatively compose music. However, their ingenuity is of course founded on a deluge of data. Thousands of MIDI files that could be anywhere from 10 KB to several megabytes each are needed for training these models.
For instance, a one-minute piece generated approximately requires about 1 MB when converted to MP3 format. But this lean output once again belies the immense resources used during the model-training phase, with copious musical samples processed.
So while these ingenious AIs churn out beautiful symphonies seemingly from thin air, the volume of data that’s been poured into their production and ongoing operation is certain to be staggering.
Chatbots and their data demands
Chatbots are a common use-case of generative AI technology, but these virtual assistants also demand substantial data.
To begin with, generating conversational responses necessitates a considerable backend infrastructure. On average, answering one chat request typically uses around 1-2 KB if only the final output text is weighed up.
However, that doesn’t tell the whole story. Behind-the-scenes there’s extensive data usage during training phases, which is the same story across every generative AI solution we’ve discussed. Consider the hundreds of thousands or more conversational scripts used to train big-league players like Microsoft’s Azure Bot or Google Dialogflow.
Add this to ongoing fine-tuning and you start grasping the true scale of requisite raw information. So even though your Alexa’s response to tomorrow’s weather may seem simplistic, it represents the tip of the iceberg in terms of the data used.
Final thoughts
It’s worth concluding on the point that while generative AI is heavily data-reliant, tools are becoming more efficient even if the datasets on which they are trained keep growing. So while it’s worth being cautious about their impact, this is certainly not a deal-breaker for anyone thinking about using them.