Home » Technical Topics » AI Linguistics

The secret to Deepgram’s speech-to-text model: Synthetic data generation

  • Jelani Harper 
comic-speech-bubbles

Nova-3, Deepgram’s most effectual speech-to-text model to date, is equipped with a plethora of capabilities. It supports approximately 10 different languages—even during the same conversation. It was trained to handle industry-specific jargon in areas of healthcare and the legal field. The model’s transcriptions are rendered fast enough for real-time use cases, obviating lengthy waiting periods common to other approaches.

Most of all, it delivers these benefits in challenging acoustic conditions, which all but redoubles its value for real-world use cases. Thus, when Nova-3 is deployed in situations in which there’s background noise, feedback, and other undesirable audio scenarios, the accuracy of transcriptions—and the rapidity in which they’re delivered—isn’t compromised.

According to Deepgram CEO Scott Stephenson, one of the ways to “get a model good at that is by exposing it to all sorts of environments.”

The difficulty in doing so is twofold. On the one hand, data scientists must account for the nearly endless array of situations in which the quality of audio for vocal interactions is degraded. On the other, they must be able to access data reflecting those conditions.

The solution to this, and an increasing number of data science problems surrounding data for training or fine-tuning models, is to simply generate the data required with synthetic data techniques.

Generating training data

For this particular use case, synthetic data was employed to multiply the amounts of data available to train Nova-3 to be robust enough to operate in adverse acoustic scenarios. Synthetic data is nearly universally renowned for its capacity to simulate events or dramatically expand the amount of training and fine-tuning data on hand for advanced machine learning models.

To build Nova-3, Deepgram required data with “all sorts of types of voices, young, old, different dialects, different regions, different languages,” Stephenson mentioned. “But then you take those voices and put them into new, or novel, or challenging acoustic environments. Then you can generate a lot of examples for the model to be forced to understand, where it would have been hard before to get those samples.”

The specificity of the types of data produced is one of the strengths of synthetic data generation techniques. Organizations can utilize various forms of synthetic data to devise results that are strikingly realistic. According to Stephenson, for Nova-3, such data “ranges from many different things, from a truck driving by in the background, to a bad connection far away from the microphone. People talking in the background, that type of thing. But, what you do is you generate synthetic data and data augmentation in order to get the model robust to that type of thing.”

Multi-model approach

The types of ways to produce synthetic data almost span the spectrum of predictive models in advanced machine learning. Deepgram’s synthetic data generation for Nova-3 is attributed to a multi-model approach that largely eschews the popular Generative Adversarial Networks (GAN) method. The GAN architecture is frequently the basis for innovations in the field of synthetic generation. “GANs are definitely part of the lineage leading up to this type of reasoning, but we don’t use GANs in order to build these models,” Stephenson disclosed. “We use Flow Matching models, State Space models, and our own proprietary latent space models.” With this artful combination, latent space models are responsible for the compression involved in Nova-3 and the capability to express high-dimensional data in a lower-dimensional space.

Stephenson characterized Flow Matching as a generalization of the widely used Diffusion models. Whereas Diffusion is often deployed to generate images, the Flow Matching techniques Deepgram relied on for Nova-3 “are being utilized in audio in order to generate ‘audio-scapes’ and voices saying whatever you need to be said,” Stephenson commented. State Space models are a generalization of Recurrent Neural Networks (RNN). They’re particularly assistive in facilitating the notion of attention in language models—allowing models to go back to different parts of a conversation, for example, in order to determine the antecedent for a demonstrative pronoun (such as ‘that’).

Increased accuracy, lower error rate

Because Nova-3 is able to function in noisy and disadvantageous audio settings—as well as in optimal ones—it creates speedy transcriptions for speech-to-text applications. Such functionality in these disadvantageous acoustic settings is attributed to the range of training data Deepgram used to teach the model to perform in such conditions. “In order to generate millions of hours of audio to train models like Nova-3, it would be extremely expensive if you just used off-the-shelf variants of GANs, or State Space, or Flow Matching,” Stephenson said. “They’d all be too expensive. So, what you do is combine these techniques in an efficient way, but do it in a way that still achieves the goal you want.”

Leave a Reply

Your email address will not be published. Required fields are marked *