Most enterprise applications of Artificial Intelligence are predicated on understanding, and generating, natural language in the form of text. Deepgram, an AI platform for generating and understanding spoken and written language, is looking to expand this paradigm by making applications just as accessible—and adept at understanding natural language—for vocal deployments of speech.
The startup’s platform specializes in building AI models for use cases involving speech-to-text, text-to-speech and, what’s likely the apex of this functionality, conversational speech-to-speech applications. Deepgram also hosts these models and the supporting infrastructure for them, which customers can access through APIs in a number of cloud settings, including Virtual Private Clouds, as well as on-premises.
Deepgram’s Voice Agent API allows customers to create digital agents that users can speak to in natural language. The agent issues near real-time responses in natural language for conversational interactions, and is trained in discoursing with end users about various facets of customers’ businesses.
“We just released our voice agent in September,” commented Deepgram CEO Scott Stephenson. “Today, we have over 200,000 developers that build on top of Deepgram. We build our own voice-native AI models that understand, analyze, and generate voice.”
The efficacy of Deepgram’s approach is considerable. In certain locations, the company has supplied Jack-in-the-Box with digital agents that take people’s orders at drive-through windows while conversing with them in natural language. Other use cases apply to contact centers and organizations seeking to upgrade dated IVR systems in verticals like finance or healthcare.
Latent space speech models
Several of the models available through Deepgram entail the neural network architecture. However, there are aspects of employing these models for audio and speech purposes—such as the need to mitigate background noise, or understand and generate speech in the presence of such noise—that AI model builders must account for. Consequently, Deepgram’s models are closed-source and proprietary. “Of course, we learn from open source,” Stephenson acknowledged. “We use underlying frameworks that help with implementing transformers, and Convolutional Neural Networks, and that type of thing in our models. But we all do the model architectures. The model weights are all trained by us.”
Some of the techniques used in Deepgram’s models align with what’s termed representation learning. This approach, which is not infrequently attended by manifold layout techniques, represents high-dimensional data in a lower-dimensional space that still preserves the features and relationships in the data. Stephenson referred to the models Deepgram makes available as “latency space models”. What the company does for voice data, he likened to what is done for images “When you can figure out a scheme for a very high-quality compression, that then reconstructs the image back and it almost looks perfect, but it still takes up way less space than the original image,” Stephenson remarked. “If you build your models with that in mind, those underlying representations are in a latent space.”
Model hosting and infrastructure
The real-time, on-demand access to language models for speech furnished by Deepgram is almost as invaluable as the models are. In addition to hosting these models, Deepgram makes available infrastructure to support an array of customer use cases at enterprise scale. “If you’ve got 1,000 or 10,000 conversations happening at once, a model isn’t all you need,” Stephenson mentioned. “You need the infrastructure to supply that. So, we build both. You need to be able to host the models. You need to be able to swap the models. You need to be able to adapt the models and handle all these streams happening at the same time.” With Deepgram’s infrastructure, customers can dynamically insert a sundry of models for speech-to-text, text-to-speech, and speech-to-speech use cases at will.
The cost benefits of doing so are obvious—organizations only pay for the models they need when they use them. Additionally, their solutions become more flexible, multifaceted, and overall utilitarian—enlarging the scope of business needs they address. Deepgram’s model hosting infrastructure enables “a very efficient and unrestricted hot swapping with low latency and high throughput,” Stephenson revealed. “So, if you need French? No problem. If you need a multi-lingual [model]? No problem. If you need your specific model that’s very good at doing tracking numbers for a shipping company, or something like that? No problem. You swap it in, utilize it, and then go back to the general model or whatever you need.”
Time will reveal
The on-premise and cloud accessibility of Deepgram’s voice models for AI, and their attendant infrastructure, has the potential to democratize this manifestation of cognitive computing. They may not become as ubiquitous as textual applications of language models, but they certainly have the potential to. Time will tell if the enterprise use cases for this technology, which include horizontally applicable deployments for interfacing with customer-facing and back-end systems via speech, will keep pace with these technological advancements.