Accounting for attention in speech-to-speech AI applications

The quantum leap forward in natural language technologies attributed to foundation models, LLMs, and modern vocal applications of AI is due in no small part to the mastery of the concept of attention.

When training and deploying the aforementioned models, attention mechanisms account for several things. One of the most valuable is allowing models to look back at different parts of a conversation or text and determine how that context relates to present inputs (or lines of text).

Attention facilitates episodic memory, in which models understand things like referents for pronouns to know which nouns ‘they’ or ‘him’ are referring to. Perfecting attention in language models is crucial to the success of everything from the transformer architecture to contemporary speech-to-speech models—including those advanced by Deepgram, which specializes in vocal and language applications of AI.

According to Deepgram CEO Scott Stephenson, the vendor has developed a speech-to-speech model that works without converting speech to text while “preserving crucial elements of communication.” For this particular model—which enables users to speak in natural language to AI systems that respond in kind—and other language models, mastering attention is pivotal.

An examination of contemporary attention mechanisms reveals that some of the most influential involve state-space models, transformers, and what Stephenson referred to as “latent space” models.

State space models

Conceptually, there are marked similarities between state space models and Recurrent Neural Networks (RNNs). Both models have attention constructs that can look back at previous parts of model inputs, passages in a text, or parts of a spoken conversation. “State space models are a generalization of RNNs,” Stephenson indicated. “They’re constrained in different ways, but the general idea is you carry around your bag with you. You carry around your luggage with you.” The analogy Stephenson mentioned refers to the previous parts of a natural language interaction to which the model was exposed. For example, models might look back at preceding sentences or answers to questions. Models use this data to inform the next phase in the sequence of the interaction.

However, the differences between state-space models and RNNs may be more significant than their similarities. “State spaces can be faster,” Stephenson admitted. “They can be longer and have a much longer attention span. An RNN’s attention span is only looking back one step. In that step, all the information has to be carried in the bag you’re carrying along. But in a state space model, you’re carrying along the state, but you have more of an ability, through that state, to look backward and, depending on the thing that you’re doing, possibly look forward.”

Self-attention and vanilla transformers

Deepgram employs state-space models to help develop their own models, which include speech-to-speech, speech-to-text, and text-to-speech models. There are parallels between the value state space models provide Deepgram’s models and what the concept of self attention provides for LLMs. Both are influential in infusing these respective models with an overall awareness required for heightened understanding of natural language. With state space models, “You get better context,” Stephenson reflected. “Just like self attention for transformers gets better context for LLMs. Well, this type of variant of self attention is included in state space models.”

Self attention in the transformer architecture enables models to enhance the representation of individual weights assigned to words. It expands how much surrounding text models can consider when understanding natural language. This mechanism is partly responsible for “like a vanilla transformer can look everywhere with the same amount of effort,” Stephenson explained. “It can look back 100 steps. It can look back one. It can look back two, and they’re all treated the same. A state space model tries to take that idea and apply it to Recurrent Neural Networks, so you gain that ability of having a broad attention span. And then, you can drop that constraint from RNNs from only looking back very near to where you are.”

Latent space techniques

Deepgram also relies on latent space approaches when building its models for generating and understanding spoken language. Certain latent space techniques allow models to represent high dimensionality data in a lower dimensionality space. When describing the relationship between these techniques and attention for models, Stephenson commented that “you can use latent space techniques, within these different structures, in order to compress, and this is a lot like what DeepSeek did with their Multi-head Latent Attention. And, that word is in there, latent.”

When properly applied, the compression benefits of latent space techniques directly translate to the notion of attention. The compression Stephenson described effectively preserves the relationships between the information, or words, in the higher dimensionality space while representing it in a lower dimensionality space. Thus, “You utilize the latent space in order to store more information, have a richer attention space, but you can do this while making the model faster,” Stephenson remarked. “You can make the model better and faster if you include the latent space techniques.”

Attention pays

Many enterprise use cases of AI have become synonymous with some form of language generation and language understanding. The models these applications hinge on are prized because they overcome conventional barriers to implementing attention for natural language technologies. As such, state space models, transformers and self attention, and latent space techniques will likely continue to influence the future of enterprise AI.

Accounting for attention in speech-to-speech AI applications

State space models

Self-attention and vanilla transformers

Latent space techniques

Attention pays

Leave a Reply Cancel reply