Product Updates

Introducing Voice Control

Novel interpretability-based method for AI voice customization

By Lorenss Martinsons, Lydia Schooler on December 2, 2024

We’re introducing Voice Control, a novel interpretability-based method that brings precise control to AI voice customization without the risks of voice cloning.
Our tool gives developers control over 10 voice dimensions, labeled “masculine/feminine,” “assertiveness,” “buoyancy,” “confidence,” “enthusiasm,” “nasality,” “relaxedness,” “smoothness,” “tepidity,” and “tightness.”
Unlike prompt-based approaches, Voice Control enables continuous adjustments along these dimensions, allowing for precise control and making voice modifications reproducible across sessions.
We’re releasing Voice Control in beta so that developers can create one-of-a-kind voices for any application, but we’re still working on making voice quality 100% reliable for extreme parameter combinations.
Through an intuitive no-code interface, you can easily tinker with this frontier technology to craft the perfect voice for your brand or application.

Faced with an increasingly recognizable set of preset voices from AI providers, creators still struggle to find voices that match their product, brand, or application without compromising on quality.

Today, we're introducing Voice Control, our experimental feature for Empathic Voice Interface 2 (EVI 2) that transforms how custom AI voices are created through interpretable, continuous controls.

Why voice control matters

Until today, finding the perfect AI voice for your product has been a compromise—either settling for stock voices that aren’t uniquely suited to your brand's identity or wrestling with voice-cloning approaches that are riskier, take more time, and often compromise on quality. We’re introducing Voice Control so that developers can design their own unique voice in seconds. On our playground, you can now tinker with voice characteristics in real-time until you find one that matches your vision—you'll know it when you hear it.

What started as a research project has evolved into an artistic tool—each voice a unique creation that captures a specific mood, personality, or character.

Interpretable control for voice AI

As scientists working at the intersection of emotion science and AI, our research goal was to develop interpretability tools for speech-language models. What makes this particularly challenging is that people’s perceptions of voices are far more granular than they can articulate in words. Consider how parents can instantly distinguish their child's voice in a playground full of young, squeaky, enthusiastic voices, or how you'd struggle to describe your best friend's voice to a stranger—despite immediately recognizing it yourself. Nuanced, ineffable voice characteristics are not just highly recognizable to humans, but extremely psychologically salient.

Given these constraints, we decided to develop a slider-based approach to voice interpretability and control that reflects the nuances of human voice perception without forcing them through the bottleneck of language.

Modifiable voice attributes

The following attributes can be modified to personalize any of the base voices:

Masculine/Feminine: The vocalization of gender, ranging between more masculine and more feminine.
Assertiveness: The firmness of the voice, ranging between timid and bold.
Buoyancy: The density of the voice, ranging between deflated and buoyant.
Confidence: The assuredness of the voice, ranging between shy and confident.
Enthusiasm: The excitement within the voice, ranging between calm and enthusiastic.
Nasality: The openness of the voice, ranging between clear and nasal.
Relaxedness: The stress within the voice, ranging between tense and relaxed.
Smoothness: The texture of the voice, ranging between smooth and staccato.
Tepidity: The liveliness behind the voice, ranging between tepid and vigorous.
Tightness: The containment of the voice, ranging between tight and breathy.

Each voice attribute can be adjusted relative to the base voice's characteristics. Values range from -100 to 100, with 0 as the default. Setting all attributes to their default values will keep the base voice unchanged.

These sliders represent perceptual qualities that listeners tend to associate with specific voice characteristics – for instance, what people commonly interpret as a voice that sounds 'confident' or 'feminine' – rather than making claims about someone’s underlying gender or confidence level (after all, these are synthetic voices that don’t correspond to any real person).

Disentangling voice characteristics

One of our core technical achievements is ensuring that, in general, modifications to one voice characteristic don't influence others. This is particularly challenging as many voice attributes are highly correlated across real speakers, so we decided to develop a new, unsupervised approach that preserves most characteristics of each base voice when specific parameters are varied.

Implementation and integration

Voice Control is immediately available through our platform. The creation process is straightforward:

Select a base voice as your starting point
Adjust the voice attributes using intuitive sliders
Preview your changes in real-time
Deploy your custom voice through the EVI configuration

The system ensures that voice customizations are:

Reproducible across sessions
Stable across different utterances
Computationally efficient for real-time applications

What's next

This release marks just the beginning of our vision for voice customization. We're actively working on:

Expanding our range of base voices
Introducing additional interpretable dimensions
Enhancing preservation of voice characteristics under extreme modifications
Developing advanced tools for analyzing and visualizing voice characteristics

Learn More: Transform AI interactions with EVI. Create customizable, emotionally intelligent voice AI for any industry to build AI applications that better understand and respond to human emotional behavior. Start building more engaging AI apps today.