Introducing Voice Control
By Lydia Schooler, Lorenss Martinsons on Dec 2, 2024
Introducing Voice Control – novel interpretability-based method for AI voice customization
-
We’re introducing Voice Control, a novel interpretability-based method that brings precise control to AI voice customization without the risks of voice cloning.
-
Our tool gives developers control over 10 voice dimensions, labeled “gender,” “assertiveness,” “buoyancy,” “confidence,” “enthusiasm,” “nasality,” “relaxedness,” “smoothness,” “tepidity,” and “tightness.”
-
Unlike prompt-based approaches, Voice Control enables continuous adjustments along these dimensions, allowing for precise control and making voice modifications reproducible across sessions.
-
We’re releasing Voice Control in beta so that developers can create one-of-a-kind voices for any application, but we’re still working on making voice quality 100% reliable for extreme parameter combinations.
-
Through an intuitive no-code interface, you can easily tinker with this frontier technology to craft the perfect voice for your brand or application.
Faced with an increasingly recognizable set of preset voices from AI providers, creators still struggle to find voices that match their product, brand, or application without compromising on quality.
Today, we're introducing Voice Control, our experimental feature for the Empathic Voice Interface (EVI) that transforms how custom AI voices are created through interpretable, continuous controls.
Why voice control matters
Until today, finding the perfect AI voice for your product has been a compromise—either settling for stock voices that aren’t uniquely suited to your brand's identity or wrestling with voice-cloning approaches that are riskier, take more time, and often compromise on quality. We’re introducing Voice Control so that developers can design their own unique voice in seconds. On our playground, you can now tinker with voice characteristics in real-time until you find one that matches your vision—you'll know it when you hear it.
What started as a research project has evolved into an artistic tool—each voice a unique creation that captures a specific mood, personality, or character.
Interpretable control for voice AI
As scientists working at the intersection of emotion science and AI, our research goal was to develop interpretability tools for speech-language models. What makes this particularly challenging is that people’s perceptions of voices are far more granular than they can articulate in words. Consider how parents can instantly distinguish their child's voice in a playground full of young, squeaky, enthusiastic voices, or how you'd struggle to describe your best friend's voice to a stranger—despite immediately recognizing it yourself. Nuanced, ineffable voice characteristics are not just highly recognizable to humans, but extremely psychologically salient.
Given these constraints, we decided to develop a slider-based approach to voice interpretability and control that reflects the nuances of human voice perception without forcing them through the bottleneck of language.
Modifiable voice attributes
The following attributes can be modified to personalize any of the base voices:
Masculine/Feminine: The vocalization of gender, ranging between more masculine and more feminine.
Assertiveness: The firmness of the voice, ranging between timid and bold.
Buoyancy: The density of the voice, ranging between deflated and buoyant.
Confidence: The assuredness of the voice, ranging between shy and confident.
Enthusiasm: The excitement within the voice, ranging between calm and enthusiastic.
Nasality: The openness of the voice, ranging between clear and nasal.
Relaxedness: The stress within the voice, ranging between tense and relaxed.
Smoothness: The texture of the voice, ranging between smooth and staccato.
Tepidity: The liveliness behind the voice, ranging between tepid and vigorous.
Tightness: The containment of the voice, ranging between tight and breathy.
Each voice attribute can be adjusted relative to the base voice's characteristics. Values range from -100 to 100, with 0 as the default. Setting all attributes to their default values will keep the base voice unchanged.
These sliders represent perceptual qualities that listeners tend to associate with specific voice characteristics – for instance, what people commonly interpret as a voice that sounds 'confident' or 'feminine' – rather than making claims about someone’s underlying gender or confidence level (after all, these are synthetic voices that don’t correspond to any real person).
Disentangling voice characteristics
One of our core technical achievements is ensuring that, in general, modifications to one voice characteristic don't influence others. This is particularly challenging as many voice attributes are highly correlated across real speakers, so we decided to develop a new, unsupervised approach that preserves most characteristics of each base voice when specific parameters are varied.
Implementation and integration
Voice Control is immediately available through our platform. The creation process is straightforward:
- Select a base voice as your starting point
- Adjust the voice attributes using intuitive sliders
- Preview your changes in real-time
- Deploy your custom voice through the EVI configuration
The system ensures that voice customizations are:
- Reproducible across sessions
- Stable across different utterances
- Computationally efficient for real-time applications
What's next
This release marks just the beginning of our vision for voice customization. We're actively working on:
- Expanding our range of base voices
- Introducing additional interpretable dimensions
- Enhancing preservation of voice characteristics under extreme modifications
- Developing advanced tools for analyzing and visualizing voice characteristics
Learn More: Transform AI interactions with EVI. Create customizable, emotionally intelligent voice AI for any industry to build AI applications that better understand and respond to human emotional behavior. Start building more engaging AI apps today.
Subscribe
Sign up now to get notified of any updates or new articles.
Share article
Recent articles
Hume AI creates emotionally intelligent voice interactions with Claude
Hume AI trained its speech-language foundation model to verbalize Claude responses, powering natural, empathic voice conversations that help developers build trust with users in healthcare, customer service, and consumer applications.
How EverFriends.ai uses empathic AI for eldercare
To truly connect with users and provide a natural, empathic experience, EverFriends.ai needed an AI solution capable of understanding and responding to emotional cues. They found their answer in Hume's Empathic Voice Interface (EVI). EVI merges generative language and voice into a single model trained specifically for emotional intelligence, enabling it to emphasize the right words, laugh or sigh at appropriate times, and much more, guided by language prompting to suit any particular use case.
How can emotionally intelligent voice AI support our mental health?
Recent advances in voice-to-voice AI, like EVI 2, offer emotionally intelligent interactions, picking up on vocal cues related to mental and physical health, which could enhance both clinical care and daily well-being.