Announcing our latest research update OCTAVERead more
Article

Designing custom voices with AI

Published on Jan 14, 2025

The use of artificial intelligence (AI) in voice design is revolutionizing how we interact with technology and experience audio content. AI-powered systems can create synthetic voices with specific characteristics, clone the voices of real people, and even generate entirely new voices with unique qualities. This has led to a surge in creative applications, such as producing realistic voiceovers for films and video games, developing personalized voices for virtual assistants, and restoring the voices of individuals with speech impairments.  

This article explores the current landscape of AI-powered voice design systems, examining the technologies, capabilities, ethical considerations, and potential of this rapidly evolving field. We will delve into the work of various companies and research labs at the forefront of this innovation, including ElevenLabs, Respeecher, Microsoft, Google, AWS, MIT CSAIL, and Hume AI.

Companies and research labs leading the way

Company/Lab

Technology

Key Features

ElevenLabs

Voice Design

Generative AI model for creating synthetic voices with prompts to customize characteristics like gender, age, and accent.

Hume AI

Voice Control, OCTAVE

Hume AI’s speech-to-speech model EVI 2 offers a number of sliders for modulating custom voices. Hume AI’s OCTAVE is a next-generation speech-language model that can generate more realistic voices and personalities from prompts.

Respeecher

AI Voice Cloning

Creates realistic voice clones for film, TV, animation, and other media. Prioritizes ethical use by obtaining consent and sharing profits with voice actors.

Microsoft

Azure AI Speech

Offers speech recognition, text-to-speech, speech translation, and custom voice creation. Includes Personal Voice for creating user-specific voices and Custom Neural Voice for brand voices.

Google

Zero-Shot Voice Transfer

Customizes text-to-speech systems with a specific person's voice using minimal audio samples.

AWS

Amazon Polly

Converts text to lifelike speech using deep learning technologies. Offers Brand Voice for creating unique brand voices.

Ethical considerations and potential risks

The development and use of AI voice design systems raise several ethical considerations that must be carefully addressed:

Authenticity and deception

The ability to create highly realistic voice clones raises concerns about the potential for misuse, such as impersonation or generating fake audio recordings that could be used to spread misinformation or manipulate individuals.

Privacy and consent

Using someone's voice to create a synthetic voice or a voice clone without their explicit consent raises significant privacy concerns. This is particularly relevant in cases where voice cloning technology is used to replicate the voices of deceased individuals or to create voices for commercial purposes without proper authorization.

Bias and discrimination

AI models can inherit and amplify biases present in the data they are trained on. This could lead to the creation of voices that perpetuate harmful stereotypes or exhibit discriminatory behavior.

Legal implications

It's crucial to consider the legal ramifications of using AI-generated voices, especially in situations where copyright infringement or intellectual property rights are involved. Using a voice clone for commercial purposes without proper authorization could have legal consequences.

Job Displacement

As AI-generated voices become increasingly sophisticated and accessible, there is a concern that they could displace human voice actors and other professionals in industries like entertainment, advertising, and customer service.

Capabilities and limitations

Capabilities

Realism and expressiveness

AI voice design systems are capable of creating highly realistic and expressive synthetic voices that can convey a wide range of emotions and nuances. Advancements in AI technology have significantly narrowed the quality gap between synthetic voices and human recordings, making them nearly indistinguishable in some cases.  

Voice cloning accuracy

AI-powered voice cloning technology can replicate existing voices with remarkable accuracy, capturing the unique characteristics and nuances of a speaker's voice.

Voice customization

AI allows for the generation of voices with specific characteristics, such as age, gender, accent, and emotional tone, providing greater flexibility and control over the final output.

Natural-sounding speech from text

AI-powered text-to-speech (TTS) systems can produce natural-sounding speech from written text, making it possible to create audio content from any text source.

Limitations

Technical challenges

Some AI voice design systems may still struggle with accurately replicating certain sounds, accents, or complex emotions.

Data dependency

The quality of generated voices can vary depending on the quality and diversity of the data used to train the AI models.

Ethical concerns

As mentioned earlier, the ethical implications of AI voice design need to be carefully considered to prevent misuse and ensure responsible development and application.

Approaches to AI Voice Design

AI voice design systems employ various approaches to create and manipulate voices:

Prompt-based models

These models, such as ElevenLabs' Voice Design and Hume AI’s OCTAVE, utilize AI algorithms to generate new voices from scratch based on user-defined parameters. This approach allows for the creation of unique voices with specific characteristics.

Voice cloning

This technique involves creating a digital replica of an existing voice by analyzing and processing recordings of that voice. Companies like Respeecher specialize in this approach, which has applications in entertainment, accessibility, and historical preservation.

Voice conversion

This method modifies an existing voice to change its characteristics, such as pitch, tone, or accent. This can be used to create variations of a voice or to adapt a voice to different contexts.

Text-to-speech (TTS)

TTS systems convert written text into spoken audio. AI is used to generate speech with a high degree of naturalness, expressiveness, and overall quality.

Research and development in AI voice design

Ongoing research plays a crucial role in advancing the field of AI voice design. Here are some notable research efforts:

Impact of AI-synthesized voices on consumer behavior

Research has shown that the design of AI-synthesized voices can systematically influence consumer perception and behavior. Studies have explored how factors like the number of AI voices used in marketing messages and the familiarity of consumers with a product category can affect their responses to AI-generated voices.

Voice cloning techniques

Researchers are actively exploring new techniques for voice cloning using AI and machine learning. This includes developing methods for integrating emotions, improving background noise adaptation, and enabling multi-speaker voice cloning for enhanced speech synthesis.

Hume AI's EVI 2 model: Interactive voice control

Hume AI's EVI 2 model introduces an innovative approach to voice control using an intuitive slider-based interface. This technology allows for precise adjustments to various vocal characteristics, enabling users to fine-tune AI voices to their liking.  

Key features of EVI 2's voice control

Slider-based interface

EVI 2 provides an easy-to-use interface with sliders that control various nuanced aspects of the voice, such as masculinity/femininity, assertiveness, buoyancy, confidence, enthusiasm, nasality, relaxedness, smoothness, tepidity, and tightness.  

Real-time adjustments

Users can hear the changes in the voice in real-time as they adjust the sliders, allowing for immediate feedback and precise customization.

Wide range of applications

This technology has potential applications in various fields, including virtual assistants, customer service, and content creation.

Hume AI's OCTAVE model

Hume AI has developed a groundbreaking speech-language model called OCTAVE, which pushes the boundaries of AI voice design. OCTAVE combines the capabilities of Hume AI's EVI 2 speech-language model with other cutting-edge voice and language design technologies, including OpenAI's Voice Engine, Elevenlabs' TTS Voice Design, and Google Deepmind's NotebookLM. Notably, OCTAVE can generate voice and personality clones in real time based on brief audio samples.

Key features of OCTAVE

Custom voice and personality generation

OCTAVE can generate custom voices and personalities from simple prompts or recordings as short as 5 seconds. This allows users to create unique voices with specific characteristics and personalities for various applications.  

Real-time interactions

OCTAVE can create real-time interactions by generating dialogue for multiple speakers and seamlessly switching between them. This capability opens up possibilities for interactive storytelling, personalized virtual assistants, synthesized podcasts, and more engaging human-computer interactions.  

Voice modulation

Users can fine-tune various aspects of the generated voices, including gender, age, accent, vocal register, emotional intonation, and speaking styles. This level of control allows for precise customization and tailoring of voices to specific needs and preferences.  

Potential applications of OCTAVE

Interactive storytelling

OCTAVE can be used to create dynamic and immersive narratives with unique characters and voices, enhancing storytelling experiences in various media, including video games, audiobooks, and virtual reality applications.

Personalized virtual assistants

OCTAVE enables the development of virtual assistants with distinct personalities and voices tailored to individual users. This can lead to more engaging and personalized interactions with AI assistants.

Accessibility tools

OCTAVE has the potential to create personalized voices for individuals who have lost their voices due to medical conditions or have difficulty speaking. This can significantly improve communication and quality of life for these individuals.  

Gaming and entertainment

OCTAVE can enhance the immersive experience in video games and other entertainment applications by providing realistic and diverse voices for characters and interactive elements.

Education

AI voice design, including technologies like OCTAVE, can be used to create personalized voices for educational purposes, making learning more engaging and accessible for students with diverse needs.  

Voice-driven interfaces

AI voice design can be used to create voice-driven interfaces for devices with limited screen real estate, such as smartwatches, smart speakers, and in-car systems, enabling more natural and intuitive interactions.  

Engaging agents

AI voice design can be used to create engaging agents with spontaneous conversational voices, making interactions with AI systems more dynamic and human-like .  

Conclusion

AI-powered voice design is transforming the way we create, experience, and interact with audio. The technology is rapidly advancing, with companies and research labs developing innovative systems with impressive capabilities. As AI voice design continues to evolve, we can expect to see even more creative applications and groundbreaking advancements in the years to come. However, it is crucial to address the ethical considerations and potential risks associated with this technology to ensure its responsible development and use. Striking a balance between innovation and ethical considerations will be essential to harness the full potential of AI voice design while mitigating potential harm. This includes promoting transparency, obtaining consent, addressing bias, and ensuring fairness in the development and deployment of AI voice design systems. By navigating these challenges responsibly, we can unlock the transformative power of AI voice design to enhance communication, creativity, and accessibility for all.

Subscribe

Sign up now to get notified of any updates or new articles.

Recent articles