Article

Controlling the speed of AI voices

By Jeffrey Brooks on January 13, 2025

The rise of artificial intelligence has revolutionized many aspects of our lives, and one area where its impact is increasingly felt is in the realm of voice technology. AI voices are now commonly used in virtual assistants, customer service bots, audiobooks, video games, and various accessibility tools. As this technology continues to evolve, a key question emerges: to what extent can we control the speed (speech rate) of these synthetic voices?

Current capabilities of commercial AI voices

Currently, most commercial AI voice generators offer a basic level of control over speech rate. AI voiceover and voice-generation platforms like Pictory, Murf.ai, and Lovo.ai allow users to adjust the speed of their AI voiceovers. This is typically achieved through a slider or by selecting from predefined speed options.

Platform	Speed Control Options	Description
Pictory	Slider	Users can adjust the speed of "Standard" voices by dragging a slider. "Premium" voice speed is not adjustable.
Murf.ai	Slider (-50% to +50%)	Speed can be adjusted at the block level or applied to the entire project.
Lovo.ai	Speed selector (1.50x, 2.00x, etc.)	Users can change the speed of an entire voiceover or individual voice blocks.

While these adjustments provide some degree of customization, the level of control offered by current AI voices is often limited.

The future of AI voices: Towards fine-grained control

The future of AI voices promises much more fine-grained control over speech synthesis. Researchers are developing new models that allow for fine-grained control over a wide range of speech parameters, including speed. The next generation of AI voice models will enable the creation of AI voices that are not just more controllable but more natural and human-like.

Update - Octave is capable of all of these features, and is available on February 26th, 2025. Try it here: platform.hume.ai

These advancements are focused on several key areas in addition to speed:

Prosody

This refers to the rhythm, stress, and intonation of speech. Future AI voices will allow users to adjust these elements with greater precision, creating more natural and engaging speech patterns.

Pronunciation

AI models are being trained to better understand and reproduce the nuances of pronunciation, including accents and dialects. This will enable the creation of AI voices that are more diverse and representative of different language variations.

Emotional Expression

AI systems are being developed to recognize and generate different emotional tones in speech. This will allow users to create AI voices that convey a wider range of emotions, such as joy, confusion, anger, pride, and surprise.

Speaker Identity

Advancements in AI voice cloning technology will enable the creation of highly realistic synthetic voices that closely resemble specific individuals.

The first next-generation AI voice model to be announced is Hume AI's OCTAVE. This model combines the capabilities of several cutting-edge technologies, including OpenAI's Voice Engine, Elevenlabs's TTS Voice Design, and Google Deepmind's NotebookLM. OCTAVE allows for fine-grained control over AI voices, including speech rate, personality, accent, and expressions. This level of control opens up exciting possibilities for creating AI voices that are more engaging, relatable, and human-like.

How will AI voices impact different industries?

The development of AI voices with fine-grained control has significant implications for various industries and applications:

Entertainment

AI voices can be used to create more realistic and immersive experiences in video games, animated films, and virtual reality environments. For example, AI voices can be used to generate dynamic and naturalistic dialogue for non-player characters in video games, making interactions more engaging and believable.

Accessibility

AI voices can provide personalized and accessible communication tools for individuals with speech impairments or language barriers. Text-to-speech software with AI voices can help people with speech difficulties communicate more effectively, and real-time language translation tools can facilitate cross-cultural communication.

Customer Service

AI-powered chatbots and virtual assistants can offer more natural and engaging interactions with customers. By incorporating more expressive AI voices, these systems can provide more personalized and human-like customer service experiences.

Education

AI voices can be used to create interactive learning experiences and personalized educational content. AI-powered tutors can provide customized feedback and support to students, and AI voices can be used to create engaging educational materials, such as audiobooks and interactive simulations.

Content Creation

AI voices can assist content creators in producing high-quality audio content, such as audiobooks, podcasts, and voiceovers. This can help streamline the content creation process and make it more accessible to a wider range of creators.

Safe and ethical use of AI voices

While the advancements in AI voice technology offer numerous benefits, they also raise ethical concerns. The ability to create highly realistic synthetic voices raises the potential for misuse, such as impersonation and the spread of misinformation. For example, AI-generated voices could be used to create fake audio recordings of public figures or to deceive people in phishing scams.

To address these concerns, it is crucial to develop safeguards and ethical guidelines for the use of AI voice technology. This may include:

Transparency: Clearly labeling AI-generated content to distinguish it from authentic human speech.
Consent: Obtaining consent from individuals before using their voices to create synthetic versions.
Security: Implementing measures to prevent unauthorized access and misuse of voice cloning technology.
Education: Raising awareness about the potential risks and ethical implications of AI voice technology.

Conclusion: the importance of AI voice speed control

The ability to control the speed and expression of AI voices is rapidly advancing. While current commercial AI voices offer basic speed control, future models like Hume AI's OCTAVE promise much more fine-grained control over various aspects of speech. This will enable the creation of AI voices that are more expressive, nuanced, and human-like, opening up new possibilities for various applications while also raising important ethical considerations.

Update - Octave is available as of February 26th, 2025. Try it here: platform.hume.ai

The development of fine-grained control in AI voices has the potential to significantly impact human-computer interaction. As AI voices become more expressive and personalized, they can facilitate more natural and engaging communication between humans and machines. This could lead to more seamless integration of AI into our daily lives, with AI assistants, companions, and collaborators becoming more commonplace.

However, it is essential to approach these advancements with a mindful consideration of their ethical implications. By developing responsible guidelines and safeguards, we can ensure that AI voice technology is used to enhance human communication and creativity while mitigating the risks of misuse.