Comparing the world’s first voice-to-voice AI models
By Jeremy Hadfield on Sep 11, 2024
Comparing the world’s first voice-to-voice AI models
Imagine if you could speak naturally to any product or app—four times faster than typing—and it could talk back. Imagine if, based not just on what you said but also how you said it, it did what you wanted it to do. That’s what voice-to-voice foundation models, the latest major breakthrough in AI, will enable for many, if not most, products and services in the coming months and years.
The world’s first working voice-to-voice models are Hume AI's Empathic Voice Interface 2 (EVI 2) and OpenAI's GPT-4o Advanced Voice Mode. EVI 2 was publicly released in September 2024, available as an app and an API that developers can build on. GPT-4o voice was previewed to a small number of ChatGPT users in mid-2024, and released for developers as the Realtime API in October 2024. Here we explore the similarities, differences, and potential applications of these systems.
What are voice-to-voice AI models?
Voice-to-voice AI models apply the same principles as large language models (LLMs), but they directly process audio of the human voice instead of text. Whereas large language models are trained on millions of pages of text, voice-to-voice models are trained on millions of hours of recorded voice data. These models enable users to speak with AI through voice alone.
In many ways, these new voice-to-voice models bring to fruition what legacy technologies like Siri and Alexa had long promised. Siri and Alexa were presented as general-purpose voice understanding systems, with the capability to fulfill arbitrary voice queries. Unfortunately, Siri and Alexa were not actually powered by general-purpose voice AI models, but by traditional computer programs that generated fixed responses to a hardcoded set of keywords.
As general-purpose systems that can fulfill arbitrary voice queries, voice-to-voice models make possible, for the first time, the things people always wished Siri and Alexa could do. Since these kinds of voice assistants were first launched over a decade ago, many have forgotten what made them so exciting to begin with. Voice is how humans interact with each other, our most natural modality for communication. Consider:
-
The average person speaks at 150 words per minute but types at only 40 wpm. Voice makes interacting with computers - especially for input - much faster.
-
Speech recognition accuracy has improved by over 5x since 2012, now rivaling or exceeding human transcription in accuracy.
-
Voice-to-voice models have the potential to democratize computing for 773 million illiterate adults worldwide.
-
For 2.2 billion people with visual impairments, voice-to-voice models are not just convenient - they can become their primary gateway to digital interaction.
Voice-to-voice models will allow billions more people to use state-of-the-art technology with seamless communication. Within a decade, our current interfaces may feel as outdated as command-line interfaces in a GUI world.
Comparing EVI 2 and GPT-4o voice
Similarities
EVI 2 and GPT-4o voice have many capabilities in common. Both are multimodal language models that can process both audio and language and output both voice and language. As a result, they can both converse rapidly and fluently with users with sub-second response times, understand a user’s tone of voice, generate any tone of voice, and even respond to some more niche requests like changing their speaking rate or rapping. Voice-to-voice models overcome the inherent limitations of traditional stitched-together systems that rely on separate steps for transcription, language modeling, and text-to-speech.
Differences
EVI 2 is optimized for emotional intelligence. EVI 2 excels at anticipating and adapting to your preferences, made possible by its special training for emotional intelligence. EVI 2 leverages Hume's research on human expression to interpret and respond to subtle emotional cues in the user's voice. Then, it can use these cues to make more empathic responses that are more likely to support the user’s well-being. In contrast, while ChatGPT voice is capable of interpreting tone and responding with an emotional tone of voice, it does not have the same depth of focus on emotional intelligence as EVI 2, and does not appear to be trained to promote the user’s well-being. Further, EVI 2 provides accurate emotional expression measures based on a decade of research for all of the user's speech.
EVI 2 is trained to maintain compelling personalities. Hume’s speech-language model is trained to maintain characters and personalities that are fun and interesting to interact with. On the other hand, GPT-4o voice is currently restricted to a small set of prototypical “AI assistant” personalities.
EVI 2 is customizable. Where the Realtime API has eight preset voices with relatively static personalities, EVI 2 can emulate an infinite number of personalities, including accents and speaking styles, with flexible prompting and voice modulation tools. We developed a novel voice modulation approach that allows anyone to adjust EVI 2's eight (and counting) base voices along a number of continuous scales, including gender, nasality, pitch, and more. This allows developers to create any custom voice, not just choose from a limited set.
EVI 2 is designed for developers. Available through our API, EVI 2 is designed for developers with a customizable voice and personality that can be tailored to specific apps and users. It also includes features like tool use, phone calling, custom language models, a wide variety of conversational controls, and comprehensive documentation. In contrast, OpenAI's voice models were designed for ChatGPT, with particular “AI assistant” personalities that can be hard to adjust. EVI has been available as an API since early 2024 and has been tested by thousands of developers and improved with new features since then, while the Realtime API has a limited feature set.
EVI 2 is designed for scale. The Realtime API costs $0.06 per minute of audio input and $0.24 per audio output, resulting in a cost of about $0.15/min or $9/hour. In contrast, the EVI API costs $0.072/min of conversation or $4.32/hour - about two times cheaper than OpenAI's voice offering. Further, Hume offers discounts at scale and a grant program for startups. The dramatic price difference and the fact that the Realtime API is still in an early-stage beta makes the empathic voice interface API far more practical for real-world applications. For companies looking to deploy voice AI at scale, this cost can be the difference between a profitable product and an unsustainable one.
EVI 2 can be used with any LLM. While EVI 2 generates its own language, it is designed to be flexible and interoperable with other LLMs. This includes supplemental LLMs from OpenAI, Anthropic, Google, Meta, or any other provider. It also enables custom language models, allowing developers to bring their own LLMs or generate fixed responses. This flexibility enables developers to leverage the strengths of different LLMs while still benefiting from EVI's voice and empathic AI capabilities. In contrast, the Realtime API is tightly integrated with OpenAI's ecosystem. It cannot be used with non-OpenAI models. Thus, it is only well-suited for applications where GPT-4o’s responses are preferred over other LLMs like Gemini or Claude.
GPT-4o supports more languages. OpenAI’s voice offering supports input and output in a wide range of languages. Similarly, EVI 2’s architecture allows for voice and text generation in any language, but the small model is currently only fluent in English, Spanish, French, German, and Polish. Many more languages will be added in the coming months.
The use cases for EVI 2
Voice-to-voice models are set to transform a wide range of products and services over the coming months.
Customer service. For customer-facing businesses, voice-to-voice models can provide 24/7 support with unprecedented empathy and understanding. This is crucial, as businesses lose an estimated $75 billion annually due to poor customer service (source), and 81% of customers prefer self-service options according to Harvard Business Review (source). Businesses are unable to answer all incoming calls, which results in significant missed revenue - small and medium-sized businesses miss between 22% to 62% of all incoming calls (source). For example, in the automotive industry, a single missed call can result in a $220 lost opportunity, with the average automotive business losing $49,000 in revenue leakage per year due to unanswered calls (source). Using voice AI models can allow businesses to earn millions more in revenue. Empathic voice AI will be even more effective in satisfying customers and resolving their issues.
A more efficient interface for virtually any application. Voice-to-voice models can significantly boost productivity by allowing hands-free, natural language interactions with complex systems. Since the voice is 4x faster than typing and allows applications to perform any action, not just the ones presented on a specific UI page, this may unlock an order of magnitude increase in productivity. Any application can add a voice interface to accelerate interaction speed and improve accessibility for millions of users.
Mental health, education, and personal development. The ability of voice-to-voice models to understand context and emotional cues opens up possibilities in specific fields like mental health, education, and personal development. The market for AI-powered mental health apps is forecast to reach $8 billion by 2025 (source), showcasing the immense potential for personalized, empathic AI services at scale.
These are just a small selection of example use cases for EVI 2. By enabling any application to add a customizable voice interface, EVI 2 enables countless new uses for voice AI.
Looking forward: the future of EVI
Currently, EVI 2 is available only in one model size: EVI-2-small. We are still making improvements to this model. In the coming weeks, it will become more reliable, learn more languages, follow more complex instructions, and use a wider range of tools. We’re also fine-tuning a larger, upgraded voice-to-voice model we will be announcing soon.
While maintaining or exceeding EVI-2-small’s voice capabilities, this larger model will be more responsive to prompts and excel in complex reasoning. For now, if your application makes use of complex reasoning skills such as logical reasoning and tool use, we recommend that you configure the EVI API to use EVI 2 in conjunction with an external LLM.
EVI 2 represents a critical step forward in our mission to optimize AI for human well-being. We focused on making its voice and personality highly adaptable to give the model more ways to meet users’ preferences and needs. The default personalities of EVI 2 already reflect how the model is optimized for user satisfaction, demonstrating that AI optimized for well-being will have a particularly pleasant and fun personality as a result of its deeper alignment with your goals.
Our ongoing research is focused on optimizing for individual user’s preferences automatically, with methods to fine-tune the model to generate responses that align with ongoing signs of happiness and satisfaction during everyday use of an application.
Voice-to-voice AI models represent a transformative leap in human-computer interaction. We can’t wait to try the delightful user experiences developers build with the EVI 2 API.
Resources
Subscribe
Sign up now to get notified of any updates or new articles.
Recent articles
00/00
We’re introducing Voice Control, a novel interpretability-based method that brings precise control to AI voice customization without the risks of voice cloning. Our tool gives developers control over 10 voice dimensions, labeled “masculine/feminine,” “assertiveness,” “buoyancy,” “confidence,” “enthusiasm,” “nasality,” “relaxedness,” “smoothness,” “tepidity,” and “tightness.” Unlike prompt-based approaches, Voice Control enables continuous adjustments along these dimensions, allowing for precise control and making voice modifications reproducible across sessions.
Hume AI creates emotionally intelligent voice interactions with Claude
Hume AI trained its speech-language foundation model to verbalize Claude responses, powering natural, empathic voice conversations that help developers build trust with users in healthcare, customer service, and consumer applications.
How EverFriends.ai uses empathic AI for eldercare
To truly connect with users and provide a natural, empathic experience, EverFriends.ai needed an AI solution capable of understanding and responding to emotional cues. They found their answer in Hume's Empathic Voice Interface (EVI). EVI merges generative language and voice into a single model trained specifically for emotional intelligence, enabling it to emphasize the right words, laugh or sigh at appropriate times, and much more, guided by language prompting to suit any particular use case.