Introducing OCTAVE (Omni-Capable Text and Voice Engine)
By Hume Research on Dec 23, 2024
A frontier speech-language model with new emergent capabilities, like on-the-fly voice and personality creation
We’re introducing OCTAVE (Omni-Capable Text and Voice Engine), a next-generation speech-language model that combines the capabilities of our EVI 2 speech-language model with those of systems like OpenAI’s Voice Engine, Elevenlab’s TTS Voice Design, and Google Deepmind’s NotebookLM.
From descriptive prompts or recordings as brief as 5s, OCTAVE generates not just voices, but personalities (language, accent, expressions, underlying disposition, etc.) that can talk to you. And it can generate multiple, interacting AI personalities and voices within a real-time response.
Maintaining the capabilities of a similar-sized frontier LLM, OCTAVE is well-suited to power AI systems that communicate richly with humans while following detailed instructions, using tools, or controlling an interface.
1. Generating not just voices, but personalities from prompts
OCTAVE can generate any voice and personality – and the accompanying language – from a prompt, emulating gender, age, accent, vocal register, emotional intonation, speaking styles related to vocations or roles (“gentle therapist,” “wizard mentor”), and many other characteristics.
Example 1: “A male voice that is extremely gravelly, as if he was gargling hot asphalt.”
Example 2: “The speaker's voice is that of an extremely gentle and empathetic therapist voice, with thoughtful pauses between phrases. Warm, supportive tone that feels like a comforting embrace, speaking just above a whisper.”
Example 3: “The speaker is an English male AI assistant, mid-30s sound, impossibly smooth and sophisticated. Perfect BBC pronunciation with subtle electronic undertones. Slight reverb suggesting spacious room acoustics.”
Example 4: “The speaker is an engaging narrator with Welsh accent. Naturally builds suspense with pacing and tone. Like a favorite uncle telling bedtime stories but with professional voice acting training.”
Example 5: “The speaker is a New Zealand female, mid-40s, with a soothing but grounded therapeutic voice. Deliberately slow pace with extended pauses. Gentle prosody that naturally guides breathing, slight Kiwi accent adds warmth."
Example 6: “The speaker is a Brooklyn cab driver with a rapid-fire, nasal voice that barrels through conversations like rush hour traffic. The voice carries a thick New York accent with aggressively pronounced vowels and clipped endings."
Example 7: “The speaker is a wise wizard with a refined British accent that gives weight to every carefully chosen word. The voice carries scholarly authority and frequently employs archaic terms and formal phrasing.”
Example 8: “Authoritative, firm, commanding, direct, crisp, clear.”
2. Instant voice and personality adoption from recordings
OCTAVE can extract a clean representation of any speaker’s voice, accent, and personality from a noisy recording as brief as 5s, clone these voice qualities in a single shot, and generate clean dialog with the speaker’s voice, all in one step.
Example 1: Cloning Hume's Lauren Kim from a 5s clip
Input Audio:
Continuation generated by OCTAVE:
Example 2: Extending Ilya Sutskever's 2024 NeurIPS talk
Input Audio:
Continuation generated by OCTAVE:
3. Interacting with any voice or personality in real-time
Any voice and personality OCTAVE generates or adopts can be used for real-time interaction. For instance, we can append a recording of Hume CEO Alan Cowen to the last example and have OCTAVE respond with its imitation of Ilya's voice and personality:
OCTAVE’s understanding of the interplay between speaking style, expressions, and underlying disposition informs its generated language and voice during real-time interaction. This results in richer and more authentic communication than can be achieved with separate models handling transcription, language response, and speech generation.
4. Generating multiple, interacting characters
Because OCTAVE has full control over the acoustic properties of the voices it generates, it can generate dialog for multiple interacting speakers, switching among them at will. This capability is comparable to that of NotebookLM by Google-Deepmind, a system that generates dialog among two specific characters. OCTAVE can clone and generate the same characters using a brief sample from NotebookLM as input.
Example 1: Replicating NotebookLM from a single recording
Input Audio:
Continuation generated by OCTAVE, with an interjection from Hume's Moses Oh:
Unlike NotebookLM, OCTAVE had not previously been calibrated on these voices; it generated them in real time based solely on a brief example.
Example 2: Continuing an OpenAI Advanced Voice Mode Demo
OCTAVE can generate all of the language in its responses or parts of its responses can be controlled by inputting text. Here, we prompt OCTAVE with an audio recording without labeling the speakers, then control the beginning of its response by inputting a question via text.
Input Audio:
Continuation generated by OCTAVE, with initial text injected:
Language performance vs. similar-sized LLM
Despite its diverse speech processing and generation capabilities, OCTAVE maintains comparable performance on language understanding tasks to a similar-sized frontier LLM. Note that all responses in this blog post were generated by OCTAVE 3B, demonstrating the capabilities of our smallest model.
OCTAVE's frontier language capabilities ensure that the same source of intelligence that determines its language maintains its personality, hears the voice of the user, and produces nuanced vocal responses. The result is a coherent persona that sounds like it understands what it's saying. The model's language capabilities also mean that it is well-suited to power AI systems that follow detailed instructions, use tools, or control an interface on its own.
Model availability
We are still working to improve OCTAVE, and given its range of new capabilities, we are taking a cautious approach to releasing it. We’ve begun giving trusted partners early access to a limited version of OCTAVE so that the model can be evaluated for safety and effectiveness in various application settings. We plan to roll out broader availability in the coming months.
OCTAVE promises to enable richer, more realistic, and more multifaceted experiences AI experiences than EVI 2. For example, users and developers will be able to craft personas for AI agents, personalize them for individuals or even create them on the fly to answer a particular question, or enable real-time group conversations involving multiple users or AIs. We’re excited to hear what you’d like to see built with it.
Subscribe
Sign up now to get notified of any updates or new articles.
Recent articles
00/00
We’re introducing Voice Control, a novel interpretability-based method that brings precise control to AI voice customization without the risks of voice cloning. Our tool gives developers control over 10 voice dimensions, labeled “masculine/feminine,” “assertiveness,” “buoyancy,” “confidence,” “enthusiasm,” “nasality,” “relaxedness,” “smoothness,” “tepidity,” and “tightness.” Unlike prompt-based approaches, Voice Control enables continuous adjustments along these dimensions, allowing for precise control and making voice modifications reproducible across sessions.
Hume AI creates emotionally intelligent voice interactions with Claude
Hume AI trained its speech-language foundation model to verbalize Claude responses, powering natural, empathic voice conversations that help developers build trust with users in healthcare, customer service, and consumer applications.
How EverFriends.ai uses empathic AI for eldercare
To truly connect with users and provide a natural, empathic experience, EverFriends.ai needed an AI solution capable of understanding and responding to emotional cues. They found their answer in Hume's Empathic Voice Interface (EVI). EVI merges generative language and voice into a single model trained specifically for emotional intelligence, enabling it to emphasize the right words, laugh or sigh at appropriate times, and much more, guided by language prompting to suit any particular use case.