Hume Startup Grant Program now liveApplication
Science

Introducing OCTAVE (Omni-Capable Text and Voice Engine)

By Hume Research on Dec 23, 2024

A frontier speech-language model with new emergent capabilities, like on-the-fly voice and personality creation

We’re introducing OCTAVE (Omni-Capable Text and Voice Engine), a next-generation speech-language model that combines the capabilities of our EVI 2 speech-language model with those of systems like OpenAI’s Voice Engine, Elevenlab’s TTS Voice Design, and Google Deepmind’s NotebookLM. 

From descriptive prompts or recordings as brief as 5s, OCTAVE generates not just voices, but personalities (language, accent, expressions, underlying disposition, etc.) that can talk to you. And it can generate multiple, interacting AI personalities and voices within a real-time response.

Maintaining the capabilities of a similar-sized frontier LLM, OCTAVE is well-suited to power AI systems that communicate richly with humans while following detailed instructions, using tools, or controlling an interface.

1. Generating not just voices, but personalities from prompts

OCTAVE can generate any voice and personality – and the accompanying language – from a prompt, emulating gender, age, accent, vocal register, emotional intonation, speaking styles related to vocations or roles (“gentle therapist,” “wizard mentor”), and many other characteristics.

Example 1: “A male voice that is extremely gravelly, as if he was gargling hot asphalt.

Ex 1 Gravelly Voice
00:00
00:00
00:00

Example 2: “The speaker's voice is that of an extremely gentle and empathetic therapist voice, with thoughtful pauses between phrases. Warm, supportive tone that feels like a comforting embrace, speaking just above a whisper.”

Ex 2 Gentle Therapist
00:00
00:00
00:00

Example 3: “The speaker is an English male AI assistant, mid-30s sound, impossibly smooth and sophisticated. Perfect BBC pronunciation with subtle electronic undertones. Slight reverb suggesting spacious room acoustics.”

Ex 3 English AI Assistant
00:00
00:00
00:00

Example 4: “The speaker is an engaging narrator with Welsh accent. Naturally builds suspense with pacing and tone. Like a favorite uncle telling bedtime stories but with professional voice acting training.”

Ex 4 Welsh Narrator
00:00
00:00
00:00

Example 5: “The speaker is a New Zealand female, mid-40s, with a soothing but grounded therapeutic voice. Deliberately slow pace with extended pauses. Gentle prosody that naturally guides breathing, slight Kiwi accent adds warmth."

Ex 5 New Zealand Wellness Coach
00:00
00:00
00:00

Example 6: “The speaker is a Brooklyn cab driver with a rapid-fire, nasal voice that barrels through conversations like rush hour traffic. The voice carries a thick New York accent with aggressively pronounced vowels and clipped endings."

Ex 6 Brooklyn Cab Driver
00:00
00:00
00:00

Example 7: “The speaker is a wise wizard with a refined British accent that gives weight to every carefully chosen word. The voice carries scholarly authority and frequently employs archaic terms and formal phrasing.”

Ex 7 Scholarly Wizard Mentor
00:00
00:00
00:00

Example 8: “Authoritative, firm, commanding, direct, crisp, clear.”

Ex 8 Commanding Authority
00:00
00:00
00:00

2. Instant voice and personality adoption from recordings

OCTAVE can extract a clean representation of any speaker’s voice, accent, and personality from a noisy recording as brief as 5s, clone these voice qualities in a single shot, and generate clean dialog with the speaker’s voice, all in one step.

Example 1: Cloning Hume's Lauren Kim from a 5s clip

Input Audio:

5s Input
00:00
00:00
00:00

Continuation generated by OCTAVE: 

Generated
00:00
00:00
00:00

Example 2: Extending Ilya Sutskever's 2024 NeurIPS talk

Input Audio:

Ilya Input
00:00
00:00
00:00

Continuation generated by OCTAVE: 

Ilya Continuation
00:00
00:00
00:00

3. Interacting with any voice or personality in real-time

Any voice and personality OCTAVE generates or adopts can be used for real-time interaction. For instance, we can append a recording of Hume CEO Alan Cowen to the last example and have OCTAVE respond with its imitation of Ilya's voice and personality:

Octave Responds as Ilya
00:00
00:00
00:00

OCTAVE’s understanding of the interplay between speaking style, expressions, and underlying disposition informs its generated language and voice during real-time interaction. This results in richer and more authentic communication than can be achieved with separate models handling transcription, language response, and speech generation.

4. Generating multiple, interacting characters

Because OCTAVE has full control over the acoustic properties of the voices it generates, it can generate dialog for multiple interacting speakers, switching among them at will. This capability is comparable to that of NotebookLM by Google-Deepmind, a system that generates dialog among two specific characters. OCTAVE can clone and generate the same characters using a brief sample from NotebookLM as input.

Example 1: Replicating NotebookLM from a single recording

Input Audio:

Notebooklm Snippet
00:00
00:00
00:00

Continuation generated by OCTAVE, with an interjection from Hume's Moses Oh: 

Continuation and Real Time Interjection
00:00
00:00
00:00

Unlike NotebookLM, OCTAVE had not previously been calibrated on these voices; it generated them in real time based solely on a brief example.

Example 2: Continuing an OpenAI Advanced Voice Mode Demo

OCTAVE can generate all of the language in its responses or parts of its responses can be controlled by inputting text. Here, we prompt OCTAVE with an audio recording without labeling the speakers, then control the beginning of its response by inputting a question via text.

Input Audio:

Input
00:00
00:00
00:00

Continuation generated by OCTAVE, with initial text injected:

Continuation With a Question Injected
00:00
00:00
00:00

Language performance vs. similar-sized LLM

Despite its diverse speech processing and generation capabilities, OCTAVE maintains comparable performance on language understanding tasks to a similar-sized frontier LLM. Note that all responses in this blog post were generated by OCTAVE 3B, demonstrating the capabilities of our smallest model.

Screen Shot 2024 12 12 at 1.38.07 Pm

OCTAVE's frontier language capabilities ensure that the same source of intelligence that determines its language maintains its personality, hears the voice of the user, and produces nuanced vocal responses. The result is a coherent persona that sounds like it understands what it's saying. The model's language capabilities also mean that it is well-suited to power AI systems that follow detailed instructions, use tools, or control an interface on its own.

Model availability

We are still working to improve OCTAVE, and given its range of new capabilities, we are taking a cautious approach to releasing it. We’ve begun giving trusted partners early access to a limited version of OCTAVE so that the model can be evaluated for safety and effectiveness in various application settings. We plan to roll out broader availability in the coming months. 

OCTAVE promises to enable richer, more realistic, and more multifaceted experiences AI experiences than EVI 2. For example, users and developers will be able to craft personas for AI agents, personalize them for individuals or even create them on the fly to answer a particular question, or enable real-time group conversations involving multiple users or AIs. We’re excited to hear what you’d like to see built with it.

Subscribe

Sign up now to get notified of any updates or new articles.

Recent articles