Octave TTS: the first text-to-speech system that understands what it’s saying

Hume AI Team

·February 26, 2025·article

Today we’re launching Octave (Omni-capable text and voice engine), the first LLM for text-to-speech. Unlike conventional TTS that merely “reads” words, Octave is a speech-language model that understands what words mean in context, unlocking a new level of expressiveness and nuance—and new AI voice capabilities. It acts out characters, generates voices from prompts, and takes instructions to modify the emotion and style of a given utterance.

In a blind comparison study with 180 human raters, Octave’s outputs were favored over outputs from ElevenLabs Voice Design in terms of audio quality (71.6%), naturalness (51.7%), and how well speech generations matched descriptions of the desired voice (57.7%), across 120 diverse prompts.

Generate more natural-sounding, context-aware speech

Octave is a state-of-the-art large language model (LLM) trained to understand and synthesize speech. This speech-language model can predict the tune, rhythm, and timbre of speech, inferring when to whisper secrets, shout triumphantly, or calmly explain a fact. In other words, Octave interprets plot twists, emotional cues, and character traits within a script or prompt, then transforms that understanding into lifelike speech, like a human actor reading a script.

Text input: “Sure, let's have another meeting about the color of the logo. That's exactly what this project needs! I mean, who cares about functionality when the shade of blue isn't quite right?"

Sarcasm script

0:00

Given a script that implies sarcasm, the model generates a sarcastic tone of voice.

Text input: “OH, NAH, NOT ME, MATE—I’VE SEEN ENOUGH! GET IT AWAY! BLOODY ‘ELL, JESUS!"

Revulsion script

0:00

By interpreting the text input, Octave emulates a strong revulsion response.

Text input: “This AI doesn't just talk, it knows. It knows panic. It knows disgust. It knows fear.”

Fear Faker Script

0:00

Octave intelligently adjusts the rhythm and emphasis of words based on their meaning.

Voice Design: Create a voice with any prompt

Guided by a prompt or just an evocative script, Octave can create any AI voice you can imagine. It automatically interprets the meaning and style of your script, such as pronouns, contractions, and vocabulary to generate a coherent voice for the character.

You can further guide Octave by prompting it with a description of the character, available via Voice Design feature. This description can encompass any number of characteristics, from “patient, empathetic counselor with an AMSR voice” to “dramatic medieval knight,” to “middle-aged, Hollywood movie trailer narrator.” It can be nuanced, combining specific accents, demographics, occupational roles, and more. For more information, visit our voice prompting guide.

You’re also free to skip Voice Design and create voices on the fly on our Playground. Simply click Generate without selecting a voice, and Octave will create one based on your script alone. You can save this output as a new voice using the three-dot menu next to the audio player.

Text input: “All right, all right, ladies and gents, gather round—this is Lot Number One: a vintage porcelain vase from the esteemed Mapleton estate. Let’s start the bidding at one hundred dollars, do I hear one hundred… one hundred, one hundred, thank you, I have one hundred, now who will give me one-twenty-five?”

No description

No description prompt

0:00

Based on the script alone, the model invents a voice.

Text input: “All right, all right, ladies and gentlemen, gather round—this is Lot Number One: a vintage porcelain vase from the esteemed Mapleton estate. Let’s start the bidding at one hundred dollars, do I hear one hundred… one hundred, one hundred, thank you, I have one hundred, now who will give me one-twenty-five?”

Description: English accent

English accent

0:00

The model invents a voice guided by the script and simple description (”English accent”).

Text input: “All right, all right, orcs and goblins, gather round—this is Lot Number One: a young hobbit fresh from the Shire. Let’s start the bidding at one hundred ingots, do I hear one hundred, one hundred, one hundred, thank you, I have one hundred, now who will give me one-twenty-five?”

Description: Goblin auctioneer with a deep, gruff, booming, gravelly, and guttural voice. Speaks with a rough cockney accent.

Goblin Auctioneer

0:00

The model invents a voice guided by the script and a more complex description.

Adopt any emotion or style you can describe

Like a human actor, Octave can take directions. Starting with an existing voice, Octave can read out new scripts with any instructed emotion or speaking style. You can access this functionality via our Acting Instructions feature.