Announcing OCTAVE, the first language model built for TTSAvailable now
Article

Octave TTS: the first text-to-speech system that understands what it’s saying

Published on February 26, 2025

Today we’re launching Octave (Omni-capable text and voice engine), the first LLM for text-to-speech. Unlike conventional TTS that merely “reads” words, Octave is a speech-language model that understands what words mean in context, unlocking a new level of expressiveness and nuance—and new AI voice capabilities. It acts out characters, generates voices from prompts, and takes instructions to modify the emotion and style of a given utterance.

In a blind comparison study with 180 human raters, Octave’s outputs were favored over outputs from ElevenLabs Voice Design in terms of audio quality (71.6%), naturalness (51.7%), and how well speech generations matched descriptions of the desired voice (57.7%), across 120 diverse prompts.

Generate more natural-sounding, context-aware speech

Octave is a state-of-the-art large language model (LLM) trained to understand and synthesize speech. This speech-language model can predict the tune, rhythm, and timbre of speech, inferring when to whisper secrets, shout triumphantly, or calmly explain a fact. In other words, Octave interprets plot twists, emotional cues, and character traits within a script or prompt, then transforms that understanding into lifelike speech, like a human actor reading a script.

Text input: “Sure, let's have another meeting about the color of the logo. That's exactly what this project needs! I mean, who cares about functionality when the shade of blue isn't quite right?"

Sarcasm script
00:00
00:00
00:00

Given a script that implies sarcasm, the model generates a sarcastic tone of voice.

Text input: “OH, NAH, NOT ME, MATE—I’VE SEEN ENOUGH! GET IT AWAY! BLOODY ‘ELL, JESUS!"

Revulsion script
00:00
00:00
00:00

By interpreting the text input, Octave emulates a strong revulsion response.

Text input: “This AI doesn't just talk, it knows. It knows panic. It knows disgust. It knows fear.”

Fear Faker Script
00:00
00:00
00:00

Octave intelligently adjusts the rhythm and emphasis of words based on their meaning.

Voice Design: Create a voice with any prompt

Guided by a prompt or just an evocative script, Octave can create any AI voice you can imagine. It automatically interprets the meaning and style of your script, such as pronouns, contractions, and vocabulary to generate a coherent voice for the character.

You can further guide Octave by prompting it with a description of the character, available via Voice Design feature. This description can encompass any number of characteristics, from “patient, empathetic counselor with an AMSR voice” to “dramatic medieval knight,” to “middle-aged, Hollywood movie trailer narrator.” It can be nuanced, combining specific accents, demographics, occupational roles, and more. For more information, visit our voice prompting guide.

You’re also free to skip Voice Design and create voices on the fly on our Playground. Simply click Generate without selecting a voice, and Octave will create one based on your script alone. You can save this output as a new voice using the three-dot menu next to the audio player.

Text input: “All right, all right, ladies and gents, gather round—this is Lot Number One: a vintage porcelain vase from the esteemed Mapleton estate. Let’s start the bidding at one hundred dollars, do I hear one hundred… one hundred, one hundred, thank you, I have one hundred, now who will give me one-twenty-five?”

No description

No description prompt
00:00
00:00
00:00

Based on the script alone, the model invents a voice.

Text input: “All right, all right, ladies and gentlemen, gather round—this is Lot Number One: a vintage porcelain vase from the esteemed Mapleton estate. Let’s start the bidding at one hundred dollars, do I hear one hundred… one hundred, one hundred, thank you, I have one hundred, now who will give me one-twenty-five?”

Description: English accent

English accent
00:00
00:00
00:00

The model invents a voice guided by the script and simple description (”English accent”).

Text input: “All right, all right, orcs and goblins, gather round—this is Lot Number One: a young hobbit fresh from the Shire. Let’s start the bidding at one hundred ingots, do I hear one hundred, one hundred, one hundred, thank you, I have one hundred, now who will give me one-twenty-five?”

Description: Goblin auctioneer with a deep, gruff, booming, gravelly, and guttural voice. Speaks with a rough cockney accent.

Goblin Auctioneer
00:00
00:00
00:00

The model invents a voice guided by the script and a more complex description.

Adopt any emotion or style you can describe

Like a human actor, Octave can take directions. Starting with an existing voice, Octave can read out new scripts with any instructed emotion or speaking style. You can access this functionality via our Acting Instructions feature.

Text input: “Are you serious?”

Description: “whispering, hushed”

"Are you serious?" – hushed, whisper
00:00
00:00
00:00

Text input: “Are you serious?”

Description: “calm, serene”

"Are you serious?" – calm, serene
00:00
00:00
00:00

Text input: “Are you serious?”

Description: “disgusted, disdainful”

"Are you serious?" – disdainful, contemptuous
00:00
00:00
00:00

Text input: “Are you serious?”

Description: “angry, furious”

"Are you serious?" – angry, furious
00:00
00:00
00:00

Text input: “Are you serious?”

Description: “pained, shocked”

"Are you serious?" – pained, shocked
00:00
00:00
00:00

Voice Cloning: Coming soon

Octave can instantly clone a voice extracted from as little as 5 seconds of audio. The team is working diligently to provide safe ways to offer this capability. We plan on launching Voice Cloning in the coming weeks.

Octave Creator Studio & Developer Tools

Octave is available today on platform.hume.ai and through our API. On our platform, creators and developers can already access:

  • Voice Design

  • Acting Instructions

  • Our voice library of over 40+ premade voices

  • Our Projects interface for generating long-form content like audiobooks and podcasts (currently in Preview).

On the developer platform side, Octave is accessible through Python and TypeScript SDKs that handle authentication and provide typed interfaces for reliable integration. The command-line interface supports fast prototyping, testing, and batch processing directly from your terminal. All tools come with clear documentation and example code, letting you quickly implement context-aware speech generation without managing complex API interactions. These developer tools streamline implementation, reducing time-to-market for voice-enabled applications.

Evaluating Hume Octave TTS

The development of Octave highlighted the need for novel evaluations capable of assessing the expressivity, steerability, and overall performance of modern TTS systems. 

Traditional TTS evaluations predominantly focus on the intelligibility and accuracy of generated speech for short, isolated text inputs—areas in which contemporary models already perform at a high level. However, existing benchmarks fail to capture the challenges associated with generating expressive speech for longer and more complex inputs or assessing how well a model follows detailed voice prompts.

To address these gaps, we conducted an internal evaluation to benchmark Octave against an industry-leading TTS system, ElevenLabs. In addition, we are launching a public evaluation initiative, Expressive TTS Arena, to facilitate broader comparative assessments of expressive speech synthesis.

ElevenLabs vs. Hume Octave

To benchmark Octave, we conducted an internal evaluation in which human raters assessed speech samples generated by Octave and ElevenLabs’ TTS system. 

We developed a diverse set of 120 voice descriptions to reflect a broad range of user prompting styles for TTS models. These descriptions were designed to encompass the full spectrum of how users specify desired voice characteristics, ranging from:

  • Elaborate, narrative-driven descriptions (e.g., “A warm, fatherly voice with a rich baritone, slightly gravelly with a reassuring tone, like an experienced storyteller.”)

  • Concise adjective-based prompts (e.g., “Energetic, youthful, slightly raspy.”)

We generated plausible dialogue for each voice description using Gemini. For each voice description and text input, we then generated three samples using Octave and three using Elevenlabs Voice Design.

Raters (N = 180) were instructed to blindly compare paired Octave and Elevenlabs speech samples generated using the same prompt. Each rater completed multiple trials and indicated whether they preferred the Hume- or Elevenlabs-generated samples in terms of audio quality, naturalness, and how well the generated voice matched the intended style and character specified in the prompt.

The results indicated that Hume’s Octave outperformed the ElevenLabs model across all three human preference metrics:

  • Audio quality: Hume-generated voices were preferred in 71.6% of trials.

  • Naturalness: Hume’s voices were favored in 51.7% of trials.

  • Description/prompt match: Hume’s generations were selected as better matches for the given descriptions/prompts in 57.7% of trials.

These findings suggest that Hume’s model not only produces higher-quality and more natural-sounding speech, but also demonstrates better adherence to user-provided descriptions, a key factor in expressive and steerable TTS generation.

Frame
Spider

Fraction of samples preferred for Hume vs. ElevenLabs in a paired preference study with blind raters. Hume samples were preferred more often on average across three criteria: naturalness of speech, alignment with the provided description or prompt, and overall audio quality.

Expressive TTS Arena

We’re also launching a new public evaluation that anyone can participate in: Expressive TTS Arena (arena.hume.ai)

We were inspired by the Hugging Face TTS Arena, which is designed to compare TTS models with short, isolated text inputs. It’s typically used with single sentences or phrases within a character limit, which is a relatively solved problem. It doesn’t address performance with longer, more complex, and expressive content or evaluate the steerability of new models using prompts, problems only recently addressed by next-generation text-to-speech systems such as Octave. That’s where Expressive TTS Arena comes in.

Expressive TTS Arena is based on a prompt-based voice generation system that facilitates testing longer, more expressive text and takes advantage of the promptability of new TTS models. This makes it an ideal tool for evaluating how well new TTS systems handle nuanced, creative, and emotionally rich content and prompts typical of real use cases.

Screenshot 2025 02 26 at 12.47.02 Am

What’s Next?

We’re continuing to train Octave and improve its capabilities. For this initial launch, we focused largely on English-language speech, but Octave can also speak Spanish fluently and we hope to improve its proficiency in other languages soon. We also expect to improve Octave’s core capabilities over the coming weeks. In particular, we remain focused on expressive speech generation, prompting for different emotions and styles, generating new voices, and smooth conversations among multiple speakers.

In the meantime, Hume’s mission remains the same: to optimize AI for human well-being. We first developed Octave to better understand how humans express themselves with their voices. In addition to TTS, we are using it to train AI systems that can anticipate users’ needs. We’ll have more updates on that in the near future.

 

Subscribe

Sign up now to get notified of any updates or new articles.

Recent articles