Octave TTS Prompting Guide

Hume AI Team

·February 26, 2025·article

While other text-to-speech models simply “read” words, Octave Text-to-Speech (TTS) is built on a language model, enabling it to interpret the meaning of text. With Octave, you can customize voices for any character, guide emotional delivery, and bring stories to life with human-like expression. The Octave speech-language model (speech LM) is a state-of-the-art voice AI model trained on data that captures the nuances of human vocal expression. It can interpret plot twists, emotional cues, and character traits within a script or prompt, transforming them into lifelike speech. To help you create the best possible samples and fully leverage the capabilities of this speech LM, we’ve compiled the following tips and tricks.

TL;DR

When crafting your prompts, consider:

Character Match: Does the text reflect the personality, tone, and role of the character described in the Voice Prompt?
Semantic Alignment: Does the text fit the context or scenario implied by the Voice Prompt?

For example, pairing a Voice Prompt about a "calm, reflective elderly woman reminiscing about her childhood” with an Input Text like “The sun dipped below the horizon, casting golden hues over the fields of my youth” creates a strong semantic and character match. Conversely, using the same Voice Prompt with an Input Text like “Let’s get ready to rumble!” is disjointed and out of character.

By ensuring both semantic alignment and character match, and experimenting with punctuation, you’ll create more natural and engaging results with Octave.

Six tips and tricks for prompting Octave TTS

Key Terms
Voice Prompt	The prompt that describes what the voice should sound like.
Input Text	The text that you want to be spoken.

**Match Input Text tone to desired delivery**

Ensure the Input Text aligns with the emotional tone you want the model to convey. If the Input Text expresses anger, the model will deliver it with an angry tone. For example, the Input Text "I am beyond livid right now! This needs to STOP!" will sound angrier than more bland text that does match the desired emotion. In shore, use expressive Input Text to produce more expressive speech.

Align Input Text with Voice Prompt

The Voice Prompt and Input Text should complement each other, ensuring both semantic alignment and character match. For instance:

Voice Prompt: “The speaker is a fast-talking sports announcer with a booming, energetic voice, delivering play-by-play commentary with the fervor of a seasoned professional and the charismatic enthusiasm of a beloved hometown hero.”
Input Text: “He shoots, he scores! What a play, folks, are you seeing this?”

This pairing works because the Input Text not only matches the energetic tone of the announcer but also aligns with the context of sports commentary. The character’s role (a sports announcer) and the Input Text's content (a thrilling sports moment) are semantically aligned, creating a cohesive and realistic output.

Experiment with punctuation to get the emphasis you want, but avoid unusual symbols

Explore variations of your Input Textand experiment with plain text punctuation. Plain text includes periods like ( . ), commas like ( , ), long dashes like ( — ), exclamation marks like ( ! ), or question marks like ( ? ). Try capitalizing an entire word for a stronger emphasis.

Try to avoid extra formatting or unusual symbols, as they can confuse the model. Emojis like 😊 or :), HTML tags or markdown text like <b>, or uncommon symbols like ( ~ # % ) should be left out of your text as much as possible as they can lead to unpredictable generations.

Incorporate emotions

As part of your Voice Prompt, describe your character’s emotional state to guide delivery. For example:

The speaker is ecstatic, celebrating a major victory
The character speaks with a mix of fear and anger

If you’re feeling bold, you can also add emotions like “joyful,” “nervous,” or “disgusted” to the Voice Prompt box without a full character description.

Create a character for your voice

Develop a detailed character for your Voice Prompt. Here are some examples:

For example, if we create a character, Jimmy Sparks, a charismatic game show host, a fully expanded Voice Prompt could become: The speaker is Jimmy Sparks, a charismatic game show host with booming energy, infectious enthusiasm, and flawless, rapid-fire delivery during the show.
A boxing ring announcer could become: The speaker has an intense voice, like a New York boxing ring announcer with a booming voice, dramatic flair, and electrifying energy that hypes up the crowd.
An ASMR streamer with a soothing, whispery tone could become: The speaker is a young canadian ASMR streamer with a soothing, whispery tone, gentle cadence, and calming presence, creating an intimate and relaxing experience.
A bubbly influencer or vlogger could become: The speaker is a bubbly influencer or vlogger with a confident, charismatic voice that sparkles with energy, effortlessly blending warmth and excitement.
An older woman with a posh British accent could become: The speaker is a cheerful, elderly woman with a slightly posh British accent, offering tea and baked goods with motherly warmth.
A jazz singer telling stories late at night could become: The speaker has a calming voice, like a South African jazz singer with a smoky, smooth timbre, effortlessly weaving captivating stories between velvety songs.

Use the ‘Enhance’ feature

Try our “enhance” button in the bottom right corner of the Voice Prompt box after you’ve added some text. This feature is designed to tweak your prompt so that it is more compatible with Octave.

Multilingual prompting

Octave currently supports English and Spanish, but will support more languages in the coming weeks. To prompt in a language other than English, ensure both your Voice Prompt and Input Text are in the desired language. All other prompting guidelines remain the same.

Example speakers to get you started

Use these example speakers in the Voice Prompt box, and generate corresponding text using the ✨ icon in the corner:

British Romance Novel Narrator: The speaker is a sophisticated British female narrator with a gentle, warm voice, recounting the ending of a classic romance novel.
Male Old-School Film Actor: An old school film actor with a transatlantic, intense, middle-aged, male voice.
Film Narrator: The speaker is an American, deep middle-aged male film trailer narrator for a film about chickens.
70-year old Black Woman Literary: The speaker is a reflective 70-year-old Black woman with a calming tone, reminiscing about the profound impact of literature on her life, speaking slowly and poetically.
Elderly Scottish Gentleman: The speaker is an elderly Scottish gentleman with a thick brogue, expressing awe and admiration.
Posh Elderly British Woman: The speaker is a cheerful, elderly woman with a slightly posh British accent, offering tea and baked goods with motherly warmth.
California Surfer: The speaker is an excited Californian surfer dude, with a loud, stoked, and enthusiastic tone.

Or, use the Randomize button on Hume’s TTS Playground to instantly generate unique voices.