Disentangling Emotion from Voice: A Cross-Product Sampling Approach for Expressive Voice Data

Hoon Shin

·May 27, 2026·research

Disentangling Emotion from Voice: A Cross-Product Sampling Approach for Expressive Voice Data

Humans can separate how they feel from how they speak. Voice models still struggle to do the same.

Picture someone angry, and you probably imagine a raised voice, faster pacing, and sharper delivery. Picture someone bored, and they likely sound flat, slow, and monotone.

Those pairings are common, but human communication is more flexible than that. A parent can be furious and still whisper. A teacher can be deeply engaged while speaking slowly. An executive can be excited but restrained. People separate emotion from delivery all the time.

Voice models still struggle with this distinction. They tend to learn emotion and delivery as a package: anger fused to shouting, boredom fused to monotone, excitement fused to speed. These pairings dominate natural training data, so models learn them as fixed mappings rather than independent axes.

The result is expressive speech that can sound performed, narrow, or slightly off. Ask a leading model to whisper angrily or sound bored while speaking quickly, and the gap becomes obvious.

Models handle the obvious pairings well. It is the subtler cases, where emotion and delivery pull in different directions, that expose the problem.

False cheer

0:00

Warmth performed, not felt. A shop assistant with a demanding customer.

Restrained excitement

0:00

Real enthusiasm, professional delivery. An executive sharing a personal decision.

Restrained anger

0:00

Strong disagreement in a professional setting. A leader responding to a legal ruling.

We call this the entanglement problem. Emotion and vocal delivery are distinct dimensions of speech, but training data often presents them as inseparable.

There has been meaningful progress on architectural fixes. But because the problem begins in the training distribution itself, solving it robustly also requires a data-level response. We propose cross-product sampling: a method for breaking default pairings in training corpora so models can learn emotion and delivery as independent dimensions.

Where Current Curation Pipelines Fall Short

Voice model development has moved beyond brute-force scaling. The field is increasingly focused on smaller, higher-quality datasets filtered for expressivity and coverage. The dominant recipe is score-and-select: run a speech quality or expressivity model, then take the top samples per category. For example, InstructTTSEval filters for samples where arousal and dominance exceed fixed thresholds. EmoCtrl-TTS curates data using labels from a speech emotion recognition model. ParaSpeechCaps applies expressivity filtering across dozens of style tags.

These approaches improve average expressivity. But they do not address a deeper issue: the strong correlations between emotion and vocal delivery in natural speech. As we’ve observed in our own data, boredom is highly correlated with robotic or monotone delivery (r=0.89). Interest trends toward confident expression (r=0.78).

As a result, common pairings appear thousands of times in a typical training dataset, while less intuitive combinations, like bored but energetic, sad but authoritative, angry but quiet, are sparse or absent. Filtering for the most expressive samples can reinforce this imbalance, because the strongest samples often reflect the dominant patterns.

From the model’s perspective, there is no signal that these are correlations rather than constraints. They are learned as fixed mappings.

Emotion × voice quality

Sixteen real samples — one per pairing. Click any tile to hear how the emotion lands in that voice. Tile brightness reflects how often the two tags co-occur in the wild.

yelling

theatrical

whispering

monotone

angry

joy

sad

bored

Tile shading = relative frequency across 100,000 EN samples where both tags have z-score > 1.5 (n = 21,673).

Architectural Approaches and Their Limits

Recent work has tried to address disentanglement at the model level.

NaturalSpeech 3 uses a factorized codec that decomposes speech into separate streams for content, prosody, timbre, and acoustic detail, then trains independent diffusion models on each. Seed-TTS keeps a unified token representation but uses self-distillation to expose the model to controlled attribute variation.

Both approaches make progress on separating speech attributes inside the model. But they still operate within the limits of the data distribution. A factorized representation can infer how restrained anger might sound, but if the training data contains few examples of anger expressed quietly, that inference remains underconstrained. Distillation can encourage separation, but it still depends on the model encountering enough variation to learn from.

When rare combinations are absent, the model lacks the evidence needed to generalize reliably. That motivates a shift upstream: disentanglement at the data level.

Cross-Product Sampling

Our approach builds on Hume’s taxonomy of voice and emotion: 604 leaf-level tags organized into two independent hierarchies. The emotion tree spans 23 parent categories, including anger, amusement, boredom, and anxiety. The voice and prosody tree includes 27 parent categories, including breathy, confident, monotone, and energetic. A tagging model scores each audio sample continuously from 0 to 1 across all 604 dimensions.

Emotion Tree (23 parents, 414 leaves)
anger	aggression, contempt, fury, rage
anxiety	dread, nervousness, worry, panic
amusement	laughter, humor, silliness, wit
awe	wonder, reverence, amazement
boredom	detachment, indifference, apathy
confusion	bewilderment, disorientation
desire	longing, yearning, craving
disgust	revulsion, nausea, contempt
disappointment	frustration, letdown, regret
doubt	skepticism, uncertainty, suspicion
embarrassment	shame, awkwardness, shyness
fear	terror, alarm, dread, horror
joy	elation, euphoria, rapture, bliss
love	affection, tenderness, devotion
pain	suffering, anguish, torment
pleasure	delight, enjoyment, gratification
pride	triumph, confidence, dignity
sadness	grief, sorrow, despair, melancholy
satisfaction	contentment, fulfillment, relief
surprise	astonishment, shock, disbelief
sympathy	compassion, empathy, concern
tiredness	fatigue, exhaustion, lethargy
interest	curiosity, fascination, engagement

Voice / Prosody Tree (27 parents, 189 leaves)
abrasive	harsh, rough, grating, scratchy
articulate	clear, precise, enunciated, crisp
breathy	whispering, hushed, airy, ASMR
cartoony	squeaky, whimsical, goofy, babyish
confident	assertive, commanding, decisive
energetic	lively, animated, spirited, bubbly
expressive	emotive, vivid, dramatic, dynamic
fast	rapid, hurried, brisk, fleet
husky	gravelly, raspy, deep, throaty
monotone	flat, toneless, robotic, dull
mellow	warm, smooth, muted, gentle
mumbling	unclear, slurred, indistinct
musical	singing, humming, rapping, poetic
nasal	twangy, pinched, adenoidal
noisy	crowd, bg music, distorted, fuzzy
non-human	alien, animal, mechanical, digital
resounding	booming, resonant, sonorous, full
robotic	artificial, synthesized, digital
shrill	piercing, screeching, wailing
shy	tentative, meek, hesitant, quiet
slow	leisurely, deliberate, measured
smooth	fluid, polished, effortless
soft	gentle, faint, quiet, subdued
sultry	seductive, throaty, smoldering
theatrical	dramatic, performative, grand
tinny	thin, metallic, hollow, weak
yelling	shouting, screaming, bellowing

Rather than selecting data based on how strongly it scores on emotion or voice attributes individually, cross-product sampling selects from emotion × voice pairs. Think of the dataset as a grid: emotion categories across one axis, voice and prosody categories across the other. The goal is to fill as many cells as possible, including combinations that natural speech rarely produces: bored but energetic, disappointed but confident, joyful but shy.

Because the raw tag space reflects the biases of natural speech, where some attributes appear far more often and some score more strongly by default, we apply two techniques before sampling.

First, we sample at the parent level rather than the leaf level. At the leaf level, the grid would contain nearly 80,000 cells, many of them redundant. Tags like “hysterical laughing” and “normal laughing” often pull the same samples. Collapsing to parent categories reduces the space to 23 × 27, or 621 cells, while preserving meaningful variation.

Second, we apply z-normalization so dominant attributes do not drown out subtler ones. Instead of comparing raw scores across attributes, we measure how prominent each attribute is relative to its own distribution. A modest raw score on “embarrassment” can outrank a higher raw score on “interest” if it represents a stronger-than-usual signal for that attribute.

With these adjustments, we explicitly sample for rare intersections. This is the critical step. No amount of top-N filtering, threshold tuning, or reweighting will reliably produce combinations like bored × energetic if the selection process does not actively seek them.

Results: Measuring Disentanglement and Diversity

The goal is to maximize coverage of emotion × voice combinations while preserving expressive signals. Optimizing for only one side is easy. Random sampling fills more of the grid, but much of the audio is low-signal. Top-N expressivity filtering produces stronger samples, but concentrates them in the most common pairings.

We evaluate three sampling strategies: random sampling, top-N parent-level sampling, and cross-product sampling. Each is measured along three dimensions: grid coverage, expressivity, and mutual information.

We use mutual information to quantify entanglement. If knowing the emotion tells you a lot about the voice quality, mutual information is high. If emotion and voice are more independent, mutual information is lower.

In a 5k curated sample from a 100k-sample dataset, random sampling occupied 490 cells, but much of that coverage came from weak signals. Only about 35% of samples fell in the upper half of the expressivity distribution.

Top-N sampling produced the opposite outcome. It achieved the highest expressivity, but coverage dropped to 344 cells and mutual information rose significantly, indicating stronger entanglement. The dataset concentrated in dominant pairings.

Cross-product sampling balanced both objectives. It occupied 463 cells, close to random sampling, while maintaining substantially higher expressivity. It also produced the lowest mutual information, indicating the strongest disentanglement. Crucially, it surfaced plausible but underrepresented combinations — boredom × fast, disappointment × confident, joy × shy, pain × articulate — giving the model direct exposure to variations it would otherwise rarely see.

Through cross-product sampling, we found rarer combinations of speech and emotion for training voice models.

Why This Matters

For expressive voice models, coverage is not just a data-quality metric. It determines what the model can learn to control.

Architectural advances can help separate emotion and delivery, but they depend on training data that actually contains examples where those dimensions vary independently. If the data never shows boredom expressed with energy, or anger expressed quietly, the model has little basis for generating those combinations reliably.

Cross-product sampling gives model builders a structured way to expand that expressive space. It does not replace architecture-level disentanglement. It gives those architectures better evidence to learn from.

The next question is whether lower mutual information in training data is sufficient for independent control at inference. If combinations like bored × energetic are represented with adequate coverage, can models generate them on demand? And can this framework extend beyond emotion and prosody to other axes, such as persona, register, or conversational context?

We are running controlled SFT experiments to test these questions. If you are working on expressive voice models and interested in collaborating, we would welcome the conversation.

Hear the difference

Same prompt, same pairing. One model trained on standard curated data, one on cross-product sampled data.

False cheer — A shop assistant staying bright with a demanding customer

Baseline

0:00

Cross-product

0:00

Restrained excitement — An executive sharing a personal decision

Baseline

0:00

Cross-product

0:00

Restrained anger — A leader responding to a legal ruling

Baseline

0:00

Cross-product

0:00

Disentangling Emotion from Voice: A Cross-Product Sampling Approach for Expressive Voice Data

Humans can separate how they feel from how they speak. Voice models still struggle to do the same.

Where Current Curation Pipelines Fall Short

Emotion × voice quality

Architectural Approaches and Their Limits

Cross-Product Sampling

Results: Measuring Disentanglement and Diversity

Why This Matters

Hear the difference

Recommended Posts

Stay in the loop

Join the community