
Humans can separate how they feel from how they speak. Voice models still struggle to do the same.
Picture someone angry, and you probably imagine a raised voice, faster pacing, and sharper delivery. Picture someone bored, and they likely sound flat, slow, and monotone.
Those pairings are common, but human communication is more flexible than that. A parent can be furious and still whisper. A teacher can be deeply engaged while speaking slowly. An executive can be excited but restrained. People separate emotion from delivery all the time.
Voice models still struggle with this distinction. They tend to learn emotion and delivery as a package: anger fused to shouting, boredom fused to monotone, excitement fused to speed. These pairings dominate natural training data, so models learn them as fixed mappings rather than independent axes.
The result is expressive speech that can sound performed, narrow, or slightly off. Ask a leading model to whisper angrily or sound bored while speaking quickly, and the gap becomes obvious.
Models handle the obvious pairings well. It is the subtler cases, where emotion and delivery pull in different directions, that expose the problem.
We call this the entanglement problem. Emotion and vocal delivery are distinct dimensions of speech, but training data often presents them as inseparable.
There has been meaningful progress on architectural fixes. But because the problem begins in the training distribution itself, solving it robustly also requires a data-level response. We propose cross-product sampling: a method for breaking default pairings in training corpora so models can learn emotion and delivery as independent dimensions.
Where Current Curation Pipelines Fall Short
Voice model development has moved beyond brute-force scaling. The field is increasingly focused on smaller, higher-quality datasets filtered for expressivity and coverage. The dominant recipe is score-and-select: run a speech quality or expressivity model, then take the top samples per category. For example, InstructTTSEval filters for samples where arousal and dominance exceed fixed thresholds. EmoCtrl-TTS curates data using labels from a speech emotion recognition model. ParaSpeechCaps applies expressivity filtering across dozens of style tags.
These approaches improve average expressivity. But they do not address a deeper issue: the strong correlations between emotion and vocal delivery in natural speech. As we’ve observed in our own data, boredom is highly correlated with robotic or monotone delivery (r=0.89). Interest trends toward confident expression (r=0.78).
As a result, common pairings appear thousands of times in a typical training dataset, while less intuitive combinations, like bored but energetic, sad but authoritative, angry but quiet, are sparse or absent. Filtering for the most expressive samples can reinforce this imbalance, because the strongest samples often reflect the dominant patterns.
From the model’s perspective, there is no signal that these are correlations rather than constraints. They are learned as fixed mappings.
Emotion × voice quality
Sixteen real samples — one per pairing. Click any tile to hear how the emotion lands in that voice. Tile brightness reflects how often the two tags co-occur in the wild.
Tile shading = relative frequency across 100,000 EN samples where both tags have z-score > 1.5 (n = 21,673).
Architectural Approaches and Their Limits
Recent work has tried to address disentanglement at the model level.
NaturalSpeech 3 uses a factorized codec that decomposes speech into separate streams for content, prosody, timbre, and acoustic detail, then trains independent diffusion models on each. Seed-TTS keeps a unified token representation but uses self-distillation to expose the model to controlled attribute variation.
Both approaches make progress on separating speech attributes inside the model. But they still operate within the limits of the data distribution. A factorized representation can infer how restrained anger might sound, but if the training data contains few examples of anger expressed quietly, that inference remains underconstrained. Distillation can encourage separation, but it still depends on the model encountering enough variation to learn from.
When rare combinations are absent, the model lacks the evidence needed to generalize reliably. That motivates a shift upstream: disentanglement at the data level.
Cross-Product Sampling
Our approach builds on Hume’s taxonomy of voice and emotion: 604 leaf-level tags organized into two independent hierarchies. The emotion tree spans 23 parent categories, including anger, amusement, boredom, and anxiety. The voice and prosody tree includes 27 parent categories, including breathy, confident, monotone, and energetic. A tagging model scores each audio sample continuously from 0 to 1 across all 604 dimensions.
| Emotion Tree (23 parents, 414 leaves) | |
|---|---|
| anger | aggression, contempt, fury, rage |
| anxiety | dread, nervousness, worry, panic |
| amusement | laughter, humor, silliness, wit |
| awe | wonder, reverence, amazement |
| boredom | detachment, indifference, apathy |
| confusion | bewilderment, disorientation |
| desire | longing, yearning, craving |
| disgust | revulsion, nausea, contempt |
| disappointment | frustration, letdown, regret |
| doubt | skepticism, uncertainty, suspicion |
| embarrassment | shame, awkwardness, shyness |
| fear | terror, alarm, dread, horror |
| joy | elation, euphoria, rapture, bliss |
| love | affection, tenderness, devotion |
| pain | suffering, anguish, torment |
| pleasure | delight, enjoyment, gratification |
| pride | triumph, confidence, dignity |
| sadness | grief, sorrow, despair, melancholy |
| satisfaction | contentment, fulfillment, relief |
| surprise | astonishment, shock, disbelief |
| sympathy | compassion, empathy, concern |
| tiredness | fatigue, exhaustion, lethargy |
| interest | curiosity, fascination, engagement |
| Voice / Prosody Tree (27 parents, 189 leaves) | |
|---|---|
| abrasive | harsh, rough, grating, scratchy |
| articulate | clear, precise, enunciated, crisp |
| breathy | whispering, hushed, airy, ASMR |
| cartoony | squeaky, whimsical, goofy, babyish |
| confident | assertive, commanding, decisive |
| energetic | lively, animated, spirited, bubbly |
| expressive | emotive, vivid, dramatic, dynamic |
| fast | rapid, hurried, brisk, fleet |
| husky | gravelly, raspy, deep, throaty |
| monotone | flat, toneless, robotic, dull |
| mellow | warm, smooth, muted, gentle |
| mumbling | unclear, slurred, indistinct |
| musical | singing, humming, rapping, poetic |
| nasal | twangy, pinched, adenoidal |
| noisy | crowd, bg music, distorted, fuzzy |
| non-human | alien, animal, mechanical, digital |
| resounding | booming, resonant, sonorous, full |
| robotic | artificial, synthesized, digital |
| shrill | piercing, screeching, wailing |
| shy | tentative, meek, hesitant, quiet |
| slow | leisurely, deliberate, measured |
| smooth | fluid, polished, effortless |
| soft | gentle, faint, quiet, subdued |
| sultry | seductive, throaty, smoldering |
| theatrical | dramatic, performative, grand |
| tinny | thin, metallic, hollow, weak |
| yelling | shouting, screaming, bellowing |
Rather than selecting data based on how strongly it scores on emotion or voice attributes individually, cross-product sampling selects from emotion × voice pairs. Think of the dataset as a grid: emotion categories across one axis, voice and prosody categories across the other. The goal is to fill as many cells as possible, including combinations that natural speech rarely produces: bored but energetic, disappointed but confident, joyful but shy.
Because the raw tag space reflects the biases of natural speech, where some attributes appear far more often and some score more strongly by default, we apply two techniques before sampling.
First, we sample at the parent level rather than the leaf level. At the leaf level, the grid would contain nearly 80,000 cells, many of them redundant. Tags like “hysterical laughing” and “normal laughing” often pull the same samples. Collapsing to parent categories reduces the space to 23 × 27, or 621 cells, while preserving meaningful variation.
Second, we apply z-normalization so dominant attributes do not drown out subtler ones. Instead of comparing raw scores across attributes, we measure how prominent each attribute is relative to its own distribution. A modest raw score on “embarrassment” can outrank a higher raw score on “interest” if it represents a stronger-than-usual signal for that attribute.
With these adjustments, we explicitly sample for rare intersections. This is the critical step. No amount of top-N filtering, threshold tuning, or reweighting will reliably produce combinations like bored × energetic if the selection process does not actively seek them.
Results: Measuring Disentanglement and Diversity
The goal is to maximize coverage of emotion × voice combinations while preserving expressive signals. Optimizing for only one side is easy. Random sampling fills more of the grid, but much of the audio is low-signal. Top-N expressivity filtering produces stronger samples, but concentrates them in the most common pairings.
We evaluate three sampling strategies: random sampling, top-N parent-level sampling, and cross-product sampling. Each is measured along three dimensions: grid coverage, expressivity, and mutual information.
We use mutual information to quantify entanglement. If knowing the emotion tells you a lot about the voice quality, mutual information is high. If emotion and voice are more independent, mutual information is lower.

In a 5k curated sample from a 100k-sample dataset, random sampling occupied 490 cells, but much of that coverage came from weak signals. Only about 35% of samples fell in the upper half of the expressivity distribution.
Top-N sampling produced the opposite outcome. It achieved the highest expressivity, but coverage dropped to 344 cells and mutual information rose significantly, indicating stronger entanglement. The dataset concentrated in dominant pairings.
Cross-product sampling balanced both objectives. It occupied 463 cells, close to random sampling, while maintaining substantially higher expressivity. It also produced the lowest mutual information, indicating the strongest disentanglement. Crucially, it surfaced plausible but underrepresented combinations — boredom × fast, disappointment × confident, joy × shy, pain × articulate — giving the model direct exposure to variations it would otherwise rarely see.
Through cross-product sampling, we found rarer combinations of speech and emotion for training voice models.

Why This Matters
For expressive voice models, coverage is not just a data-quality metric. It determines what the model can learn to control.
Architectural advances can help separate emotion and delivery, but they depend on training data that actually contains examples where those dimensions vary independently. If the data never shows boredom expressed with energy, or anger expressed quietly, the model has little basis for generating those combinations reliably.
Cross-product sampling gives model builders a structured way to expand that expressive space. It does not replace architecture-level disentanglement. It gives those architectures better evidence to learn from.
The next question is whether lower mutual information in training data is sufficient for independent control at inference. If combinations like bored × energetic are represented with adequate coverage, can models generate them on demand? And can this framework extend beyond emotion and prosody to other axes, such as persona, register, or conversational context?
We are running controlled SFT experiments to test these questions. If you are working on expressive voice models and interested in collaborating, we would welcome the conversation.
Hear the difference
Same prompt, same pairing. One model trained on standard curated data, one on cross-product sampled data.
False cheer — A shop assistant staying bright with a demanding customer
Restrained excitement — An executive sharing a personal decision
Restrained anger — A leader responding to a legal ruling


