Leapfrog years of R&D. Access the architectures behind leading voice models.
Don't solve every problem from scratch. At Hume, we believe the best voice AI is built through collective effort.
TADA
Low latency, low hallucination, open source
Deploy TADA (Text and Audio Dual Alignment), a TTS system with text and audio in one synchronized stream to reduce token-level hallucinations and improve latency.
Zero hallucinations
Zero content hallucinations across 1,000+ test samples.
5x faster
5x faster than similar-grade LLM-based TTS systems.
Long-form audio
2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems.
Free transcript
Get a transcript alongside audio with no added latency.
MLX support
Run locally on Apple Silicon with optimized MLX inference.
Open source
Fully open weights and architecture for research and production.
EVI
Empathic speech to speech with contextual understanding
Access EVI (Empathic Voice Interface), a speech to speech system with user speech prosody understanding, native language generation, and customizable voices.
Emotion instruction
Emotion instruction following and unparalleled naturalness.
Voice design
Voice cloning and voice design support.
Natural turn-taking
Interruptibility and back channeling.
Tool use
Tool use and dynamic variables for agentic workflows.
Context injection
Context injection and external LLM compatibility.
Multilingual
Native language generation across a growing list of languages.
Octave
Low latency TTS with voice design and expression modulation
Use Octave (Omni Capable Text and Voice Engine), an LLM-based TTS system with voice design, voice modulation, voice cloning, voice conversion, and more.

Multispeaker
Multispeaker and multilingual synthesis in a single model.
Voice design
Infinite voices through natural language voice descriptions.
Creator platform
Purpose-built for audiobooks and podcasts.
Voice cloning
Clone any voice from a short audio sample.
Expression modulation
Fine-grained control over emotion and delivery style.
Low latency
Streaming output with fast time-to-first-byte for real-time use.
From Our Lab
Peer-reviewed insights
The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge


The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction.
TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (Under Review)
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
How emotion is experienced and expressed in multiple cultures: a large-scale experiment across North America, Europe, and Japan


Core to understanding emotion are subjective experiences and their expression in facial behavior. Past studies have largely focused on six emotions and prototypical facial poses, reflecting limitations in scale and narrow assumptions about the variety of emotions and their patterns of expression.
Get Started with Hume Today
Build, train, and evaluate your voice AI models with us. Reach out to get started.