Vocal bursts are little understood, but profoundly important: They provide structure and dynamism to our social interactions. They seem to predate language and make up a huge part of the early communication that occurs between babies and their parents. The way two people use vocal bursts in a conversation can provide information about how good their friendship is or their relative rank in a social hierarchy. It’s clear that vocal bursts are an important part of the human toolkit for communication and expression, but we still have little idea of what they mean. Do they express emotions, just give you a general idea of how someone is feeling, or something else? How many different emotions can we express using vocal bursts? Do people from different parts of the world use the same sounds to express the same emotions?
Our Study
Over 16,000 people from the United States, China, India, South Africa, and Venezuela participated in our study to help determine the emotional meanings of vocal bursts.
We started by collecting thousands of audio clips of people from around the world making sounds like this (and this, and this). We asked a first group of 8,941 participants to listen to the vocal bursts and tell us what emotions they thought they conveyed (choosing from up to 48 different options – including nuanced states like Empathic Pain, Contemplation, and Aesthetic Appreciation).
On their own, these emotion ratings can tell us a lot about what people around the world hear in vocal bursts. But, in theory, these ratings could also just be influenced by quirks of the original vocal bursts, like the speaking voices, demographics, or audio quality of the recordings.
So, for each vocal burst, we asked participants to record themselves mimicking the sounds they heard. This led to a uniquely large and diverse set of vocal bursts – 282,906 vocal bursts recorded by our participants tagged with self-reported emotions. This allowed us to understand more about the underlying vocal modulations that convey emotion. Specifically, did the mimicked vocal bursts reliably convey the same emotions as the original “seed” vocal bursts? This would indicate some underlying meaning expressed by these vocalizations that isn’t shaped by the identity of the person making them, like their gender and age. It would also confirm that the translations of emotion terms in different languages carry similar meanings if they are used to describe the same vocal bursts.
In order to address this systematically, we asked an additional group of 7,879 participants from the same countries to listen to the recordings of participants mimicking the vocal bursts and decide what emotions they heard.
We wanted to determine whether the same underlying vocal expressions were present in the vocal burst recordings made from around the world, and whether these expressions had the same meaning to people in different cultures. How many distinct expressions were there, even though they were made by people from a variety of backgrounds, cultures, and contexts? Did people from around the world use the same emotion concepts to describe the same vocal bursts, even though they responded in their own languages?
Since we had a very large dataset to work with, we were able to approach this with cutting-edge computational methods. We trained an AI model to understand vocal bursts, and used this model to address our questions in a data-driven way.
Training AI to understand vocal expression
Large language models (LLMs) have made exciting progress toward enabling computational systems to communicate with humans, but they currently miss out on a lot of what makes the human voice so rich. We express a multitude of meanings by changing up how we say things, including the way we punctuate and alternate our words with vocal bursts. Training AI models to differentiate the nuanced ways we express ourselves with our voices is key to getting AI to do what we want—in other words, aligning AI with human well-being (the ultimate mission we’re building toward at Hume AI).
From a psychology perspective, analyzing the outputs of AI models like deep neural networks (DNNs) can also tell us more about the input dataset at a fine-grained level of detail. How many distinct auditory signals were present in our data? How many of these correspond to distinct patterns of emotion expression shared across cultures?