Developing computational systems that can communicate with humans—and figure out what they really want—is perhaps the most important goal in AI research. So far, progress in this area has been driven by models that are trained to understand language. Language is an essential component of our intelligence as a species and our ability to cooperate socially.
But there’s a lot more to how humans communicate than language. There’s the tune, rhythm, and timbre of our voice. These signals are so rich that humans can—and often do—use them without even forming words. But just how much can we say without language? This is a key unanswered question not just for AI research, but for psychology as well. We gasp during scary movies, sigh with fatigue, grunt with effort, chuckle with amusement – but what can these sounds – known as “vocal bursts” – really tell us?
Recently, we investigated the emotional meaning of vocal bursts using data collected from around the world. We published our findings last month in the journal Nature Human Behaviour, in a paper titled “Deep learning reveals what vocal bursts express in different cultures.”
Vocal bursts are little understood, but profoundly important: They provide structure and dynamism to our social interactions. They seem to predate language and make up a huge part of the early communication that occurs between babies and their parents. The way two people use vocal bursts in a conversation can provide information about how good their friendship is or their relative rank in a social hierarchy. It’s clear that vocal bursts are an important part of the human toolkit for communication and expression, but we still have little idea of what they mean. Do they express emotions, just give you a general idea of how someone is feeling, or something else? How many different emotions can we express using vocal bursts? Do people from different parts of the world use the same sounds to express the same emotions?
Over 16,000 people from the United States, China, India, South Africa, and Venezuela participated in our study to help determine the emotional meanings of vocal bursts.
We started by collecting thousands of audio clips of people from around the world making sounds like this (and this, and this). We asked a first group of 8,941 participants to listen to the vocal bursts and tell us what emotions they thought they conveyed (choosing from up to 48 different options – including nuanced states like Empathic Pain, Contemplation, and Aesthetic Appreciation).
On their own, these emotion ratings can tell us a lot about what people around the world hear in vocal bursts. But, in theory, these ratings could also just be influenced by quirks of the original vocal bursts, like the speaking voices, demographics, or audio quality of the recordings.
So, for each vocal burst, we asked participants to record themselves mimicking the sounds they heard. This led to a uniquely large and diverse set of vocal bursts – 282,906 vocal bursts recorded by our participants tagged with self-reported emotions. This allowed us to understand more about the underlying vocal modulations that convey emotion. Specifically, did the mimicked vocal bursts reliably convey the same emotions as the original “seed” vocal bursts? This would indicate some underlying meaning expressed by these vocalizations that isn’t shaped by the identity of the person making them, like their gender and age. It would also confirm that the translations of emotion terms in different languages carry similar meanings if they are used to describe the same vocal bursts.
In order to address this systematically, we asked an additional group of 7,879 participants from the same countries to listen to the recordings of participants mimicking the vocal bursts and decide what emotions they heard.
We wanted to determine whether the same underlying vocal expressions were present in the vocal burst recordings made from around the world, and whether these expressions had the same meaning to people in different cultures. How many distinct expressions were there, even though they were made by people from a variety of backgrounds, cultures, and contexts? Did people from around the world use the same emotion concepts to describe the same vocal bursts, even though they responded in their own languages?
Since we had a very large dataset to work with, we were able to approach this with cutting-edge computational methods. We trained an AI model to understand vocal bursts, and used this model to address our questions in a data-driven way.
Training AI to understand vocal expression
Large language models (LLMs) have made exciting progress toward enabling computational systems to communicate with humans, but they currently miss out on a lot of what makes the human voice so rich. We express a multitude of meanings by changing up how we say things, including the way we punctuate and alternate our words with vocal bursts. Training AI models to differentiate the nuanced ways we express ourselves with our voices is key to getting AI to do what we want—in other words, aligning AI with human well-being (the ultimate mission we’re building toward at Hume AI).
From a psychology perspective, analyzing the outputs of AI models like deep neural networks (DNNs) can also tell us more about the input dataset at a fine-grained level of detail. How many distinct auditory signals were present in our data? How many of these correspond to distinct patterns of emotion expression shared across cultures?
We trained a DNN to find vocal expressions that had distinct meanings within or across cultures. Using a DNN allowed us to precisely control several important aspects of our analysis that could be problematic for studies on vocal expression.
We trained the DNN on the set of mimicked vocal bursts, and tested the DNN by having it predict the emotions in the original seed vocal bursts (which it had no exposure to during training). This meant that the model was forced to ignore factors like the particular speaking voices of the participants in our study, as these were randomized in the mimicry portion of our experiment. Instead, the model focused on isolating the consistencies in auditory input that give rise to human judgments of particular emotions.
We then compared the DNN predictions to the average judgments that the human participants in our study made about the seed vocal bursts. These comparisons allowed us to uncover how many distinct vocal expressions were in the data, and precisely quantify how many of these expressions had shared meanings across cultures.
We found that our model was able to differentiate 24 different kinds of vocal expressions shared across cultures. Twenty-one kinds of vocal expression had the same primary meaning across all five cultures, and the remaining three had the same primary meaning in four out of five cultures. The emotions and mental states people associated with the different kinds of vocal bursts were 79% preserved across cultures (which is impressively high given that emotions concepts themselves can differ across cultures and languages).
The 21 distinct kinds of vocal bursts that had the same primary meaning, expressed using the same 21 emotion concepts (or combinations of concepts) or their most direct translations across all five countries, were: Admiration/Aesthetic Appreciation/Awe, Adoration/Love/Sympathy, Anger/Distress, Boredom, Concentration/Contemplation/Calmness, Confusion/Doubt, Desire, Disappointment, Disgust, Excitement/Triumph, Fear/Horror, Horror, Interest, Joy/Amusement, Pain, Pride/Triumph, Relief, Sadness, Satisfaction, Surprise, and Tiredness.
Importantly, the model architecture allowed us to avoid linguistic bias, which is a tricky problem when conducting experiments and training machine learning models with data from multiple cultures. We structured the model so that the average emotion judgments within each culture (evaluated in three separate languages) were outputted separately. This means that the DNN was not instructed to assume any relationship between emotion concepts and how they are used in different countries or languages (English, Chinese, and Spanish).
Since the model was not given any translation of the words from different languages to one another, the relationships we uncovered between the words used in different languages show that the concepts were used similarly to describe the same vocal bursts. This means that the dimension corresponding to “happiness” in one country could just have easily been found to correspond to “sadness” in another, if the same vocal modulations in fact had opposite meanings across cultures.
Our findings add to a growing body of work showing that large numbers of emotions have associated expressions with shared meanings across cultures. They also provide an example of how DNNs can be used to investigate psychological processes while controlling for human biases.
The results speak to the universality of vocal expression at an unprecedented level of detail, but they are by no means exhaustive. Going forward, we would like to expand this approach to more languages and cultures than we were able to study here.
We hope that this work contributes to the challenging task of building models that can accurately engage with the complexities and nuances of how people around the world express themselves. Recently, we held machine learning competitions at ICML 2022 and ACII 2022, with the dataset we collected here, to encourage and foster community around these exciting goals.
To learn more about the background for this project and stay up to date on our latest developments, you can visit hume.ai/science.
Sign up now to get notified of any updates or new articles.