Julia Chivu ’23

Given that speech and song are produced by the same vocal tract, there is an ongoing debate as to whether or not speech and song can be distinguished by their acoustical features across different societies. It can be difficult to determine where speech ends and songs begin within a culture. This concept is especially challenging to research since there is no reliable approach to compare the two vocalizations properly. However, a multi-university research team led by Dr. Albouy has proposed a method of using spectro-temporal modulation patterns to help distinguish these two kinds of vocal expressions.
Spectro-temporal modulation is a measure of the temporal and spectral features associated with sounds over time. Prior to the study, spectro-temporal features of music had only been characterized in Western music. Therefore, the research team decided to study the spectro-temporal modulation patterns of vocalizations produced by 369 people. The participants originated from 21 different societies across the globe, including various rural, urban, and small-scale regions. The participants were tasked with listening to recordings of vocalizations from the other participants that did not speak the same language to see if they can place the unfamiliar vocal noises into the categories of either speech or song. Additionally, a machine learning algorithm was trained to analyze the specto-temporal characteristics produced by the participants. The algorithm was only trained using vocal expression data from one region.
The participants and the machine learning algorithm were able to assign a vocal recording into the correct category. These findings show that both the participants and the computer used the same spectro-temporal cues to differentiate between the recording types. Therefore, it was found that speech and sound utilize different ends of the spectro-temporal spectrum. More specifically, songs were associated with lower temporal and higher spectral modulation rates, as well as higher energy. In contrast, speech was associated with lower spectral and higher temporal modulation, and less energy production. Through the use of spectro-temporal modulation patterns, the research team successfully demonstrated that speech and sound can be easily distinguished across cultures and societies worldwide.
Works Cited:
[1] P. Albouy, et al., Spectro-temporal acoustical markers differentiate speech from song across cultures. bioRxiv, (2023). doi: https://doi.org/10.1101/2023.01.29.526133
[2] Image retrieved from: https://unsplash.com/photos/yriJc_vccQY

