In auditory spatial perception, horizontal sound image localization and a sense of spaciousness are based on level and time differences between the left and right ears as cues, and the degree of correlation between the left and right signals is thought to contribute to the sense of horizontal spaciousness, in particular [Hidaka1995, Zotter2013]. For the vertical image spread (VIS), spectral cues are necessary. The change in VIS due to the degree of correlation between the vertical and horizontal signals depends on the frequency response [Gribben2018]. This paper investigated the influence of different correlation values between the top and middle layers of loudspeaker signals within a 3D audio reproduction system on listening impressions through two experiments. The results of experiments using pink noise with different correlation values for the top and middle layers show that the lower the vertical correlation values are, the wider the listening range is, where the impression does not change from the central listening position. From the results of experiments using impulse responses obtained by setting up microphones in an actual concert hall, a tendency to perceive a sense of spaciousness at the off-center listening position was found when cardioid microphones were used for the top layer that were spaced apart from the middle layer. The polar pattern and height of the microphones may have resulted in lower correlation values in the vertical direction, thus widening the listening range of consistent spatial impression outside of the central listening position (i.e., “sweet spot”.)
Toru Kamekawa: After graduating from the Kyushu Institute of Design in 1983, he joined the Japan Broadcasting Corporation (NHK) as a sound engineer. During that period, he gained his experience as a recording engineer, mostly in surround sound programs for HDTV.In 2002, he joined... Read More →
This paper proposes the method that plane wave field creation with spherical harmonics for a non-spherical array. In sound field control, there are physics-acoustic models and psycho-acoustic models. Some former are allowed in the location of each loudspeaker, but the sound have the differences between the auditory and the reproduction sound because phantom sources are constructed. The latter developed with wave equation under circle or spherical array conditions which are located strictly, and with high order Ambisonics (HOA) based on spherical harmonics which express only a single point. Therefore, we consider requiring the method which physically creates actual waveforms and provides flexibility in the shape of the loudspeaker array. In this paper, we focus on the Lamé function, changing its order as well as the shape of spatial figures, and propose formulating the distance between the center and each loudspeaker using the function in a polar expression. As the simulation experiment, in the inscribed region, the proposed plane wave can create the same waveform as the spherical one under high order Lamé function which is close to rectangular shape.
This paper presents a recursive solution to the Broadband Acoustic Contrast Control with Pressure Matching (BACC-PM) algorithm, designed to optimize sound zones systems efficiently in the time domain. Traditional frequency-domain algorithms, while computationally less demanding, often result in non-causal filters with increased pre-ringing, making time-domain approaches preferable for certain applications. However, time-domain solutions typically suffer from high computational costs as a result of the inversion of large convolution matrices. To address these challenges, this study introduces a method based on gradient descent and conjugate gradient descent techniques. By exploiting recursive calculations, the proposed approach significantly reduces computational time compared to direct inversion. Theoretical foundations, simulation setups, and performance metrics are detailed, showcasing the efficiency of the algorithm in achieving high acoustic contrast and low reproduction errors with reduced computational effort. Simulations in a controlled environment demonstrate the advantages of the method.
Accurate and efficient simulation of room impulse responses is crucial for spatial audio applications. However, existing acoustic ray-tracing tools often operate as black boxes and only output impulse responses (IRs), providing limited access to intermediate data or spatial fidelity. To address those problems, this paper presents GSound-SIR, a novel Python-based toolkit for room acoustics simulation that addresses these limitations. The contribution of this paper includes the follows. First, GSound-SIR provides direct access to up to millions of raw ray data points from simulations, enabling in-depth analysis of sound propagation paths that was not possible with previous solutions. Second, we introduce a tool to convert acoustic rays into high-order Ambisonic impulse response synthesis, capturing spatial audio cues with greater fidelity than standard techniques. Third, to enhance efficiency, the toolkit implements an energy-based filtering algorithm and can export only the top-X or top-X-% rays. Fourth, we propose to store the simulation results into Parquet formats, facilitating fast data I/O and seamless integration with data analysis workflows. Together, these features make GSound-SIR an advanced, efficient, and modern foundation for room acoustics research, providing researchers and developers with a powerful new tool for spatial audio exploration.
This paper proposes a new algorithm for enhancing the spatial resolution of measured first-order Ambisonics room impulse responses (FOA RIRs). It applies a separation of the RIR into a salient stream (direct sound and reflections) and a diffuse stream to treat them differently: The salient stream is enhanced using the Ambisonic Spatial Decomposition Method (ASDM) with a single direction of arrival (DOA) per sample of the RIR, while the diffuse stream is enhanced by 4-directional (4D-)ASDM with 4 DOAs at the same time. Listening experiments comparing the new Salient/Diffuse S/D-ASDM to ASDM, 4D-ASDM, and the original FOA RIR reveal the best results for the new algorithm in both spatial clarity and absence of artifacts, especially for its variant, which keeps the DOA constant within each salient event block.
Head-related transfer functions (HRTFs) are used in auditory applications for spatializing virtual sound sources. Listener-specific HRTFs, which aim at mimicking the filtering of the head, torso and pinnae of a specific listener, improve the perceived quality of virtual sound compared to using non-individualized HRTFs. However, using listener-specific HRTFs may not be accessible for everyone. Here, we propose as an alternative to take advantage of the adaptation abilities of human listeners to a new set of HRTFs. We claim that agreeing upon a single listener-independent set of HRTFs has beneficial effects for long-term adaptation compared to using several, potentially severely different HRTFs. Thus, the Non-individual Ear MOdel (NEMO) initiative is a first step towards a standardized listener-independent set of HRTFs to be used across applications as an alternative to individualization. A prototype, NEMObeta, is presented to explicitly encourage external feedback from the spatial audio community, and to agree on a complete list of requirements for the future HRTF selection.
PhD student in spatial audio, Acoustics Research Institute Vienna & Imperial College London
Katharina Pollack studied electrical engineering audio engineering in Graz, both at the Technical University and the University of Music and Performing Arts in Graz and is doing her PhD at the Acoustics Research Institute in Vienna in the field of spatial hearing. Her main research... Read More →
Multimodal research and applications are becoming more commonplace as Virtual Reality (VR) technology integrates different sensory feedback, enabling the recreation of real spaces in an audio-visual context. Within VR experiences, numerous applications rely on the user’s voice as a key element of interaction, including music performances and public speaking applications. Self-perception of our voice plays a crucial role in vocal production. When singing or speaking, our voice interacts with the acoustic properties of the environment, shaping the adjustment of vocal parameters in response to the perceived characteristics of the space.
This technical report presents a real-time auralization pipeline that leverages three-dimensional Spatial Impulse Responses (SIRs) for multimodal research applications in VR requiring first-person vocal interaction. It describes the impulse response creation and rendering workflow, the audio-visual integration, and addresses latency and computational considerations. The system enables users to explore acoustic spaces from various positions and orientations within a predefined area, supporting three and five Degrees of Freedom (3Dof and 5DoF) in audio-visual multimodal perception for both research and creative applications in VR.
The design of this pipeline arises from the limitations of existing audio tools and spatializers, particularly regarding signal latency, and the lack of SIRs captured from a first-person perspective and in multiple adjacent distributions to enable translational rendering. By addressing these gaps, the system enables real-time auralization of self-generated vocal feedback.
I'm interested in spatial audio, spatial music, and psychoacoustics. I'm the deputy director of the Music & Media Technologies M.Phil. programme in Trinity College Dublin, and a researcher with the ADAPT centre. At this convention I'm presenting a paper on a Ambisonic Decoder Test... Read More →
Immersive Audio Media and Formats (IAMF), also known as Eclipsa Audio, is an open-source audio container developed to accommodate multichannel and scene-based audio formats. Headphone-based delivery of IAMF audio requires efficient binaural rendering. This paper introduces the Open Binaural Renderer (OBR), which is designed to render IAMF audio. It discusses the core rendering algorithm, the binaural filter design process as well as real-time implementation of the renderer in a form of an open-source C++ rendering library. Designed for multi-platform compatibility, the renderer incorporates a novel approach to binaural audio processing, leveraging a combination of spherical harmonic (SH) based virtual listening room model and anechoic binaural filters. Through its design, the IAMF binaural renderer provides a robust solution for delivering high-quality immersive audio across diverse platforms and applications.
Professor of Audio Engineering, University of York
Gavin Kearney graduated from Dublin Institute of Technology in 2002 with an Honors degree in Electronic Engineering and has since obtained MSc and PhD degrees in Audio Signal Processing from Trinity College Dublin. He joined the University of York as Lecturer in Sound Design in January... Read More →
Jan Skoglund leads a team at Google in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering. These components have been deployed in Google software products such as Meet and hardware products such... Read More →
Thursday May 22, 2025 12:00pm - 12:20pm CEST C2ATM Studio Warsaw, Poland
A computational framework is proposed for analyzing the temporal evolution of perceptual attributes of sound stimuli. As a paradigm, the perceptual attribute of envelopment, which is manifested in different audio sound reproduction formats, is employed. For this, listener temporal ratings of the envelopment for mono, stereo, and 5.0-channel surround music samples, serve as the ground truth for establishing a computational model that can accurately trace temporal changes from such recordings. Combining established and heuristic methodologies, different features of the audio signals were extracted at each segment that envelopment ratings were registered, named long-term (LT) features. A memory LT computational stage is proposed to account for the temporal variations of the features through the duration of the signal, based on the exponentially weighted moving average of the respective LT features. These are utilized in a gradient tree boosting, machine learning algorithm, leading to a Dynamic Model that accurately predicts the listener’s temporal envelopment ratings. Without the proposed memory LT feature function, a Static Model is also derived, which is shown to have lower performance for predicting such temporal envelopment variations.
Department of Electrical and Computer Engineering, University of Patras
I am a graduate of the Electrical and Computer Engineering Department of the University of Patras. Since 2020, I am a PhD candidate in the same department under the supervision of Professor John Mourjopoulos. My research interests include analysis and modeling of perceptual and affective... Read More →
John Mourjopoulos is Professor Emeritus at the Department of Electrical and Computer Engineering, University of Patras and a Fellow of the AES. As the head of the Audiogroup for nearly 30 years, he has authored and presented more than 200 journal and conference papers. His research... Read More →
Thursday May 22, 2025 3:00pm - 3:20pm CEST C1ATM Studio Warsaw, Poland
The Honeybee is an insect known to almost all human beings around the world. The sounds produced by bees is a ubiquitous staple of the soundscape of the countryside and forest meadows, bringing an air of natural beauty to the perceived environment. Honeybee-produced sounds are also an important part of apitherapeutic experiences, where the close-quarters exposure to honeybees proves beneficial to the mental and physical well-being of humans. This research investigates the generation of synthetic honeybee buzzing sounds using Conditional Generative Adversarial Networks (cGANs). Trained on a comprehensive dataset of real recordings collected both inside and outside the beehive during a long-term audio monitoring session. The models produce diverse and realistic audio samples. Two architectures were developed: an unconditional GAN for generating long, high-fidelity audio, and a conditional GAN that incorporates time-of-day information to generate shorter samples reflecting diurnal honeybee activity patterns. The generated audio exhibits both spectral and temporal properties similar to real recordings, as confirmed by statistical analysis performed during the experiment. This research has implications for scientific research in honeybee colony health monitoring as well as apitherapy research. and artistic endeavours, for example in sound design and immersive soundscape creation, the trained generator model is publicly available on the project’s website.
Existing methods for moving sound source localization and tracking face significant challenges when dealing with an unknown number of sound sources, which substantially limits their practical applications. This paper proposes a moving sound source tracking method based on source signal envelopes that does not require prior knowledge of the number of sources. First, an encoder-decoder attractor (EDA) method is used to estimate the number of sources and obtain an attractor for each source, based on which the signal envelope of each source is estimated. This signal envelope is then used as a clue for tracking the target source. The proposed method has been validated through simulation experiments. Experimental results demonstrate that the proposed method can accurately estimate the number of sources and precisely track each source.
Traditional methods for inferring room geometry from sound signals are predominantly based on Room Impulse Response (RIR) or prior knowledge of the sound source location. This significantly restricts the applicability of these approaches. This paper presents a method for estimating room geometry based on the localization of direct sound source and its early reflections from First-Order Ambisonics (FOA) signals without the prior knowledge of the environment. First, this method simultaneously estimates the Direction of Arrival (DOA) of the direct source and the detected first-order reflected sources. Then, a Cross-attention-based network for implicitly extracting the features related to Time Difference of Arrival (TDOA) between the direct source source and the first-order reflected sources is proposed to estimate the distances of the direct and the first-order reflected sources. Finally, the room geometry is inferred from the localization results of the direct and the first-order reflected sources. The effectiveness of the proposed method was validated through simulation experiments. The experimental results demonstrate that the method proposed achieves accurate localization results and performs well in inference of room geometry.
In recent years, there has been an increasing interest in binaural technology due to its ability to create immersive spatial audio experiences, particularly in streaming services and virtual reality applications. While audio localization studies typically focus on individual sound sources, ensemble width (EW) is crucial for scene-based analysis, as wider ensembles enhance immersion. We define intended EW as the angular span between the outermost sound sources in an ensemble, controlled during binaural synthesis. This study presents a comparison between human perception of EW and its automatic estimation under simulated anechoic conditions. Fifty-nine participants, including untrained listeners and experts, took part in listening tests, assessing 20 binaural anechoic excerpts synthesized using 2 publicly available music recordings, 2 different HRTFs, and 5 distinct EWs (0° to 90°). The excerpts were played twice in random order via headphones through a web-based survey. Only a subset of ten listeners, of which nine were experts, passed post-screening tests, with a mean absolute error (MAE) of 74.62° (±38.12°), compared to MAE of 5.92° (±0.14°) achieved a by pre-trained machine learning method using auditory modeling and gradient-boosted decision trees. This shows that while intended EW can be algorithmically extracted from synthesized recordings, it significantly differs from human perception. Participants reported insufficient externalization, front-back confusion (suggesting HRTF mismatch). The untrained listeners demonstrated response inconsistencies and a low degree of discriminability, which led to the rejection of most untrained listeners during post-screening. The findings may contribute to the development of perceptually aligned EW estimation models.
This research aims to provide a systematic approach for the analysis of geometrical and material characteristics of traditional frame drums using deep learning. A data-driven approach is used, integrating supervised and unsupervised feature extraction techniques to associate measurable audio features with perceptual attributes. The methodology involves the training of convolutional neural networks on Mel-Scale spectrograms to estimate wood type (classification), diameter (regression), and depth (regression). A multi-labeled dataset containing recorded samples of frame drums of different specifications is used for model training and evaluation. Hierarchical classification is explored, incorporating playing techniques and environmental factors. Handcrafted features enhance interpretability, helping determine the impact of construction attributes on sound perception, ultimately aiding instrument design. Data augmentation techniques, including pitch alterations, additive noise, etc. are introduced to expand the generalization of the approach and dataset expansion.
Dr. Nikolaos Vryzas was born in Thessaloniki in 1990. He studied Electrical & Computer Engineering in the Aristotle University of Thessaloniki (AUTh). After graduating, he received his master degrees on Information and Communication Audio Video Technologies for Education & Production... Read More →
This paper discusses the process of generating natural language music descriptions, called captioning, using deep learning and large language models. A novel encoder architecture is trained to learn large-scale music representations and generate high-quality embeddings, which a pre-trained decoder then uses to generate captions. The captions used for training are from the state-of-the-art LP-MusicCaps dataset. A qualitative and subjective assessment of the quality of created captions is performed, showing the difference between various decoder models.