Sound synthesis is a key part of modern music and audio production. Whether you are a producer, composer, or just curious about how electronic sounds are made, this workshop will break it down in a simple and practical way.
We will explore essential synthesis techniques like subtractive, additive, FM, wavetable, and granular synthesis. You will learn how different synthesis methods create and shape sound, and see them in action through live demonstrations using both hardware and virtual synthesizers, including emulators of the legendary studio equipment.
This session is designed for everyone — whether you are a total beginner or an experienced audio professional looking for fresh ideas. You will leave with a solid understanding of synthesis fundamentals and the confidence to start creating your own unique sounds. Join us for an interactive, hands-on introduction to the world of sound synthesis!
Everybody knows the existence of music with electronic elements. Most of us are aware of the synthesis standing behind it. But the moment I start asking about what's under the hood, the majority of the audience start to run for their lifes. Which is rather sad for me, because learning synthesis could be among the greatest journeys you could take in your life. And I want to back those words up on my workshop.
Let's talk and see what exactly is synthesis, and what it is not. Let's talk about building blocks of basic substractive setup. We will track all the knobs, buttons and sliders, down to every single cable under the front panel. Simply to see which "valve" and "motor" is controlled by which knob. And how does it sounds.
I also want to make you feel safe about modular setups, because when you understand the basic blocks - you understand the modular synthesis. Just like building from bricks!
A computational framework is proposed for analyzing the temporal evolution of perceptual attributes of sound stimuli. As a paradigm, the perceptual attribute of envelopment, which is manifested in different audio sound reproduction formats, is employed. For this, listener temporal ratings of the envelopment for mono, stereo, and 5.0-channel surround music samples, serve as the ground truth for establishing a computational model that can accurately trace temporal changes from such recordings. Combining established and heuristic methodologies, different features of the audio signals were extracted at each segment that envelopment ratings were registered, named long-term (LT) features. A memory LT computational stage is proposed to account for the temporal variations of the features through the duration of the signal, based on the exponentially weighted moving average of the respective LT features. These are utilized in a gradient tree boosting, machine learning algorithm, leading to a Dynamic Model that accurately predicts the listener’s temporal envelopment ratings. Without the proposed memory LT feature function, a Static Model is also derived, which is shown to have lower performance for predicting such temporal envelopment variations.
Department of Electrical and Computer Engineering, University of Patras
I am a graduate of the Electrical and Computer Engineering Department of the University of Patras. Since 2020, I am a PhD candidate in the same department under the supervision of Professor John Mourjopoulos. My research interests include analysis and modeling of perceptual and affective... Read More →
John Mourjopoulos is Professor Emeritus at the Department of Electrical and Computer Engineering, University of Patras and a Fellow of the AES. As the head of the Audiogroup for nearly 30 years, he has authored and presented more than 200 journal and conference papers. His research... Read More →
Thursday May 22, 2025 3:00pm - 3:20pm CEST C1ATM Studio Warsaw, Poland
The Honeybee is an insect known to almost all human beings around the world. The sounds produced by bees is a ubiquitous staple of the soundscape of the countryside and forest meadows, bringing an air of natural beauty to the perceived environment. Honeybee-produced sounds are also an important part of apitherapeutic experiences, where the close-quarters exposure to honeybees proves beneficial to the mental and physical well-being of humans. This research investigates the generation of synthetic honeybee buzzing sounds using Conditional Generative Adversarial Networks (cGANs). Trained on a comprehensive dataset of real recordings collected both inside and outside the beehive during a long-term audio monitoring session. The models produce diverse and realistic audio samples. Two architectures were developed: an unconditional GAN for generating long, high-fidelity audio, and a conditional GAN that incorporates time-of-day information to generate shorter samples reflecting diurnal honeybee activity patterns. The generated audio exhibits both spectral and temporal properties similar to real recordings, as confirmed by statistical analysis performed during the experiment. This research has implications for scientific research in honeybee colony health monitoring as well as apitherapy research. and artistic endeavours, for example in sound design and immersive soundscape creation, the trained generator model is publicly available on the project’s website.
Existing methods for moving sound source localization and tracking face significant challenges when dealing with an unknown number of sound sources, which substantially limits their practical applications. This paper proposes a moving sound source tracking method based on source signal envelopes that does not require prior knowledge of the number of sources. First, an encoder-decoder attractor (EDA) method is used to estimate the number of sources and obtain an attractor for each source, based on which the signal envelope of each source is estimated. This signal envelope is then used as a clue for tracking the target source. The proposed method has been validated through simulation experiments. Experimental results demonstrate that the proposed method can accurately estimate the number of sources and precisely track each source.
Traditional methods for inferring room geometry from sound signals are predominantly based on Room Impulse Response (RIR) or prior knowledge of the sound source location. This significantly restricts the applicability of these approaches. This paper presents a method for estimating room geometry based on the localization of direct sound source and its early reflections from First-Order Ambisonics (FOA) signals without the prior knowledge of the environment. First, this method simultaneously estimates the Direction of Arrival (DOA) of the direct source and the detected first-order reflected sources. Then, a Cross-attention-based network for implicitly extracting the features related to Time Difference of Arrival (TDOA) between the direct source source and the first-order reflected sources is proposed to estimate the distances of the direct and the first-order reflected sources. Finally, the room geometry is inferred from the localization results of the direct and the first-order reflected sources. The effectiveness of the proposed method was validated through simulation experiments. The experimental results demonstrate that the method proposed achieves accurate localization results and performs well in inference of room geometry.
In recent years, there has been an increasing interest in binaural technology due to its ability to create immersive spatial audio experiences, particularly in streaming services and virtual reality applications. While audio localization studies typically focus on individual sound sources, ensemble width (EW) is crucial for scene-based analysis, as wider ensembles enhance immersion. We define intended EW as the angular span between the outermost sound sources in an ensemble, controlled during binaural synthesis. This study presents a comparison between human perception of EW and its automatic estimation under simulated anechoic conditions. Fifty-nine participants, including untrained listeners and experts, took part in listening tests, assessing 20 binaural anechoic excerpts synthesized using 2 publicly available music recordings, 2 different HRTFs, and 5 distinct EWs (0° to 90°). The excerpts were played twice in random order via headphones through a web-based survey. Only a subset of ten listeners, of which nine were experts, passed post-screening tests, with a mean absolute error (MAE) of 74.62° (±38.12°), compared to MAE of 5.92° (±0.14°) achieved a by pre-trained machine learning method using auditory modeling and gradient-boosted decision trees. This shows that while intended EW can be algorithmically extracted from synthesized recordings, it significantly differs from human perception. Participants reported insufficient externalization, front-back confusion (suggesting HRTF mismatch). The untrained listeners demonstrated response inconsistencies and a low degree of discriminability, which led to the rejection of most untrained listeners during post-screening. The findings may contribute to the development of perceptually aligned EW estimation models.
This research aims to provide a systematic approach for the analysis of geometrical and material characteristics of traditional frame drums using deep learning. A data-driven approach is used, integrating supervised and unsupervised feature extraction techniques to associate measurable audio features with perceptual attributes. The methodology involves the training of convolutional neural networks on Mel-Scale spectrograms to estimate wood type (classification), diameter (regression), and depth (regression). A multi-labeled dataset containing recorded samples of frame drums of different specifications is used for model training and evaluation. Hierarchical classification is explored, incorporating playing techniques and environmental factors. Handcrafted features enhance interpretability, helping determine the impact of construction attributes on sound perception, ultimately aiding instrument design. Data augmentation techniques, including pitch alterations, additive noise, etc. are introduced to expand the generalization of the approach and dataset expansion.
Dr. Nikolaos Vryzas was born in Thessaloniki in 1990. He studied Electrical & Computer Engineering in the Aristotle University of Thessaloniki (AUTh). After graduating, he received his master degrees on Information and Communication Audio Video Technologies for Education & Production... Read More →
This paper discusses the process of generating natural language music descriptions, called captioning, using deep learning and large language models. A novel encoder architecture is trained to learn large-scale music representations and generate high-quality embeddings, which a pre-trained decoder then uses to generate captions. The captions used for training are from the state-of-the-art LP-MusicCaps dataset. A qualitative and subjective assessment of the quality of created captions is performed, showing the difference between various decoder models.