Below is a selected list of publications, alongside supplementary material.
A comprehensive list is available on my Google Scholar profile.
Instrumental playing techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called “ordinary” technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time–frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99.0%±1. An ablation study demonstrates that removing either the joint time–frequency scattering transform or the metric learning algorithm noticeably degrades performance.
Cet article présente Synopsis Seriation (2021), une création musicale générée avec l’aide de l’ordinateur. L’idée centrale consiste à ré-organiser des fragments de pistes dans une oeuvre multicanal pré-existante afin de produire un flux stéréo. Nous appelons “sériation” la recherche de la plus grande similarité de timbre entre fragments successifs dans chaque canal ainsi qu’entre canal gauche et canal droite. Or, puisque le nombre de permutations d’un ensemble est la factorielle de son cardinal, l’espace des séquences possibles est trop vaste pour être exploré directement par l’humain. Là contre, nous formalisons la sériation comme un problème d’optimisation NP-complet de type “voyageur de commerce” et présentons un algorithme évolutionniste qui en donne une solution approximée. Dans ce cadre, nous définissons la dissimilarité de timbre entre deux fragments à partir d’outils issus de l’analyse en ondelettes (diffusion temps-fréquence) ainsi que de la géométrie de l’information (divergence de Jensen–Shannon).
Pour cette oeuvre, nous avons exécuté l’algorithme de sériation sur un corpus de quatre oeuvres de Florian Hecker, comprenant notamment Formulation (2015). La maison de disques Editions Mego, Vienne, a publié \emph{Synopsis Seriation} en format CD, assorti d’un livret d’infographies sur la diffusion temps-fréquence con\c{c}u en partenariat avec le studio de design NORM, Zurich.
The recent surge of machine learning models for wireless sensor networks brings new opportunities for environmental acoustics. Yet, these models are prone to statistical deviations, e.g., due to unforeseen changes in recording hardware or atmospheric conditions. In a supervised learning context, mitigating such deviations is all the more difficult that the area of coverage is vast. I propose to mitigate this problem by applying a form of adaptive gain control in the time-frequency domain, known as Per-Channel Energy Normalization (PCEN). While PCEN has recently been introduced for keyword spotting in the smart home, i show that it is also beneficial for outdoor sensing applications. Specifically, i discuss the deployment of PCEN for terrestrial bio-acoustics, marine bio-acoustics, and urban acoustics. Finally, i formulate three unsolved problems regarding PCEN, approached from the different perspectives of signal processing, real-time systems, and deep learning.
Machine listening systems for environmental acoustic monitoring face a shortage of expert annotations to be used as training data. To circumvent this issue, the emerging paradigm of self-supervised learning proposes to pre-train audio classifiers on a task whose ground truth is trivially available. Alternatively, training set synthesis consists in annotating a small corpus of acoustic events of interest which are then automatically mixed at random to form a larger corpus of polyphonic scenes. Prior studies have considered these two paradigms in isolation, but rarely ever in conjunction. Furthermore, the impact of data curation in training set synthesis remains unclear. To fill this gap in research, this article proposes a two-stage approach. In the self-supervised stage, we formulate a pretext task (Audio2Vec skip-gram inpainting) on unlabeled spectrograms from an acoustic sensor network. Then, in the supervised stage, we formulate a downstream task of multilabel urban sound classification on synthetic scenes. We find that training set synthesis benefits more to overall performance than self-supervised learning. Interestingly, the geographical origin of the acoustic events in training set synthesis appears to have a decisive impact.
Disentangling and recovering physical attributes, such as shape and material, from a few waveform examples is a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as well as structural engineering. We propose to address this problem via a combination of time-frequency analysis and supervised machine learning. We start by synthesizing a dataset of sounds using the functional transformation method. Then, we represent each percussive sound in terms of its time-invariant scattering transform coefficients and formulate the parametric estimation of the resonator as multidimensional regression with a deep convolutional neural network. We interpolate scattering coefficients over the surface of the drum as a surrogate for potentially missing data and study the response of the neural network to interpolated samples. Lastly, we resynthesize drum sounds from scattering coefficients, therefore paving the way towards a deep generative model of drum sounds whose latent variables are physically interpretable.
We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST V2 consists of 18510 audio recordings from the “Sounds of New York City” (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor. The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits spatiotemporal information.
Playing techniques are important expressive elements in music signals. In this paper, we propose a recognition system based on the joint time–frequency scattering transform (jTFST) for pitch evolution-based playing techniques (PETs), a group of playing techniques with monotonic pitch changes over time. The jTFST represents spectro-temporal patterns in the time–frequency domain, capturing discriminative information of PETs. As a case study, we analyse three commonly used PETs of the Chinese bamboo flute: acciacatura, portamento, and glissando, and encode their characteristics using the jTFST. To verify the proposed approach, we create a new dataset, the CBF-petsDB, containing PETs played in isolation as well as in the context of whole pieces performed and annotated by professional players. Feeding the jTFST to a machine learning classifier, we obtain F-measures of 71% for acciacatura, 59% for portamento, and 83% for glissando detection, and provide explanatory visualisations of scattering coefficients for each technique.
This paper introduces OrchideaSOL, a free dataset of samples of extended instrumental playing techniques, designed to be used as default dataset for the Orchidea framework for target-based computer-aided orchestration.
OrchideaSOL is a reduced and modified subset of Studio On Line, or SOL for short, a dataset developed at Ircam between 1996 and 1998. We motivate the reasons behind OrchideaSOL and describe the differences between the original SOL and our dataset. We will also show the work done in improving the dynamic ranges of orchestral families and other aspects of the data.
With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order parents gives a simple numerical criterion to assess whether two neighboring components will interfere psychoacoustically. Secondly, we run a manifold learning algorithm (Isomap) on scattering coefficients to visualize the similarity space underlying parametric additive synthesis. Thirdly, we generalize the “one or two components” framework to three sine waves or more and prove that the effective scattering depth of a Fourier series grows in logarithmic proportion to its bandwidth.
To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a three-dimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.
This article proposes a machine learning method to discover a nonlinear transformation which maps a collection of source vectors onto a collection of target vectors. The key idea is to learn the Lie algebra associated to the underlying one-parameter subgroup of the general linear group. This method has the advantage of not requiring any human intervention other than collecting data samples by pairs, i.e., before and after the action of the group.
The wavelet scattering transform is an invariant and stable signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks, including PyTorch and TensorFlow/Keras. The transforms are implemented on both CPUs and GPUs, the latter offering a significant speedup over the former. The package also has a small memory footprint. Source code, documentation, and examples are available under a BSD license at https://www.kymat.io.
Octave equivalence serves as domain-knowledge in MIR systems, including chromagram, spiral convolutional networks, and harmonic CQT. Prior work has applied the Isomap manifold learning algorithm to unlabeled audio data to embed frequency sub-bands in 3-D space where the Euclidean distances are inversely proportional to the strength of their Pearson correlations. However, discovering octave equivalence via Isomap requires visual inspection and is not scalable. To address this problem, we define “helicality” as the goodness of fit of the 3-D Isomap embedding to a Shepherd-Risset helix. Our method is unsupervised and uses a custom Frank-Wolfe algorithm to minimize a least-squares objective inside a convex hull. Numerical experiments indicate that isolated musical notes have a higher helicality than speech, followed by drum hits.
Class imbalance in the training data hinders the generalization ability of machine listening systems. In the context of bioacoustics, this issue may be circumvented by aggregating species labels into super-groups of higher taxonomic rank: genus, family, order, and so forth. However, different applications of machine listening to wildlife monitoring may require different levels of granularity. This paper introduces TaxoNet, a deep neural network for structured classification of signals from living organisms. TaxoNet is trained as a multitask and multilabel model, following a new architectural principle in end-to-end learning named “hierarchical composition”: shallow layers extract a shared representation to predict a root taxon, while deeper layers specialize recursively to lower-rank taxa. In this way, TaxoNet is capable of handling taxonomic uncertainty, out-of-vocabulary labels, and open-set deployment settings. An experimental benchmark on two new bioacoustic datasets (ANAFCC and BirdVox-14SD) leads to state-of-the-art results in bird species classification. Furthermore, on a task of coarse-grained classification, TaxoNet also outperforms a flat single-task model trained on aggregate labels.
Electrocardiogram (ECG) analysis is the standard ofcare for the diagnosis of irregular heartbeat patterns, known as arrhythmias. This paper presents a deep learning system for the automatic detection and multilabel classification of arrhythmias in ECG recordings. Our system composes three differentiable operators: a scattering transform (ST), a depthwise separable convolutional network (DSC), and a bidirectional long short-term memory network (BiLSTM). The originality of our approach is that all three operators are implemented in Python. This is in contrast to previous publications, which pre-computed ST coefficients in MATLAB. The implementation of ST on Python was made possible by using a new software library for scattering transform named Kymatio.This paper presents the first successful application of Kymatio to the analysis of biomedical signals. As part of the PhysioNet/Computing in Cardiology Challenge 2020, we trained our hybrid Scattering–LSTM model to classify 27 cardiac arrhythmias from two databases of 12–lead ECGs: CPSC2018 and PTB-XL, comprising 32k recordings in total. Our team “BitScattered” achieved a Challenge metric of 0.536±0.012 over ten folds of cross-validation.
This article explains how to apply time–frequency scattering, a convolutional operator extracting modulations in the time–frequency domain at different rates and scales, to the re-synthesis and manipulation of audio textures.
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 milliseconds) and long-term (30 minutes) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer. Combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
In the context of automatic speech recognition and acoustic event detection, an adaptive procedure named per-channel energy normalization (PCEN) has recently shown to outperform the pointwise logarithm of mel-frequency spectrogram (logmelspec) as an acoustic frontend. This letter investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints. First, we apply PCEN on various datasets of natural acoustic environments and find empirically that it Gaussianizes distributions of magnitudes while decorrelating frequency bands. Second, we describe the asymptotic regimes of each component in PCEN: temporal integration, gain control, and dynamic range compression. Third, we give practical advice for adapting PCEN parameters to the temporal properties of the noise to be mitigated, the signal to be enhanced, and the choice of time-frequency representation. As it converts a large class of real-world soundscapes into additive white Gaussian noise, PCEN is a computationally efficient frontend for robust detection and classification of acoustic events in heterogeneous environments.
This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalizes logarithm-based spectral flux, yet with a tunable time scale for background noise estimation. In comparison with pointwise logarithm, PCEN reduces false alarm rate by 50x in the near field and 5x in the far field, both on avian and marine bioacoustic datasets. Such improvements come at moderate computational cost and require no human intervention, thus heralding a promising future for PCEN in bioacoustics.
In time series classification and regression, signals are typically mapped into some intermediate representation used for constructing models. Since the underlying task is often insensitive to time shifts, these representations are required to be time-shift invariant. We introduce the joint time-frequency scattering transform, a time-shift invariant representation that characterizes the multiscale energy distribution of a signal in time and frequency. It is computed through wavelet convolutions and modulus non-linearities and may, therefore, be implemented as a deep convolutional neural network whose filters are not learned but calculated from wavelets. We consider the progression from mel-spectrograms to time scattering and joint time-frequency scattering transforms, illustrating the relationship between increased discriminability and refinements of convolutional network architectures. The suitability of the joint time-frequency scattering transform for time-shift invariant characterization of time series is demonstrated through applications to chirp signals and audio synthesis experiments. The proposed transform also obtains state-of-the-art results on several audio classification tasks, outperforming time scattering transforms and achieving accuracies comparable to those of fully learned networks.
Early detection of sleep arousal in polysomnographic (PSG) signals is crucial for monitoring or diagnosing sleep disorders and reducing the risk of further complications, including heart disease and blood pressure fluctuations. Approach: In this paper, we present a new automatic detector of non-apnea arousal regions in multichannel PSG recordings. This detector cascades four different modules: a second-order scattering transform (ST) with Morlet wavelets; depthwise-separable convolutional layers; bidirectional long short-term memory (BiLSTM) layers; and dense layers. While the first two are shared across all channels, the latter two operate in a multichannel formulation. Following a deep learning paradigm, the whole architecture is trained in an end-to-end fashion in order to optimize two objectives: the detection of arousal onset and offset, and the classification of the type of arousal. Main results and Significance: The novelty of the approach is three-fold: it is the first use of a hybrid ST-BiLSTM network with biomedical signals; it captures frequency information lower (0.1 Hz) than the detection sampling rate (0.5 Hz); and it requires no explicit mechanism to overcome class imbalance in the data. In the follow-up phase of the 2018 PhysioNet/CinC Challenge the proposed architecture achieved a state-of-the-art area under the precision-recall curve (AUPRC) of 0.50 on the hidden test data, tied for the second-highest official result overall.
Beyond the scope of thermal conduction, Joseph Fourier’s treatise on the Analytical Theory of Heat (1822) profoundly altered our understanding of acoustic waves. It posits that any function of unit period can be decomposed into a sum of sinusoids, whose respective contributions represent some essential property of the underlying periodic phenomenon. In acoustics, such a decomposition reveals the resonant modes of a freely vibrating string. The introduction of Fourier series thus opened new research avenues on the modeling of musical timbre—a topic that was to become of crucial importance in the 1960s with the advent of computer-generated sounds. This article proposes to revisit the scientific legacy of Joseph Fourier through the lens of computer music research. We first discuss how the Fourier series marked a paradigm shift in our understanding of acoustics, supplanting the theory of consonance of harmonics in the Pythagorean monochord. Then, we highlight the utility of Fourier’s paradigm via three practical problems in analysis–synthesis: the imitation of musical instruments, frequency transposition, and the generation of audio textures. Interestingly, each of these problems involves a different perspective on time–frequency duality, and stimulates a multidisciplinary interplay between research and creation that is still ongoing.
Vibratos, tremolos, trills, and flutter-tongue are techniques frequently found in vocal and instrumental music. A common feature of these techniques is the periodic modulation in the time–frequency domain. We propose a representation based on time–frequency scattering to model the interclass variability for fine discrimination of these periodic modulations. Time–frequency scattering is an instance of the scattering transform, an approach for building invariant, stable, and informative signal representations. The proposed representation is calculated around the wavelet subband of maximal acoustic energy, rather than over all the wavelet bands. To demonstrate the feasibility of this approach, we build a system that computes the representation as input to a machine learning classifier. Whereas previously published datasets for playing technique analysis focus primarily on techniques recorded in isolation, for ecological validity, we create a new dataset to evaluate the system. The dataset, named CBF-periDB, contains full-length expert performances on the Chinese bamboo flute that have been thoroughly annotated by the players themselves. We report F-measures of 99% for flutter-tongue, 82% for trill, 69% for vibrato, and 51% for tremolo detection, and provide explanatory visualisations of scattering coefficients for each of these techniques.
The expressive variability in producing a musical note conveys information essential to the modeling of orchestration and style. As such, it plays a crucial role in computer-assisted browsing of massive digital music corpora. Yet, although the automatic recognition of a musical instrument from the recording of a single “ordinary” note is considered a solved problem, automatic identification of instrumental playing technique (IPT) remains largely underdeveloped. We benchmark machine listening systems for query-by-example browsing among 143 extended IPTs for 16 instruments, amounting to 469 triplets of instrument, mute, and technique. We identify and discuss three necessary conditions for significantly outperforming the traditional mel-frequency cepstral coefficient (MFCC) baseline: the addition of second-order scattering coefficients to account for amplitude modulation, the incorporation of long-range temporal dependencies, and metric learning using large-margin nearest neighbors (LMNN) to reduce intra-class variability. Evaluating on the Studio On Line (SOL) dataset, we obtain a precision at rank 5 of 99.7% for instrument recognition (baseline at 89.0%) and of 61.0% for IPT recognition (baseline at 44.5%). We interpret this gain through a qualitative assessment of practical usability and visualization using nonlinear dimensionality reduction.
We introduce a new multidimensional representation, named eigenprogression transform, that characterizes some essential patterns of Western tonal harmony while being equivariant to time shifts and pitch transpositions. This representation is deep, multiscale, and convolutional in the piano-roll domain, yet incurs no prior training, and is thus suited to both supervised and unsupervised MIR tasks. The eigenprogression transform combines ideas from the spiral scattering transform, spectral graph theory, and wavelet shrinkage denoising. We report state-of-the-art results on a task of supervised composer recognition (Haydn vs. Mozart) from polyphonic music pieces in MIDI format.
This article addresses the automatic detection of vocal, nocturnally migrating birds from a network of acoustic sensors. Thus far, owing to the lack of annotated continuous recordings, existing methods had been benchmarked in a binary classification setting (presence vs. absence). Instead, with the aim of comparing them in event detection, we release BirdVox-full-night, a dataset of 62 hours of audio comprising 35402 flight calls of nocturnally migrating birds, as recorded from 6 sensors. We find a large performance gap between energy-based detection functions and data-driven machine listening. The best model is a deep convolutional neural network trained with data augmentation. We correlate recall with the density of flight calls over time and frequency and identify the main causes of false alarm.
Musical performance combines a wide range of pitches, nuances, and expressive techniques. Audio-based classification of musical instruments thus requires to build signal representations that are invariant to such transformations. This article investigates the construction of learned convolutional architectures for instrument recognition, given a limited amount of annotated training data. In this context, we benchmark three different weight sharing strategies for deep convolutional networks in the time-frequency domain: temporal kernels; time-frequency kernels; and a linear combination of time-frequency kernels which are one octave apart, akin to a Shepard pitch spiral. We provide an acoustical interpretation of these strategies within the source-filter framework of quasi-harmonic sounds with a fixed spectral envelope, which are archetypal of musical notes. The best classification accuracy is obtained by hybridizing all three convolutional layers into a single deep learning architecture.
We present a new representation of harmonic sounds that linearizes the dynamics of pitch and spectral envelope, while remaining stable to deformations in the time–frequency plane. It is an instance of the scattering transform, a generic operator which cascades wavelet convolutions and modulus nonlinearities. It is derived from the pitch spiral, in that convolutions are successively performed in time, log-frequency, and octave index. We give a closed-form approximation of spiral scattering coefficients for a nonstationary generalization of the harmonic source–filter model.
We introduce a scattering representation for the analysis and classification of sounds. It is locally translation-invariant, stable to deformations in time and frequency, and has the ability to capture harmonic structures. The scattering representation can be interpreted as a convolutional neural network which cascades a wavelet transform in time and along a harmonic spiral. We study its application for the analysis of the deformations of the source–filter model.
We introduce the joint time–frequency scattering transform, a time shift invariant descriptor of time–frequency structure for audio classification. It is obtained by applying a two-dimensional wavelet transform in time and log-frequency to a time–frequency wavelet scalogram. We show that this descriptor successfully characterizes complex time–frequency phenomena such as time-varying filters and frequency modulated excitations. State-of-the-art results are achieved for signal reconstruction and phone segment classification on the TIMIT dataset.