Publication Types:

Sort by year:
Synopsis Seriation

Synopsis Seriation: A computer music piece made with time–frequency scattering and information geometry

Workshop paper
Vincent Lostanlen, Florian Hecker
Journées d'Informatique Musicale, 2021
Publication year: 2021

Cet article présente Synopsis Seriation (2021), une création musicale générée avec l’aide de l’ordinateur. L’idée centrale consiste à ré-organiser des fragments de pistes dans une oeuvre multicanal pré-existante afin de produire un flux stéréo. Nous appelons “sériation” la recherche de la plus grande similarité de timbre entre fragments successifs dans chaque canal ainsi qu’entre canal gauche et canal droite. Or, puisque le nombre de permutations d’un ensemble est la factorielle de son cardinal, l’espace des séquences possibles est trop vaste pour être exploré directement par l’humain. Là contre, nous formalisons la sériation comme un problème d’optimisation NP-complet de type “voyageur de commerce” et présentons un algorithme évolutionniste qui en donne une solution approximée. Dans ce cadre, nous définissons la dissimilarité de timbre entre deux fragments à partir d’outils issus de l’analyse en ondelettes (diffusion temps-fréquence) ainsi que de la géométrie de l’information (divergence de Jensen–Shannon).
Pour cette oeuvre, nous avons exécuté l’algorithme de sériation sur un corpus de quatre oeuvres de Florian Hecker, comprenant notamment Formulation (2015). La maison de disques Editions Mego, Vienne, a publié \emph{Synopsis Seriation} en format CD, assorti d’un livret d’infographies sur la diffusion temps-fréquence con\c{c}u en partenariat avec le studio de design NORM, Zurich.

SONYC-UST-V2: An Urban Sound Tagging Dataset \\with Spatiotemporal Context

SONYC-UST-v2: An Urban Sound Tagging Dataset with Spatiotemporal Context

Workshop paper
Mark Cartwright, Jason Cramer, Ana Elisa Méndez Méndez, Yu Wang, Ho-Hsiang Wu, Vincent Lostanlen, Magdalena Fuentes, Graham Dove, Charlie Mydlarz, Justin Salamon, Oded Nov, Juan Pablo Bello
Proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)
Publication year: 2020

We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST V2 consists of 18510 audio recordings from the “Sounds of New York City” (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor. The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits spatiotemporal information.

Helicality: An Isomap-based Measure of Octave Equivalence in Audio Data

Workshop paper
Sripathi Sridhar, Vincent Lostanlen
Extended Abstracts for the Late-Breaking/Demo Session of the International Society on Music Information Retrieval (ISMIR) Conference
Publication year: 2020

Octave equivalence serves as domain-knowledge in MIR systems, including chromagram, spiral convolutional networks, and harmonic CQT. Prior work has applied the Isomap manifold learning algorithm to unlabeled audio data to embed frequency sub-bands in 3-D space where the Euclidean distances are inversely proportional to the strength of their Pearson correlations. However, discovering octave equivalence via Isomap requires visual inspection and is not scalable. To address this problem, we define “helicality” as the goodness of fit of the 3-D Isomap embedding to a Shepherd-Risset helix. Our method is unsupervised and uses a custom Frank-Wolfe algorithm to minimize a least-squares objective inside a convex hull. Numerical experiments indicate that isolated musical notes have a higher helicality than speech, followed by drum hits.

SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network

Workshop paper
Mark Cartwright, Ana Elisa Mendez Mendez, Jason Cramer, Vincent Lostanlen, Graham Dove, Ho-Hsiang Wu, Justin Salamon, Oded Nov, Juan Pablo Bello
Proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)
Publication year: 2019
SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for real-world urban noise monitoring. It consists of 3068 audio recordings from the “Sounds of New York City” (SONYC) acoustic sensor network. Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 fine-grained classes that were chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes. In this work, we describe the collection of this dataset, metrics used to evaluate tagging systems, and the results of a simple baseline model.

Matching Human Vocal Imitations to Birdsong: An Exploratory Analysis

Workshop paper
Kendra Oudyk, Yun-Han Wu, Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, and Juan Pablo Bello
Proceedings of the International Workshop on Vocal Interactivity in-and-between Humans, Animals, and Robots
Publication year: 2019
We explore computational strategies for matching human vocal imitations of birdsong to actual birdsong recordings. We recorded human vocal imitations of birdsong and subsequently analysed these data using three categories of audio features for matching imitations to original birdsong: spectral, temporal, and spectrotemporal. These exploratory analyses suggest that spectral features can help distinguish imitation strategies (e.g., whistling vs. singing) but are insufficient for distinguishing species. Similarly, whereas temporal features are correlated between human imitations and natural birdsong, they are also insufficient. Spectrotemporal features showed the greatest promise, in particular when used to extract a representation of the pitch contour of birdsong and human imitations. This finding suggests a link between the task of matching human imitations to birdsong to retrieval tasks in the music domain such as query-by-humming and cover song retrieval; we borrow from such existing methodologies to outline directions for future research.

Long-distance Detection of Bioacoustic Events with Per-Channel Energy Normalization

Workshop paper
Vincent Lostanlen, Kaitlin Palmer, Elly Knight, Christopher Clark, Holger Klinck, Andrew Farnsworth, Tina Wong, Jason Cramer, Juan Pablo Bello
Proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)
Publication year: 2019

This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalizes logarithm-based spectral flux, yet with a tunable time scale for background noise estimation. In comparison with pointwise logarithm, PCEN reduces false alarm rate by 50x in the near field and 5x in the far field, both on avian and marine bioacoustic datasets. Such improvements come at moderate computational cost and require no human intervention, thus heralding a promising future for PCEN in bioacoustics.

Extended Playing Techniques: The Next Milestone in Musical Instrument Recognition

Workshop paper
Vincent Lostanlen, Joakim Andén, Mathieu Lagrange
Proceedings of the International Workshop on Digital Libraries for Musicology (DLfM)
Publication year: 2018

The expressive variability in producing a musical note conveys information essential to the modeling of orchestration and style. As such, it plays a crucial role in computer-assisted browsing of massive digital music corpora. Yet, although the automatic recognition of a musical instrument from the recording of a single “ordinary” note is considered a solved problem, automatic identification of instrumental playing technique (IPT) remains largely underdeveloped. We benchmark machine listening systems for query-by-example browsing among 143 extended IPTs for 16 instruments, amounting to 469 triplets of instrument, mute, and technique. We identify and discuss three necessary conditions for significantly outperforming the traditional mel-frequency cepstral coefficient (MFCC) baseline: the addition of second-order scattering coefficients to account for amplitude modulation, the incorporation of long-range temporal dependencies, and metric learning using large-margin nearest neighbors (LMNN) to reduce intra-class variability. Evaluating on the Studio On Line (SOL) dataset, we obtain a precision at rank 5 of 99.7% for instrument recognition (baseline at 89.0%) and of 61.0% for IPT recognition (baseline at 44.5%). We interpret this gain through a qualitative assessment of practical usability and visualization using nonlinear dimensionality reduction.

Eigentriads and Eigenprogressions on the Tonnetz

Workshop paper
Vincent Lostanlen
ISMIR Late-Breaking/Demo Session
Publication year: 2018

We introduce a new multidimensional representation, named eigenprogression transform, that characterizes some essential patterns of Western tonal harmony while being equivariant to time shifts and pitch transpositions. This representation is deep, multiscale, and convolutional in the piano-roll domain, yet incurs no prior training, and is thus suited to both supervised and unsupervised MIR tasks. The eigenprogression transform combines ideas from the spiral scattering transform, spectral graph theory, and wavelet shrinkage denoising. We report state-of-the-art results on a task of supervised composer recognition (Haydn vs. Mozart) from polyphonic music pieces in MIDI format.

Transformée en scattering sur la spirale temps–chroma–octave

Workshop paper
Vincent Lostanlen, Stéphane Mallat
Actes du colloque GRETSI, 2015
Publication year: 2015

We introduce a scattering representation for the analysis and classification of sounds. It is locally translation-invariant, stable to deformations in time and frequency, and has the ability to capture harmonic structures. The scattering representation can be interpreted as a convolutional neural network which cascades a wavelet transform in time and along a harmonic spiral. We study its application for the analysis of the deformations of the source–filter model.

Joint Time–frequency Scattering for Audio Classification

Workshop paper
Joakim Andén, Vincent Lostanlen, and Stéphane Mallat
Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP)
Publication year: 2015

We introduce the joint time–frequency scattering transform, a time shift invariant descriptor of time–frequency structure for audio classification. It is obtained by applying a two-dimensional wavelet transform in time and log-frequency to a time–frequency wavelet scalogram. We show that this descriptor successfully characterizes complex time–frequency phenomena such as time-varying filters and frequency modulated excitations. State-of-the-art results are achieved for signal reconstruction and phone segment classification on the TIMIT dataset.