We introduce the joint time–frequency scattering transform, a time shift invariant descriptor of time–frequency structure for audio classification. It is obtained by applying a two-dimensional wavelet transform in time and log-frequency to a time–frequency wavelet scalogram. We show that this descriptor successfully characterizes complex time–frequency phenomena such as time-varying filters and frequency modulated excitations. State-of-the-art results are achieved for signal reconstruction and phone segment classification on the TIMIT dataset.