In time series classification and regression, signals are typically mapped into some intermediate representation used for constructing models. Since the underlying task is often insensitive to time shifts, these representations are required to be time-shift invariant. We introduce the joint time-frequency scattering transform, a time-shift invariant representation that characterizes the multiscale energy distribution of a signal in time and frequency. It is computed through wavelet convolutions and modulus non-linearities and may, therefore, be implemented as a deep convolutional neural network whose filters are not learned but calculated from wavelets. We consider the progression from mel-spectrograms to time scattering and joint time-frequency scattering transforms, illustrating the relationship between increased discriminability and refinements of convolutional network architectures. The suitability of the joint time-frequency scattering transform for time-shift invariant characterization of time series is demonstrated through applications to chirp signals and audio synthesis experiments. The proposed transform also obtains state-of-the-art results on several audio classification tasks, outperforming time scattering transforms and achieving accuracies comparable to those of fully learned networks.