Abstract
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to text-to-speech synthesis deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation when reducing sampling iterations. To tackle the model convergence challenge with reduced iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at https://prodiff.github.io/.
Progressive Fast Diffusion Model for High-Quality
Preliminary Analyses
Reference Text: This type was introduced into England by Wynkyn de Worde, Caxton’s successor
Method | Recording | 128 iter | 64 iter | 32 iter | 16 iter | 8 iter | 4 iter | 2 iter |
---|---|---|---|---|---|---|---|---|
Gradient-Based | ||||||||
Generator-Based |
Reference Text: the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way
Method | Recording | 128 iter | 64 iter | 32 iter | 16 iter | 8 iter | 4 iter | 2 iter |
---|---|---|---|---|---|---|---|---|
Gradient-Based | ||||||||
Generator-Based |
Reference Text: The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern-work, should form part of the page
Method | Recording | 128 iter | 64 iter | 32 iter | 16 iter | 8 iter | 4 iter | 2 iter |
---|---|---|---|---|---|---|---|---|
Gradient-Based | ||||||||
Generator-Based |
Reference Text: The prison population fluctuated a great deal
Method | Recording | 128 iter | 64 iter | 32 iter | 16 iter | 8 iter | 4 iter | 2 iter |
---|---|---|---|---|---|---|---|---|
Gradient-Based | ||||||||
Generator-Based |
Performance
LJSpeech
Reference/Target Text: This type was introduced into England by Wynkyn de Worde, Caxton’s successor
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff Teacher | ProDiff |
---|---|---|---|---|---|---|---|---|
Reference/Target Text: the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff Teacher | ProDiff |
---|---|---|---|---|---|---|---|---|
Reference/Target Text: The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern-work, should form part of the page
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff Teacher | ProDiff |
---|---|---|---|---|---|---|---|---|
Reference/Target Text: The prison population fluctuated a great deal
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff Teacher | ProDiff |
---|---|---|---|---|---|---|---|---|
LibriTTS
Reference/Target Text: villeforts conduct , therefore , upon reflection , appeared to the baroness as if shaped for their mutual advantage .
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff |
---|---|---|---|---|---|---|---|
Reference/Target Text: she would invoke the past , recall old recollections ; she would supplicate him by the remembrance of guilty , yet happy days .
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff |
---|---|---|---|---|---|---|---|
Reference/Target Text: do you intend opening the door ? said the baroness .
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff |
---|---|---|---|---|---|---|---|
Reference/Target Text: come , forget him for a moment , and instead of pursuing him let him go .
Recording | Tacotron 2 | FastSpeech 2 | GANSpeech | Glow-TTS | Grad-TTS | DiffSpeech | ProDiff |
---|---|---|---|---|---|---|---|
Ablation
Reference/Target Text: This type was introduced into England by Wynkyn de Worde, Caxton’s successor
Recording | w/o GP | w/o KD | Teacher(T=16) | Teacher(T=8) | ProDiff |
---|---|---|---|---|---|
Reference/Target Text: the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way
Recording | w/o GP | w/o KD | Teacher(T=16) | Teacher(T=8) | ProDiff |
---|---|---|---|---|---|
Reference/Target Text: The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern-work, should form part of the page
Recording | w/o GP | w/o KD | Teacher(T=16) | Teacher(T=8) | ProDiff |
---|---|---|---|---|---|
Reference/Target Text: The prison population fluctuated a great deal
Recording | w/o GP | w/o KD | Teacher(T=16) | Teacher(T=8) | ProDiff |
---|---|---|---|---|---|