Sercan Arik - Deep Voice 2: Multi-Speaker Neural Text-to-Speech (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: June 1, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~268 words)
Created: June 1, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~268 words)
- Which building blocks of the Deep Voice (1) model can be trained and shared on all the speakers voice?
- Which part of the Deep Voice pipeline is unique to each speaker (its speech signature/fingerprint)?
- Multi-speaker support is added by augmenting the existing model with an embedding vector which represents a speaker
- One major difference between Deep Voice 2 and Deep Voice 1 is the separation of the phoneme duration and frequency models
- The major architecture changes in Deep Voice 2 are the addition of batch normalization and residual connections in the convolutional layers
- We introduce a small post-processing step to correct segmentation mistakes for boundaries between silence phonemes and other phonemes: whenever the segmentation model decodes a silence boundary, we adjust the location of the boundary with a silence detection heuristic
- Instead of predicting a continuous-valued duration, we formulate duration prediction as a sequence labeling problem
- We discretize the phoneme duration into log-scaled buckets, and assign to each input phoneme the bucket label corresponding to its duration
- In order to synthesize speech from multiple speakers, we augment each of our models with a single low-dimensional speaker embedding vector per speaker
- We use speaker embeddings to produce recurrent neural network (RNN) initial states, nonlinearity biases, and multiplicative gating factors, used throughout the network
- The Tacotron character-to-spectrogram architecture consists of
- a convolution-bank-highway-GRU (CBHG) encoder
- an attentional decoder
- a CBHG post-processing network
- Arik, Sercan, et al. "Deep Voice 2: Multi-Speaker Neural Text-to-Speech." arXiv preprint arXiv:1705.08947 (2017).