Home ML Papers Sercan Arik - Deep Voice 2: Multi-Speaker Neural Text-to-Speech (2017)

Sercan Arik - Deep Voice 2: Multi-Speaker Neural Text-to-Speech (2017)

History / Edit / PDF / EPUB / BIB /
Created: June 1, 2017 / Updated: July 24, 2025 / Status: finished / 2 min read (~268 words)
machine learning

Which building blocks of the Deep Voice (1) model can be trained and shared on all the speakers voice?
Which part of the Deep Voice pipeline is unique to each speaker (its speech signature/fingerprint)?

Multi-speaker support is added by augmenting the existing model with an embedding vector which represents a speaker

One major difference between Deep Voice 2 and Deep Voice 1 is the separation of the phoneme duration and frequency models

The major architecture changes in Deep Voice 2 are the addition of batch normalization and residual connections in the convolutional layers
We introduce a small post-processing step to correct segmentation mistakes for boundaries between silence phonemes and other phonemes: whenever the segmentation model decodes a silence boundary, we adjust the location of the boundary with a silence detection heuristic

Instead of predicting a continuous-valued duration, we formulate duration prediction as a sequence labeling problem
We discretize the phoneme duration into log-scaled buckets, and assign to each input phoneme the bucket label corresponding to its duration

In order to synthesize speech from multiple speakers, we augment each of our models with a single low-dimensional speaker embedding vector per speaker
We use speaker embeddings to produce recurrent neural network (RNN) initial states, nonlinearity biases, and multiplicative gating factors, used throughout the network

The Tacotron character-to-spectrogram architecture consists of
- a convolution-bank-highway-GRU (CBHG) encoder
- an attentional decoder
- a CBHG post-processing network

Arik, Sercan, et al. "Deep Voice 2: Multi-Speaker Neural Text-to-Speech." arXiv preprint arXiv:1705.08947 (2017).