Controllable TTS

Controllable Text-to-Speech Synthesis with Masked-Autoencoded
Style-Rich Representation

Anonymous Author

Abstract. Controllable TTS models with natural language prompts often lack the ability for fine-grained control and face a scarcity of high-quality data. We propose a two-stage style-controllable TTS system with language models, utilizing a quantized masked-autoencoded style-rich representation as an intermediary. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. The second stage generates codec tokens from both text and sampled style-rich tokens. Experiments show that training the first-stage model on extensive datasets enhances the content robustness of the two-stage model as well as control capabilities over multiple attributes. By selectively combining discrete labels and speaker embeddings, we explore fully controlling the speaker’s timbre and other stylistic information, and adjusting attributes like emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io.

Model Overview

Our controllable TTS system consists of two major stages with a discrete style-rich token as an intermediate representation. This style representation is from a transformer-based MAE as illustrated in figure (a), which learns to capture style information including speaker timbre, prosody, and acoustic environment in the speech with a mask-reconstruction paradigm. The style-rich tokens of a speech clip can be extracted with the style encoder of the pre-trained MAE followed by a residual vector quantizer (RVQ) trained individually. The two stages of TTS are (1) style-rich token (ST) generation, which generates style-rich tokens conditioned on content phonemes and style controlling signals including discrete labels and / or continuous speaker embeddings; and (2) codec token (CT) generation, which generates codec tokens conditioned on content phonemes and style-rich tokens, where the style-rich tokens are either extracted from ground truth speech or predicted by the former stage. The generated codec tokens are then used to reconstruct the waveform with the codec decoder. Each of the two stages relies on a decoder-only transformer to conduct LM-style generation, as illustrated in figure (b).

Transcript

Ground Truth

GT. + codec

Acoustic LM + GT. style-rich tokens

YourTTS

XTTS-V2

LibriTTS

The lower of the three is Gilchrist, a fine scholar and athlete, plays in the Rugby team and the cricket team for the college, and got his Blue for the hurdles and the long jump.

Nothing happened, however, to interfere with the successful running of the station, and for twenty years thereafter the same two dynamos continued to furnish light in Sunbury.

In the palace yard stood two soldiers with shining helmets, and with muskets over their shoulders; and when Anders came, both the muskets were levelled at him.

Gigaspeech

thank you, said alice, feeling very glad that the figure was over.

he was here to present from his latest book of photography and text, blind spot,

but we've got extra information in there to make it a lot easier for keyboard-only users and screen reader users to use.

then i add the handlers for managing changes and adding new items on the inputs and outputs,

you can save your ether for reuse although it should be stored over sodium to destroy any contaminants.

DailyTalk

Is it possible to change to another room?

Yeah? Umm can I take a look at the Sirs you carry?

Transcript

Male

Female

But this subject will be more properly discussed when we treat of the different races of mankind.

The bell rang. The master marked the sums and cuts to be done for the next lesson and went out.

The objection of course presents itself that expenditure on women's dress and household paraphernalia is an obvious exception to this rule; but it will appear in the sequel that this exception is much more obvious than substantial.

Transcript	Emotion Labels	Result
Oh! it is better to live on the sea and let other men raise your crops and cook your meals.	A=2, V=2, D=3 (Depressed)
A=4, V=4, D=5 (Netural)
A=6, V=7, D=7 (Happy)
A=7, V=2, D=8 (Angry)
The whores would be just coming out of their houses making ready for the night, yawning lazily after their sleep and settling the hairpins in their clusters of hair. He would pass by them calmly waiting for a sudden movement of his own will or a sudden call to his sin loving soul from their soft perfumed flesh.	A=3, V=2, D=4 (Negative)
A=3, V=6, D=4 (Leisurely)
A=5, V=7, D=6 (Delighted)
A=6, V=2, D=7 (Angry)
The day of the entertainment was as sunny and mild as heart could desire.	A=2, V=2, D=3 (Depressed)
A=4, V=3, D=5 (Neutral)
A=6, V=7, D=7 (Happy)
A=7, V=2, D=8 (Angry)
Every strong impression which you make upon his perceptive powers must have a very lasting influence, and even the impression itself may, in some cases, be forever indelible.	A=3, V=3, D=4 (Calm)
A=7, V=4, D=8 (Emotional)

Transcript

Emotion Labels

Result

Oh! it is better to live on the sea and let other men raise your crops and cook your meals.

A=2, V=2, D=3 (Depressed)

A=4, V=4, D=5 (Netural)

A=6, V=7, D=7 (Happy)

A=7, V=2, D=8 (Angry)

The whores would be just coming out of their houses making ready for the night, yawning lazily after their sleep and settling the hairpins in their clusters of hair. He would pass by them calmly waiting for a sudden movement of his own will or a sudden call to his sin loving soul from their soft perfumed flesh.

A=3, V=2, D=4 (Negative)

A=3, V=6, D=4 (Leisurely)

A=5, V=7, D=6 (Delighted)

A=6, V=2, D=7 (Angry)

The day of the entertainment was as sunny and mild as heart could desire.

A=2, V=2, D=3 (Depressed)

A=4, V=3, D=5 (Neutral)

A=6, V=7, D=7 (Happy)

A=7, V=2, D=8 (Angry)

Every strong impression which you make upon his perceptive powers must have a very lasting influence, and even the impression itself may, in some cases, be forever indelible.

A=3, V=3, D=4 (Calm)

A=7, V=4, D=8 (Emotional)

Transcript	Age Label	Result
Let us have compassion on the chastised.	1 (5-14 years old)
3 (25-34 years old)
5 (45-54 years old)
7 (65-74 years old)
And God lighted a fire in the second orbit from the earth which is called the sun, to give light over the whole heaven, and to teach intelligent beings that knowledge of number which is derived from the revolution of the same.	2 (15-24 years old)
4 (35-44 years old)
6 (55-64 years old)
"But what is the delicate mission?" I asked.	2 (15-24 years old)
4 (35-44 years old)
6 (55-64 years old)
8 (75-84 years old)

Transcript

Age Label

Result

Let us have compassion on the chastised.

1 (5-14 years old)

3 (25-34 years old)

5 (45-54 years old)

7 (65-74 years old)

And God lighted a fire in the second orbit from the earth which is called the sun, to give light over the whole heaven, and to teach intelligent beings that knowledge of number which is derived from the revolution of the same.

2 (15-24 years old)

4 (35-44 years old)

6 (55-64 years old)

"But what is the delicate mission?" I asked.

2 (15-24 years old)

4 (35-44 years old)

6 (55-64 years old)

8 (75-84 years old)

Transcript	Pitch Mean Label	Result
I must by no means show in such company as was now present the strong feeling which pervaded my own mind.	1 (low for male)
2 (medium-low for male)
3 (medium-high for male)
4 (high for male)
But let us turn more particularly to the history of the Church itself.	2 (medium-low for male)
3 (medium-high for male)
4 (high for male)
I was not a bit afraid of being found out.	2 (very low for female)
3 (low for female)
4 (medium-low for female)
5 (medium-high for female)
6 (high for female)
7 (very high for female)
"I've been bothered already over your election campaign," resumed the manager, arranging his papers in a bored manner.	2 (very low for female)
3 (low for female)
4 (medium-low for female)
5 (medium-high for female)
6 (high for female)
7 (very high for female)

Transcript

Pitch Mean Label

Result

I must by no means show in such company as was now present the strong feeling which pervaded my own mind.

1 (low for male)

2 (medium-low for male)

3 (medium-high for male)

4 (high for male)

But let us turn more particularly to the history of the Church itself.

2 (medium-low for male)

3 (medium-high for male)

4 (high for male)

I was not a bit afraid of being found out.

2 (very low for female)

3 (low for female)

4 (medium-low for female)

5 (medium-high for female)

6 (high for female)

7 (very high for female)

"I've been bothered already over your election campaign," resumed the manager, arranging his papers in a bored manner.

2 (very low for female)

3 (low for female)

4 (medium-low for female)

5 (medium-high for female)

6 (high for female)

7 (very high for female)

Transcript	SNR Label	Result
"But what is the delicate mission?" I asked.	1 (very noisy)
3 (noisy)
4 (a bit noisy)
8 (clean)
But let us turn more particularly to the history of the Church itself.	1 (very noisy)
3 (noisy)
4 (a bit noisy)
8 (clean)

Transcript

SNR Label

Result

"But what is the delicate mission?" I asked.

1 (very noisy)

3 (noisy)

4 (a bit noisy)

8 (clean)

But let us turn more particularly to the history of the Church itself.

1 (very noisy)

3 (noisy)

4 (a bit noisy)

8 (clean)

Transcript	C50 Label	Result
The utility of consumption as an evidence of wealth is to be classed as a derivative growth.	1 (strong reverberation)
2 (normal reverberation)
4 (slight reverberation)
8 (no reverberation)
"I've been bothered already over your election campaign," resumed the manager, arranging his papers in a bored manner.	1 (strong reverberation)
3 (medium reverberation)
7 (no reverberation)

Transcript

C50 Label

Result

The utility of consumption as an evidence of wealth is to be classed as a derivative growth.

1 (strong reverberation)

2 (normal reverberation)

4 (slight reverberation)

8 (no reverberation)

"I've been bothered already over your election campaign," resumed the manager, arranging his papers in a bored manner.

1 (strong reverberation)

3 (medium reverberation)

7 (no reverberation)

Transcript	Speaker Reference Audio	Emotion Labels	Result
The lately roaring winds are hushed into a dead calm; nature seems to breathe no more, and to be sinking into the stillness of death.		A=2, V=3, D=3 (Negative)
A=4, V=3, D=5 (Normal)
A=5, V=7, D=6 (Joyful)
A=6, V=2, D=7 (Angry)
The infamy of this bargain had such an influence on the Scottish parliament, that they once voted that the king should be protected, and his liberty insisted on.		A=2, V=2, D=3 (Depressed)
A=4, V=3, D=5 (Normal)
A=5, V=7, D=6 (Delighted)
A=6, V=2, D=7 (Angry)
MILLIMETER Roughly one twenty fifth of an inch		A=2, V=3, D=3 (Negative)
A=4, V=5, D=5 (Normal)
A=5, V=2, D=6 (Offended)
A=5, V=7, D=6 (Delighted)
Good gracious, mr Holmes, you are surely not going to leave me in this abrupt fashion! You don't seem to realize the position.		A=3, V=4, D=4 (Bored)
A=5, V=4, D=6 (Normal)
A=6, V=4, D=7 (Emotional)

Transcript

Speaker Reference Audio

Emotion Labels

Result

The lately roaring winds are hushed into a dead calm; nature seems to breathe no more, and to be sinking into the stillness of death.

A=2, V=3, D=3 (Negative)

A=4, V=3, D=5 (Normal)

A=5, V=7, D=6 (Joyful)

A=6, V=2, D=7 (Angry)

The infamy of this bargain had such an influence on the Scottish parliament, that they once voted that the king should be protected, and his liberty insisted on.

A=2, V=2, D=3 (Depressed)

A=4, V=3, D=5 (Normal)

A=5, V=7, D=6 (Delighted)

A=6, V=2, D=7 (Angry)

MILLIMETER Roughly one twenty fifth of an inch

A=2, V=3, D=3 (Negative)

A=4, V=5, D=5 (Normal)

A=5, V=2, D=6 (Offended)

A=5, V=7, D=6 (Delighted)

Good gracious, mr Holmes, you are surely not going to leave me in this abrupt fashion! You don't seem to realize the position.

A=3, V=4, D=4 (Bored)

A=5, V=4, D=6 (Normal)

A=6, V=4, D=7 (Emotional)

Transcript	Result	Analysis
Yes, call me by my pet name! let me hear The name I used to run at, when a child, From innocent play, and leave the cowslips plied, To glance up in some face that proved me dear With the look of its eyes.		Control signal conflicts (gender=female, pitch mean=1 (very low)). The timbre sounds masculine, which means the gender label doesn't work, and there are errors in the content.
There were some really curious pieces of mediaeval domestic architecture.		Control signal conflicts (gender=male, pitch mean=6 (high)). The timbre is low, which means the pitch mean label doesn't work.
MILLIMETER Roughly one twenty fifth of an inch.		Control signal conflicts (age=6 (old), pitch mean=7 (very high)). The timbre is high and sounds like a young girl, which means the age label doesn't work.
Every strong impression which you make upon his perceptive powers must have a very lasting influence, and even the impression itself may, in some cases, be forever indelible.		Control signal conflicts (A=4, D=7). The speech is supposed to be relative calm emotion with arousal=4, yet the result turns out to have strong emotion. This may due to the strong correlation between arousal and dominance modeled by the emotion labeling tool.
Two hours afterwards a terrible shock awoke me.		Extreme labels with very little presence in the training data (SNR = 0, indicating extremely high noise) fail to produce clear and meaningful speech.

Transcript

Result

Analysis

Yes, call me by my pet name! let me hear The name I used to run at, when a child, From innocent play, and leave the cowslips plied, To glance up in some face that proved me dear With the look of its eyes.

Control signal conflicts (gender=female, pitch mean=1 (very low)). The timbre sounds masculine, which means the gender label doesn't work, and there are errors in the content.

There were some really curious pieces of mediaeval domestic architecture.

Control signal conflicts (gender=male, pitch mean=6 (high)). The timbre is low, which means the pitch mean label doesn't work.

MILLIMETER Roughly one twenty fifth of an inch.

Control signal conflicts (age=6 (old), pitch mean=7 (very high)). The timbre is high and sounds like a young girl, which means the age label doesn't work.

Every strong impression which you make upon his perceptive powers must have a very lasting influence, and even the impression itself may, in some cases, be forever indelible.

Control signal conflicts (A=4, D=7). The speech is supposed to be relative calm emotion with arousal=4, yet the result turns out to have strong emotion. This may due to the strong correlation between arousal and dominance modeled by the emotion labeling tool.

Two hours afterwards a terrible shock awoke me.

Extreme labels with very little presence in the training data (SNR = 0, indicating extremely high noise) fail to produce clear and meaningful speech.

Controllable Text-to-Speech Synthesis with Masked-Autoencoded
Style-Rich Representation

Model Overview

Table of Contents

Reconstruct speech style from GT style-rich tokens