Controllable Text-to-Speech Synthesis with Masked-Autoencoded
Style-Rich Representation

Anonymous Author

Abstract. Controllable TTS models with natural language prompts often lack the ability for fine-grained control and face a scarcity of high-quality data. We propose a two-stage style-controllable TTS system with language models, utilizing a quantized masked-autoencoded style-rich representation as an intermediary. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. The second stage generates codec tokens from both text and sampled style-rich tokens. Experiments show that training the first-stage model on extensive datasets enhances the content robustness of the two-stage model as well as control capabilities over multiple attributes. By selectively combining discrete labels and speaker embeddings, we explore fully controlling the speaker’s timbre and other stylistic information, and adjusting attributes like emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io.

Model Overview



Our controllable TTS system consists of two major stages with a discrete style-rich token as an intermediate representation. This style representation is from a transformer-based MAE as illustrated in figure (a), which learns to capture style information including speaker timbre, prosody, and acoustic environment in the speech with a mask-reconstruction paradigm. The style-rich tokens of a speech clip can be extracted with the style encoder of the pre-trained MAE followed by a residual vector quantizer (RVQ) trained individually. The two stages of TTS are (1) style-rich token (ST) generation, which generates style-rich tokens conditioned on content phonemes and style controlling signals including discrete labels and / or continuous speaker embeddings; and (2) codec token (CT) generation, which generates codec tokens conditioned on content phonemes and style-rich tokens, where the style-rich tokens are either extracted from ground truth speech or predicted by the former stage. The generated codec tokens are then used to reconstruct the waveform with the codec decoder. Each of the two stages relies on a decoder-only transformer to conduct LM-style generation, as illustrated in figure (b).

Table of Contents

  • Reconstruct speech style from GT style-rich tokens
  • Controllable TTS with discrete lables
  • Controlling emotion with a reference speaker
  • Bad cases
  • Reconstruct speech style from GT style-rich tokens

    In this section, we provide some results of reconstructing speech from ground truth style-rich tokens, as well as ground truth speech, compressed speech from codec, and zero-shot TTS results. You may need to scroll right to see full results.

    Transcript Ground Truth GT. + codec Acoustic LM + GT. style-rich tokens YourTTS XTTS-V2
    LibriTTS
    The lower of the three is Gilchrist, a fine scholar and athlete, plays in the Rugby team and the cricket team for the college, and got his Blue for the hurdles and the long jump.
    Nothing happened, however, to interfere with the successful running of the station, and for twenty years thereafter the same two dynamos continued to furnish light in Sunbury.
    In the palace yard stood two soldiers with shining helmets, and with muskets over their shoulders; and when Anders came, both the muskets were levelled at him.
    Gigaspeech
    thank you, said alice, feeling very glad that the figure was over.
    he was here to present from his latest book of photography and text, blind spot,
    but we've got extra information in there to make it a lot easier for keyboard-only users and screen reader users to use.
    then i add the handlers for managing changes and adding new items on the inputs and outputs,
    you can save your ether for reuse although it should be stored over sodium to destroy any contaminants.
    DailyTalk
    Is it possible to change to another room?
    Yeah? Umm can I take a look at the Sirs you carry?

     

    Controllable TTS with discrete lables

    In this section, we provide results of controlling speech attributes with discrete labels. We change specific attributes labels based on ground truth label combinations, and also use the pitch MLP predictors on some samples for avoiding the conflict between control signals.

    Gender

    CFG=2
    Transcript Male Female
    But this subject will be more properly discussed when we treat of the different races of mankind.
    The bell rang. The master marked the sums and cuts to be done for the next lesson and went out.
    The objection of course presents itself that expenditure on women's dress and household paraphernalia is an obvious exception to this rule; but it will appear in the sequel that this exception is much more obvious than substantial.

     

    Emotion

    CFG=3
    Transcript Emotion Labels Result
    Oh! it is better to live on the sea and let other men raise your crops and cook your meals. A=2, V=2, D=3 (Depressed)
    A=4, V=4, D=5 (Netural)
    A=6, V=7, D=7 (Happy)
    A=7, V=2, D=8 (Angry)
    The whores would be just coming out of their houses making ready for the night, yawning lazily after their sleep and settling the hairpins in their clusters of hair. He would pass by them calmly waiting for a sudden movement of his own will or a sudden call to his sin loving soul from their soft perfumed flesh. A=3, V=2, D=4 (Negative)
    A=3, V=6, D=4 (Leisurely)
    A=5, V=7, D=6 (Delighted)
    A=6, V=2, D=7 (Angry)
    The day of the entertainment was as sunny and mild as heart could desire. A=2, V=2, D=3 (Depressed)
    A=4, V=3, D=5 (Neutral)
    A=6, V=7, D=7 (Happy)
    A=7, V=2, D=8 (Angry)
    Every strong impression which you make upon his perceptive powers must have a very lasting influence, and even the impression itself may, in some cases, be forever indelible. A=3, V=3, D=4 (Calm)
    A=7, V=4, D=8 (Emotional)

     

    Age

    CFG=3
    Transcript Age Label Result
    Let us have compassion on the chastised. 1 (5-14 years old)
    3 (25-34 years old)
    5 (45-54 years old)
    7 (65-74 years old)
    And God lighted a fire in the second orbit from the earth which is called the sun, to give light over the whole heaven, and to teach intelligent beings that knowledge of number which is derived from the revolution of the same. 2 (15-24 years old)
    4 (35-44 years old)
    6 (55-64 years old)
    "But what is the delicate mission?" I asked. 2 (15-24 years old)
    4 (35-44 years old)
    6 (55-64 years old)
    8 (75-84 years old)

     

    Pitch Mean

    CFG=2
    Transcript Pitch Mean Label Result
    I must by no means show in such company as was now present the strong feeling which pervaded my own mind. 1 (low for male)
    2 (medium-low for male)
    3 (medium-high for male)
    4 (high for male)
    But let us turn more particularly to the history of the Church itself. 2 (medium-low for male)
    3 (medium-high for male)
    4 (high for male)
    I was not a bit afraid of being found out. 2 (very low for female)
    3 (low for female)
    4 (medium-low for female)
    5 (medium-high for female)
    6 (high for female)
    7 (very high for female)
    "I've been bothered already over your election campaign," resumed the manager, arranging his papers in a bored manner. 2 (very low for female)
    3 (low for female)
    4 (medium-low for female)
    5 (medium-high for female)
    6 (high for female)
    7 (very high for female)

     

    SNR (Signal-to-Noise Ratio)

    CFG=3

    Note that the label value is not the SNR value.

    Transcript SNR Label Result
    "But what is the delicate mission?" I asked. 1 (very noisy)
    3 (noisy)
    4 (a bit noisy)
    8 (clean)
    But let us turn more particularly to the history of the Church itself. 1 (very noisy)
    3 (noisy)
    4 (a bit noisy)
    8 (clean)

     

    C50 (Reverberation)

    CFG=2

    Note that the label value is not the C50 value.

    Transcript C50 Label Result
    The utility of consumption as an evidence of wealth is to be classed as a derivative growth. 1 (strong reverberation)
    2 (normal reverberation)
    4 (slight reverberation)
    8 (no reverberation)
    "I've been bothered already over your election campaign," resumed the manager, arranging his papers in a bored manner. 1 (strong reverberation)
    3 (medium reverberation)
    7 (no reverberation)

     

    Controlling emotion with a reference speaker

    In this section, we provide results of combining discrete labels and speaker embeddings. We change emotion labels in the ground truth label combinations, and use the pitch MLP predictors on some samples for avoiding the conflict between control signals.

    CFG=3

    Transcript Speaker Reference Audio Emotion Labels Result
    The lately roaring winds are hushed into a dead calm; nature seems to breathe no more, and to be sinking into the stillness of death. A=2, V=3, D=3 (Negative)
    A=4, V=3, D=5 (Normal)
    A=5, V=7, D=6 (Joyful)
    A=6, V=2, D=7 (Angry)
    The infamy of this bargain had such an influence on the Scottish parliament, that they once voted that the king should be protected, and his liberty insisted on. A=2, V=2, D=3 (Depressed)
    A=4, V=3, D=5 (Normal)
    A=5, V=7, D=6 (Delighted)
    A=6, V=2, D=7 (Angry)
    MILLIMETER Roughly one twenty fifth of an inch A=2, V=3, D=3 (Negative)
    A=4, V=5, D=5 (Normal)
    A=5, V=2, D=6 (Offended)
    A=5, V=7, D=6 (Delighted)
    Good gracious, mr Holmes, you are surely not going to leave me in this abrupt fashion! You don't seem to realize the position. A=3, V=4, D=4 (Bored)
    A=5, V=4, D=6 (Normal)
    A=6, V=4, D=7 (Emotional)

     

    Bad cases

    In this section, we provide some samples with bad qualitys or unsuccessful control, and provide our analysis for the reasons.

    Transcript Result Analysis
    Yes, call me by my pet name! let me hear The name I used to run at, when a child, From innocent play, and leave the cowslips plied, To glance up in some face that proved me dear With the look of its eyes. Control signal conflicts (gender=female, pitch mean=1 (very low)). The timbre sounds masculine, which means the gender label doesn't work, and there are errors in the content.
    There were some really curious pieces of mediaeval domestic architecture. Control signal conflicts (gender=male, pitch mean=6 (high)). The timbre is low, which means the pitch mean label doesn't work.
    MILLIMETER Roughly one twenty fifth of an inch. Control signal conflicts (age=6 (old), pitch mean=7 (very high)). The timbre is high and sounds like a young girl, which means the age label doesn't work.
    Every strong impression which you make upon his perceptive powers must have a very lasting influence, and even the impression itself may, in some cases, be forever indelible. Control signal conflicts (A=4, D=7). The speech is supposed to be relative calm emotion with arousal=4, yet the result turns out to have strong emotion. This may due to the strong correlation between arousal and dominance modeled by the emotion labeling tool.
    Two hours afterwards a terrible shock awoke me. Extreme labels with very little presence in the training data (SNR = 0, indicating extremely high noise) fail to produce clear and meaningful speech.