Myrsini Christidou*Equal Contribution, Alexandra Vioni*Equal Contribution, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris and Pirros Tsiakoulis
Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregres- sive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are pro- posed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normaliza- tion, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabili- ties, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model main- tains high output speech quality and that the proposed method allows efficient prosody control within each speaker’s range de- spite the variability that a multispeaker setting introduces.