Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Myrsini Christidou*Equal Contribution, Alexandra Vioni*Equal Contribution, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris and Pirros Tsiakoulis

Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregres- sive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are pro- posed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normaliza- tion, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabili- ties, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model main- tains high output speech quality and that the proposed method allows efficient prosody control within each speaker’s range de- spite the variability that a multispeaker setting introduces.

1) F0 modification based on offset from ground truth labels

Multispeaker/speaker adaptation same voice comparison

“

His goodness of heart and simplicity of character were irresistible.

Ground Truth

cathy-multi

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

cathy-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

“

She needed no one to defend her: his humbled pride was her surest protection.

Ground Truth

cathy-multi

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

cathy-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

Speaker adaptation

“

She needed no one to defend her: his humbled pride was her surest protection.

Ground Truth

obama

jsj

martha

obama-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

jsj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

lj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

martha-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

“

Experience had bred no fancies in him that could raise the phantasm of appetite.

obama-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

jsj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

lj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

martha-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

2) Duration modification based on offset from ground truth labels

Multispeaker/speaker adaptation same voice comparison

“

He looked straight into her eyes with his shy grey glance.

Ground Truth

cathy-multi

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

cathy-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

“

Very soon Sara left her reflections and turned to her with a new question.

Ground Truth

cathy-multi

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

cathy-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

Speaker adaptation

“

Very soon Sara left her reflections and turned to her with a new question.

Ground Truth

obama

jsj

martha

obama-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

jsj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

lj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

martha-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

“

Experience had bred no fancies in him that could raise the phantasm of appetite.

obama-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

jsj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

lj-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

martha-adapt

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

3) F0 single cluster for all phonemes

Multispeaker/speaker adaptation same voice comparison

“

Harney was still unaware of her presence.

Ground Truth

cathy-multi

cathy-adapt

“

Charity had always suspected that the shunned Julias fate might have its compensations.

Ground Truth

cathy-multi

cathy-adapt

Speaker adaptation

“

Charity had always suspected that the shunned Julias fate might have its compensations.

Ground Truth

obama

jsj

martha

obama-adapt

jsj-adapt

lj-adapt

martha-adapt

“

His understanding was good, and his education had given it solid improvement.

obama-adapt

jsj-adapt

lj-adapt

martha-adapt

4) Duration single cluster for all phonemes

Multispeaker/speaker adaptation same voice comparison

“

That famous ring that pricked its owner when he forgot duty and followed desire.

Ground Truth

cathy-multi

cathy-adapt

“

My father has that effect on nearly every one, he informed her.

Ground Truth

cathy-multi

cathy-adapt

Speaker adaptation

“

My father has that effect on nearly every one, he informed her.

Ground Truth

obama

jsj

martha

obama-adapt

jsj-adapt

lj-adapt

martha-adapt

“

She reached the brick temple, unlocked the door and entered into the glacial twilight.

obama-adapt

jsj-adapt

lj-adapt

martha-adapt

5) Single word augmentation

Multispeaker/speaker adaptation same voice comparison

“

Captain Wentworth was acknowledged again by each, by Elizabeth more graciously than before.

Ground Truth

cathy-multi

Dur

cathy-adapt

Dur

“

Captain Wentworth was acknowledged again by each, by Elizabeth more graciously than before.

Ground Truth

cathy-multi

Dur

cathy-adapt

Dur

Speaker adaptation

“

My father has that effect on nearly every one, he informed her.

Ground Truth

obama

jsj

martha

obama-adapt

Dur

jsj-adapt

Dur

lj-adapt

Dur

martha-adapt

Dur

“

Harney was still unaware of her presence.

obama-adapt

Dur

jsj-adapt

Dur

lj-adapt

Dur

martha-adapt

Dur

6) Single phoneme augmentation (SAMPA representation)

Multispeaker/speaker adaptation same voice comparison

“

My father (f A: D @) has that effect on nearly every one, he informed her.

Ground Truth

cathy-multi

Dur

cathy-adapt

Dur

“

Very (v e r I) soon Sara left her reflections and turned to her with a new question.

Ground Truth

cathy-multi

Dur

cathy-adapt

Dur

Speaker adaptation

“

The portraits themselves (D @ m s e l v z) seemed to be staring in astonishment.

Ground Truth

obama

jsj

martha

obama-adapt

Dur

jsj-adapt

Dur

lj-adapt

Dur

martha-adapt

Dur

“

Captain Wentworth (w e n t w 3: T) was acknowledged again by each, by Elizabeth more graciously than before.

obama-adapt

Dur

jsj-adapt

Dur

lj-adapt

Dur

martha-adapt

Dur