Karaoker: Alignment-free singing voice synthesis with speech training data

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice following a multi-dimensional template extracted from a source waveform of an unseen speaker/singer. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. Except for multi-tasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.

1.Singing Voice Synthesis

“

Twinkle, twinkle, little star. How I wonder what you are! Up above the world so high. Like a diamond in the sky. Twinkle, twinkle, little star. How I wonder what you are! When the blazing sun is gone. When he nothing shines upon. Then you show your little light. Twinkle, twinkle, all the night. Twinkle, twinkle, little star. How I wonder what you are!

Reference

vctks5

Ground Truth

libri6025

Ground Truth

libri6072

Ground Truth

“

Edelweiss, edelweiss, every morning you greet me. Small and white, clean and bright. You look happy to meet me.

Reference

vctks5

Ground Truth

vctkp308

Ground Truth

libri6284

Ground Truth

“

Life is a mystery, everyone must stand alone, I hear you call my name.

Reference

vctks5

Ground Truth

libri5890

Ground Truth

p264

Ground Truth

“

Jingle bells, jingle bells, Jingle all the way. Oh what fun it is to ride, In a one-horse open sleigh, hey!

Reference

vctks5

Ground Truth

p362

Ground Truth

p276

Ground Truth

“

Well, you only need the light when it's burning low. Only miss the sun when it starts to snow. Only know you love her when you let her go. Only know you've been high when you're feeling low. Only hate the road when you're missing home. Only know you love her when you let her go. And you let her go.

Reference

vctkp276

Ground Truth

“

As I walk through the valley of the shadow of death. I take a look at my life, and realize there's nothin' left. 'Cause I've been blastin' and laughin' so long. That even my momma thinks that my mind is gone. But I ain't never crossed a man that didn't deserve it. Me be treated like a punk, you know that's unheard of. You better watch how you talkin' and where you walkin'. Or you and your homies might be lined in chalk.

Reference

vctkp277

Ground Truth

“

I've heard there was a secret chord, That David played, and it pleased the Lord, But you dont really care for music, do you?

Reference

libri6088

Ground Truth

“

Sometimes you make me blue. Sometimes I feel good. At times I feel used. Lovin' you darlin'. Makes me so confused.

Reference

libri6281

Ground Truth

2.Style Transfer

3.Cross-Lingual SVS

4.Controllability

Zero input

“

Happy birthday to you, happy birthday to you, happy birthday dear master, happy birthday to you.

per feature

CPP

HNR

OCT

RMS

20 %

per feature

CPP

HNR

OCT

RMS

50 %

per feature

CPP

HNR

OCT

RMS

60 %

per feature

CPP

HNR

OCT

RMS

80 %

per feature

CPP

HNR

OCT

RMS

5.Comparing with SOTA

“

Edelweiss, edelweiss, every morning you greet me. Small and white, clean and bright. You look happy to meet me.

Reference

LJSpeech (250k - no GAN)

Mellotron

Ground Truth

“

And my destination makes it worth the while. Pushing through the darkness still another mile. I believe in angels. Something good in everything I see. I believe in angels. When I know the time is right for me. I'll cross the stream, I have a dream. I'll cross the stream, I have a dream.

Reference

LJSpeech (250k - no GAN)

Mellotron

Ground Truth

“

Reference

LJSpeech (250k - no GAN)

Mellotron

Ground Truth

“

I can fly. I'm proud that I can fly. To give the best of mine, Till the end of the time. Believe me I can fly. I'm proud that I can fly. To give the best of mine, The heaven in the sky.

Reference

LJSpeech (250k - no GAN)

Mellotron

Ground Truth

“

I'm just a little bit caught in the middle, Life is a maze, and love is a riddle, I don't know where to go, Can't do it alone, I've tried, and I don't know why.

Reference

LJSpeech (250k - no GAN)

Mellotron

Ground Truth