Self supervised learning for robust voice cloning

Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris and Pirros Tsiakoulis

Abstract: Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker’s voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker’s voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.

Voice Cloning from unseen clean utterances.

Absolutely, glad to help!
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Black rhinos were once considered vermin and exterminated at will.
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Personally, I'd describe myself as a friend who's highly skilled in conversational arts.
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox

Voice Cloning from unseen noisy utterances with SNR 5.

I say all humans should be treated fairly.
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
I have a strong bond with WiFi. We just have this connection.
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
You don't have anything scheduled for today.
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox
Reference Speaker
BYOL-A
BYOL-A + Pros
BYOL-A + Noise
BYOL-A + Pros + Noise
d-vectors VCTK
d-vectors Vox