Innoetics

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests.

Ground truth Samples

(GT: Ground truth, Voc: GT resynthesis)

“

His hair, though gray, was thick, and lay smooth over his forehead.

Voc

“

and he was in possession, the first time to her knowledge, of a watch.

Voc

“

On stopping at a door in this low street, Ikey jumped out, ran into the house, slamming the door behind him.

Voc

“

The poor soul then joined the doctor in prayer, and never did I witness more contrition at any condemned sermon than he then evinced.

Voc

Generated Samples

“

Given a four-month suspended sentence.

r10

“

This is a list of notable Irish film directors.

r10

“

In general the Full Form of INTERNET is International Network.

r10

“

The resulting fiscal crisis has prompted the government to print more money, which has led to hyperinflation and a collapse of the currency.

r10

“

In the Republic of Ireland, a structure or site may be deemed to be a "National Monument", and therefore worthy of state protection, if it is of national importance.

r10

“

City nicknames can help in establishing a civic identity, helping outsiders recognize a community or attracting people to a community because of its nickname; promote civic pride; and build community unity.

r10

“

Then Prime Minister David Cameron was the leading voice in the Remain campaign, after reaching an agreement with other European Union leaders that would have changed the terms of Britain's membership had the country voted to stay in.

r10

“

The Rawlings Gold Glove Award, usually referred to as the Gold Glove, is the award given annually to the Major League Baseball players judged to have exhibited superior individual fielding performances at each fielding position in both the National League and the American League, as voted by the managers and coaches in each league.

r10

“

Elisa is a mute, isolated woman who works as a cleaning lady in a hidden, high-security government laboratory in 1962 Baltimore. Her life changes forever when she discovers the lab's classified secret -- a mysterious, scaled creature from South America that lives in a water tank. As Elisa develops a unique bond with her new friend, she soon learns that its fate and very survival lies in the hands of a hostile government agent and a marine biologist.

r10

“

Poor Douglas, before his death -- when it was in sight -- committed to me the manuscript that reached him on the third of these days and that, on the same spot, with immense effect, he began to read to our hushed little circle on the night of the fourth. The departing ladies who had said they would stay didn't, of course, thank heaven, stay: they departed, in consequence of arrangements made, in a rage of curiosity, as they professed, produced by the touches with which he had already worked us up. But that only made his little final auditory more compact and select, kept it, round the hearth, subject to a common thrill.

r10