The dataset is available for download on Zenodo: https://zenodo.org/record/7119399

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris and Pirros Tsiakoulis

Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 different TTS systems including a variety of vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the variations in the final samples depend only on the acoustic models. The synthesized utterances provide a balanced and adequate domain, length and phoneme coverage. MOS naturalness evaluations are collected via crowdsourcing on Amazon Mechanical Turk. We present in detail the design of the SOMOS dataset, as well as provide baseline results by training and evaluating state-of-the-art MOS prediction models, while we show the problems that these models face when assigned to evaluate TTS samples.

Below are displayed 20 sample sentences out of the 2,000 sentences included in the dataset, together with their corresponding speech samples. Each sentence is uttered by 10 TTS systems (ranging from 001 to 200), while the LJ Speech sentences are additionally uttered by the natural LJ Speech voice (denoted as system 000). The variant speech samples reflect modern acoustic model problems, such as prosody, rhythm, stress, pauses and pronunciation. The F0 contours of the speech samples are illustrated for the first 400 frames.

“

That torturing jingle departed out of my brain, and a grateful sense of rest and peace descended upon me.

006

010

039

051

055

067

104

155

166

189

“

He could not be accountable for his children's want of spirits, or for her want of enjoyment in his company.

002

009

038

068

079

094

106

150

192

193

“

He still has a choice.

029

100

110

120

149

155

156

160

171

198

“

What's the name of the bar?

013

017

040

077

082

088

089

096

158

192

“

The restaurant called The Deep Sea Takeaway has good food quality with excellent service, and is located in Leith.

032

072

077

112

131

144

172

179

180

196

“

Do you ever feel angry about housework but say nothing?

021

028

069

121

123

144

145

147

168

178

“

Mr President, you have given me a keynote to use.

020

031

053

055

070

085

129

130

161

179

“

Neville, the government official who instigated the policy.

009

033

036

037

073

086

108

115

118

133

“

I've always loved modern art and I adore surrealism.

001

016

021

028

038

111

128

133

141

179

“

But that depends on the point of view.

004

030

051

077

094

106

108

135

155

186

“

This time there was no doubt.

021

037

041

084

086

087

101

112

118

148

“

Most of the other Secret Service agents in the motorcade had drawn their sidearms.

000

024

044

055

059

110

126

141

172

177

181

“

Having determined that missus Paine was a responsible and reliable citizen, Hosty interviewed her on november first.

000

048

075

091

093

107

120

126

129

161

166

“

The discounting , say fast-food operators , occurs on a scale and with a frequency they haven't seen before .

003

005

016

068

090

125

148

160

174

177

“

Because these freshmen placed far more emphasis on their partisan role -- spreading the Reagan revolution -- in national policy making , they were more vulnerable to defeat .

024