The dataset is available for download on Zenodo: https://zenodo.org/record/7119399

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris and Pirros Tsiakoulis

Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 different TTS systems including a variety of vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the variations in the final samples depend only on the acoustic models. The synthesized utterances provide a balanced and adequate domain, length and phoneme coverage. MOS naturalness evaluations are collected via crowdsourcing on Amazon Mechanical Turk. We present in detail the design of the SOMOS dataset, as well as provide baseline results by training and evaluating state-of-the-art MOS prediction models, while we show the problems that these models face when assigned to evaluate TTS samples.

Below are displayed 20 sample sentences out of the 2,000 sentences included in the dataset, together with their corresponding speech samples. Each sentence is uttered by 10 TTS systems (ranging from 001 to 200), while the LJ Speech sentences are additionally uttered by the natural LJ Speech voice (denoted as system 000). The variant speech samples reflect modern acoustic model problems, such as prosody, rhythm, stress, pauses and pronunciation. The F0 contours of the speech samples are illustrated for the first 400 frames.

That torturing jingle departed out of my brain, and a grateful sense of rest and peace descended upon me.
006
010
039
051
055
067
104
155
166
189
booksent_2012_0037


He could not be accountable for his children's want of spirits, or for her want of enjoyment in his company.
002
009
038
068
079
094
106
150
192
193
booksent_2013_0013


He still has a choice.
029
100
110
120
149
155
156
160
171
198
broadcast_2010_0046


What's the name of the bar?
013
017
040
077
082
088
089
096
158
192
conv_2008_0073


The restaurant called The Deep Sea Takeaway has good food quality with excellent service, and is located in Leith.
032
072
077
112
131
144
172
179
180
196
conv_2009_0033-2


Do you ever feel angry about housework but say nothing?
021
028
069
121
123
144
145
147
168
178
general_0009


Mr President, you have given me a keynote to use.
020
031
053
055
070
085
129
130
161
179
general_0064


Neville, the government official who instigated the policy.
009
033
036
037
073
086
108
115
118
133
news_2013_0050


I've always loved modern art and I adore surrealism.
001
016
021
028
038
111
128
133
141
179
news_2009_0131


But that depends on the point of view.
004
030
051
077
094
106
108
135
155
186
novel_2011_0095


This time there was no doubt.
021
037
041
084
086
087
101
112
118
148
novel_2010_0052


Most of the other Secret Service agents in the motorcade had drawn their sidearms.
000
024
044
055
059
110
126
141
172
177
181
LJ030-0231


Having determined that missus Paine was a responsible and reliable citizen, Hosty interviewed her on november first.
000
048
075
091
093
107
120
126
129
161
166
LJ047-0153


The discounting , say fast-food operators , occurs on a scale and with a frequency they haven't seen before .
003
005
016
068
090
125
148
160
174
177
reportorial_2011_0156


Because these freshmen placed far more emphasis on their partisan role -- spreading the Reagan revolution -- in national policy making , they were more vulnerable to defeat .
024
038
056
068
075
078
131
133
171
195
reportorial_2011_0497


A gale is a strong wind, typically used as a descriptor in nautical contexts.
048
052
082
111
121
135
141
147
160
174
wiki_0004


The blogosphere is made up of all blogs and their interconnections.
020
053
056
077
084
144
162
181
187
189
wiki_0089


The odious serials mistook the interactive easel.
013
031
039
070
073
082
096
131
191
199
sus_2008_0200


The area traveled above the true trip.
030
033
058
062
067
101
107
122
191
192
sus_2011_0456


The air that traveled questions the wines.
007
048
049
063
077
118
146
159
169
190
sus_2013_0107