Importing large annotated database of CC0 speech data in Swedish?

PeterKz · May 27, 2019, 1:06pm

A couple of years ago speech data from a bankrupt speech research company was made available by the National library of Norway. It contains audio files and annotated text files from a multitude of speakers and is CC0 licensed. It is close to 100GB of data. Would it be possible to import this data into Common Voice to improve Swedish ASR?

The data is freely available at the national library here.

Similar data exists for Norwegian and Danish. An overview (in norwegian) is available in this document.

nukeador · May 27, 2019, 1:07pm

I suspect this will require some work to adapt it to our dataset needs.

@kdavis @lsaunders can better comment on this one.

kdavis · May 28, 2019, 6:28am

@PeterKz Nice find! I’ll take a look at the data sets, but it looks promising!