Improving Slovenian language model

JakaBac · September 25, 2019, 9:15pm

Hello,

I would like to help improving the Slovenian language model.

Currently I am negotiating to obtain speech data from Slovenian TV stations.
Speech data would be journalists voiceovers and text transcriptions of the stories.
Majority of the recordings are done in a studio environment (clean, no background noise) and ideally I will have access to 3 “main” accents.

But the problem is that I will certainly not be able to release the data under CC (or any free license). And I currently don’t have access to a decent HW setup so training would take forever.

Also the data would need to be preprocessed before training since recordings would consist of the whole story (multiple sentences). I saw on your blog that you also gathered data from TV and radio stations, so I imagine that you had to do something similar already.

I would like to know if there is some kind of agreement that could be done with Mozilla for such data set or this would not be possible.

reyxuan · September 26, 2019, 6:33am

And I currently don’t have access to a decent HW setup so training would take forever.

You can try Google Colab. Save the progress in Drive and repeat each time the session closes.

kdavis · September 26, 2019, 7:44am

We have previously licensed data from broadcasters. But we’d have to license the data directly from the broadcaster. We couldn’t do so through you @JakaBac

However, the big problem I see in this case is the data is no aligned. To align the data you need a rudimentary STT engine in Slovenian. We don’t have such an engine. However, given enough aligned data we could create one.

JakaBac · September 26, 2019, 8:07am

I didn’t mean that licensing would go through me. I just wanted to know if Mozilla is open for doing something like this.
Could you please let me know what would be needed from your side for licensing procedure so I can talk to the relevant people here. We can also take this part of the conversation offline. Just let me know how to proceed.

Today I got some sample data from one of the broadcasters. I will post more details later.

For the rudimentary STT engine would it be enough that the data consists of whole story aligned aligned with the text or it would be necessary to split the voice into distinct sentences. Also please let me know how much is “enough aligned data”

kdavis · September 26, 2019, 8:59am

We’re open to doing something like this.

From our side we’d need to know the broadcasters involved and be introduced to them so we can figure out the details of a license.

For the rudimentary STT engine, we’d need alignment on sentence level for about 500 to 1000 hours of speech.