Hi! I’d like to add data for Norwegian speech recognition. Specifically to help with recognising dialects (how people actually speak).
I’ve been collecting sentences. And have 500 so far.
Sentences will need to be standardized and written in one form. That will be a huge problem.
Infinitiv a/e
Even the Norwegian translation of voice.mozilla.org now is using two distinct rules for the simplest of rules. It has both å lage
and å laga
for “to make
”.
- I want to use a-infinitiv (
å laga
), because it is further from Danish-Norwegian and will therefore be simpler to disambiguate. - A negative is that most school text books and the state is using e-infinitiv as far as I’ve seen.
Different words for the same thing
To be
can be either å verta
or å bli
. People say it differently based on their dialect. Both is allowed to be written (sadly). There’s two ways to go here:
- demand all written text always is one of them, and people just say whatever they say in their dialect.
- get both in, and have speakers speak it as written
The problem with number 2 is that it will be unclear to the speaker when they should speak as written, and when they should not. As an example “to see
” is written “å sjå
”. But if we have a rule of “say it like it’s written” for “å verta” and “å bli” (and similar pairs), it will be hard for a speaker ho says “å se” in her dialect to know that she should not say “å sjå” in this particular case. She would have to know that “å se” is not a valid way to write to see
.
The problem with number 1 is that we’d need to decide on it. And also that saying “jei blir gla’” and getting out “eg vert glad” is a very far stretch on the “what you hear” to “what you get”. However, English has a lot of these weird things, so the model might be able to learn it.
Marking of dialect in profile
Before we actually open up to recording, we need to have dialect marking in the profile. It also needs to be in the exported data.
What do you think? Do any other languages have similar problems? My impression is that most countries have a more accepted standard language. Foreigners often will call “Standard austnorsk” the same for Norwegian, but that’s really not true.