Add non-native field

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 10, 2019, 1:07pm

I was thinking about how the non-native field could be important for certain tasks, since these datasets will certainly be used for other tasks other than SR, the non-native field could be used for pronunciation evaluation.

nukeador · April 10, 2019, 1:07pm

I’m currently in conversations with the team on how to improve the accent lists, there is definitely better approaches that can be helpful for everyone.

Thanks for your feedback, I’ll put it also as an item to discuss and consider.

Cheers.

Michael_Maggs · April 10, 2019, 1:24pm

Not sure exactly what ‘non-native’ would mean over and above, say, a ‘German accent’ in English. One interesting option might be a ‘fluent’ tag. There’s a huge difference between someone who speaks English fluently with a German accent, and a German speaker whose knowledge of English is fairly basic. Unfortunately, many of our current readers in English are by no means fluent in the language, which causes all sorts of mis-pronunciations. But you may be able to identify some of those by looking at their reading accuracy (based on validations).

As an aside, I wonder if the ML team might investigate whether they could get more robust results by excluding some of the less accurate speakers entirely?

JAGulin · May 9, 2019, 12:03pm

We can’t be sure how the data will be useful in the future, but I though about this when registering. As far as I saw there were plenty of variants of English to select from, but none that I felt matched myself.

“German accent” would be a very descriptive setting, but I think having one “catch all” for “non-native” is better than nothing (supposedly meaning “I didn’t bother to fill in”). The settings also had a “native language” setting so inferring the type of accent from that may be possible. On the other hand I have English in “other language” so perhaps that’s also enough to infer “non-native” for me.

nukeador · May 9, 2019, 4:00pm

Quick update: I have a meeting tomorrow with the team to talk about this and expect a topic here early next week with the proposal for feedback.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · May 9, 2019, 6:34pm

@nukeador Great! I was thinking about this for a while, and a dataset with only natives speakers would be very useful for pronunciation evaluation research. I would consider myself a non-native, hence English is my second language.