About the new English Sentences

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 3:52pm

I really like that we have new sentences now, I’m seing a lot of new proper names, which for me as a non-native struggles a lit bit to pronounce, but it’s ok because it’ll improve my English, I have doubts about abbreviations such as “Inc” should I say just Inc or ‘in case’, I said it ‘inc’ just in case, pun intended.

nukeador · April 24, 2019, 12:36pm

Do you have an example where “Inc” is included?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 12:44pm

It was about the name of a company I don’t recall much of it now.

dabinat · April 24, 2019, 3:12pm

I’ve seen it a couple of times. It’s from the wiki sentences.

nukeador · April 24, 2019, 3:19pm

OK, I understand that the code didn’t recognized that as an acronym (no upper cases) or an abbreviation (missing dot)

joshua.landau.ws · April 24, 2019, 3:48pm

Some sentences are ridiculous. Here are some.

The characters for Kyoto are 京都 and Osaka’s are 大阪.

The township is in Schuylkill Valley School District.

Abolhasan Saba, Esmaeil Ghahremani and Ali-Naqi Vaziri were among his students.

Bundesliga club Kaiserslautern on a one-year contract.

It is also close to the Naskapi reserved land of Kawawachikamach.

Niche words are important, but “niche” is a niche word, not “Kawawachikamach”. Surely there’s a better way to get dictionary coverage that doesn’t involve foreign place names and twenty-letter scientific Latin terms.

dabinat · April 24, 2019, 3:48pm

I posted an issue about this here so feel free to chime in with any comments: https://github.com/mozilla/voice-web/issues/1958

1.5 million sentences were imported so even if only 3% are bad, that’s quite a lot. I’m working on a script to filter out most of them but some manual validation will still be needed.

nukeador · April 24, 2019, 3:48pm

We agree, thanks for flagging, we are looking into improving and fixing this.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 3:48pm

Hi it’s me again, well I’ve come across some Japanese words in this sentence “The characters for Kyoto are 京都 and Osaka’s are 大阪.” When I don’t know how to pronounce a word I look it up with Google Translate, but this could lead to some mismatch I think. Was the inclusion of non-english words planned, is it a plan to collect voice data to other languages?

nukeador · April 24, 2019, 3:49pm

Merging all messages about this in this topic.

Yes, @gregor is looking into it, for now please skip these sentences.

Thanks!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 3:51pm

I’ll do that, Roger that!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 4:09pm

I’m recording some clips and sometimes there are sentences which are confusing, I think an option to disable these sentences for recording would be welcome,
example: [‘The original center ran perpendicular to W. Club Blvd.’]
I can’t record news clips without having to record this sentence, that’s why I think this option would be needed, I just recorded a silent clip.

nukeador · April 26, 2019, 1:30pm

I want to let you know that thanks to @dabinat great work we have filter-out a lot of sentences with issues (at the end there were just around 8%).

This changes should be reflected in the next deployment.

Thanks everyone for your valuable feedback to improve the project

Cheers.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 26, 2019, 7:04pm

We also thank the Common Voice team for this awesome project!

JAGulin · May 10, 2019, 7:26am

@Codigo_Logo_Programacao_e_Inteligencia_Artificial
Just in-case it wasn’t mentioned before, there is a “skip” button which will give you a new phrase to record (so you still have 5). Isn’t it available to you?

As for “Inc” it would mean “incorporated” when about a company.
https://en.wikipedia.org/wiki/Inc.
The intention may have been to remove these strings, but when it’s a standard part of a company name like “Acme Inc” I’d say it could be read as “ink”. The safe bet is to skip, though.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · May 10, 2019, 7:42am

Yeah I’m seeing this button now. Ok got it.

Michael_Maggs · May 22, 2019, 12:50pm

While it’s fantastic to have so many English sentences from Wikipedia, we shouldn’t assume that everything should come from there. WP sentences are typically straight facts which are often boring to record and review. They frequently include really obscure non-English proper names (such as villages in Russia) that aren’t at all useful for the dataset and are exceptionally hard for volunteers to read. And they mostly lack the proper names that we do need such as common English language given names.

If nobody objects I’ll re-start uploading sentences from interesting public domain books, with personal names replaced by script to increase name diversity.

nukeador · May 22, 2019, 12:52pm

Do you have ideas on how we can optimize the wikipedia extraction to avoid this issue?

dabinat · May 23, 2019, 4:00am

I have been trying to filter out letter sequences that don’t tend to occur often in English. For example, there are no English words (that I know of) that contain the letter sequences “uuk” or “ijp”, so it filters sentences with words containing these letter sequences.

You can find my script here: https://github.com/dabinat/cvtools/blob/master/sentence_validator.py

(I also have a PR awaiting approval with these changes: https://github.com/mozilla/voice-web/pull/2040 )

dabinat · May 23, 2019, 4:03am

Yes, the Wikipedia stuff has improved word coverage a lot but it’s all past-tense, third-person fact description. We still need other sources for diversity.

I’m happy to review any sentences you upload.