Grammatically poor sample sentences

rogersto · March 5, 2019, 12:02pm

Is it not possible to proof-read the sample sentences for grammar, punctuation and spelling? It must cause confusion and unwanted inconsistencies in the spoken submissions. For example, did you really want to hear that Leonardo Di Caprio is staring in a new film, or starring ? It makes a difference. I find significant errors in about one of every 10-15 sentence sets; trivial errors in about every third set.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 3, 2019, 2:40pm

I’ll give my take on this, as fas as I know the goal is to train a acoustic model so spelling wouldn’t be the problem, the language model can handle that.

xorgy · March 3, 2019, 3:42pm

I would say it still matters. A lot of the speakers (myself included) run in to situations where a lack of punctuation or a typo changes our inflection and/or the word being transcribed. I can’t imagine that this is good for the model, even if it is just the acoustic part.

A lot of the errors are “homophones”, but to the keen ear, there are few true homophones in English, nor in Japanese as far as I have seen.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 3, 2019, 4:42pm

I’m not talking about homophones, I reject when someone read “gonna” or “wanna” the they should have said “going to”/“want to”.

xorgy · March 3, 2019, 4:46pm

I’m specifically talking about issues like “here” in place of “hear”, which I have seen in the text.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 3, 2019, 4:50pm

Got it, well I don’t know about this one.

dabinat · March 3, 2019, 5:25pm

We now have the Sentence Collector which provides a way to submit new sentences and have them validated. However, it has only been live for the past few weeks and many of the sentences pre-date that. In fact, English currently has a big backlog so you are reviewing sentences recorded around a year ago.

I believe the eventual plan is to resubmit the existing sentences through the Sentence Collector so they can be validated and fixed as needed.

nukeador · March 5, 2019, 12:04pm

We want to run an automated clean-up an on the existing sentences. We haven’t established yet a final process to solve the issue described here, the current one is to request the removal of these sentences and then re-submit them corrected to the sentence collector tool.

Any ideas to improve this process are welcomed!

rogersto · March 5, 2019, 5:02pm

How many are we talking about? A good proof-reader should be able to check and edit at quite a speed; much quicker than it takes to speak them.

rogersto · March 5, 2019, 5:05pm

‘Hear’ and ‘here’ might be a distracting error in print, but would not necessarily affect the spoken words. Leonardo Di Caprio ‘staring’ in a film would, however.

ajay.dixon · March 6, 2019, 2:38pm

In my area, there’s a big difference between ‘hear’ (heer) and here (hee-ya).

dabinat · March 6, 2019, 3:03pm

Even if it doesn’t make a difference to pronunciation, sometimes speakers pause or stumble when they encounter a mistake in the text.

jf99 · March 7, 2019, 3:34pm

We need a tool that lets you

search in all submitted sentences of a specified language
submit a corrected version of that sentence
provide a justification for the correction (e.g. a link to a dictionary)
review the corrections that others made (ideally allowing a discussion between corrector and reviewer)

I already collected dozens of mistakes in the German corpus. The longer we wait the more of them make it into it.

nukeador · March 7, 2019, 4:23pm

I don’t expect we will be able to have the tool to do what you describe in the short term (we have other priorities and not a lot of resources), that’s why the current proposal to at least ensure no bad sentences end up in the dataset is what I described:

Request removal from the sentences list.
Correct the sentences and submit them to the sentence collector so they end up in the site with their right form.

Existing sentences that can be identified automatically as “wrong” will be removed by our cleaning scripts.

In any case your feedback is valuable and we can incorporate it into the list of things we would like to have in the future.

jf99 · March 7, 2019, 4:51pm

The lack of resources is understandable. Where and how do I request the removal of several sentences? Shall I make a PR on Github?

nukeador · March 7, 2019, 4:54pm

Yes please, thanks for your understanding

rogersto · March 7, 2019, 7:24pm

Suggestion: since the volunteer readers will all be seeing these sentences and having to inspect them attentively (we hope) before reading them aloud, could we in addition to the button saying ‘Skip’ - provide a button for ‘Error in this sentence?’ That should act as a filter to collect as many as possible, narrowing the task for the second-level reviewers. It won’t be infallible, but it could be a big (and inexpensive) step in the right direction.

rogersto · March 7, 2019, 7:23pm

Incidentally, you speak of needing to cite a reference to justify corrections. That sounds needlessly complex to me, and would be very difficult in the case of punctuation and some of the grammar errors. You’ll find plenty of people who can identify that ‘should’ needs to be replaced with ‘will’ but far fewer who will be able to identify it as an inappropriate use of the future conditional subjunctive (for example). Why do you need that anyway?

dabinat · March 7, 2019, 7:26pm

Couldn’t it just be a system where people vote on whether they agree with the correction? That’s much simpler.

nukeador · March 7, 2019, 7:53pm

I suggest we don’t go too deep into solution ideation, since at the end of the day it’s not something we will be able to change today, and sometimes it tends to be an endless conversation about personal preferences.

I think it’s better to focus on describing the problem clearly so we can come back here for reference when we have time to start thinking on a proper solution