Is it not possible to proof-read the sample sentences for grammar, punctuation and spelling? It must cause confusion and unwanted inconsistencies in the spoken submissions. For example, did you really want to hear that Leonardo Di Caprio is staring in a new film, or starring ? It makes a difference. I find significant errors in about one of every 10-15 sentence sets; trivial errors in about every third set.
Iâll give my take on this, as fas as I know the goal is to train a acoustic model so spelling wouldnât be the problem, the language model can handle that.
I would say it still matters. A lot of the speakers (myself included) run in to situations where a lack of punctuation or a typo changes our inflection and/or the word being transcribed. I canât imagine that this is good for the model, even if it is just the acoustic part.
A lot of the errors are âhomophonesâ, but to the keen ear, there are few true homophones in English, nor in Japanese as far as I have seen.
Iâm not talking about homophones, I reject when someone read âgonnaâ or âwannaâ the they should have said âgoing toâ/âwant toâ.
Iâm specifically talking about issues like âhereâ in place of âhearâ, which I have seen in the text.
Got it, well I donât know about this one.
We now have the Sentence Collector which provides a way to submit new sentences and have them validated. However, it has only been live for the past few weeks and many of the sentences pre-date that. In fact, English currently has a big backlog so you are reviewing sentences recorded around a year ago.
I believe the eventual plan is to resubmit the existing sentences through the Sentence Collector so they can be validated and fixed as needed.
We want to run an automated clean-up an on the existing sentences. We havenât established yet a final process to solve the issue described here, the current one is to request the removal of these sentences and then re-submit them corrected to the sentence collector tool.
Any ideas to improve this process are welcomed!
How many are we talking about? A good proof-reader should be able to check and edit at quite a speed; much quicker than it takes to speak them.
âHearâ and âhereâ might be a distracting error in print, but would not necessarily affect the spoken words. Leonardo Di Caprio âstaringâ in a film would, however.
In my area, thereâs a big difference between âhearâ (heer) and here (hee-ya).
Even if it doesnât make a difference to pronunciation, sometimes speakers pause or stumble when they encounter a mistake in the text.
We need a tool that lets you
- search in all submitted sentences of a specified language
- submit a corrected version of that sentence
- provide a justification for the correction (e.g. a link to a dictionary)
- review the corrections that others made (ideally allowing a discussion between corrector and reviewer)
I already collected dozens of mistakes in the German corpus. The longer we wait the more of them make it into it.
I donât expect we will be able to have the tool to do what you describe in the short term (we have other priorities and not a lot of resources), thatâs why the current proposal to at least ensure no bad sentences end up in the dataset is what I described:
- Request removal from the sentences list.
- Correct the sentences and submit them to the sentence collector so they end up in the site with their right form.
Existing sentences that can be identified automatically as âwrongâ will be removed by our cleaning scripts.
In any case your feedback is valuable and we can incorporate it into the list of things we would like to have in the future.
The lack of resources is understandable. Where and how do I request the removal of several sentences? Shall I make a PR on Github?
Yes please, thanks for your understanding
Suggestion: since the volunteer readers will all be seeing these sentences and having to inspect them attentively (we hope) before reading them aloud, could we in addition to the button saying âSkipâ - provide a button for âError in this sentence?â That should act as a filter to collect as many as possible, narrowing the task for the second-level reviewers. It wonât be infallible, but it could be a big (and inexpensive) step in the right direction.
Incidentally, you speak of needing to cite a reference to justify corrections. That sounds needlessly complex to me, and would be very difficult in the case of punctuation and some of the grammar errors. Youâll find plenty of people who can identify that âshouldâ needs to be replaced with âwillâ but far fewer who will be able to identify it as an inappropriate use of the future conditional subjunctive (for example). Why do you need that anyway?
Couldnât it just be a system where people vote on whether they agree with the correction? Thatâs much simpler.
I suggest we donât go too deep into solution ideation, since at the end of the day itâs not something we will be able to change today, and sometimes it tends to be an endless conversation about personal preferences.
I think itâs better to focus on describing the problem clearly so we can come back here for reference when we have time to start thinking on a proper solution