Use deepspeech as one positive validation

jef.daniels · June 5, 2019, 12:03pm

In quite a few languages, there seems to be a lag of validations. Often many more sentences are recorded than there are validated. This is probably also due to the fact that each recording needs two validations and only one read.

Why not use the deepspeech project already to validate the results. Each sentence could be automatically parsed by deepspeech. If deepspeech correctly analyses the clip (the same as the text), it counts as one validation. If the validation by deepspeech fails, this sample would still need both validation checks. This could reduce the strain on the validation and also focus exactly on those sentences that are hard to detect for deepspeech.

jakub.wrobel7 · June 4, 2019, 5:40pm

I am not an expert in the field but I suspect this may lead to dataset quality not improving but stalling in some place. I believe manual validations are meant to keep only the good quality recordings while the engine may approve something what it believes is ok but actually is not. What do You think?

jef.daniels · June 4, 2019, 8:01pm

That’s why it should only serve for on out of the two positive validations. I think it will be highly unlikely that a clip that is actually a wrong transcription will be by coincidence right. In the very unlikely event that this happens, there is still the second validation.

On the other hand, if Deepspeech thinks it is wrong, the probability is quite high that it is still right. Therefore it should never count as a negative validation.

kdavis · June 5, 2019, 1:32pm

The problem is that for most of the languages there currently is not enough data to create a useful Deep Speech model yet. We’ve yet to hit the data set size where we can enter this virtuous cycle.

However, for English it might be possible.

davidak · June 8, 2019, 2:34pm

Can you give an estimate how much data is needed for this to be useful?

I also had this idea: https://github.com/mozilla/voice-web/issues/1666

carlfm01 · June 9, 2019, 9:14am

I’ve used this approach to create my Spanish dataset, you can read to see what I did : Releasing my Spanish dataset - 120h of public domain data
Now about the quality of the dataset is hard to tell, I need people to test it, I can’t simply manually review them, it is 110k files. I think probably a good idea is to sort them using the loss and start the review on the higher ones.