In quite a few languages, there seems to be a lag of validations. Often many more sentences are recorded than there are validated. This is probably also due to the fact that each recording needs two validations and only one read.
Why not use the deepspeech project already to validate the results. Each sentence could be automatically parsed by deepspeech. If deepspeech correctly analyses the clip (the same as the text), it counts as one validation. If the validation by deepspeech fails, this sample would still need both validation checks. This could reduce the strain on the validation and also focus exactly on those sentences that are hard to detect for deepspeech.