Potential to pre-screen clips with SNR analysis (similar to what TTS does)

nmstoker · July 17, 2019, 11:59pm

I have been thinking for a bit about ways that the clips presented for review could be pre-screened so that either the good or the bad could be separated out before they’re actually sent to people to listen (eg the syllable check idea mentioned in passing here, which I’ve yet to get round to looking into further!)

It seems like a script that was recently posted in the Mozilla TTS repo might be of use for exactly this kind of pre-screening - it analyses signal to noise ration (“SNR”): https://github.com/mozilla/TTS/blob/master/dataset_analysis/CheckDatasetSNR.ipynb

If that were able to be automated it could identify the worst clips which would not be worth sending for human review.

I’m thinking this would pick up cases where the mic was barely working - it’s still not quite as good as giving someone more immediate feedback as mentioned here but it would at least save needless review.

Even if it wasn’t used as an absolute decider of quality (ie to say that a recording definitely wasn’t of use) it might still give value in letting the priority of review be determined: it makes sense to first focus on the cases which don’t have a relatively bad SNR, even if you then later come back to the more borderline cases later. If using it this way, you wouldn’t necessarily need it to run interactively, it could be batched to run on a load of clips and prioritise them.

Any thoughts?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · July 18, 2019, 9:28am

I think this is a good idea, so far at the moment there are some silent clips, I think that smart batching is the way to go, removing silent clips and the ones with low confidence in an ASR model would decrease the amount of manual work quite a bit.

nukeador · July 18, 2019, 12:27pm

What would happen to clips with some background noise that a human is still able to fully understand the voice?

Note that a minor background noise is OK for the training models, because it will allow to understand voices in real world, non-perfect environments.

nmstoker · July 25, 2019, 6:00pm

My thinking was that those would still get through eventually, they’d just be deprioritised (ie passed to people to review later on than ones with less noise)

Right now there’s a big backlog of unreviewed clips isn’t there? With that in mind, it seems worthwhile getting attention on the best clips first before individual volunteers become frustrated dealing with harder to hear cases. Clearly some super-dedicated people may never get put off, but it seems likely that some volunteers would and this means you get the maximum out of those ones.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · July 25, 2019, 6:10pm

@nmstoker That’s precisely what I was thinking.

nmstoker · July 25, 2019, 6:12pm

And just in case I’ve been a bit ambiguous with the description, here’s a bit more detail of a hypothetical scenario.

Say there’s 12 person-months of clips to listen to. Through experimentation it’s found that the clips with a SNR below a certain threshold are just unintelligible, so those are removed and it’s down to 10 person-months.

Of that 10 person-months set, it’s then prioritied and that way the worst cases are at the end. There could still be some (acceptable) noise in the first 8 person months, and everyone collectively listens away at that first. Then when they’re down to the late part, maybe it’s harder to discern, but people can still keep going. Of course in the meantime more clips have arrived, and the whole remaining set can periodically be re-prioritised.