It’s explained in more detail in the readme but the basic idea is that it looks for big differences in the number of expected words vs the number of words it receives back from DeepSpeech.
So if it’s expecting 5 words but it receives 10 back, that could be an indication that the user repeated the sentence or that someone else talked during the recording. If you’re expecting 12 and receive 3, that could indicate a truncated recording, excessive background noise or a recording that’s too quiet.
It’s worth mentioning that the only automated process is identifying potential problem clips. I am still manually reviewing them before I put them in the bad clip CSV. I am using the same criteria I would if I was validating through the web site and so far have identified robotic/filtered voices, recordings too quiet, incorrect transcripts, truncated recordings, clipped audio, repetitions and noise drowning out words. The CSV contains the original expected transcript to make it easy for others to check my work.