Deepspeech-tools - Scripts to help manage datasets

I developed some tools to help with some dataset-related tasks I needed like merging transcript CSVs together and I’m releasing them on Github in case they’re useful to anyone else.

5 Likes

I added a new script, clip_stats.py. This is most useful for taking a directory or CSV and calculating the total duration of the files inside it.

Thank for that initiative! @dabinat Do you think there’s a case for opening a PR ? If it’s useful to you it might to others … @reuben what’s your take ?

Two new scripts:

csv_purge - Takes a list of filenames and removes them from a CSV. Useful if you remove clips from the Common Voice CSVs and need to sync that between dataset releases.

transcript_check - This compares the expected word count and actual transcribed word count of a clip. The idea is that a big difference is indicative of a potentially bad clip that warrants further investigation.

For example, if it’s expecting five words and it gets ten back, that might be indicative of the user repeating the sentence or background voices. If it’s expecting twelve words and it only gets three, that could be indicative of a partial transcript, excessive background noise or a recording that’s too quiet.

I have already used it to identify 40 or so clips that shouldn’t have made it through validation. I’m working on building up a list so they can be removed from the dataset, but that’s likely to be an ongoing project due to the speed of inference.

@lissyx I’d prefer to keep them independent in the short term as it makes development easier at my end but I’d be happy to submit a PR once they stabilize.

Maybe worth mentionning in the docs at least ?