Deepspeech-tools - Scripts to help manage datasets

dabinat · July 28, 2019, 5:52pm

I developed some tools to help with some dataset-related tasks I needed like merging transcript CSVs together and I’m releasing them on Github in case they’re useful to anyone else.

dabinat · July 30, 2019, 8:18am

I added a new script, clip_stats.py. This is most useful for taking a directory or CSV and calculating the total duration of the files inside it.

lissyx · August 1, 2019, 1:26pm

Thank for that initiative! @dabinat Do you think there’s a case for opening a PR ? If it’s useful to you it might to others … @reuben what’s your take ?

dabinat · August 5, 2019, 10:46pm

Two new scripts:

csv_purge - Takes a list of filenames and removes them from a CSV. Useful if you remove clips from the Common Voice CSVs and need to sync that between dataset releases.

transcript_check - This compares the expected word count and actual transcribed word count of a clip. The idea is that a big difference is indicative of a potentially bad clip that warrants further investigation.

For example, if it’s expecting five words and it gets ten back, that might be indicative of the user repeating the sentence or background voices. If it’s expecting twelve words and it only gets three, that could be indicative of a partial transcript, excessive background noise or a recording that’s too quiet.

I have already used it to identify 40 or so clips that shouldn’t have made it through validation. I’m working on building up a list so they can be removed from the dataset, but that’s likely to be an ongoing project due to the speed of inference.

dabinat · August 5, 2019, 10:49pm

@lissyx I’d prefer to keep them independent in the short term as it makes development easier at my end but I’d be happy to submit a PR once they stabilize.

lissyx · August 6, 2019, 8:24am

Maybe worth mentionning in the docs at least ?