Using Common Voice data with DeepSpeech

nmstoker · October 1, 2017, 6:10pm

Would I be right in thinking that the publicly captured Common Voice data will at some point be used to train models in Mozilla’s DeepSpeech library?

I’ve been able to get Common Voice working locally myself and just recently managed to run the basic training example in DeepSpeech running successfully (on a GPU to boot), so I was thinking I’d take a look at how to wrangle the Common Voice data into the right form to use with DeepSpeech for training.

Is there a plan to do this kind of thing within the Common Voice or the DeepSpeech repos? (or perhaps neither?)

My guess (optimisitcally!) is that this may not be too hard, but thought I’d see whether it was on the cards or even already under way?

nmstoker · October 1, 2017, 6:25pm

BTW: what I’m suggesting is basically as described in here:

so it seems likes it’s a matter of getting the AWS data out of my S3 bucket, downloaded locally and then generate a CSV for the files and their corresponding transcript text

mhenretty · October 2, 2017, 4:30pm

Thanks for the question @nmstoker!

We absolutely plan to use the Common Voice data with Mozilla’s DeepSpeech engine. Our goal is to release the first version of this data by the end of the year, in a format that makes it easy to import into project like DeepSpeech.

While this is certainly in the cards, we haven’t started this process yet. Perhaps we can enlist your help once we pick up this work in earnest (probably in the November timeframe)?

nmstoker · October 2, 2017, 9:34pm

That’s great, would be delighted to help if i can @mhenretty

With a slightly hacky combo of AWS CLI and adapting from the existing import and run scripts I’ve managed to put together something that did the trick. Of course something more polished and straight-through in nature would be better, but it’s a start!

Nikhil_Kansal · October 4, 2017, 5:28pm

@nmstoker I am also trying to use Common Voice to train Deep Speech – can you please post here for reference how you were able to do this?

nmstoker · October 4, 2017, 10:09pm

Certainly Nikhil (but please don’t judge my code too harshly )

The steps are:

Get the files down from AWS to somewhere local
Run the import script to convert them from .mp3 to .wav and generate the .csv files
Run the training script

For 1, I used AWS CLI: https://github.com/aws/aws-cli

You need to set up your credentials so it stores them locally then you can just navigate to a download folder, then run something equivalent to this:

aws s3 sync s3://your-voice-web-bucket .

You’ll see a whole load of your files download (very quickly if your experience is anything ike mine)

For 2 I used a script I’d cobbled together mainly from the other import scripts. The gist is here:

gist.github.com

https://gist.github.com/nmstoker/4c8b9f2fc26cf444c04c00e038404c0b

import_s3_files.py

#!/usr/bin/env python
from __future__ import absolute_import, division, print_function

import sys
import os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

import fnmatch
import codecs
import unicodedata

This file has been truncated. show original

You run something equivalent to:

python import_s3_files.py ../your-local-voice-web-bucket-folder/ ./data

That walks your local bucket folder, going through the paired up Common Voice transcripts and mp3 files cleaning up the text of the former and converting the latter into .wav files in a data folder, then creating a .csv file for each of training, dev and test (in that same data folder)

NB: one problem with my bucket is a handful of transcript files w/o corresponding .mp3 files - I should clean them up, but for now I just delete those transcripts after I sync.

For step 3 I run this script which is based on the other examples provided: https://gist.github.com/nmstoker/780bbf16a199007e3dff594f22e36d04

So far I’m getting fairly good results but I need to create more Common Voice records (I’ve done about 1,800 or so) and I’ve no doubt got lots to learn about how best to tweak the DeepSpeech settings

I hope that helps - it’s a start, but there’s a lot that could be improved (easily!) Big thanks to the Mozilla teams for making both Common Voice and DeepSpeech so awesome!!

nmstoker · October 4, 2017, 11:35pm

Did a quick video of the steps above in case it’s helpful

mhenretty · October 11, 2017, 3:40pm

This is an amazing start @nmstoker!!! You’ve really given us a leg up when we start our integration (which we will be working on in November). Thank you for this!!!

Nikhil_Kansal · October 11, 2017, 11:28pm

Thanks so much for the instructions @nmstoker.

One thing I would say to people who are reading this looking for ways to train DeepSpeech is to look into using the build in mechanisms to train the model. The bin/librivox script will fetch 55GB of audio and transcription from a variety of audio books for example and train the model using that. There is also a bin/voxforge that will download about 6GB of audio data and train the model on that.