Would I be right in thinking that the publicly captured Common Voice data will at some point be used to train models in Mozilla’s DeepSpeech library?
I’ve been able to get Common Voice working locally myself and just recently managed to run the basic training example in DeepSpeech running successfully (on a GPU to boot), so I was thinking I’d take a look at how to wrangle the Common Voice data into the right form to use with DeepSpeech for training.
Is there a plan to do this kind of thing within the Common Voice or the DeepSpeech repos? (or perhaps neither?)
My guess (optimisitcally!) is that this may not be too hard, but thought I’d see whether it was on the cards or even already under way?
BTW: what I’m suggesting is basically as described in here:
so it seems likes it’s a matter of getting the AWS data out of my S3 bucket, downloaded locally and then generate a CSV for the files and their corresponding transcript text
We absolutely plan to use the Common Voice data with Mozilla’s DeepSpeech engine. Our goal is to release the first version of this data by the end of the year, in a format that makes it easy to import into project like DeepSpeech.
While this is certainly in the cards, we haven’t started this process yet. Perhaps we can enlist your help once we pick up this work in earnest (probably in the November timeframe)?
That’s great, would be delighted to help if i can @mhenretty
With a slightly hacky combo of AWS CLI and adapting from the existing import and run scripts I’ve managed to put together something that did the trick. Of course something more polished and straight-through in nature would be better, but it’s a start!
That walks your local bucket folder, going through the paired up Common Voice transcripts and mp3 files cleaning up the text of the former and converting the latter into .wav files in a data folder, then creating a .csv file for each of training, dev and test (in that same data folder)
NB: one problem with my bucket is a handful of transcript files w/o corresponding .mp3 files - I should clean them up, but for now I just delete those transcripts after I sync.
So far I’m getting fairly good results but I need to create more Common Voice records (I’ve done about 1,800 or so) and I’ve no doubt got lots to learn about how best to tweak the DeepSpeech settings
I hope that helps - it’s a start, but there’s a lot that could be improved (easily!) Big thanks to the Mozilla teams for making both Common Voice and DeepSpeech so awesome!!
This is an amazing start @nmstoker!!! You’ve really given us a leg up when we start our integration (which we will be working on in November). Thank you for this!!!
One thing I would say to people who are reading this looking for ways to train DeepSpeech is to look into using the build in mechanisms to train the model. The bin/librivox script will fetch 55GB of audio and transcription from a variety of audio books for example and train the model using that. There is also a bin/voxforge that will download about 6GB of audio data and train the model on that.