I am running a research project which requires transcoding audio from telephone calls, and I have two main questions:
1st, would training a model on gsm and ulaw encoded audio actually improve performance? We’ve had mixed results with the pretrained model which comes with DeepSpeech, but are also on a time crunch and need to maximize our time.
2nd, if I wanted to train on the CommonVoice audio, with half transcoded to GSM and ULaw. What would be the best way to do this? Can I do this through the import_cv2.py script? We want to avoid transcoding to gsm/ulaw wav and then going back to mp3.
I apologise for any mistakes/oversights in my post. I’m not primarily an ML guy, so this work is a little outside of my wheel house.
Thanks for your help