We are evaluating DeepSpeech for a call-center project. We have lots of audio to transcribe: it’s relatively poor quality, but our accuracy requirements aren’t high. Poor quality means 8khz recordings often compressed with either G.729 or G.711 or Speex. Accuracy could be 30% WER or even 40-50% for this application, open vocabulary american english conversation.
Out-of-the-box results with 0.3.0 seem OK. Results on upsampled audio aren’t good or even useful yet, maybe 60%-ish WER, but it works, and we think that with some work we could get to useful accuracy. Inference performance seems good with gpus. We use our own noise-robust VAD to prepare input segments.
Our thinking is to start with the pretrained models (preferably 0.5 noise-robust models), and retrain with data in our domain. That data could be CommonVoice samples downsampled/transcoded/upsampled to simulate 8khz-compressed, perhaps with augmented noise, and/or human-transcribed samples of our audio.
How well do you think such retraining will work, with varying amounts of data (say 100, 200, 500 hours)? How will the result compare to a complete training in our domain? Anyone have any references for comparable retraining projects? We’re trying to get a sense of the effort and prospects for success.