Inference problem with DeepSpeech

tonytonyissaissa · October 8, 2018, 2:32pm

Hi,
I trained a DeepSpeech model in arabic language. I generated the audio corpus by using Google TTS ( https://github.com/pndurette/gTTS). Indeed, I collected a text corpus consisting of 22,583 text files, each of them containing a small text, and then I used a small script to convert them to audio mp3 using Google TTS. Then I converted *.mp3 to *.wav mono 16khz ( the resulting audio files duration is most cases less than 30 seconds but some files may exceed till 90 secs maximum).
Training was successful with WER 6.7828%.
The big problem was in inference: 100% WER !
My inference method was to obtain .mp3 from youtube videos, transform them to .wav and then infer text.
I suspect that the problem comes from the fact that my audio corpus is generated by only one speaker ( which is the female human voice of Google translate in Arabic).
My questions are: should I repeat my training with more than one speaker ? what is the recommended number of speakers that I should use ?
Can you tell me what is the number of speakers you used in your pre-trained model ?

Thanks in advance!

kdavis · October 8, 2018, 4:30pm

The more people you train from the more robust the system will be to speaker variation. So, using a single STT voice from Google won’t really give, as you’ve seen, a robust system.

As to the recommended number of speakers, really the more the better.

I don’t know how many we used in total, but it’s in the thousands.