Hi,
I trained a DeepSpeech model in arabic language. I generated the audio corpus by using Google TTS ( https://github.com/pndurette/gTTS). Indeed, I collected a text corpus consisting of 22,583 text files, each of them containing a small text, and then I used a small script to convert them to audio mp3 using Google TTS. Then I converted *.mp3 to *.wav mono 16khz ( the resulting audio files duration is most cases less than 30 seconds but some files may exceed till 90 secs maximum).
Training was successful with WER 6.7828%.
The big problem was in inference: 100% WER !
My inference method was to obtain .mp3 from youtube videos, transform them to .wav and then infer text.
I suspect that the problem comes from the fact that my audio corpus is generated by only one speaker ( which is the female human voice of Google translate in Arabic).
My questions are: should I repeat my training with more than one speaker ? what is the recommended number of speakers that I should use ?
Can you tell me what is the number of speakers you used in your pre-trained model ?
Thanks in advance!