I am trying to use DeepSpeech to perform speech recognition on my android device. Since the default model is too big and a bit slow, I wanted to optimize the model for my specific use-case. So I tried to built my own tflite model, lm.binary and trie. I have a very limited dataset (around 30 words). I am trying to understand how practically Deepspeech works and have a couple of questions related to this.
While training the model:
My training data consists of only a single word utterances. i.e. only one word occurs once in a .wav file.
- How do I decide how many hidden layers are sufficient for my use-case? I tried 150 and 250 layers which gave me more or less the same accuracy.
- How many epochs should I try to run to sort of overfit my model for these specific 30 words? I tried 30 epochs with a learning rate of 0.0001 but the wer doesn’t go less than 0.51. If I further reduce the learning rate, will I be able to reach wer of 0.1 or less?
- How many utterances of each word do I need to train such a model? Presently, I have around 400-500 utterances of each word and I am using a train-dev-test split of 65-20-15. Is this sufficient for me? How do I know how many utterances are needed for different use cases?
While building the lm file:
- What corpus should I use to generate the language model? I tried a corpus as the list of all the 30 words separated by a newline. What happened is that while running the model, it always gave me a single word, even if the .wav file has multiple words in it. Then I tried to build lm again with corpus consisting of multiple words(from my 30-word dictionary) occurring in the same line. Even then, it only recognized just single words. The same lm file, gave multiple words(working fine and as expected) when I used the default tflite model of deepspeech 0.5.1. Should I keep the training data also such that multiple words occur in each .wav?