I trained DeepSpeech v4 on a portion of data from http://openslr.org/53 (Total 33hr data used, and I know it’s not even close for a good model). Upon training, it gives relatively good values for loss on validation and test dataset. But WER on test set is very high.
Parameter for training were :
N_hidden = 2048 (I wanted to overfit to get an overview)
Dropout = 0.2
Learning Rate = 0.0001
Epoch = 50 (Early Stop triggered after 15 epochs)
Beam width = 1024
Alphabet Size = 62
Train/Dev/Test Ratio = 80/10/10
** Other Parameters were their default values
And Model performance:
Train Loss = 13
Dev Loss = 24
Test Loss = 24
Test WER = 0.75
Test Edit Distance = 0.36
The dataset was split based on speaker. Total utterences of each speaker were distributed 80/10/10
among train, dev and test set.
Experiments on another dataset http://openslr.org/37/ gave much lower WER on test set. This dataset contains only 9hrs
speech data. Using pitch/tempo/speed
augmentation I was able to reach minimum 0.24 WER
and 0.08 Edit Distance
on test set.
The parameters were all same except N_hidden
. N_hidden
was 1024
for this experiment.
So, what may be causing the high WER on test set. What things may I investigate other than increasing dataset (Training is prohibitively time consuming on larger dataset). Thanks