Val loss stops going down after 6 epochs

praveeny1986 · March 27, 2018, 3:53pm

We are trying to train Deepspeech model on our custom English data ~ 1000 hours. Unfortunately after 6 epochs val loss stops going down and it starts going up. Finally stopped after 9 epochs with not very good accuracy of the 6th epoch model. Data label quality is very good.

The hyper parameters used were exactly same as the deepspeech release version.

One thing we suspect is that the hyper parameters need to be changed and need to experiment with different values. But just wondering how to go about it. In future we will have more training data then how do we tune the hyper parameters. Is it just trial and error or any suggestions from the core team? Any findings as in - relation between data size and LR/Dropout etc if any?

Also, can we get the train and val loss after every epoch of the deepspeech release? It will give some idea about the fluctuations and rate of the loss changes.

Thanks

kdavis · March 27, 2018, 6:01pm

“The First Nobel Truth of machine learning, tuning hyperparameters is painful.” -Me

That said, from this bit “…after 6 epochs val loss stops going down and it starts going up…” it sounds like the learning rate is too high.

As to your query, indeed the hyperparameters do need to be changed for different data sets. Finding the correct values is more of an art than science. For example, initially we just brute forced the dropout values via a binary search on a test dataset.

For tuning the hyperparameters what I suggest is creating a subset of your data set, the --limit_train, --limit_dev, and --limit_test commandline parameters are useful for this. The sample size of the training subset should be around 16k samples so it’s large enough to give a good statistical representation of the full data set.

Using this subset you can then tune various hyperparameters through trial and error relatively rapidly, as the data set is not too large. However, as the size of this subset is large enough to give a good statistical representation of the full data set, the values found for these hyperparameters will also work for the full dataset.

As to a relation between dropout and data size, they’re relatively independent. Dropout usually needs to be decreased when the training audio is noisy. Noisy audio essentially self-regularizes so there’s less need for dropout regularization. On the flip side, dropout can be increased for clean audio.

As for the loss history for the release model. Here it is…

Validation of Epoch 0 - loss: 47.713472
Training of Epoch 0 - loss: 76.803943
Training of Epoch 1 - loss: 49.826667
Validation of Epoch 1 - loss: 27.493896
Validation of Epoch 2 - loss: 27.717039
Training of Epoch 2 - loss: 40.688553
Training of Epoch 3 - loss: 35.456641
Validation of Epoch 3 - loss: 25.176263
Validation of Epoch 4 - loss: 22.647187
Checking for early stopping (last 4 steps) validation loss: 22.647187, with standard deviation: 1.148756 and mean: 26.795733
Training of Epoch 4 - loss: 31.873167
Training of Epoch 5 - loss: 29.195960
Checking for early stopping (last 4 steps) validation loss: 22.647187, with standard deviation: 1.148756 and mean: 26.795733
Validation of Epoch 5 - loss: 21.736987
Validation of Epoch 6 - loss: 20.460873
Checking for early stopping (last 4 steps) validation loss: 20.460873, with standard deviation: 1.455003 and mean: 23.186812
Training of Epoch 6 - loss: 27.093540
Training of Epoch 7 - loss: 25.271599
Checking for early stopping (last 4 steps) validation loss: 20.460873, with standard deviation: 1.455003 and mean: 23.186812
Validation of Epoch 7 - loss: 20.202375
Validation of Epoch 8 - loss: 19.448440
Checking for early stopping (last 4 steps) validation loss: 19.448440, with standard deviation: 0.670847 and mean: 20.800078
Training of Epoch 8 - loss: 23.773587
Training of Epoch 9 - loss: 22.447867
Checking for early stopping (last 4 steps) validation loss: 19.448440, with standard deviation: 0.670847 and mean: 20.800078
Validation of Epoch 9 - loss: 18.980904
Validation of Epoch 10 - loss: 18.989760
Checking for early stopping (last 4 steps) validation loss: 18.989760, with standard deviation: 0.503212 and mean: 19.543907
Training of Epoch 10 - loss: 21.306929
Training of Epoch 11 - loss: 20.299765
Checking for early stopping (last 4 steps) validation loss: 18.989760, with standard deviation: 0.503212 and mean: 19.543907
Validation of Epoch 11 - loss: 19.041425
Validation of Epoch 12 - loss: 18.623053
Checking for early stopping (last 4 steps) validation loss: 18.623053, with standard deviation: 0.026689 and mean: 19.004030
Early stop triggered as (for last 4 steps) validation loss: 18.623053 with standard deviation: 0.026689 and mean: 19.004030

praveeny1986 · March 27, 2018, 6:27pm

Thanks Kelly, how is this 16k samples a good statistical representation of the full data set?
What is the theory behind this number?

kdavis · March 27, 2018, 6:48pm

It’s an approximation from statistical sample size work usually used for surveys and the like.

If you have 1k hours of data, and each utterance is about 4sec, that’s 900k utterances. If you want a confidence level of 99% with a 1% margin of error when sampling from this population of 900k you’d want about 16k samples.