Where can i find some documentation regarding checkpoints and how they work?
and also every time new checkpoints are created and consumes a lot of space. Is here any way to clean checkpoints folder
You can configure certain things about the checkpoints
#Checkpointing
tf.app.flags.DEFINE_string ('checkpoint_dir', '', 'directory in which checkpoints are stored - defaults to directory "deepspeech/checkpoints" within user\'s data home specified by the XDG Directory Specification')
tf.app.flags.DEFINE_integer ('checkpoint_secs', 600, 'checkpoint saving interval in seconds')
tf.app.flags.DEFINE_integer ('max_to_keep', 5, 'number of checkpoint files to keep - default value is 5')
The folder will not ever become too big since it as a max_to_keep parameter, save the last 5 and erases the older ones.
If you use the checkpoint_dir FLAG you will load the last checkpoint in that dir, not the best one, THE LAST ONE. This is important because early_stop is not synchronized with this.
checkpoint_secs define how often your weights are saved
Sir, I am training the pre-trained model on my own data. I am facing problem in using checkpoints. For different data, epochs start with different no.
example- for a small data set 30 wav files (each of 7-8 secs long) divided in 70-20-10 ratio across train-dev-test folders, it resumes from epoch 10276
Output
Use standard file utilities to get mtimes.
I STARTING Optimization
I Training epoch 10276…
but a larger data with 180 wav (each of 10 secs length), the epoch starts from 1832
Output
Use standard file utilities to get mtimes.
I STARTING Optimization
I Training epoch 1832…
we have tried for third datatasets and in this epoch starts from 45214
Something similar has happened to me, but I do not have any idea why, maybe @reuben can tell us something about that
How did you overcome this problem then?
I did not, for the time being I am training from scratch.
I plan to carry out more experiments in transfer learning by the end of this week, I will let you if I get anything clear.
Anyhow I think you are making some mistake, they said they trained the network for 30 epochs, I cannot understand how you get 1832
Am I supposed to give the entire training dataset at once or can I divide the datatset into parts and train for few epochs (like 10 or till convergence) for all the different parts of the dataset?