Checkpoints

kondaraunak · March 25, 2019, 3:15pm

Where can i find some documentation regarding checkpoints and how they work?
and also every time new checkpoints are created and consumes a lot of space. Is here any way to clean checkpoints folder

daniel.cruzado · March 26, 2019, 3:17pm

You can configure certain things about the checkpoints

#Checkpointing
    tf.app.flags.DEFINE_string  ('checkpoint_dir',   '',          'directory in which checkpoints are stored - defaults to directory "deepspeech/checkpoints" within user\'s data home specified by the XDG Directory Specification')
    tf.app.flags.DEFINE_integer ('checkpoint_secs',  600,         'checkpoint saving interval in seconds')
    tf.app.flags.DEFINE_integer ('max_to_keep',      5,           'number of checkpoint files to keep - default value is 5')

The folder will not ever become too big since it as a max_to_keep parameter, save the last 5 and erases the older ones.

If you use the checkpoint_dir FLAG you will load the last checkpoint in that dir, not the best one, THE LAST ONE. This is important because early_stop is not synchronized with this.

checkpoint_secs define how often your weights are saved

kondaraunak · March 26, 2019, 4:52pm

Sir, I am training the pre-trained model on my own data. I am facing problem in using checkpoints. For different data, epochs start with different no.

example- for a small data set 30 wav files (each of 7-8 secs long) divided in 70-20-10 ratio across train-dev-test folders, it resumes from epoch 10276
Output
Use standard file utilities to get mtimes.
I STARTING Optimization
I Training epoch 10276…

but a larger data with 180 wav (each of 10 secs length), the epoch starts from 1832

Output
Use standard file utilities to get mtimes.
I STARTING Optimization
I Training epoch 1832…

we have tried for third datatasets and in this epoch starts from 45214

daniel.cruzado · March 27, 2019, 8:22am

Something similar has happened to me, but I do not have any idea why, maybe @reuben can tell us something about that

kondaraunak · March 27, 2019, 5:24pm

How did you overcome this problem then?

daniel.cruzado · March 28, 2019, 2:51pm

I did not, for the time being I am training from scratch.

I plan to carry out more experiments in transfer learning by the end of this week, I will let you if I get anything clear.

Anyhow I think you are making some mistake, they said they trained the network for 30 epochs, I cannot understand how you get 1832

kondaraunak · March 30, 2019, 7:50am

Am I supposed to give the entire training dataset at once or can I divide the datatset into parts and train for few epochs (like 10 or till convergence) for all the different parts of the dataset?