Hi Everyone,
If I want continue training a model, and want to restore the checkpoints from a previously backed up location, do I have to recopy ALL the files or will a smaller subset do the job? Which ones if not all are required?
I am training my model (using version 051) in small batches of input data due to OOM issues. I have already trained on the Mozilla Voice corpus. Using the checkpoints directory, I want to to continue training on contents extracted from certain pdf files. I will be doing this one at a time. As I encounter OOM failures, I keep backups of the checkpoints directory, restore it to working checkpoint directory and resume training. But this back checkpoints directory size increases over time.
Now I am facing problems as its hitting sizes of 30 GB and over.
For example: my backup checkpoint directory contains following files:
best_dev-143.data-00000-of-00001,
best_dev_checkpoint,
train-110.index,
train-121.index,
train-132.index,
train-143.index,
train-99.index
best_dev-143.index,
checkpoint,
train-110.meta,
train-121.meta,
train-132.meta,
train-143.meta,
train-99.meta
best_dev-143.meta,
train-110.data-00000-of-00001,
train-121.data-00000-of-00001,
train-132.data-00000-of-00001,
train-143.data-00000-of-00001,
train-99.data-00000-of-00001
Which files should I copy into working checkpoint directory to resume training successfully? Is there any reason to retain the rest of the files?
Hope you can guide me on this.
Regards,
Rohit