Training Checkpointing Data.Loss.Error

JohnWayne · September 23, 2019, 1:38pm

Hi All

Hope you are well

I am currently attempting to train the model off common voice. Have ran into an issue which seems like the checkpoint has been corrupted after trying to resume training:

Code:
./DeepSpeech.py --train_files /media/sf_en/clips/train.csv --dev_files /media/sf_en/clips/dev.csv --test_files /media/sf_en/clips/test.csv--checkpoint_dir

Output:

  File "/home/chabani/tmp/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/chabani/tmp/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/chabani/tmp/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 1933602204 vs. calculated on the restored bytes 887113119
	 [[{{node save/RestoreV2}}]]```

Wanted to ask if i could delete the latest checkpoint and run from an earlier one?

lissyx · September 23, 2019, 2:32pm

Have a look into the checkpoint directory, there’s a checkpoint files that should refers to the last 5 checkpoints, you can likely do that here.