There have been questions about evaluating performance and monitoring progress.
at ~ line 1600 (depending on version) add 2 lines
# Get the first job
job = COORD.get_job()
batch_time_start = datetime.datetime.now() <----
summed_batch_loss = 0.0 <------
at ~ line 1640 add the lines shown below the call to session.run
# Compute the batch
_, current_step, batch_loss, batch_report = session.run([train_op, global_step, loss, report_params], **extra_params)
summed_batch_loss += batch_loss
mod_current_step_100 = current_step % 100
if mod_current_step_100 == 0:
batch_time_stop = datetime.datetime.now() - batch_time_start
delta_seconds = batch_time_stop.total_seconds()
average_batch_loss = summed_batch_loss/100.
print ( 'current_step = %d time_delta = %f avg batch_loss = %f ' %
(current_step, delta_seconds, average_batch_loss))
summed_batch_loss = 0.0
batch_time_start = datetime.datetime.now()
DL loss debug
print ( ’ current_step = %d, batch_loss = %f ’ % (current_step, batch_loss))
The commented out line was part of some debugging I had to do. As far as I can tell having a comma separated list of input training files will result in the batch loss occasionally returning a inf value.