Printing periodic progress during training

david.levinthal1 · February 22, 2018, 10:18pm

There have been questions about evaluating performance and monitoring progress.
at ~ line 1600 (depending on version) add 2 lines
# Get the first job
job = COORD.get_job()
batch_time_start = datetime.datetime.now() <----
summed_batch_loss = 0.0 <------
at ~ line 1640 add the lines shown below the call to session.run
# Compute the batch
_, current_step, batch_loss, batch_report = session.run([train_op, global_step, loss, report_params], **extra_params)

                    summed_batch_loss += batch_loss
                    mod_current_step_100 = current_step % 100
                    if mod_current_step_100 == 0:
                        batch_time_stop =  datetime.datetime.now() - batch_time_start
                        delta_seconds = batch_time_stop.total_seconds()
                        average_batch_loss = summed_batch_loss/100.
                        print ( 'current_step = %d time_delta = %f avg batch_loss = %f ' %
                             (current_step, delta_seconds, average_batch_loss))
                        summed_batch_loss = 0.0
                        batch_time_start = datetime.datetime.now()

DL loss debug

print ( ’ current_step = %d, batch_loss = %f ’ % (current_step, batch_loss))

The commented out line was part of some debugging I had to do. As far as I can tell having a comma separated list of input training files will result in the batch loss occasionally returning a inf value.

david.levinthal1 · February 22, 2018, 10:19pm

sorry about the formatting…cut and pasted from a vi session…seems “#” causes issues