@deepthi.karkada, I was also looking forward to use Horovod to do distributed training. And, coz of the Coordinator code it didn’t allow me to set -np
more
than 1, I kept getting an error saying “Address Already in use”. However, when I ran with-np 1
with mpirun
, I observed all my GPU were being used and respective numbers of processes were being created.
Care to share your PR with changes in Coordinator code
.
The one I did, was just a small change without modifying the Coordinator code.
def main(_):
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
initialize_globals()
opt = create_optimizer(hvd.size())
opt = hvd.DistributedOptimizer(opt)
hooks = [
hvd.BroadcastGlobalVariablesHook(0)
]
print("hvd_Size: " + str(hvd.size()))
print("hvd_Rank: " + str(hvd.rank()))
FLAGS.checkpoint_dir = FLAGS.checkpoint_dir if hvd.rank() == 0 else None
train(opt, hooks)
# Are we the main process?
if is_chief:
# Doing solo/post-processing work just on the main process...
# Exporting the model
if FLAGS.export_dir:
export()
if len(FLAGS.one_shot_infer):
do_single_file_inference(FLAGS.one_shot_infer)
# Stopping the coordinator
COORD.stop()