Unable to train on multiple GPU

iyer.sujatha94 · January 31, 2019, 1:36pm

I have 2 GPU (GeForce 1080Ti) in one box. I tried to run Deepspeech on roughly 20 minutes of data. (1178 files each spanning 1 second, 800 in train set rest in dev set) . Running this on Single GPU seems fine. But on multi GPU, the loss is always Nan and I get the error Nan in histogram summary . These are the following hyperparameters that I have used for both single and multi GPU

–n_hidden 494
–learning_rate 0.0001
–train_batch_size 4
–dev_batch_size 2

Tensorflow version 1.12.0

I tried running Deepspeech.py on single GPU and it works fine

Training of Epoch 0 - loss: 4.271649
Training of Epoch 1 - loss: 0.046320
Training of Epoch 2 - loss: 0.010699

The same when run on 2 GPU, I get this -

Training of Epoch 0 - loss: nan
Training of Epoch 1 - loss: nan
At the end I get this Nan in summary histogram for: b6_0

I tried using run-cluster.sh . After the pre-processing, it hangs at this -

[worker 1] Instructions for updating:
[worker 1] To construct input pipelines, use the tf.data module.
[worker 1] W Parameter --validation_step needs to be >0 for early stopping to work

Please do let me know if I am missing out on something

reuben · January 31, 2019, 1:47pm

The algorithmic batch size is the batch size specified in the flags times the number of GPUs being used for training, so there’s interaction between number of GPUs, learning rate and batch size that can affect training. In your case, you doubled the batch size so it’s a bit unexpected that it causes divergence. Maybe unfortunate initialization? Try changing the random seed. If that doesn’t make any difference, lower the learning rate.