I have 2 GPU (GeForce 1080Ti) in one box. I tried to run Deepspeech on roughly 20 minutes of data. (1178 files each spanning 1 second, 800 in train set rest in dev set) . Running this on Single GPU seems fine. But on multi GPU, the loss is always Nan and I get the error Nan in histogram summary . These are the following hyperparameters that I have used for both single and multi GPU
–n_hidden 494
–learning_rate 0.0001
–train_batch_size 4
–dev_batch_size 2
Tensorflow version 1.12.0
I tried running Deepspeech.py on single GPU and it works fine
Training of Epoch 0 - loss: 4.271649
Training of Epoch 1 - loss: 0.046320
Training of Epoch 2 - loss: 0.010699
The same when run on 2 GPU, I get this -
Training of Epoch 0 - loss: nan
Training of Epoch 1 - loss: nan
At the end I get this Nan in summary histogram for: b6_0
I tried using run-cluster.sh . After the pre-processing, it hangs at this -
[worker 1] Instructions for updating:
[worker 1] To construct input pipelines, use the tf.data
module.
[worker 1] W Parameter --validation_step needs to be >0 for early stopping to work
Please do let me know if I am missing out on something