Hi,
I am trying to use run-cluster.sh
script provided in DeepSpeech’s code for distributed training.
My workstation has two NVIDIA 1080 GPUS over PCIe slots.
When I run the following command
./run-cluster.sh 1:2:1 --train_files /docker_files/voxforge/voxforge-train.csv --dev_files /docker_files/voxforge/voxforge-dev.csv --test_files /docker_files/voxforge/voxforge-test.csv --checkpoint_dir /docker_files/checkpoints_cv_mozilla/ --epoch -3 --n_hidden 2048
where Deepspeech’s trained checkpoints (downloaded from Github) are in /docker_files/checkpoints_cv_mozilla/. I have two workers each for one GPU and one parameter server.
But after I run the above command, the processes are stuck after certain point. Here’s the output -
[worker 0] Preprocessing done
[worker 0] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-dev.csv’])
[worker 1] Preprocessing done
[worker 1] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-dev.csv’])
[worker 0] Preprocessing done
[worker 0] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-test.csv’])
[worker 1] Preprocessing done
[worker 1] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-test.csv’])
[worker 0] Preprocessing done
[worker 0] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py:335: init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
[worker 0] Instructions for updating:
[worker 0] To construct input pipelines, use thetf.data
module.
[worker 1] Preprocessing done
[worker 1] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py:335: init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
[worker 1] Instructions for updating:
[worker 1] To construct input pipelines, use thetf.data
module.
The processors use up all the memory of both GPUs but volatile-memory usage stays at 0 to 2% always, which means no training seems to be initiated.
Wanted to know, if there’s anything that I’m missing in this process.