I’m running a DeepSpeech train on an AWS p3.xlarge instance with four GPUs.
A few minutes after starting up the log looks like the following. As you can see all jobs are on worker:0… is that right? I’m guessing not…
Any suggestions on how to fix this would be very gratefully received.
Using nvidia-smi I can see that that data is being sent to all four GPUs. Most of the time they all sit at 0% Volatile GPU-Util , but every ten seconds or so, all four GPUs jump to ~97% ish usage for one brief instant and then back to zero.
I thin this means that my training is I/O bound. That is it doesn’t look like training is being limited by the GPU compute speed but rather by the problem of marshalling data to and from the GPU.
Maybe that’s why all the workers are always worker=0 or maybe it’s the other way around and increasing the number of workers (somehow) might get more data to the GPU ? If so any suggestions as to how one might do this or debug thigs?
Are their other ways to increase the GPU utilization?
Here’s the relevant part of the training log:
D Finished batch step 204.
D Sending Job (ID: 205, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 206, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 205.
D Sending Job (ID: 206, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 207, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 206.
D Sending Job (ID: 207, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 208, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 207.
D Sending Job (ID: 208, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 209, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 208.
D Sending Job (ID: 209, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 210, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 209.
D Sending Job (ID: 210, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 211, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
Here’s what nvidia-smi looks like during that brief instant. The rest of the time it’s the same but GPU-Util is at 0%…
Every 1.0s: nvidia-smi Fri May 25 03:56:55 2018
Fri May 25 03:56:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1B.0 Off | 0 |
| N/A 52C P0 65W / 300W | 15426MiB / 16152MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off | 0 |
| N/A 63C P0 78W / 300W | 15426MiB / 16152MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off | 0 |
| N/A 61C P0 73W / 300W | 15426MiB / 16152MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 50C P0 69W / 300W | 15426MiB / 16152MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 127768 C python 15408MiB |
| 1 127768 C python 15408MiB |
| 2 127768 C python 15408MiB |
| 3 127768 C python 15408MiB |
+-----------------------------------------------------------------------------+
Any suggestions or tips very gratefully received. Thanks!