Is it OK if all jobs are on worker:0?

utunga · May 25, 2018, 4:12am

I’m running a DeepSpeech train on an AWS p3.xlarge instance with four GPUs.

A few minutes after starting up the log looks like the following. As you can see all jobs are on worker:0… is that right? I’m guessing not…

Any suggestions on how to fix this would be very gratefully received.

Using nvidia-smi I can see that that data is being sent to all four GPUs. Most of the time they all sit at 0% Volatile GPU-Util , but every ten seconds or so, all four GPUs jump to ~97% ish usage for one brief instant and then back to zero.

I thin this means that my training is I/O bound. That is it doesn’t look like training is being limited by the GPU compute speed but rather by the problem of marshalling data to and from the GPU.

Maybe that’s why all the workers are always worker=0 or maybe it’s the other way around and increasing the number of workers (somehow) might get more data to the GPU ? If so any suggestions as to how one might do this or debug thigs?

Are their other ways to increase the GPU utilization?

Here’s the relevant part of the training log:

D Finished batch step 204.
D Sending Job (ID: 205, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 206, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 205.
D Sending Job (ID: 206, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 207, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 206.
D Sending Job (ID: 207, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 208, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 207.
D Sending Job (ID: 208, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 209, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 208.
D Sending Job (ID: 209, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 210, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 209.
D Sending Job (ID: 210, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 211, worker: 0, epoch: 0, set_name: train)...
D Starting batch...

Here’s what nvidia-smi looks like during that brief instant. The rest of the time it’s the same but GPU-Util is at 0%…

 Every 1.0s: nvidia-smi                                                                                  Fri May 25 03:56:55 2018

Fri May 25 03:56:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   52C    P0    65W / 300W |  15426MiB / 16152MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   63C    P0    78W / 300W |  15426MiB / 16152MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   61C    P0    73W / 300W |  15426MiB / 16152MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   50C    P0    69W / 300W |  15426MiB / 16152MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    127768      C   python                                     15408MiB |
|    1    127768      C   python                                     15408MiB |
|    2    127768      C   python                                     15408MiB |
|    3    127768      C   python                                     15408MiB |
+-----------------------------------------------------------------------------+

Any suggestions or tips very gratefully received. Thanks!

utunga · May 25, 2018, 4:07am

And for what it’s worth here’s the hyper parameters I’m using

python ./DeepSpeech.py  \
  --train_files '$(TRAIN_DATA_DIR)/mi_train.csv' \
  --dev_files '$(TRAIN_DATA_DIR)/mi_dev.csv' \
  --test_files '$(TRAIN_DATA_DIR)/mi_test.csv' \
  --alphabet_config_path '$(TRAIN_DATA_DIR)/alphabet.txt' \
  --lm_binary_path '$(TRAIN_DATA_DIR)/lm/lm.binary' \
  --lm_trie_path '$(TRAIN_DATA_DIR)/lm/trie' \
  --lm_weight 2.00 \
  --train_batch_size 20 \
  --dev_batch_size 20 \
  --test_batch_size 20 \
  --learning_rate 0.0001 \
  --epoch 15 \
  --display_step 5 \
  --validation_step 5 \
  --dropout_rate 0.30 \
  --default_stddev 0.046875 \
  --checkpoint_dir "$(CHECKPOINT_DIR)" \
  --export_dir "$(CHECKPOINT_DIR)" \
  --decoder_library_path "$(DECODER_LIBRARY_PATH)" \
  --display_step 1 \
  --log_level 0 \
  --summary_dir "$(TENSORBOARD_DIR)"  \
  --summary_secs 60 \
  --fulltrace 1

The data is a few thousand (in this run) wav files 100-200K in size, or about 5-10 ish seconds each.