While I training, training or val time takes too much time at random point.
For example, at epoch 14 below, validation takes about 5 min which is normal. But, sometime it takes few hours with same data set and it happens randomly. At the moment, gpu is idle and cpu stats looks wired(I attach “top” and “nvidia-smi” result).
Is there anyway I can solve this problem?
Epoch 14 | Training | Elapsed Time: 1:16:39 | Steps: 3864 | Loss: 54.739649
Epoch 14 | Validation | Elapsed Time: 0:04:55 | Steps: 322 | Loss: 53.396248 | Dataset: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/csv/dev_1.csv
I Saved new best validating model with loss 53.396248 to: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/checkpoint/best_dev-70036
Epoch 15 | Training | Elapsed Time: 7:51:04 | Steps: 3864 | Loss: 53.578589
Epoch 15 | Validation | Elapsed Time: 0:51:09 | Steps: 322 | Loss: 52.377501 | Dataset: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/csv/dev_1.csv
I Saved new best validating model with loss 52.377501 to: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/checkpoint/best_dev-73900
Epoch 16 | Training | Elapsed Time: 4:45:37 | Steps: 3864 | Loss: 52.526960
Epoch 16 | Validation | Elapsed Time: 0:04:56 | Steps: 322 | Loss: 51.434787 | Dataset: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/csv/dev_1.csv
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P8 13W / 250W | 10915MiB / 11177MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 00000000:02:00.0 Off | N/A |
| 0% 43C P8 13W / 250W | 10915MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14857 C python 10889MiB |
| 1 14857 C python 10889MiB |
±----------------------------------------------------------------------------+
%Cpu0 : 0.0 us, 78.3 sy, 0.0 ni, 21.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 99.0 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 8.0 us, 75.9 sy, 0.0 ni, 16.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 99.7 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 8.3 us, 87.4 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 30.7 sy, 0.0 ni, 69.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 8.4 us, 20.1 sy, 0.0 ni, 71.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 0.3 us, 8.0 sy, 0.0 ni, 91.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 0.0 us, 8.0 sy, 0.0 ni, 92.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32874304 total, 1048340 free, 18044372 used, 13781592 buff/cache
KiB Swap: 33406972 total, 31324732 free, 2082240 used. 5838024 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14857 ubuntu 20 0 76.650g 0.024t 8.631g S 830.6 79.7 4605:44 DeepSpeech.py
CPU INFO: Intel® Core™ i7-8700 CPU @ 3.20GHz
MEMORY: 32G
GPU: GeForce 1080Ti x2