Hi there,
I’m running DeepSpeech v0.4.1 on a single machine with 11GB of VRAM, using tensorflow r1.12 as recommended.
When I train the model, I get a ResourceExhaustedError of the type below:
2019-02-16 23:06:33.533108: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[40752,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
But the reason I’m interested in this error is because it seems to happen after the model has completed an epoch, evaluated on the dev set and after it is done a few batches on the following epoch. So my logs look something like this:
...
D Starting batch...
D Finished batch step 13489.
D Sending Job (ID: 6801, worker: 0, epoch: 0, set_name: train)...
I Training of Epoch 0 - loss: 17.559973
D Computing Job (ID: 6803, worker: 0, epoch: 0, set_name: dev)...
D Starting batch...
D Finished batch step 13489.
100% (6800 of 6800) |####################| Elapsed Time: 1:12:49 Time: 1:12:49
...
D Starting batch...
D Finished batch step 13489.
D Sending Job (ID: 6859, worker: 0, epoch: 0, set_name: dev)...
I Validation of Epoch 0 - loss: 14.614131
100% (57 of 57) |########################| Elapsed Time: 0:00:08 Time: 0:00:08
...
D Starting batch...
D Finished batch step 13494.
D Sending Job (ID: 6865, worker: 0, epoch: 1, set_name: train)...
D Computing Job (ID: 6866, worker: 0, epoch: 1, set_name: train)
...
D Starting batch...
D Finished batch step 13495.
0% (6 of 6800) | | Elapsed Time: 0:00:11 ETA: 4:46:442019-02-16 23:06:33.531305: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 318.38MiB. Current allocation summary follows.
...
2019-02-16 23:06:33.533108: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[40752,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
D Sending Job (ID: 6866, worker: 0, epoch: 1, set_name: train)...
D Computing Job (ID: 6867, worker: 0, epoch: 1, set_name: train)...
D Starting batch...
E OOM when allocating tensor with shape[40752,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Indeed, if I use the checkpoints to continue the train after the model has crashed, I get in the tensorboard logs that the epoch was valid, and it seems to continue ahead without issue until it crashes again.
This seems like the kind of error that’s specific to my setup, which I don’t expect you guys are experiencing. But it might have a simple solution in DeepSpeech.py
. Maybe the gpu memory isn’t being released after each epoch for a reason? I’m not really sure.
Anyway, if there’s a way to keep it running that would be great. Otherwise I might have to hack it by running the model train from bash expecting it to crash every time because lowering the batch size isn’t something I want to do because of how much it affects the training time for the model.
Anyway, if anyone has thoughts I’d be happy to hear them. Cheers.