ResourceExhaustedError after training an epoch + performing dev set evaluation

mathematiguy · February 16, 2019, 11:22pm

Hi there,

I’m running DeepSpeech v0.4.1 on a single machine with 11GB of VRAM, using tensorflow r1.12 as recommended.

When I train the model, I get a ResourceExhaustedError of the type below:

2019-02-16 23:06:33.533108: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[40752,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

But the reason I’m interested in this error is because it seems to happen after the model has completed an epoch, evaluated on the dev set and after it is done a few batches on the following epoch. So my logs look something like this:

...
D Starting batch...
D Finished batch step 13489.
D Sending Job (ID: 6801, worker: 0, epoch: 0, set_name: train)...              
I Training of Epoch 0 - loss: 17.559973
D Computing Job (ID: 6803, worker: 0, epoch: 0, set_name: dev)...
D Starting batch...
D Finished batch step 13489.
100% (6800 of 6800) |####################| Elapsed Time: 1:12:49 Time:  1:12:49
...
D Starting batch...
D Finished batch step 13489.
D Sending Job (ID: 6859, worker: 0, epoch: 0, set_name: dev)...                
I Validation of Epoch 0 - loss: 14.614131
100% (57 of 57) |########################| Elapsed Time: 0:00:08 Time:  0:00:08 
...
D Starting batch...
D Finished batch step 13494.
D Sending Job (ID: 6865, worker: 0, epoch: 1, set_name: train)...              
D Computing Job (ID: 6866, worker: 0, epoch: 1, set_name: train)
...
D Starting batch...
D Finished batch step 13495.
  0% (6 of 6800) |                       | Elapsed Time: 0:00:11 ETA:   4:46:442019-02-16 23:06:33.531305: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 318.38MiB.  Current allocation summary follows.
...
2019-02-16 23:06:33.533108: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[40752,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
D Sending Job (ID: 6866, worker: 0, epoch: 1, set_name: train)...              
D Computing Job (ID: 6867, worker: 0, epoch: 1, set_name: train)...
D Starting batch...
E OOM when allocating tensor with shape[40752,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Indeed, if I use the checkpoints to continue the train after the model has crashed, I get in the tensorboard logs that the epoch was valid, and it seems to continue ahead without issue until it crashes again.

This seems like the kind of error that’s specific to my setup, which I don’t expect you guys are experiencing. But it might have a simple solution in DeepSpeech.py. Maybe the gpu memory isn’t being released after each epoch for a reason? I’m not really sure.

Anyway, if there’s a way to keep it running that would be great. Otherwise I might have to hack it by running the model train from bash expecting it to crash every time because lowering the batch size isn’t something I want to do because of how much it affects the training time for the model.

Anyway, if anyone has thoughts I’d be happy to hear them. Cheers.

lissyx · February 16, 2019, 11:45pm

You’re just OOM on the GPU, try reducing the batch size, this is already quite documented over the forum and github.

mathematiguy · February 17, 2019, 12:13am

Thanks for your reply, it just seems strange to get an OOM after the model has finished an Epoch and already started the next one. I take it the memory doesn’t clear itself of whatever it had in it from the previous epoch when it starts a new one.

If the batch size were too large, wouldn’t you get the error before the end of an epoch, and not after the start of the next one?

That reducing the batch size will solve the problem is not really under debate here. I can actually train the model faster by training for one epoch, letting it crash and starting again, than by reducing the batch size.

lissyx · February 17, 2019, 12:14am

If you can find the exact root issue, you’re welcome, but this is something we have always been hitting since the early days. Not sure if there’s something leaking somewhere, or just the way computations are handled on DeepSpeech side.

nirb999 · April 3, 2019, 7:21am

I had the exact same issue, and it was resolved when I removed long samples from the dataset. Apparently I had long samples (~1 minutes) in the dataset, and once those were removed and the longest samples are ~15 seconds, it stopped crashing at the beginning of the next epoch.
I think the fact it happens at the end of the epoch is related to the ‘curriculum learning’ method, where the longest samples are processes at the end of the epoch, and consumes more memory, but that is just a guess.

ResourceExhaustedError *after* training an epoch + performing dev set evaluation

ResourceExhaustedError after training an epoch + performing dev set evaluation