Cuda_error_out_of_memory with deepspeech-gpu

I have been running the deepspeech-gpu inference inside docker containers. I am trying to run around 30 containers on one EC2 instance which as a Tesla K80 GPU with 12 GB. The containers run for a bit then I start to get CUDA memory errors: cuda_error_out_of_memory . My question is do you think that this is a problem with CUDA where after the model is loaded it is not releasing the model from memory or something else?

Also, each container has around 360 20 second .wav files I am transcribing. I am using a for loop and am calling the cli via subprocess:

Using deepspeech-gpu==0.4.1 and deepspeech-0.4.1-models.tar.gz

deepSpeechResults = subprocess.Popen("exec deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio " + audioLocation + “> " + savelocation”, stdout=subprocess.PIPE, shell=True)

try:

deepSpeechResults.wait(timeout=30)

except subprocess.TimeoutExpired:

kill(deepSpeechResults.pid)

30 containers in parallel, on a 12GB GPU ?

correct, I just assumed it would be able to handle just the inference because I was able to get around 30 containers running in parallel using just the deepspeech CPU package with a 16 core processor.

Memory sharing from containers on a GPU might be more complex. Try to run them adding one by one. I’m pretty sure the model will use way more than what you expect :slight_smile:

Okay I will try that, thanks! Also, in terms of memory sharing and CPU vs GPU usage for 30+ containers in parallel would you recommend not using a GPU for this?

I have no experience wrt. multiple instances on the same GPU, so I cannot help.

Our clients haven’t really been optimized for this type of use case. TensorFlow will try to use all of the GPU memory available in the system by default, so running multiple instances side by side is bound to fail. You can configure it to use a share of the available memory by modifying the parameters used when creating the session (requires modifying our code). The cleaner solution is to implement a server-oriented client that does batching and is focused on high throughput on large GPUs rather than low latency on small CPUs (the goal of our current clients).

I actually was able to get ~50 containers running in parallel with deepspeech running on a c5n.9xlarge which has 36 cpus with pretty decent inference time. In terms of time/cost I think that using cpu’s for inference for large inference processing in parallel is best with the testing I have done.

Yep, if using our unmodified clients is needed then I’d definitely go with the CPU package rather than the GPU one.

1 Like

If anyone wants to see how I set the GPU up with docker and docker-machine I wrote a little article on it: https://medium.com/@tbobik91/creating-a-docker-image-and-aws-ec2-ami-with-nvidia-cuda-9-0-959d57e5849