Splitting the GPU memory to run multiple workers on a deepspeech server

Hello, I am wondering if it’s possible to split the gpu memory up into fractions so that I can run multiple deepspeech instances.

In tensorflow you can configure the session object to only use a fraction of the available memory. Example here. Anyway to configure the same thing in deepspeech?

for reference, I am using the python deepspeech client.

Currently we don’t have any option to do so.

However, it looks like one could do so with a few code changes using the technique you reference[1] along with, for the case of multiple GPU’s on a single machine, use of CUDA_VISIBLE_DEVICES[2].

@kdavis thanks for the tip. I was looking through the code, but it seems like I would have to make this change in the DeepSpeech.py but that’s more for training purposes. Is there a way to do that in the python native client? Seems like maybe it would be in deepspeech.h or deepspeech.cc.

@LearnedVector My guess, though I’ve not tried, is that you’d have to modify the SessionOptions[1] passed to NewSession[2] to include GPU options.

@LearnedVector Were you able to split the GPU memory for running the inference using deepspeech client?

FWIW, I tested changing that GPU limit in deepspeech.cc and it works fine.

1 Like

Why it does not work [here] ?(https://github.com/mozilla/DeepSpeech/blob/ae146d06199280758cb34acb3496c0ec5d303ad6/DeepSpeech.py#L1758)?
def do_single_file_inference(input_file_path):
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
with tf.Session(config=config) as session:

After modifying with the above 4 lines…while using --one_shot_infer it still uses all GPU memory, why?

No idea, maybe some TensorFlow bug? Are you also using --notrain --notest?

yes using --train False --test False but still it is using whole GPU memory…

Hi, could you please elaborate on it?
what did you do exactly?