Running multiple inferences in parallel on a GPU

tbatkin · November 27, 2019, 3:03pm

We are running into an issue with trying to run multiple inferences in parallel on a GPU. By using torch multiprocessing we have made a script that creates a queue and run ‘n’ number of processes.

When setting ‘n’ to greater than 2 we run into errors to do with lack of memory, from a bit of research on the discourse we’ve figured out that this is due to tensorflow allocating all of the GPU memory to itself when it initialises the session.

We know how to alter the ‘use_allow_growth’ flag in the flags.py which as we understand is basically just adding changing the tf.ConfigProto() to add

config.gpu_options.allow_growth = True

but that seems to only apply to the training method and not the inference method.

How and where can we alter the tf.ConfigProto() to be able to utilise this tensorflow method in order to be able to take full advantage of the GPU memory with many multiple processes?

(This is using v0.5.1 and the pre-trained model associated with it)

reuben · November 27, 2019, 3:36pm

github.com

mozilla/DeepSpeech/blob/4b29b78832036216b53f59b953639bde7cde7dfe/native_client/deepspeech.cc#L665




  DS_PrintVersions();


  if (!aModelPath || strlen(aModelPath) < 1) {
    std::cerr << "No model specified, cannot continue." << std::endl;
    return DS_ERR_NO_MODEL;
  }


#ifndef USE_TFLITE
  Status status;
  SessionOptions options;


  bool is_mmap = std::string(aModelPath).find(".pbmm") != std::string::npos;
  if (!is_mmap) {
    std::cerr << "Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage." << std::endl;
  } else {
    status = model->mmap_env->InitializeFromFile(aModelPath);
    if (!status.ok()) {
      std::cerr << status << std::endl;
      return DS_ERR_FAIL_INIT_MMAP;
    }

Add something to the effect of: options.config.mutable_gpu_options().set_allow_growth(true);

tbatkin · November 28, 2019, 7:52am

Thank you for the reply. Will we have to go through the steps of rebuilding the package for this to take effect? Or will this somehow be read just by running the python native client scripts?

reuben · November 28, 2019, 10:17am

You’ll need to rebuild, as this is C++ code.

tbatkin · November 28, 2019, 10:25am

Excellent thank you. We will try this

tbatkin · November 28, 2019, 10:59am

After rebuilding and trying to run 2 parallel processes we notice that one of the processes running still tries to allocate all the GPU memory available meaning we still run into the same out of memory error

    2019-11-28 10:53:55.315564: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
    2019-11-28 10:53:55.550252: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 13.69G (14699583744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.551025: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 12.32G (13229624320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.551784: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 11.09G (11906661376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.552518: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 9.98G (10715995136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.553244: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.98G (9644395520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.553949: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.08G (8679955456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.554668: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.28G (7811959808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.555398: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.55G (7030763520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.556143: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.89G (6327687168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.556854: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.30G (5694918144 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.557579: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.77G (5125426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.558281: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.30G (4612883456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.559010: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.87G (4151595008 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.559719: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.48G (3736435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.560427: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.13G (3362791936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.561154: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.82G (3026512640 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.561890: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.54G (2723861248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.562617: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.28G (2451474944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.563371: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.05G (2206327296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.564074: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.85G (1985694464 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.564774: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.66G (1787124992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.565476: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.50G (1608412416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.566201: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.35G (1447571200 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.566917: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.21G (1302814208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.567654: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.09G (1172532736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.568357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1006.39M (1055279616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.569082: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 905.75M (949751808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.569801: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 815.18M (854776576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.570519: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 733.66M (769298944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:57.587464: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
    2019-11-28 10:53:57.823504: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
    2019-11-28 10:53:57.853637: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.856427: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.858252: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.859887: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.861685: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.863990: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.864661: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.866457: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.868445: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.870251: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.995128: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.995181: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
    Error running session: Internal: Blas GEMM launch failed : a.shape=(16, 494), b.shape=(494, 2048), m=16, n=2048, k=494
             [[{{node MatMul}}]]
             [[{{node logits}}]]
    2019-11-28 10:53:57.995617: I tensorflow/stream_executor/stream.cc:2079] [stream=0x12a00170,impl=0x772ad70] did not wait for [stream=0x772ac90,impl=0x772a920]
    2019-11-28 10:53:57.995622: I tensorflow/stream_executor/stream.cc:2079] [stream=0x125d9ae0,impl=0x12a2d610] did not wait for [stream=0x772ac90,impl=0x772a920]
    2019-11-28 10:53:57.995700: I tensorflow/stream_executor/stream.cc:5027] [stream=0x12a00170,impl=0x772ad70] did not memcpy host-to-device; source: 0x178cbb00
    2019-11-28 10:53:57.995713: I tensorflow/stream_executor/stream.cc:5014] [stream=0x125d9ae0,impl=0x12a2d610] did not memcpy device-to-host; source: 0x7fa6de002500
    2019-11-28 10:53:57.995741: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
    2019-11-28 10:53:57.997924: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.997954: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
    2019-11-28 10:53:57.997983: I tensorflow/stream_executor/stream.cc:2079] [stream=0x12153fc0,impl=0x12154060] did not wait for [stream=0x10e21be0,impl=0x68b1600]
    2019-11-28 10:53:57.998011: I tensorflow/stream_executor/stream.cc:5014] [stream=0x12153fc0,impl=0x12154060] did not memcpy device-to-host; source: 0x7fd539457400
    2019-11-28 10:53:57.998142: F tensorflow/core/common_runtime/gpu/gpu_util.cc:292] GPU->CPU Memcpy failed

Sorry for the huge block of error message but I thought it would be relevant. Would you have any insight as to why this would be happening despite the rebuild with the changes to the tensorflow config?

lissyx · November 28, 2019, 12:36pm

Your nvidia-smi output (please please please avoid screenshots) shows two python processes, and not a deepspeech one.

Can you explain what you are doing ?

Specifically, how are you runing your inference here ?

tbatkin · November 28, 2019, 12:51pm

We’re running it as an imported package from a python script that does the multiprocessing.

In the multiprocessing script we import ‘Model’ and then initialise it in the following way:

ds = Model(
    MODEL_ROOT_DIR + "output_graph.pb",
    N_FEATURES,
    N_CONTEXT,
    MODEL_ROOT_DIR + "alphabet.txt",
    BEAM_WIDTH)

ds.enableDecoderWithLM(MODEL_ROOT_DIR + "alphabet.txt",
                        MODEL_ROOT_DIR + "lm.binary",
                        MODEL_ROOT_DIR + "trie",
                        LM_WEIGHT,
                        VALID_WORD_COUNT_WEIGHT)

and then use the following:

fs, audio = wav.read("audio/test_16kHz.wav")
transcript = ds.stt(audio, fs)

for the inference

lissyx · November 28, 2019, 12:55pm

Can you describe your steps from rebuilding with the changes applied to running ?

tbatkin · November 28, 2019, 1:05pm

So we altered the code then followed the instructions here:

github.com

mozilla/DeepSpeech/blob/master/native_client/README.rst


Building DeepSpeech Binaries
============================

If you'd like to build the DeepSpeech binaries yourself, you'll need the following pre-requisites downloaded and installed:


* `Mozilla's TensorFlow ``r1.14`` branch <https://github.com/mozilla/tensorflow/tree/r1.14>`_
* `General TensorFlow requirements <https://www.tensorflow.org/install/install_sources>`_
* `libsox <https://sourceforge.net/projects/sox/>`_

It is required to use our fork of TensorFlow since it includes fixes for common problems encountered when building the native client files.

If you'd like to build the language bindings or the decoder package, you'll also need:


* `SWIG >= 3.0.12 <http://www.swig.org/>`_
* `node-pre-gyp <https://github.com/mapbox/node-pre-gyp>`_ (for Node.JS bindings only)

Dependencies

This file has been truncated. show original

to rebuild and install the package to a virtual environment then using torch multiprocessing we initialised two processes with the target being the function described in my previous comment about inference

lissyx · November 28, 2019, 1:06pm

That’s the part that we need to verify for sure.

tbatkin · November 28, 2019, 2:11pm

That’s the part that we need to verify for sure.

We’ve definitely verified that because we put a stdout in the file and it was printed on execution

ryoji.ysd · November 29, 2019, 6:11am

According to the comment, I modified tfmodelstate.cc as follows and it works fine.
libdeepspeech.so can be used from multiple processes.
Thanks

SessionOptions options;
options.config.mutable_gpu_options()->set_allow_growth(true);

tbatkin · November 29, 2019, 9:21am

Could you provide a bit of clarity on where the tfmodelstate.cc file is?

reuben · November 29, 2019, 9:33am

tfmodelstate.cc is for master, it did not exist in v0.5.1. Also, make sure you’re reading the documentation for v0.5.1 as well. In the post above you linked to the master documentation. That could be why the change isn’t working for you, you followed the wrong steps maybe.

tbatkin · November 29, 2019, 9:50am

In the post above you linked to the master documentation.

That was just my fault with linking quickly we did follow the README.md found in the v0.5.1 repo. If it doesn’t exist in the release version and only exists in the alpha and the changes made to the deepspeech.cc file have not taken effect is there any action you could suggest to either diagnose or solve the problem?

reuben · November 29, 2019, 12:21pm

The changes made to deepspeech.cc should apply if you rebuild libdeepspeech.so as well as the Python package you’re using. If it is applied, but not doing what you expected, then that’s a TensorFlow bug, as there’s nothing we can do beyond setting the flag in the Session options.

vigneshgig · December 31, 2019, 5:03am

@ tbatkin can you give me a step that you have been followed for multiple inferences.

Thanks

vigneshgig · December 31, 2019, 8:10am

ERROR: /home/administrator/deepspeech/tensorflow/tensorflow/core/kernels/BUILD:4192:1: error while parsing .d file: /home/administrator/.cache/bazel/_bazel_administrator/f0ef65007c45462e3bb61b45513f09ae/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/softmax_op_gpu/softmax_op_gpu.cu.pic.d (No such file or directory)
nvcc fatal : Could not open output file ‘/tmp/tmpxft_000065a2_00000000’
INFO: Elapsed time: 34.680s, Critical Path: 25.55s
INFO: 130 processes: 130 local.
FAILED: Build did NOT complete successfully

i’m getting above error while compiling.
i followed this page https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.rst

when i checked there is no /tmp directory in my system @lissyx @reuben @tbatkin @ryoji.ysd

lissyx · December 31, 2019, 8:38am

We can’t help you with just a half-backed error and no context, sorry.