Built native-client from source for CPU only uses one thread

I’m using the deepspeech executable inside the native_client folder to do the inference using the release models/ audio examples. Taking 13seconds on my 2 cpu 12 core 24 thread machine using only one thread, is there a compile option or a runtime option to enable threading?

Can you document how you built ? Our binaries do leverage threading, as much as I can verify.

i built with the bazel command as the guide shows, I did add -msse4.1 and -msse4.2 to the args that it asks for, that didn’t help. Do I need to build with MKL? or opencl? or is there an option in the deepspeech build? that just uses make deepspeech. The resulting output I get also crashes when running with models/trie.

Reading /home/jacobmh/deepspeech/audio/2830-3980-0043.wav
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
terminate called after throwing an instance of 'lm::FormatLoadException’
what(): native_client/kenlm/lm/read_arpa.cc:65 in void lm::ReadARPACounts(util::FilePiece&, std::vector&) threw FormatLoadException.
first non-empty line was "RIFF�
Aborted (core dumped)

No need for MKL or OpenCL. There is no specific option, -lpthread should be there already. I reverified, and I see threads spawns and running here. Your crash is because you are not passing the arguments properly and are mixing order of arguments.

Now working with multi-threading as well, might suggest putting a comment on the readme saying that PREFIX=/usr/local sudo make install needs to be done for each build to run from the native_client directory if done once already, not just for the binaries. Thanks very much for your help, the ldd command was the thing that solved it. Do you think its worth it performance wise, to rebuild again for CUDA9.1 over CUDA8.0?

About threading, copying my answer from the other thread:

Regarding the threads, default build should enable pthread. Here, more verbose log of TaskCluster CPU only build on my laptop:

$ TF_CPP_MIN_VLOG_LEVEL=2 ./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t 2>&1 | grep -i thread
2018-03-18 00:19:21.785609: I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 4
2018-03-18 00:19:21.785870: I tensorflow/core/common_runtime/direct_session.cc:82] Direct session inter op parallelism threads: 4
[...]

And here on my desktop:

$ TF_CPP_MIN_VLOG_LEVEL=2 ./deepspeech models/output_graph.pb models/alphabet.txt audio/ -t 2>&1 | grep -i thread
2018-03-18 00:23:05.969842: I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 32
2018-03-18 00:23:05.971011: I tensorflow/core/common_runtime/direct_session.cc:83] Direct session inter op parallelism threads: 32
[...]

So it creates threads.

For CUDA 9.1 / 8.0, I don’t really have an opinion. I guess it might depend on your GPU, if it’s new, you might benefit. You should benchmark it.