DeepSpeech Tensorflow does not use all CPU cores

I could see the CPU only versions/releases of DeepSpeech for RaspBerry PI-3 or ARM64 utilizes only one CPU core for performing the inference, while parallel execution across cores will improve the inference time. Has any one seen this behavior and is there a way to configure tensorflow to utilize all the CPU cores for inference?

All the verifications I could pursue on this confirmed that TensorFlow was properly using all cores. Can you provide more feedback on what makes you think it is not the case ?

Especially, running with env variable TF_CPP_MIN_VLOG_LEVEL=2 should give indications on the inter and intra op parallelism being used

@sranjeet.visteon

$ TF_CPP_MIN_VLOG_LEVEL=2 ./deepspeech --model output_graph.pbmm --alphabet alphabet.txt --audio LDC93S1.wav 2>&1 | grep -i parallelism
2018-10-23 16:41:24.924421: I tensorflow/core/common_runtime/local_device.cc:41] Local device intra op parallelism threads: 4
2018-10-23 16:41:24.925181: I tensorflow/core/common_runtime/process_util.cc:82] Direct session inter op parallelism threads: 4 

A parallel htop also shows multiple threads running and taking CPU.

@lissyx Below is the output from my Jetson TX2 HW

(venv_0.3.0) nvidia@tegra-ubuntu:~/deepspeech/native_client.arm64.cpu.linux$ TF_CPP_MIN_VLOG_LEVEL=2 ./deepspeech --model ./…/models/output_graph.pbmm --alphabet ./…/models/alphabet.txt --audio ./…/wav 2>&1 | grep -i parallelism
2018-10-23 16:52:26.319769: I tensorflow/core/common_runtime/local_device.cc:41] Local device intra op parallelism threads: 6
2018-10-23 16:52:26.320450: I tensorflow/core/common_runtime/process_util.cc:82] Direct session inter op parallelism threads: 6

it shows that 6 threads are available and running htop in parallel I could see all the CPU’s used but only one CPU is heavily utilized at > 70% and rest of the CPU’s are mostly < 20% being used. Also the AVG usage of all CPU’s is ~25%. Is this an expected behavior?

This is more of a TensorFlow-level question, but it does confirm that there is parallelism triggered. Honestly, I think it’s mostly the same level of usage we can see on other hardware. However, if you are running on Jetson, you should rather look into cross-compile for your system with CUDA and leverage the GPU.

@lyssx, Thanks. yes this is more of tensorflow question but wanted to understand from other deepspeech users about my observation. On a RPI3 I could find a similar concern where utilization is always <=100% across cores while 400% is available to be utilized.

I have an usecase to run deepspeech on another HW where only CPU is available to be used and that is the reason for me to try it without enabling the GPU on Jetson TX2.

But all of that actually depends on how much parallelism there is in the model. Just because there are 4 CPUs does not means we can make use of all of them.

Well, currently, we don’t have an as good as we’d like performances on RPi3 and that kind of boards, but they are here as demo purpose.

Running a (slightly modified) version of the model under tflite benchmark_model yield better performances on Pixel 2 device (~2.40B FLOPs per sec) versus RPi3 or LePotato boards (~490M FLOPs per sec)

plain-vs-tflite.zip (249.7 KB)
This contains SVG graph of the model, plain version and tflite one, this latter being different to be able to be ingested by toco and run on the TFLite engine.

@lissyx, is there a plan to release a model and implementation of deepspeech based on tflite any time soon? As I had mentioned before we have an usecase to run on a CPU only system and probably the tflite version might be a better option to evaluate.

Yes, there’s an issue opened on GitHub. There’s an (outdated) WIP tflite branch, I should send PR on it with my fixes, and it does run in the end, at least a 2048-wide model trained for one epoch on LDC93S1. We still have work to do, but it’s starting to work. Except when you want NNAPI on Android, but that’s another story :slight_smile: