Prebuild deep speech binary for tensorflow lite model on raspberry pi 3?

You should not hope to use OpenCL on the RPi. I worked on that for weeks to test the status last year, and while the driver was (and is still) in good development status, our model was too complicated for it and neither the maintainer or myself could find time to start working on the blocking items.

I would really like you to share more context, because Iā€™m still not able to reproduce. This is on a RPi4, reinstalled right now, with the ice tower fan + heatspread:

pi@raspberrypi:~/ds $ for f in audio/*.wav; do echo $f; mediainfo $f | grep Duration; done;
audio/2830-3980-0043.wav
Duration                                 : 1 s 975 ms
Duration                                 : 1 s 975 ms
audio/4507-16021-0012.wav
Duration                                 : 2 s 735 ms
Duration                                 : 2 s 735 ms
audio/8455-210777-0068.wav
Duration                                 : 2 s 590 ms
Duration                                 : 2 s 590 ms
pi@raspberrypi:~/ds $ ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24553
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.38253
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.23032

So itā€™s consistent with the previous builds I did. Can you @dr0ptp4kt give more context on what you do ? How do you build / measure ?

Hi @lissyx! When Iā€™m referring to inferencing, Iā€™m talking about the inferencing-specific portion of the run with client.py.

I noticed that the GitHub link I posted looked like a simple fork link, but hereā€™s the specific README.md that shows what I did:

https://github.com/dr0ptp4kt/DeepSpeech/blob/tflite-rpi-3and4-compat/rpi3and4/README.md

The LM even from a warmed up filesystem cache is taking 1.28s to load on this 4 GB RAM Pi 4. So when thatā€™s subtracted from total run, that makes a significant percentage-wise difference. In a an end user application context, what Iā€™d do is have that LM pre-injected before the intake of voice data so that the only thing the client has to do is the inferencing. Of course a 1.8 GB LM isnā€™t going to fit into RAM on a device with 1 GB of RAM, so there I think the only good option is to fiddle with the the size (and therefore quality) of the LM, TRIE, and .tflite model files appropriate to the use case.

Iā€™m not telling you anything new here, but itā€™s also of course possible to offload error correction to the retrieval system. In my Wikipedia use case I might be contented for lower RAM scenarios to forego or dramatically shrink the LM and TRIE, increase the size of the .tflite for greater precision (because there would still be RAM space available), and use some sort of optimized forgiving topic embedding / fuzzy matching scheme in the retrieval system, effectively moving part of the problem to the later stage. Itā€™s of course possible to move those improvements into the audio detection run with DeepSpeech itself, but in the context of this binary, itā€™s about managing the RAM in stages so that the LM and TRIE donā€™t spill over and page to disk.

Anyway, it looks like your run and my run are pretty close in terms of general speed - itā€™s really close to taking about the same time to process as the length of the clip (and the inferencing specific part seems to take less time).

For your product roadmap, is the hope to be as fast as the incoming audio for realtime processing or something of that nature? How much optimization do you want? Iā€™m really interested in helping with that (through raw algorithms and smart hacks on LM / TRIE / .tflite) or even with build system stuff if youā€™re open to it - but I also know you need to manage the product roadmap as well, so donā€™t want to be too imposing!

Keep up the great work! If it would work for you Iā€™d be happy to discuss on video (or Freenode if you prefer).

Iā€™m running without LM.

Ok, can you try with deepspeech C++ binary and the -t command line argument?

Those are mmap()'d, so itā€™s not really a big issue.

What do you mean?

Hereā€™s what Iā€™m seeing with -t. Funny I missed the flag earlier :stuck_out_tongue:

Using the LM:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio arctic_a0024.wav -t

TensorFlow: v1.13.1-13-g174b4760eb

DeepSpeech: v0.5.1-0-g4b29b78

it was my reports from the north which chiefly induced people to buy

cpu_time_overall=3.25151

Not using the LM:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --audio arctic_a0024.wav -t
TensorFlow: v1.13.1-13-g174b4760eb
DeepSpeech: v0.5.1-0-g4b29b78
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=6.95059

So part of the speed is definitely actual use of the LM.

I agree with you that the mmapā€™ing on the .tflite diminishes the negative effect of disk reads. As for the LM, but itā€™s definitely faster when injected in RAM. Are you sure thatā€™s being consumed in an mmapā€™d fashion? I know it should be possible to mmap read of course, but it seems like that thing is taking up some 40s on initial run - that seems longer than I would expect if it were doing filesystem segment seeks in an mmap fashion; maybe this 40s on the first read is just because the client is fully consuming the file whereas it could be made to only consume the pointerā€¦I havenā€™t dug into that part of the code beyond a quick scan. Gotta run, but interested to hear if you have tips.

For the product roadmap, I mainly just wanted to ensure that if Iā€™d be posting patches theyā€™d be valuable to you and the general user base of DeepSpeech. I know itā€™s an open source project and Iā€™m free to fork, but I was hoping if there are problems that can be solved that are mutually beneficial I work on those ones. I reckon the last thing you need is patches that arenā€™t aligned with where youā€™re taking this software. Specifically, I was wondering how much optimization you want in this Rpi4 context. I was thinking that if it would be helpful, I might be inclined to post patches to address optimization to the level youā€™re hoping for. As for the build system, I also would be happy to help with build scripts and that sort of thing (e.g., Bazel stuff, cutting differently sized versions of models, etc.) - not sure if youā€™d need to me get shell access and requisite paperwork for that, though, or if thatā€™s off limits or just not helpful - I can appreciate just how hard build and deploy pipelines are. I realize taskcluster sort of runs on arbitrary systems, but itā€™s also the case that I donā€™t have a multi-GPU machine, so much of the capabilities when it comes to the full build pipeline or the assumptions of things even as simple as keystores sort of seem to break down on my local dev rig.

Iā€™m unsure exactly what you are suggesting here. You reported much faster inference than what we can achieve, so Iā€™m trying to understand. Getting TFLite to run is not a problem, Iā€™ve been experimenting with that for months now, so I know how to do it.

Maybe, but thatā€™s not really what we are concerned about for now.

Could you please reproduce that with current master and using our audio files?

Also, could you please document whatā€™s your basesystem ? Raspbian ? Whatā€™s your PSU ?

Iā€™ll reproduce with the current master and share. I may be a bit occupied the next several days, just a heads up.

Itā€™s stock Raspbian for the Raspberry Pi 4, using the Raspberry Pi official USB-C power adapter. I have a metal heatsink on the CPU and modem (heatsink for modem doesnā€™t fit on GPU or Iā€™d put it there!) but no other thermal stuff going, no overclocking. Pretty normal setup.

@lissyx I found some time tonight to build this against master using bazel 0.24.1. I used the multistrap conf file for Stretch instead of Buster just to hold that variable constant. FWIW, I had actually produced a Buster build ~2 weeks ago for v.051 and saw similar performance between the Stretch and Buster builds at that time.

I just re-used the .tflite and lm.binary from the v0.51 model archive as you might infer from the directory names in the second run below. From my Mac I generated a new trie file for v4 instead of v3, then SCPā€™d that over to the Pi for use in the second run below (the faster one that uses the LM and TRIE)

Anyway, hereā€™s what Iā€™m seeing.

Without LM and TRIE. Quite similar in speed compared to the results you pasted in.

pi@pi4:~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.16360
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.08172
> audio//arctic_a0024.wav
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=4.89014
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.30461

With LM and TRIE. Faster. The nice news here is that it appears that for arctic_a0024.wav itā€™s even faster than what I got with the v0.51 tflite build.

~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --lm ~/ds/deepspeech-0.5.1-models/lm.binary --trie trie --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=2.10253
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=2.07445
> audio//arctic_a0024.wav
it was my reports from the north which chiefly induced people to buy
cpu_time_overall=3.17514
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=1.55579

Well if you run against a RPi4, you have Buster. Itā€™d be more correct to use Buster when building, even though it should not make any difference.

All in all, we have the same setup, at least.

Iā€™m starting to wonder if we have not regressed the way we measure time.

Ok, adding the LM, Iā€™m getting similar results:

pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24919
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.37936
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.20311

real    0m8.877s
user    0m8.781s
sys     0m0.091s
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=2.14202
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=1.59947
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=2.15120

real    0m6.810s
user    0m5.890s
sys     0m0.920s
pi@raspberrypi:~/ds $ 

Okay, I know whatā€™s happening. When we donā€™t load a LM, we wonā€™t set an external scorer and thus this is not executed: https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L39-L44

The DS_SpeechToText* implementation will, under the hood, rely on the Streaming API and this means StreamingState::processBatch() gets computed a few times. That will call decoder_state_.next(): https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/deepspeech.cc#L253-L255

Obivously, with the LM and the trie, the beam search is faster.

This is confirmed when checking execution time around .next().

Here, without LM:

pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie --extended -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-5-g5845505
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
ds_createstream_time=0.00002
decoder_state_time=0.00501
decoder_state_time=0.00368
decoder_state_time=0.02743                   
decoder_state_time=0.04064     
decoder_state_time=0.02312                                                   
decoder_state_time=0.03033
decoder_state_time=0.04554
decoder_state_time=0.01943
decoder_state_time=0.01434
ds_create_time=1.59941 ds_finish_time=2.11518
why should one halt on the way
cpu_time_overall=2.11524 cpu_time_decoding=0.08359 cpu_time_decodeall=0.08360
> audio//2830-3980-0043.wav
ds_createstream_time=0.00001
decoder_state_time=0.00450
decoder_state_time=0.00636
decoder_state_time=0.01696
decoder_state_time=0.03406
decoder_state_time=0.02439
decoder_state_time=0.00506
decoder_state_time=0.00079
ds_create_time=1.09895 ds_finish_time=1.58255
experienced proof less
cpu_time_overall=1.58259 cpu_time_decoding=0.07956 cpu_time_decodeall=0.07957
> audio//8455-210777-0068.wav
ds_createstream_time=0.00001
decoder_state_time=0.00414
decoder_state_time=0.00451
decoder_state_time=0.02432
decoder_state_time=0.05007
decoder_state_time=0.04753
decoder_state_time=0.03541
decoder_state_time=0.04675
decoder_state_time=0.00631
decoder_state_time=0.00084
ds_create_time=1.62787 ds_finish_time=2.12236
your power is sufficient i said
cpu_time_overall=2.12242 cpu_time_decoding=0.08898 cpu_time_decodeall=0.08899

real    0m6.732s
user    0m5.850s
sys     0m0.882s

And with LM:

pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --extended -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-5-g5845505
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
ds_createstream_time=0.00008
decoder_state_time=0.09759
decoder_state_time=0.11414
decoder_state_time=0.13162
decoder_state_time=0.15993
decoder_state_time=0.16111
decoder_state_time=0.14577
decoder_state_time=0.16027
decoder_state_time=0.16672
decoder_state_time=0.09356
ds_create_time=2.42510 ds_finish_time=3.16191
why should one halt on the way
cpu_time_overall=3.16199 cpu_time_decoding=0.07662 cpu_time_decodeall=0.07662
> audio//2830-3980-0043.wav
ds_createstream_time=0.00001
decoder_state_time=0.09842
decoder_state_time=0.11592
decoder_state_time=0.13760
decoder_state_time=0.16061
decoder_state_time=0.15751
decoder_state_time=0.14751
decoder_state_time=0.02784
ds_create_time=1.68183 ds_finish_time=2.32753
experienced proof less
cpu_time_overall=2.32759 cpu_time_decoding=0.07175 cpu_time_decodeall=0.07176
> audio//8455-210777-0068.wav
ds_createstream_time=0.00001
decoder_state_time=0.10290
decoder_state_time=0.11246
decoder_state_time=0.13637
decoder_state_time=0.15481
decoder_state_time=0.16675
decoder_state_time=0.18728
decoder_state_time=0.19037
decoder_state_time=0.18308
decoder_state_time=0.01142
ds_create_time=2.46809 ds_finish_time=3.14513
your power is sufficient i said
cpu_time_overall=3.14518 cpu_time_decoding=0.08011 cpu_time_decodeall=0.08012

real    0m8.674s
user    0m8.594s
sys     0m0.081s

This does account for the difference of execution time.

@dr0ptp4kt Thanks for sharing your experiments, Iā€™ve taken the bad habit to test without LM and did not pay attention enough when I early tested on the RPi4 that LM would have such an impact. Without your feedback, I would not have checked deeper. I think weā€™re going to switch to TFLite runtime for v0.6 on ARMv7 builds for RPi3/RPi4, and thus instruct people that they should be able to get faster than realtime on RPi4.

As much as I could test on Aarch64 boards we have (LePotato, S905X, https://libre.computer/products/boards/aml-s905x-cc/) the situation still holds and the SoC is not powerful enough for those decent performances.

I did investigate that, and found that we could tune the way the LM is loaded. Current master has a PR that improves things. On RPi4, the latency improves nicely. On Android devices, itā€™s not even visible as much as I could test :-).