You should not hope to use OpenCL on the RPi. I worked on that for weeks to test the status last year, and while the driver was (and is still) in good development status, our model was too complicated for it and neither the maintainer or myself could find time to start working on the blocking items.
I would really like you to share more context, because Iām still not able to reproduce. This is on a RPi4, reinstalled right now, with the ice tower fan + heatspread:
pi@raspberrypi:~/ds $ for f in audio/*.wav; do echo $f; mediainfo $f | grep Duration; done;
audio/2830-3980-0043.wav
Duration : 1 s 975 ms
Duration : 1 s 975 ms
audio/4507-16021-0012.wav
Duration : 2 s 735 ms
Duration : 2 s 735 ms
audio/8455-210777-0068.wav
Duration : 2 s 590 ms
Duration : 2 s 590 ms
pi@raspberrypi:~/ds $ ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24553
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.38253
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.23032
So itās consistent with the previous builds I did. Can you @dr0ptp4kt give more context on what you do ? How do you build / measure ?
Hi @lissyx! When Iām referring to inferencing, Iām talking about the inferencing-specific portion of the run with client.py.
I noticed that the GitHub link I posted looked like a simple fork link, but hereās the specific README.md that shows what I did:
https://github.com/dr0ptp4kt/DeepSpeech/blob/tflite-rpi-3and4-compat/rpi3and4/README.md
The LM even from a warmed up filesystem cache is taking 1.28s to load on this 4 GB RAM Pi 4. So when thatās subtracted from total run, that makes a significant percentage-wise difference. In a an end user application context, what Iād do is have that LM pre-injected before the intake of voice data so that the only thing the client has to do is the inferencing. Of course a 1.8 GB LM isnāt going to fit into RAM on a device with 1 GB of RAM, so there I think the only good option is to fiddle with the the size (and therefore quality) of the LM, TRIE, and .tflite model files appropriate to the use case.
Iām not telling you anything new here, but itās also of course possible to offload error correction to the retrieval system. In my Wikipedia use case I might be contented for lower RAM scenarios to forego or dramatically shrink the LM and TRIE, increase the size of the .tflite for greater precision (because there would still be RAM space available), and use some sort of optimized forgiving topic embedding / fuzzy matching scheme in the retrieval system, effectively moving part of the problem to the later stage. Itās of course possible to move those improvements into the audio detection run with DeepSpeech itself, but in the context of this binary, itās about managing the RAM in stages so that the LM and TRIE donāt spill over and page to disk.
Anyway, it looks like your run and my run are pretty close in terms of general speed - itās really close to taking about the same time to process as the length of the clip (and the inferencing specific part seems to take less time).
For your product roadmap, is the hope to be as fast as the incoming audio for realtime processing or something of that nature? How much optimization do you want? Iām really interested in helping with that (through raw algorithms and smart hacks on LM / TRIE / .tflite) or even with build system stuff if youāre open to it - but I also know you need to manage the product roadmap as well, so donāt want to be too imposing!
Keep up the great work! If it would work for you Iād be happy to discuss on video (or Freenode if you prefer).
Iām running without LM.
Ok, can you try with deepspeech
C++ binary and the -t
command line argument?
Those are mmap()
'd, so itās not really a big issue.
What do you mean?
Hereās what Iām seeing with -t. Funny I missed the flag earlier
Using the LM:
$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio arctic_a0024.wav -t
TensorFlow: v1.13.1-13-g174b4760eb
DeepSpeech: v0.5.1-0-g4b29b78
it was my reports from the north which chiefly induced people to buy
cpu_time_overall=3.25151
Not using the LM:
$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --audio arctic_a0024.wav -t
TensorFlow: v1.13.1-13-g174b4760eb
DeepSpeech: v0.5.1-0-g4b29b78
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=6.95059
So part of the speed is definitely actual use of the LM.
I agree with you that the mmapāing on the .tflite diminishes the negative effect of disk reads. As for the LM, but itās definitely faster when injected in RAM. Are you sure thatās being consumed in an mmapād fashion? I know it should be possible to mmap read of course, but it seems like that thing is taking up some 40s on initial run - that seems longer than I would expect if it were doing filesystem segment seeks in an mmap fashion; maybe this 40s on the first read is just because the client is fully consuming the file whereas it could be made to only consume the pointerā¦I havenāt dug into that part of the code beyond a quick scan. Gotta run, but interested to hear if you have tips.
For the product roadmap, I mainly just wanted to ensure that if Iād be posting patches theyād be valuable to you and the general user base of DeepSpeech. I know itās an open source project and Iām free to fork, but I was hoping if there are problems that can be solved that are mutually beneficial I work on those ones. I reckon the last thing you need is patches that arenāt aligned with where youāre taking this software. Specifically, I was wondering how much optimization you want in this Rpi4 context. I was thinking that if it would be helpful, I might be inclined to post patches to address optimization to the level youāre hoping for. As for the build system, I also would be happy to help with build scripts and that sort of thing (e.g., Bazel stuff, cutting differently sized versions of models, etc.) - not sure if youād need to me get shell access and requisite paperwork for that, though, or if thatās off limits or just not helpful - I can appreciate just how hard build and deploy pipelines are. I realize taskcluster sort of runs on arbitrary systems, but itās also the case that I donāt have a multi-GPU machine, so much of the capabilities when it comes to the full build pipeline or the assumptions of things even as simple as keystores sort of seem to break down on my local dev rig.
Iām unsure exactly what you are suggesting here. You reported much faster inference than what we can achieve, so Iām trying to understand. Getting TFLite to run is not a problem, Iāve been experimenting with that for months now, so I know how to do it.
Maybe, but thatās not really what we are concerned about for now.
Could you please reproduce that with current master and using our audio files?
Also, could you please document whatās your basesystem ? Raspbian ? Whatās your PSU ?
Iāll reproduce with the current master and share. I may be a bit occupied the next several days, just a heads up.
Itās stock Raspbian for the Raspberry Pi 4, using the Raspberry Pi official USB-C power adapter. I have a metal heatsink on the CPU and modem (heatsink for modem doesnāt fit on GPU or Iād put it there!) but no other thermal stuff going, no overclocking. Pretty normal setup.
@lissyx I found some time tonight to build this against master using bazel 0.24.1. I used the multistrap conf file for Stretch instead of Buster just to hold that variable constant. FWIW, I had actually produced a Buster build ~2 weeks ago for v.051 and saw similar performance between the Stretch and Buster builds at that time.
I just re-used the .tflite and lm.binary from the v0.51 model archive as you might infer from the directory names in the second run below. From my Mac I generated a new trie file for v4 instead of v3, then SCPād that over to the Pi for use in the second run below (the faster one that uses the LM and TRIE)
Anyway, hereās what Iām seeing.
Without LM and TRIE. Quite similar in speed compared to the results you pasted in.
pi@pi4:~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.16360
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.08172
> audio//arctic_a0024.wav
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=4.89014
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.30461
With LM and TRIE. Faster. The nice news here is that it appears that for arctic_a0024.wav itās even faster than what I got with the v0.51 tflite build.
~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --lm ~/ds/deepspeech-0.5.1-models/lm.binary --trie trie --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=2.10253
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=2.07445
> audio//arctic_a0024.wav
it was my reports from the north which chiefly induced people to buy
cpu_time_overall=3.17514
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=1.55579
Well if you run against a RPi4, you have Buster. Itād be more correct to use Buster when building, even though it should not make any difference.
All in all, we have the same setup, at least.
Iām starting to wonder if we have not regressed the way we measure time.
Ok, adding the LM, Iām getting similar results:
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24919
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.37936
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.20311
real 0m8.877s
user 0m8.781s
sys 0m0.091s
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=2.14202
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=1.59947
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=2.15120
real 0m6.810s
user 0m5.890s
sys 0m0.920s
pi@raspberrypi:~/ds $
Okay, I know whatās happening. When we donāt load a LM, we wonāt set an external scorer and thus this is not executed: https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L39-L44
The DS_SpeechToText*
implementation will, under the hood, rely on the Streaming API and this means StreamingState::processBatch()
gets computed a few times. That will call decoder_state_.next()
: https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/deepspeech.cc#L253-L255
Obivously, with the LM and the trie, the beam search is faster.
This is confirmed when checking execution time around .next()
.
Here, without LM:
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie --extended -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-5-g5845505
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
ds_createstream_time=0.00002
decoder_state_time=0.00501
decoder_state_time=0.00368
decoder_state_time=0.02743
decoder_state_time=0.04064
decoder_state_time=0.02312
decoder_state_time=0.03033
decoder_state_time=0.04554
decoder_state_time=0.01943
decoder_state_time=0.01434
ds_create_time=1.59941 ds_finish_time=2.11518
why should one halt on the way
cpu_time_overall=2.11524 cpu_time_decoding=0.08359 cpu_time_decodeall=0.08360
> audio//2830-3980-0043.wav
ds_createstream_time=0.00001
decoder_state_time=0.00450
decoder_state_time=0.00636
decoder_state_time=0.01696
decoder_state_time=0.03406
decoder_state_time=0.02439
decoder_state_time=0.00506
decoder_state_time=0.00079
ds_create_time=1.09895 ds_finish_time=1.58255
experienced proof less
cpu_time_overall=1.58259 cpu_time_decoding=0.07956 cpu_time_decodeall=0.07957
> audio//8455-210777-0068.wav
ds_createstream_time=0.00001
decoder_state_time=0.00414
decoder_state_time=0.00451
decoder_state_time=0.02432
decoder_state_time=0.05007
decoder_state_time=0.04753
decoder_state_time=0.03541
decoder_state_time=0.04675
decoder_state_time=0.00631
decoder_state_time=0.00084
ds_create_time=1.62787 ds_finish_time=2.12236
your power is sufficient i said
cpu_time_overall=2.12242 cpu_time_decoding=0.08898 cpu_time_decodeall=0.08899
real 0m6.732s
user 0m5.850s
sys 0m0.882s
And with LM:
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --extended -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-5-g5845505
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
ds_createstream_time=0.00008
decoder_state_time=0.09759
decoder_state_time=0.11414
decoder_state_time=0.13162
decoder_state_time=0.15993
decoder_state_time=0.16111
decoder_state_time=0.14577
decoder_state_time=0.16027
decoder_state_time=0.16672
decoder_state_time=0.09356
ds_create_time=2.42510 ds_finish_time=3.16191
why should one halt on the way
cpu_time_overall=3.16199 cpu_time_decoding=0.07662 cpu_time_decodeall=0.07662
> audio//2830-3980-0043.wav
ds_createstream_time=0.00001
decoder_state_time=0.09842
decoder_state_time=0.11592
decoder_state_time=0.13760
decoder_state_time=0.16061
decoder_state_time=0.15751
decoder_state_time=0.14751
decoder_state_time=0.02784
ds_create_time=1.68183 ds_finish_time=2.32753
experienced proof less
cpu_time_overall=2.32759 cpu_time_decoding=0.07175 cpu_time_decodeall=0.07176
> audio//8455-210777-0068.wav
ds_createstream_time=0.00001
decoder_state_time=0.10290
decoder_state_time=0.11246
decoder_state_time=0.13637
decoder_state_time=0.15481
decoder_state_time=0.16675
decoder_state_time=0.18728
decoder_state_time=0.19037
decoder_state_time=0.18308
decoder_state_time=0.01142
ds_create_time=2.46809 ds_finish_time=3.14513
your power is sufficient i said
cpu_time_overall=3.14518 cpu_time_decoding=0.08011 cpu_time_decodeall=0.08012
real 0m8.674s
user 0m8.594s
sys 0m0.081s
This does account for the difference of execution time.
@dr0ptp4kt Thanks for sharing your experiments, Iāve taken the bad habit to test without LM and did not pay attention enough when I early tested on the RPi4 that LM would have such an impact. Without your feedback, I would not have checked deeper. I think weāre going to switch to TFLite runtime for v0.6 on ARMv7 builds for RPi3/RPi4, and thus instruct people that they should be able to get faster than realtime on RPi4.
As much as I could test on Aarch64 boards we have (LePotato, S905X, https://libre.computer/products/boards/aml-s905x-cc/) the situation still holds and the SoC is not powerful enough for those decent performances.
I did investigate that, and found that we could tune the way the LM is loaded. Current master has a PR that improves things. On RPi4, the latency improves nicely. On Android devices, itās not even visible as much as I could test :-).