How to use the pretrained tflite model?

Hi,
How should I modify client.py to support the tflite pretrained model ?

I tried without changing anything and I got:

Loading model from file models/deepspeech-0.6.0-models/output_graph.tflite
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.0-0-g6d43e21
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-12-05 09:29:49.068415: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Data loss: Can't parse models/deepspeech-0.6.0-models/output_graph.tflite as binary proto
Traceback (most recent call last):
  File "deepspeech_client.py", line 149, in <module>
main()
  File "deepspeech_client.py", line 113, in main
ds = Model(args.model, args.beam_width)
  File "C:\Work\Analytics\git\PacMoose\cocktails\.venv38\lib\site-packages\deepspeech\__init__.py", line 42, in __init__
raise RuntimeError("CreateModel failed with error code {}".format(status))
RuntimeError: CreateModel failed with error code 12293

I’m running DeepSpeech 0.6.0 / python 3.8 [MSC v.1916 64 bit (AMD64)] on win32

Thanks !

Ah just saw TFlite is not supported on Windows… I’ll try on linux.
Sorry about that!

It is supported on Linux, just not on the default package we upload to PyPI. I’m working on a way to make it easier to try. For now you can grab the Python wheel file directly from our build infrastructure: https://community-tc.services.mozilla.com/tasks/DwHim8KIQniEb9UZetiJMQ

On the right side, click “Artifacts”, then copy the URL of the Python package matching your version, then pip install it. For Python 3.8/Windows, it’ll be:

pip install https://community-tc.services.mozilla.com/api/queue/v1/task/DwHim8KIQniEb9UZetiJMQ/runs/0/artifacts/public/deepspeech-0.6.0-cp38-cp38-win_amd64.whl

Ah ah great!
And works on Win10.

Interesting results:

testing with TF lite model
Loading model from file models/deepspeech-0.6.0-models/output_graph.tflite
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.0-0-g6d43e21
INFO: Initialized TensorFlow Lite runtime.
Loaded model in 0.0477s.
Loading language model from files models/deepspeech-0.6.0-models/lm.binary models/deepspeech-0.6.0-models/trie
Loaded language model in 0.064s.
Running inference.
i like to make it or a story
Inference took 3.429s for 2.113s audio file.

testing with TF models
Loading model from file models/deepspeech-0.6.0-models/output_graph.pbmm
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.0-0-g6d43e21
2019-12-05 09:29:46.644788: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Loaded model in 0.0193s.
Loading language model from files models/deepspeech-0.6.0-models/lm.binary models/deepspeech-0.6.0-models/trie
Loaded language model in 0.0221s.
Running inference.
i would like to make a dark and stormy
Inference took 1.589s for 2.113s audio file.

Results on an audio file saying “i would like to make a dark and stormy”

  • Inference is more than 2 times slower using TFLite models
  • Inference result is perfect with TF, not as good with TFLite

Thanks for reporting the results. That is not expected, it shouldn’t be slower. I’ll try to reproduce.

Can you share complete command lines as well as details on your hardware / setup ?

Can you reproduce over several sequential runs ?

I can confirm that v0.6 TFLite model is slower and less accurate than advertised speeds in press release a couple days ago. It is even slightly slower and less accurate than the v0.5.1 version of TFLite model. This is on an Android device.

It is taking almost 3 seconds from cold startup and inference and then 2 seconds on subsequent calls for a 1 second clip of audio. Granted, we are trying to get this running on a slower device than a Pixel 2, but still disappointing based on claims around RPi4.

I am going to try and get the project up into a repo for y’all to look at.

This is very device-dependant. Can you share more informations on your device ?

Definitely nothing fancy. ARMv8 arch

What are the stats/performance expected on a Pixel 2?

We’re mostly twice as fast as realtime last I checked (this afternoon, basically) on Snapdragon 835 devices, so the Pixel 2.

I can see:

Chipset: Exynos 7570
Processor speed: 1.4 GHz Quad Core

That’s https://en.wikichip.org/wiki/samsung/exynos/7570 Cortex-A53. That’s indeed closer to RPi3, and the figures you report matches what we experienced on RPi3.

The Snapdragon 835 is Cortex-A73

Just ran this on a Pixel 2 XL with same chipset as Pixel 2. Getting inference time of 350ms after startup on a 950ms clip of audio. Thank you.

Is there anyway to reduce the latency on slower devices? The way I see it, I have some options like prune the language model to a smaller vocab, reduce the number of hidden nodes and train a smaller graph from scratch, is there anything else to try?

I want to avoid training from scratch if I can just because that will take a significant amount of time tuning that parameter to maintain accuracy while optimizing latency. And I doubt I can reduce the number of nodes from 2048 all the way to say 64 or 128 and have any hope of keeping a good WER, right?

Definitely not 64 and 128, but Iara Health (the company I mentioned in the blog post) is using a width of 700 for their model and getting good accuracy. Their use case has a constrained vocabulary.

The language model might help, but most of the complexity will be on the LSTM layers. Reducing the number of hidden nodes will definitively help, but at the expense of training from scratch.

There’s basically a quadratic relationship in term of temporal complexity on the n_hidden value.

Has there been any observation on the increase in false positive rates as the vocabulary size decreases? We need about 30 continuous commands with the ability to grow possibly later. If we were to constrain to 30 and someone says a word not in there, will it try to find a bucket for it, or will it throw it away? Obviously if the word sounds similar I would expect it to classify as such.

This is why it would be great to have the entire language there and then use post processing to check to see what was said.

And, just to confirm, I would have to avoid using the librispeech dataset if I were to train from scratch on a constrained vocabulary, correct? Can the model ignore words in the training data as noise if the word does not appear in the vocab?

Or I would just write a script that would generate all possible combinations of the commands we need and generate the language model from that.

That is what I thought. Looks like I will be exploring that path or possibly going the route of creating a simpler LSTM-CTC network to recognize just those commands as individual units.

What is the expected in WER degradation for the TFLite model in ideal conditions?

On test set, when it was enabled, the WER went from ~8.2% to ~10.3%. But real use might yield different values, there are a lot of variables in play …

That is fair. Thank you.