Feature Request: Spotting Keywords

I’ve been monitoring the impressive work done on this project. What I would like to see is the ability to detect keyword - e.g. wake words and short commands. The detectable keywords should be configurable and the STT engine should focus it’s results to match ONLY those keywords - as in everything is a nail if your a hammer. In other words - if the configured keyword would be “house” - even recognized words such as “mouse” could be matched up to a certain (maybe even configurable) threshold.

Thanks for your consideration.

Have you tried a dedicated language model, with only those keywords ? We did some experiments using that, in several contexts, and it proved to work quite well with the released english model.

@lissyx one of the key requirements for wake word or key word is that the solution should be always listening and so has to be computationally efficient with the reduced foot print in terms of flash and RAM usage. Can deepspeech acoustic model and language model that is trained only to recognize the key word, result in low foot print?

Take a look at https://github.com/MycroftAI/mycroft-precise

Well … a dedicated language model might work, but the idea is to change the keywords “on the fly” - depending on the environment e.g, in a home automation system were the operator installs and names a new switch.

@lissyx, @reuben Coming back to this topic, I was trying to experiment deepspeech model for wakeword detection by reducing the number of hidden units from 2048 to 256 as part of the training using a custom dataset with samples generated from various speakers for 10 small words(max of 10 characters) with approx 1.5K samples for each word.

Though the CPU load for inference reduced by 8 times from the original deepspeech model, this still was not efficient for a keyword system. Do you have any suggestions on modifying the parameters used for training, such as further reducing the size of hidden units to create a more smaller network to predict 10 to 20 characters?

that seems quite radical. I could build something quite efficient on Android with the default english model, and just a dedicated language model with command-words. COuld you document exactly your constraints ?

@lissyx specific concern with using the pre-trained model as such, would be the usage of high CPU, since this model is expected to be always active as it has to decipher each word being spoken, it will not be affordable to run it with heavy CPU usage.

For example I was running the TFLite version of the pre-trained model on a Qualcomm 820 HW to infer speech all the time and it takes almost 100% of CPU(which is close to 2.0 GHz of processing power).

This is the reason I was trying to reduce the n_hidden units, as this reduces the complexity of the model and was thinking that will be efficient to decode a single word. Do you have some suggestions on reducing the model size for a key word detection process to make it more CPU efficient?

You may want to take a look to https://www.tensorflow.org/tutorials/sequences/audio_recognition

Though it should take no more than one core. It’s still high if you have continuous recognition, but if you use VAD and the Streaming API, then you won’t have 100% of one core 100% of the time.

No because we are not working on that, so we can’t give more insight than “adapt and test”. You may want to check what @elpimous_robot did with his robot, though it was very early deepspeech, and the model was a bit different.

And FYI this is exactly the setup in mozillaspeechlibrary, I have a PR open against that Android Component to add DeepSpeech there.

My observation is that the STT industry (particularly commercial offerings) seem to have split to short utterance-based (< 30s, typically 2-5s) and arbitrary length dictation / transcription. They’re being solved in different ways. Long-form is much harder.

For example see picovoice that the author says is ultra high performance, embeddable, offline-able, and could be a nice complement to DS due it being optimised for short utterances (less than 30s).