Custom LM causes terrible false positive rate

MattC_eostar · December 11, 2019, 6:24pm

Hello!

We have a specific use case where the DeepSpeech tflite model will be used on an android device and needs to recognize about 30 commands. I successfully created an lm binary and trie file using the tools in the repo and kenlm. This decreased our WER by a lot, but I am noticing some funky behavior when it comes to passing the model audio that contains a sentence of words that are OOV. Instead of just ignoring the words and treating them as noise as per the restricted vocab, it tries to force the audio into one of those 30 command buckets causing a false positive.

Is there a way to retrieve a confidence or make the model more robust to changes like this? Or is there something I could try with the generation of the LM and trie?

Any help will be greatly appreciated!

Commands used:

./lmplz -o 3 < corpus.txt > lm.arpa --discount_fallback

./build_binary lm.arpa lm.binary

./generate_trie ../alphabet.txt lm.binary trie

Thanks!

MattC_eostar · December 11, 2019, 6:35pm

Maybe there is some optimization when creating the LM file that allows more emphasis on ? I thought --interpolate-unigrams 0 would help with that, but I saw no difference!

lissyx · December 11, 2019, 7:52pm

Those commands does not match what we document on how to produce the language model. Can you verify after using the proper ones ?

lissyx · December 11, 2019, 8:14pm

Yes, have a look at the Metadata part in the API

lissyx · December 11, 2019, 8:15pm

This is something we already experimented successfully, although there was a bit more commands.

I’d really like to see the outcome with proper LM generation parameters.

MattC_eostar · December 11, 2019, 8:21pm

Is what I used not correct? Any tips on that?

MattC_eostar · December 11, 2019, 8:41pm

I am not sure where that is documented. I see this file here: https://github.com/mozilla/DeepSpeech/blob/d925e6b5fc186f3524e7c03d6eacf440d5366262/data/lm/generate_lm.py

But that includes pruning and a huge dataset. We have a small number of command phrases where some phrases are 1 word long, so we do not want to filter or prune.

I see that I changed the order from what is listed here: https://github.com/mozilla/DeepSpeech/blob/ea8e4637d34fcdbd2d0d77d821208e2e1012a59c/native_client/kenlm/README.md#estimation

I will try with an order of five and see what happens.

MattC_eostar · December 11, 2019, 8:48pm

Also, for the smaller corpus, I am getting an error unless I use the discount fallback flag. Does this flag change the nature of the solution?

lissyx · December 11, 2019, 8:52pm

Yes, but look at the build_binary call, there’s some quantization and trie format specified.

I can’t really say for the behavior of the flag, but I do have hit the same limitation as well. Although it did not seemed to have the same impact as what you describe.

MattC_eostar · December 11, 2019, 9:04pm

So I did it again with the updated build_binary command and rebuilt the trie and lm.binary files. This language is focused mainly on numbers along with some other commands. As a false positive smoke test, I pass the model samples from LibriSpeech just to see what it comes out as. Instead of ignore the OOV words, it tries to force it into one of the buckets.

e.g. these are sentences that are falsely positive
two nine one two 
scan one one ten one
five ten four seven
three one

This part of the error is interesting when --discount_fallback is omitted:

To override this error for e.g. a class-based model, rerun with --discount_fallback

any idea what a class-based model is?

reuben · December 11, 2019, 9:58pm

Did you tune the LM hyperparameters alpha and beta?

MattC_eostar · December 12, 2019, 4:31pm

I have not. I see in the repo these comments:

# The alpha hyperparameter of the CTC decoder. Language Model weight
LM_ALPHA = 0.75

# The beta hyperparameter of the CTC decoder. Word insertion bonus.
LM_BETA = 1.85

Do you have a recommendation of which would be best to focus on?

reuben · December 12, 2019, 4:58pm

Both. Do a grid search, or a random search, and plot the error surface to see in what direction things are improving.

MattC_eostar · December 12, 2019, 5:02pm

That is running now. Will post results when completed. Thanks!

MattC_eostar · December 12, 2019, 9:10pm

~Update~
Some preliminary results for this show that to optimize for model execution time, low false positive rate, and lowest WER using a GA, lm_alpha needs to be greater than lm_beta and beam_width needs to be less than 50. Going to let my GA run for a bit and try to get some hard numbers for my use case.

MattC_eostar · December 13, 2019, 6:53pm

Is it worth maybe looking into tuning the full model on small set of Librispeech samples but modify all words to for a few epochs and then train on small set of samples from our own audio that uses the commands? Not sure if there is much more we can optimize for on the LM side.

reuben · December 13, 2019, 7:01pm

I don’t understand your question.

small set of Librispeech samples but modify all words to for a few epochs

What does this mean?

and then train on small set of samples from our own audio that uses the commands

You can also try fine tuning the model on just these samples. If it’s very few samples (<100) you can even do it on a laptop. Other people who reported fine tuning experiments here on the forum recommended using a much lower learning rate when doing that. We use 1e-4 for our models, maybe try 1e-6 and 1e-7, see which one works best.

MattC_eostar · December 13, 2019, 7:04pm

I am sorry. I forgot a word.
What if I modified the Librispeech dataset so that all words are mapped to <unk> for like 100 samples. And then trained on my training set

And good to know. We may try that then. We do not have a lot of data.

MattC_eostar · December 13, 2019, 9:34pm

So the key takeaway for us was that decoder post processing becomes much more effective once the beam width is reduced. It is hard to notice a material difference made by lm_alpha and lm_beta as beam_width is decreased. For our use case, beam_width of 5 and lm_alpha and lm_beta of 0.001 allows the decoder to spit out nonsense when OOV words are spoken, which we can easily filter out. Sort of hacky, but effective. Thanks for all the help!

MattC_eostar · December 19, 2019, 9:15pm

Any tips for building that dataset? The audio files are all wav files at 16 bit depth with a 16kHz sample rate. Can I just build a train, dev, and valid set of files that lists the absolute path to these mapped to the transcript? Is there an example anywhere of this format? Thanks!