How to Strict the output to the Language Model only?

tarekeldeeb · July 22, 2018, 11:54am

Hello Community.

I have trained an arabic model, and managed to get WER or around 0.35. My data set is still small (~40 hours) and I’m working on collecting more data and data augmentation.

Meanwhile, I see some strange output text.

Two correct words having no interspace
Very weird letters that look like nothing, rubbish!

As far as I understand, the language model is used for beam search and defines the output text. Why is the output not restricted to vocabulary from the LM? Is there a switch for that?

jageshmaharjan · July 24, 2018, 4:29am

+1, I was on the same boat. But i was training on Chinese language data-set. What made me ressolve (that might not be applicable in your language), i seperate each character by space and build the language model with 4 gram. But, still its strange to not having space between each words, regardless of having space in alpabets.txt and offcourse vocabulary to build language model.

kdavis · July 25, 2018, 7:38am

@tarekeldeeb

Did you create your own Arabic language model?
Did you create your own trie?
Have you adjusted lm_weight?
Have you adjusted word_count_weight?
Have you adjusted valid_word_count_weight?

In particular the trie indicates which words are valid and valid_word_count_weight indicates the relative weight given to the trie results.

tarekeldeeb · July 25, 2018, 2:07pm

@kdavis

Yes, I created my language model, with guarantee that all spoken words are included.
Yes, I created my trie
No I did not adjust any weights, I think this is automatically built and included by building a 4-gram.

I can see clear text in (.arpa) as expected. In the (.trie) I see a pattern like:

-1
21
2633
-4.63527
-1
-1
-1
-1
-1
-1
-1
-1
-1
1
14574
-4.63527
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1

What have I missed?

lissyx · July 25, 2018, 2:01pm

I think you are just hitting https://github.com/mozilla/DeepSpeech/issues/1156

kdavis · July 25, 2018, 2:12pm

The trie determines what is “in vocabulary” and the valid_word_count_weight determines how much importance should be given to the trie’s opinion on what is in and what is not in vocabulary.

So, in particular, increasing the value of valid_word_count_weight should decrease the occurrence of out of vocabulary words as defined by the trie.

The weights I mentioned lm_weight, word_count_weight, and valid_word_count_weight are external to the language model and are not part of the language model weights which, as you mention, are built automatically.

I hope that’s a bit clearer?

tarekeldeeb · July 29, 2018, 4:39pm

Yes, thanks a lot.

I have started playing around with those weights.

Regards,