Adding custom words to language model

yv001 · January 31, 2019, 10:53am

So I’d like to extend the language model with custom words that do not appear in the default one or only appear rarely - e.g. imagine “mozilla”, “google”, “ibm” appears in many of the samples I’d like to transcribe.

The question is, how much text I’d need to add on top of the texts used for the deepspeech default language model.

I thought the following method might work, but I’d like to see if someone has a better way of doing that:

Collect a sample text with the custom words and estimate probability of the custom words in the sample text e.g.
sample_mozilla_probability = #mozilla / #sample_words
I would like to get the same probability in the final lm training set (final = deepspeech + new text) for my custom words as I have in my custom training set so I calculate the number of occurences in the new text as

mozilla_needed_count = sample_mozilla_probability * #words_deepspeech_text

So to get my final language model training set, I can use my sample phrases containing mozilla and repeat them enough times to get mozilla_needed_count and append to the original deepspeech text.

I ignore the fact that word mozilla can already be in the deepspeech text but account for that should be as easy as counting #mozilla in deepseech text and subtract it from the mozilla_needed_count.

Does that sound reasonable?

lissyx · March 4, 2019, 12:33pm

Have you tried just adding those words into the existing language model, as a baseline ?