Adding new sentences to the existing language model

shan18 · December 2, 2019, 10:33am

I want to train DeepSpeech on domain-specific data and in order to get good results on inference I want to add new sentences to the existing language model.

According to the documentation, LibriSpeech normalized LM training text was used to create the present language model. After downloading the file librispeech-lm-norm.txt, I saw that the file contains sentences like

...
A A A A A BOVE SECOND SINGER DIMINUENDO
A A A A A MEN
A A A A A Y
A A A A AHOWOOH
A A A A ALL ABOARD
...

These sentences do not make any sense, can anyone please help me out in understanding the format of this data?

If I want to create a custom domain-specific language model or add sentences to the existing language model then can I add the sentences from my data directly or do I have to convert my data into some other format (as shown above) before creating a language model out of it?

lissyx · December 2, 2019, 12:43pm

You can directly add them. I’m not sure I understand why you think there is a specific format, it is properly documented in data/lm/ how we re-build.

shan18 · December 2, 2019, 12:58pm

@lissyx, why I thought that there was a specific format, is due to the sentences present in LibriSpeech’s normalized LM training text (a few examples of such sentences have been shown above).

If you read the corpus used for making the language model (i.e. LibriSpeech LM corpus as specified in data/lm/README), you’ll see that the sentences present there are not proper English sentences. This made me wonder why is the data present in such format.

lissyx · December 2, 2019, 1:33pm

This is just how they built their corpus, but KenLM does not require any specific format. So you can read that as is.

shan18 · December 2, 2019, 4:54pm

Thanks for the help.

I just have one more query. If we feed such kind of corpus where the sentences don’t make sense then wouldn’t that also create an improper language model?

lissyx · December 2, 2019, 4:57pm

That also depends on the amount. The language model built from this has improved quality a lot, but we are still working on improving it.

shan18 · December 2, 2019, 5:53pm

Thanks @lissyx for the help.