I want to train DeepSpeech on domain-specific data and in order to get good results on inference I want to add new sentences to the existing language model.
According to the documentation, LibriSpeech normalized LM training text was used to create the present language model. After downloading the file librispeech-lm-norm.txt
, I saw that the file contains sentences like
...
A A A A A BOVE SECOND SINGER DIMINUENDO
A A A A A MEN
A A A A A Y
A A A A AHOWOOH
A A A A ALL ABOARD
...
These sentences do not make any sense, can anyone please help me out in understanding the format of this data?
If I want to create a custom domain-specific language model or add sentences to the existing language model then can I add the sentences from my data directly or do I have to convert my data into some other format (as shown above) before creating a language model out of it?