I want to train a custom ASR model for English language. However the model can be tested on some sentences having lots of domain specific vocabulary (which the model might not have seen before). I have a large text corpus specific to the domain I am curious about. So, in this scenario what would be the best thing to do, in order to improve the model’s performance :
- I have converted the domain specific text corpus to audio, using TTS services. Hence I have audio(the voice is quite robotic since it is generated from TTS) and its transcripts. The total duration of this audio is about 60hrs. So will it be a good idea to finetune the released model, using this audio-transcript (training on this data but having the released model as the initial checkpoint).
- Just train a language model on this domain specific corpus. And use this language model at the time of inference.
I am new in this domain, any help or suggestion would be highly appreciated.
Thanks.