Hey I’ve seen that Mozilla has some data augmentation methods mostly using Gaussian filters and other audio enhancement technologies.
Has there been any thoughts about data augmentation using a Text to Speech pipeline?
I was looking at Almost Unsupervised Text to Speech and Automatic Speech Recognition. I think for technical reasons, this wouldn’t work with DeepSpeech, but there are some references in the paper that I think make reference to a more similar setup to DeepSpeech, that could leverage a TTS model to generate a file.
My use case is going to have a lot of domain specific jargon and acronyms, so wanted to know if there were any options to feed in a list of words to have the system bootstrap.
Additionally interested to see if anyone has setup a voice-to-voice preprocessing step, such as those described in Google’s Parratron