Data Augmentation using a Text to Speech Pipeline

Hey I’ve seen that Mozilla has some data augmentation methods mostly using Gaussian filters and other audio enhancement technologies.

Has there been any thoughts about data augmentation using a Text to Speech pipeline?

I was looking at Almost Unsupervised Text to Speech and Automatic Speech Recognition. I think for technical reasons, this wouldn’t work with DeepSpeech, but there are some references in the paper that I think make reference to a more similar setup to DeepSpeech, that could leverage a TTS model to generate a file.

My use case is going to have a lot of domain specific jargon and acronyms, so wanted to know if there were any options to feed in a list of words to have the system bootstrap.

Additionally interested to see if anyone has setup a voice-to-voice preprocessing step, such as those described in Google’s Parratron

Not that we know of. I guess there might be some legal issues here, other TTS services may forbid to do this.

It’s not impossible that using a better-built language model does address your needs better, in this case.

1 Like

You could use Mozilla’s TTS repo to do this though, right?

Using TTS for data augmentation (I assume) is less than ideal/perfect for training data. But it might be a good enough bandaid to bootstrap to a better trained model once you have real data.

Just to make sure I understand, would tweaking and changing the language model be the best use of limited resources in a bootstrap scenario as previously described? I had figured that data augmentation using a TTS for novel utterances would be a good resource utilization, but I’m open minded to any approach.

Thanks for the quick response!

I defer to @erogol regarding the status for this usecase.

If I understand your usecase well, I think it would be yes. It’s not impossible you would have to re-do a LM from scratch, the current one is huge and partly based on LibriSpeech which uses 1800s circa books so the English itself might be not so perfect. But adding your domain-specific jargon is obviously the best way, this was verified by ourselves as well as other contributors.

It might not be a quick and simple solution, but definitively more reliable and easier than involving data-augmentation through TTS as you had in mind.

1 Like

yes we can do TTS to ASR but it needs a good multi-speaker TTS model so that you can generate enough variety in your artificial dataset. So far, I could not work hard on multi-speaker case. There are some models I trained but I did not try for ASR.

I alos know people used the same idea under the name of “cyclic consistency”.

1 Like

Sounds good, I think the information as presented works for my purposes.

In case anybody else gets here, it looks like NVIDIA has successfully implemented data augmentation via STT as part of their OpenSeq2Seq repo. Not sure if there would be any license issues between Apache 2.0 and MPL 2.0, but see below:

https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition/synthetic_dataset.html#training-with-synthetic-data