I’m trying to get to the point where I can point DSAlign at a new dataset and automatically get new training examples. To start with, i’m finding DSAlign performs poorly unless DeepSpeech does a very good job on producing transcripts. So it’s almost as if DSAlign is only useful for those who already have a great working model to begin with, the chicken and egg problem is especially frustrating because producing training data by hand is soul destroying. I was hoping DSAlign could help by making the process more or less semi-supervised but it seems unless I hand transcribe about 7hrs per speaker, DeepSpeech doesn’t work well enough for DSAlign to work effectively. Am I missing something or is this a fair assessment?
Would be interested in trying some forced alignment solutions that don’t depend on ASR to bootstrap the process for a low resource language.