I’m considering a several month learning project figuring out how to use DeepSpeech to determine the location of speech/audio in a transcription.
The idea is to eventually build a teleprompter out of it that will automatically follow along with you.
I’ve read quite a bit of DL books but this would be my first hands-on project and I’m wondering if this is a) feasible to start with and b) a rough idea of what kind of network structure I should think of?
I’m thinking of combining the unidirectional DeepSpeech RNN with a one-dimensional CNN trained on the transcript but I have no idea if that’s even remotely in the right direction.