Using DeepSpeech to determine location of speech in transcription

I’m considering a several month learning project figuring out how to use DeepSpeech to determine the location of speech/audio in a transcription.

The idea is to eventually build a teleprompter out of it that will automatically follow along with you.

I’ve read quite a bit of DL books but this would be my first hands-on project and I’m wondering if this is a) feasible to start with and b) a rough idea of what kind of network structure I should think of?

I’m thinking of combining the unidirectional DeepSpeech RNN with a one-dimensional CNN trained on the transcript but I have no idea if that’s even remotely in the right direction.

Sounds like a fun project.

If you search for “timestamp” here, you should find a few threads where people have discussed obtaining the timing (ie location as you put it)

There was an issue raised for it with a link to a fork that exposes timing here:

I hope that helps. Best of luck with the project!