Extract timing of phonemes and words from attention map

Hey all!

Thankfully I have been able to get the pre-trained model up and running, and producing great synthesized speech.

Some context: I want to animate a face / mouth to speak while the synthesized audio is playing. In order to do this I need the start and stop time of each phoneme in the synthesized speech.

I am wondering if it is possible to use the attention map to extract the timings of then synthesized words? Once I have this I would like to extract the timings of each phonemes…

I would like to analyze the attention map to do this even though I know I could use an acoustic model to calculate this, but this is overkill, and I thought it would be better to find a solution that’s already in the TTS library.

I originally posted on the git hub, and erogol suggested to look at the attention maps. I’m also just wondering if there is a way to get the image / data structure that contains the attention map of a synthesized phrase, and analyze this to get the proper timings.

Thanks for any help! :smile:

It is not an easy problem. You can get some insights from this paper
" Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis"