No alignment without linear spectrograms

I need help understanding something.

I removed the linear spectrogram part of the loss function, along with postnet that generates it. I didn’t need the linear spectrogram for the vocoder and removing the linear spectrogram part save a LOT of GPU memory during training. However, the reduced model doesn’t produce reasonable attention even after 50K steps. For the full model, attention was reasonable after only a few thousand steps.

Why isn’t mel spectrogram part of the loss not enough to train the attention?

could you check gradient norms between these two runs? Maybe , the scale of loss value has been changed and it requires a new learning rate. If you have tensorboard files, you can also share them.