Query regarding post processing

I’m back to this topic :smiley:

Please read : https://arxiv.org/pdf/1703.10135.pdf (3.3 Decoder) the decoder sections will help you understand what’s going on.

From what I noticed trying to train with lower batch_size, If you see a good alignment and then it breaks almost sure it wont align back.

Same here, from my experience testing TTS and different Tacotron versions I think is better to throw away data rather than lowering the batch size.With TTS is really easy to find a good balance using the max lenght.

For tacotron2(not TTS) what I did was to sort the text using a text editor and remove the longer ones manually, most of the time just a few very long sentences ruins the whole thing.

1 Like

@erogol Hello, is OK to share my tests here in the forum even if they are not 100% related to Mozilla TTS but TTS in general?

FYI, I think I’ve solved the issue, Tacotron2 was using a “target mel scale”, I removed that scale clipping and now looks promising.

With just 5k steps the attention looks good, and the audio too. My previous attempts required at least 60k steps to start seeing the align.

10k step audios:
10k.zip (317,3 KB)

Good to see you back here!

I did read that, was wondering if someone could shine a light on the values and direct implications on memory, speed and alignment time for this implementation. (If anyone has logged that.)

I’ve not removed the sentences but i have decreased the max seq len to 200, still not able to run r=1 at 32 batch_size though.

Hope that’s a yes. I’d love to see what you’re working on and how it’s working out for you.

I think they only way, right now is to lower the max seq, there’s an issue:https://github.com/mozilla/TTS/issues/183 about OOM.

Hows your length distribution? If you go lower than 200 will lose a lot of data?

 r=1;  batch_size=32;
| > Number of instances : 9489

with a max len of 200,
| > Num. instances discarded by max-min seq limits: 684
-OOM

with a max len of 150,
| > Num. instances discarded by max-min seq limits: 1610
-OOM

with a max len of 100,
| > Num. instances discarded by max-min seq limits: 3591
-OOM

with a max len of 50,
| > Num. instances discarded by max-min seq limits: 7338
works but i’ve lost 80% of my data.

I think I am going to have to look into the dynamic batch size hack.

For reference:
Using 0,1,2 at max len 50(will go as high as i can)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:18:00.0 Off |                  N/A |
| 56%   77C    P2   105W / 250W |   8813MiB / 12066MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:3B:00.0 Off |                  N/A |
| 59%   82C    P2   131W / 250W |   5823MiB / 12066MiB |     58%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:86:00.0 Off |                  N/A |
| 51%   83C    P2    91W / 250W |   5981MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:AF:00.0  On |                  N/A |
| 32%   51C    P5    26W / 250W |   1095MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I sort of recall using using 100/200 with a k80 11GB, single GPU, then when I tried dual GPU it required to low a bit the max length, you get the same results using a single GPU?

Feel free to share anything related.

What do you mean btw by “target mel scale”?

Thanks.

I’ve removed everything that touches the prediction and now is working fine, T2_output_range well I think is ok to call it the “output/target scale”, or am I wrong?

I see gaps in the alignment, saw the same gaps while I was training TTS, I guess they are data related. About the audios I don’t hear a significative improvement from 10k to 25k, also I don’t hear an expressive speech on questions and special characters. I think is related to the speaker, the source voice is so flat. I’m cutting a more expressive female speech to adapt using the trained model, hopefully the issue is not the LPCNet being unable to be expressive over Tacotron predictions.

Examples:
25k.zip (960,4 KB)

I haven’t tried yet, I am going to let this model train on 2000 sents and see what r=1 actually gives me in terms of quality of generated audio.(because according to 3.3 in the paper, thay’ve only discussed the major pros of having r>1 and not what the tradeoffs are, if any.)

I’ve trained it for around 30k steps and the quality is much better than what i had at r=2 but not better than waveRNN. I have to figure out some way to make it hog less VRAM so that i can acutally train the Taco2 on my entire dataset( followed by another wavernn training sess).