Information on training and inferring audio file length

meghagowda5193 · August 2, 2018, 10:18am

Hi,

Is it compulsory to have training and inferring audio file length equal to 5 seconds?
I have this questions because I have a large amount of training data with audio(every audio more than 30 seconds) and respective transcripts. If I can’t use this data as it is for training, then I need to chunk the audio files( which I can do easily with some python script) but I am finding it difficult to chunk the transcript for the respective chunked audio files. I am doing it manually for now, but is there any way to automate it?

Any suggestions?

Thank you:)

kdavis · August 3, 2018, 9:27am

@meghagowda5193 Having training and inferencing audio file length equal to 5 seconds is not compulsory. With the current architecture it just lessens the memory pressure on the GPU, if a GPU is being used to train on.

However, 30sec may be too much for you GPU’s memory. You simply have to try. To give you a feel for scale, we had batches of size 12 or so using audio with lengths around 5 sec with a 11GB GPU. So, for a 11GB GPU we can fit about 60 sec of audio on the GPU with the current model when training.

Your mileage may vary

meghagowda5193 · August 3, 2018, 12:21pm

@kdavis

By assuming the same GPU settings, and with batch size of 2, can I use my 30 sec of audios?

Thanks.

kdavis · August 3, 2018, 12:33pm

You can try.

However, there are other issues to consider, for example the finite horizon a RNN operates under, i.e. the forward or backward RNN of the BRNN may not be able to transfer information across the entire 30 second of audio and thus performance may decrease as a result. But it’s worth a try.

Let us know how it works out!

tan_oscar · August 11, 2018, 12:48pm

With the new streaming model, are we still restricted to around 5 seconds for training data ?

kdavis · August 15, 2018, 3:14pm

As before there is no hard limit, but increasing the length of the training data snippets will increase memory pressure on your GPU and lead to problems in which the RNN may not be able to transfer information across the entire N seconds of audio