Preprocessing audio

Hello!

I am forced to use 8000hz / mono audio (Phone calls). I know DeepSpeech works best 16000hz, so my questions goes:

Does Deepspeech (version DeepSpeech: v0.4.1-0-g0e40db6) upsample my training material from 8000hz to 16000hz ? What about dev and test material, does it also upsample those from 8000hz to 16000hz ?

Anyone studied bad labeled audio data, how it affects to results ? Of course it will affect, but lets say my data is 51% labeled right, and rest of it is gibberish or wrong words etc. do you think it might still do the job if I have it enough and still over 50% or more are Ok … ?

I am doing some semi-automatic labeling to audio, and this method is producing those figures … doing same job manually would be expensive and -very- time consuming as you know…

What are we talking about here, training ? Inference ? If you’re training a new model from scratch, you can do it at 8kHz

Sorry, I was talking about training.

@pete Then you should be able to train with those 8kHz data. I’m sure somebody already shared such feedback on the forum. There may be adjustements to perform to a few hyper-parameters but it should work.

Regarding labelling, I fear that having 49% of broken labelled data might be a big blocking point. @kdavis do you have an opinion?

@pete I will suggest to upsample the data with correct labels and fine tune an existing model, then you upsample and evaluate the whole set, you can remove sorting code and save the results to a file. In this way you will see which audios are scoring CER 0% and then you can fine tune again. The results depend on your number of correct labels to start, let’s say 5k+ 5s-7s is working for me.

If you don’t have a good amount of correct labeled data you can start scoring the whole set with an existing model and sort by CER to use the ones with CER close to 0%.

Please don’t use any of your labels of the set that you are evaluating to train a new LM when scoring the dataset.

Hello, and thanks for you reply!

So, I take my existing model which is about 80hrs of manually labeled data BUT its from different “domain” than the ones I am going to train. Lets say that 80hrs is about boats and I am trying to train model about cars.

So I use that model (made from 80hrs) to evaluate my semi-automatic labeled data about cars and pick only those which are nearest to CER 0% and use them on top of that 80hrs and repeat this process and hope to get model which understand talk about cars … just to put it simply.

Hello,

If you are using English data try the first evaluation with an existing trained model, if you are not using English I’m afraid that 80h may not do the trick.

Yes, you can also play around and use only the ones that scored CER 0%.

I did something similar with data labeled by the windows speech recognition, at some point the correct ones will start to pop out.