Some words getting skipped in whole sentence

sanjay.pandey · May 19, 2019, 5:31pm

I trained my model on words zero to nine audio datasets collected from 116 people and then trained the model and it is doing way good job even in noise and i havent received any wrong inference.
The only problem is when people say only word from zero to nine it works well but when somebody says their phone number like “nine five four two three two eight nine seven zero” it skips most of the word it doesnt give wrong inference but skips most of the words like it will infere as “nine two eight seven”.
I have included zero to nine numbers in language model as well like this

zero
one
two
three
four
five
six
seven
eight
nine

what should i do for better sequence of words prediction?

lissyx · May 19, 2019, 6:20pm

I guess it’s hard to be conclusive and it highly depends on your training data. Since you don’t give much details, I’m assuming your model is unable to learn more than one word.

dwn · May 20, 2019, 4:58am

As @lissyx suspects, it is very likely related to your training data.
Your details on the training data indicate that you may have trained only audio clips with a single digit in it.
If so, you could randomly concatenate the audio clips and thus train sequences of variable length.
The annotations would be then the spoken numbers in the combined audio.

sanjay.pandey · May 20, 2019, 6:27am

I have trained on people speaking only single word(like zero,one,two till nine) at a time.
Reason is i don’t know in which sequence people are going to say each and every number during inference as everyone’s phone number is different.
So it wont be correct to train model on everyone saying their own phone number because it will be different everytime.
If i am training it wrong way then what will be the right approach so that i can get correct inference when people say their mobile number?

sanjay.pandey · May 20, 2019, 6:31am

The problem is i want model to correctly infer people saying their phone number.
If i train them on sequence of number. Wont it will be giving result as per the sequence it is trained on?
Like if i have never trained it on sequence like “nine three four two …” and if people are saying the same thing. wont it be predicting something else as per it was trained on?

dwn · May 20, 2019, 6:44am

What you describe is basically the same problem as in ASR in general, you cannot train every possible sequence of letters / words someone’s going to say.
So the purpose of the model is to handle exactly this problem.
Intuitively, the CTC loss takes care of matching the sequence of labels to the audio representation, so your model learns to recognize the start and end of numbers.
The RNN allows you to model the sequence to sequence behavior, having arbitrary input length mapping to arbitrary output length.

utunga · May 21, 2019, 9:51am

@sanjay.pandey can you say what version of DeepSpeech you are using? For instance are you using what is the current HEAD of the master branch or are you using an earlier version such as v0.4.2?

I ask because we are seeing whole words get dropped also, but (we think, maybe) only after we upgraded our deepspeech to the HEAD of master

sanjay.pandey · May 22, 2019, 6:04am

I think so i am using v0.4.2 like i am not sure but i cloned from the current github.

sanjay.pandey · May 22, 2019, 6:19am

Okay so as per my understanding i should train model on people saying 10 digits i.e their phone number and then during inference if even there are other type of sequence of number people are saying the model will predict?

dwn · May 22, 2019, 11:25am

Exactly,

it is not quite obvious how you intend to train but I’d imagine you’re best of to train your system to learn just the ten possible values (0, 1, …, 9), instead of words (zero, one, …, nine), which are composed of individual letters (a, b, …, z)

also, i don’t think that it is necessary that they say exactly ten digits, rather just compose random sequences with random lengths of numbers from the recordings and train the system on that.

But as lissyx already pointed out, it depends a lot on your intentions and training data and so on,

Good luck!