I have a Web-Application which continuously streams audio from the browser to a node.js server. I’ts basically the same as the official example (VAD, ffmpeg) , with the difference that the audio comes from the browser instead of a RTMP server.
My implementation worked fine in 0.5.1 so the origin of the stream shouldn’t be the problem. Now in 0.6.0 I constantly get the letter "i"
from inference. Just that single letter. As soon as I say a word it infers it correctly, also sentences correctly but after finishing the sentence it continues producing "i"
.
It doesn’t do so in absolute silence, but a little noise from the movement of my mouse over the desk is enough to trigger it.
In 0.5.1 I had it constantly produce empty strings ""
which back then I believed was fine, but now instead of the empty string it infers silence to "i"
.
I played around with the VAD configuration (Mode and debounce time) but noise still results in that.
My thinking:
Apparently now the 0.6.0 model thinks that a snippet of silence more likely is "i"
then ""
but actually why does VAD trigger at all? It shouldn’t do so in the first place if don’t say anything. So we’re back to configuring VAD correctly, but changing the parameters won’t help much. (though NORMAL results in less “i’s” then AGGRESSIVE). Also in other old examples I can see how ""
was continuously produced aswell like in this live-demo, so then I think this is the way it’s supposed to work (producing empty string for silence).
Does this have something to do with the new model, or what’s your best guess?