Pretrained Model cannot provide accurate English words

ziqianzhu727 · November 7, 2018, 8:32pm

I am using retrained Deepspeech 0.3 to do some inference from recoded call. However it always output me some results like

"we for we look on a spirsofalaea a i go to a wouldtilemottofolespurtthou oh we o o n to compare now or bocortoby i for both fought no one to go and be a tale to mocometothecousonanofalinbuta a "

I tried to adjust the model arguments LM_WEIGHT, VALID_WORD_COUNT_WEIGHT, and change the audio sampling rate and format, also chunked the length. None of them help.

Any advice on this? Thanks,

lissyx · November 7, 2018, 8:56pm

Can you give us context ? How did you retrained, source of the data, what version of the inference code do you use ?

ziqianzhu727 · November 7, 2018, 9:16pm

Hi lissyx, thanks for help. I haven’t retrained the model as no available labelled data. The source of data is recorded call from call centre. I am directly using python code to do the inference,

ds = Model(‘Test/output_graph.pb’, 26, 9, ‘Test/alphabet.txt’, 500)
ds.enableDecoderWithLM(str(‘Test/alphabet.txt’), str(‘Test/lm.binary’), str(‘models/trie’), 1.5, 2.1)
fs, audio = wav.read(‘short_test.wav’)
processed_data = ds.stt(audio, fs)

lissyx · November 7, 2018, 9:17pm

Any details ? Like format, sampling rate ? Are the people speaking with a native accent ?

lissyx · November 7, 2018, 9:20pm

The whole output would help as well, to check the version of libdeepspeech.so you are using.

ziqianzhu727 · November 7, 2018, 9:25pm

The accent is native but there are places talker try to correct himself and little bit background noise. Format: I tried 8 bit, 16bit and 32bit. Sampling rate: I tried 8000hz, 16000 hz and 32000 hz. I tried google api to do it, it worked very well which dropped the correction part and only leave words make sense there.

lissyx · November 7, 2018, 9:33pm

Can you give the source format? Conversions can add artifacts that messes up with the data.

ziqianzhu727 · November 7, 2018, 9:38pm

Source is 32000HZ with 32 bit.

lissyx · November 7, 2018, 9:39pm

And how many channels ?

ziqianzhu727 · November 7, 2018, 9:42pm

oh, forgot to mention that. it is Mono.

lissyx · November 7, 2018, 9:45pm

Ok. I’m still waiting on the exact version of libdeepspeech.so you are using …

lissyx · November 7, 2018, 9:47pm

Is it possible you might share some of them ?

ziqianzhu727 · November 7, 2018, 10:00pm

I cannot share them. I am not sure the version of libdeepspeech.so. The Deepspeech version is 0.3.0.

lissyx · November 7, 2018, 10:02pm

It should be printed on the output when you run it …

ziqianzhu727 · November 8, 2018, 12:10am

Hi Lissyx, thanks. I didn’t find the version in the output. But as you said , the sampling rate matters here. The way I changed the sampling rate was not right. Now, I applied pydub change the sampling rate which gives more reasonable output.

reuben · November 8, 2018, 1:25am

You should also try the new decoder in master, it should alleviate these problems. There’s instructions here: https://github.com/mozilla/DeepSpeech/issues/1156#issuecomment-434351398

lissyx · November 8, 2018, 5:01am

Why don’t you share the whole output? It should be printed, by that call https://github.com/mozilla/DeepSpeech/blob/master/native_client/deepspeech.cc#L360

derekpankaew · November 8, 2018, 8:23am

How did you change the sampling rate? I’m experiencing a similar issue, using ffmpeg to resample. From what I can tell, pydub uses ffmpeg to do its resampling:

http://github.com/jiaaro/pydub/blob/master/API.markdown#audiosegmentexport

Can you share any more details about what was the wrong way and the right way to resample the audio? And how big a difference did it make? (Did it eliminate the issue completely?)

ziqianzhu727 · November 8, 2018, 5:05pm

Yes, the first method is I use audio software(Audacity) to adjust the sampling rate and later I installed the pydub with ffmpeg to do it. What I found out is for some long words, it works better. Not completely solve the issue as still something need to work on such as removing the noise.