Built native client from scratch, accuracy issues

Hi!

My hardware doesn’t support avx2 (this machine is on ivy bridge), so i spent the day building native-client from source, including tensorflow.

The process was a bit arduous, but I ended up getting everything working. When i run the binary with the pretrained model, I can now get some output without anything throwing an error.

The problem is that the output is laughably bad. Its not even like… close. I think i’m doing something wrong.

For example: one input i’ve been using is a .wav (single channel, 16bit) of me saying “testing testing, 123”. The output from the pretrained model is just ‘oo’

Another file, much clearer, in the same format: "hi, im amy, one of the available high quality text to speech voices, select download now to install my voice’.

A smattering of outputs i got from deepspeech on that file when messing around with .wav encoding parameters:

“har omm one omhumho wommen”

“har o awi won o veembo hoa homten ta”

“a am won a vemhomable han wontontun”

I think i’m encoding the files incorrectly, or something. I’m not sure what else to do to get some clean output.

Thanks for testing :). First, we switched requirements from AVX2/FMA to AVX (the binaries are not yet published on pypi and npm registries, but you can download that from taskcluster: https://tools.taskcluster.net/index/artifacts/project.deepspeech.deepspeech.native_client.master/cpu

Can you make sure your wav file is 16 bits and 16kHz ? Using mediainfo you could be able to check that.

Ah, yeah, I was using 44khz audio. I resampled my test file (same speech, very clear still) at 16khz @ 16 bits, and the output was

“howar oire mi wonee vo virmabl ho orntenm”

are there any other encoding gotchas i should look out for?

Ah, and i just ran the precompiled binary. The output was much more accurate!

Thanks for helping me out. I’m not sure exactly what happened, but i’m glad i was able to get some decent output.

1 Like