Terrible Accuracy?

lissyx · October 29, 2019, 8:06pm

Please keep in mind this is an old and contributed tutorial, a lot has moved since. I don’t want to dismiss @elpimous_robot contribution, it is great

How do you check that ?

Which is not surprising, since LibriSpeech is based on old books.

shamoons · October 29, 2019, 8:15pm

I’m testing with LibriSpeech dev-clean, so it’s the same old books. To calculate WER, I’m using jiwer.

I’m tracking each sample like:

Then averaging the clean_wer

lissyx · October 29, 2019, 8:19pm

They are using a different method for evaluation. Ours is consistent with others, but I don’t remember the specifics. Maybe @reuben remembers?

elpimous_robot · October 29, 2019, 8:26pm

Lissyx, thanks my friend😉

shamoons · October 29, 2019, 8:36pm

So it seems I’m simply calculating WER differently - is that right? https://github.com/mozilla/DeepSpeech/blob/daa6167829e7eee45f22ef21f81b24d36b664f7a/util/evaluate_tools.py#L19 seems to have a function to evaluate. But is there some clean interface?

lissyx · October 29, 2019, 8:37pm

That’s about right, you can also look at how it is used in evaluate.py. Regarding a clean interface, it’s not really meant to be exposed, so I don’t think we can guarantee that …

beiserjohannes · October 30, 2019, 11:15am

The only thing that would explain the inaccuracy would be my german accent. I have an easy-to-setup example project here which uses Angular & Node.js to record and transcribe audio. It would help me a great lot if you could see for yourself and confirm/deny my experience with the accuracy.

lissyx · October 30, 2019, 11:54am

Well, that’s not a small difference. As documented, the current pre-trained model mostly has american english accent, so it’s expected to be of lower quality with other accents.

FTR, being french, I’m also suffering from that …

dabinat · October 31, 2019, 4:12am

Around 10,000 hours of speech data is required to create a high-quality STT model; the current model has a fraction of this. It is also not very robust to noise.

These issues will be solved over time with more data, but the current model should not be considered production-ready.

The model does achieve a <10% WER on the Librispeech clean test set - the key word there being “clean”. It is not a test of noisy environments or accent diversity.

shamoons · October 31, 2019, 2:17pm

I am currently using the dev-clean set, so I should have similar results. As for measuring WER, I am now doing:

    def word_error_rate(self, ground_truth, hypothesis):
        ground_truth_words = ground_truth.split(' ')
        hypothesis_words = hypothesis.split(' ')
        levenshtein_word_distance = editdistance.eval(ground_truth_words, hypothesis_words)
        wer = levenshtein_word_distance / len(ground_truth_words)
        return wer

Where editdistance uses a word-level Levenshtein distance. I am now getting an average WER of ~17%. What am I doing wrong?

tensorfoo · November 1, 2019, 6:31am

Is there anything in particular that you would point out as a change? I also started off with that tutorial so i’m wondering what could be things I need to revise.

lissyx · November 1, 2019, 7:41am

Sorry, I have no time to review that.

tensorfoo · November 1, 2019, 7:56am

Fair. The only thing i can think of is probably some of the hyperparams he suggests might be out of date but apart from that I can’t see anything that stands out.

reuben · November 2, 2019, 9:32am

See util/evaluate_tools.py, in particular calculate_report.