Please keep in mind this is an old and contributed tutorial, a lot has moved since. I don’t want to dismiss @elpimous_robot contribution, it is great
How do you check that ?
Which is not surprising, since LibriSpeech is based on old books.
Please keep in mind this is an old and contributed tutorial, a lot has moved since. I don’t want to dismiss @elpimous_robot contribution, it is great
How do you check that ?
Which is not surprising, since LibriSpeech is based on old books.
I’m testing with LibriSpeech dev-clean
, so it’s the same old books. To calculate WER, I’m using jiwer.
I’m tracking each sample like:
Then averaging the clean_wer
They are using a different method for evaluation. Ours is consistent with others, but I don’t remember the specifics. Maybe @reuben remembers?
Lissyx, thanks my friend😉
So it seems I’m simply calculating WER
differently - is that right? https://github.com/mozilla/DeepSpeech/blob/daa6167829e7eee45f22ef21f81b24d36b664f7a/util/evaluate_tools.py#L19 seems to have a function to evaluate. But is there some clean interface?
That’s about right, you can also look at how it is used in evaluate.py
. Regarding a clean interface, it’s not really meant to be exposed, so I don’t think we can guarantee that …
The only thing that would explain the inaccuracy would be my german accent. I have an easy-to-setup example project here which uses Angular & Node.js to record and transcribe audio. It would help me a great lot if you could see for yourself and confirm/deny my experience with the accuracy.
Well, that’s not a small difference. As documented, the current pre-trained model mostly has american english accent, so it’s expected to be of lower quality with other accents.
FTR, being french, I’m also suffering from that …
Around 10,000 hours of speech data is required to create a high-quality STT model; the current model has a fraction of this. It is also not very robust to noise.
These issues will be solved over time with more data, but the current model should not be considered production-ready.
The model does achieve a <10% WER on the Librispeech clean test set - the key word there being “clean”. It is not a test of noisy environments or accent diversity.
I am currently using the dev-clean
set, so I should have similar results. As for measuring WER, I am now doing:
def word_error_rate(self, ground_truth, hypothesis):
ground_truth_words = ground_truth.split(' ')
hypothesis_words = hypothesis.split(' ')
levenshtein_word_distance = editdistance.eval(ground_truth_words, hypothesis_words)
wer = levenshtein_word_distance / len(ground_truth_words)
return wer
Where editdistance uses a word-level Levenshtein distance. I am now getting an average WER of ~17%. What am I doing wrong?
Is there anything in particular that you would point out as a change? I also started off with that tutorial so i’m wondering what could be things I need to revise.
Sorry, I have no time to review that.
Fair. The only thing i can think of is probably some of the hyperparams he suggests might be out of date but apart from that I can’t see anything that stands out.
See util/evaluate_tools.py
, in particular calculate_report
.