Research comparisons of WER between official DeepSpeech model and commercial tools 🧐

tinok · April 18, 2019, 2:50pm

Hi everyone,
We are conducting tests to establish the WER between the official DeepSpeech pre-trained model (0.4.1 for now) and commercial tools (Watson, Google, Azure, AWS). Has anyone done something similar?

Results will vary widely (and wildly) depending on the test dataset used. We have focused on English but want to cover other languages soon.

It would be great to collaborate or chat about approaches if anyone is also already working on this.

tuttlebr · April 18, 2019, 4:06pm

There is a company called Descript which posted on a similar topic on Medium. Perhaps it will provide real-world guidance and expectations for WER.

julianharris · April 19, 2019, 12:40am

I’d like to help. My main job over the last year or so has been technical marketplace intelligence in NLP including speech to text. There doesn’t seem to be anywhere near as much interest in the community to compete against such performance indicators in speech to text vs say text-based conversational AI (e.g. SQuAD, CoQA etc) but maybe there will be in future. One issue we came across was that for arbitrary-length files, WER’s cumulative error rate and apparently arbitrary sentence chunking methods made it, in our research anyway, impractical so we’ve been evaluating using difflib that produces 100% reliable insert / delete / omission counts for any two texts, as long as you preprocess each word / token to separate lines.

This isn’t a problem right now for DeepSpeech with its “sentence-length” / “whole recording visibility” constraint (practically meaning you needing to break up an audio recording into 5-7s chunks) but will be an issue for dictation scenarios. Although with the streaming work maybe this has been eliminated – please school me if I’m out of date.

kdavis · April 19, 2019, 4:50pm

We published our WER results, 8.3%, for the 0.4.1 version of DeepSpeech on the LibriSpeech clean test data set (LS-c) , which you can verify if you want.

Franck Dernoncourt benchmarked commercial engines on the LibriSpeech clean test data set (LS-c) and found that the best commercial engine on LS-c was Speechmatics with a WER of 7.3% and the second best was IBM with 9.8%

ASR API	Date	CV	F	IER	LS-c	LS-o
Human					5.8	12.7
Google	2018-03-30	23.2	24.2	16.6	12.1	28.8
Google Cloud	2018-03-30	23.3	26.3	18.3	12.3	27.3
IBM	2018-03-30	21.8	47.6	24.0	9.8	25.3
Microsoft	2018-03-30	29.1	28.1	23.1	18.8	35.9
Speechmatics	2018-03-30	19.1	38.4	21.4	7.3	19.4
Wit.ai	2018-03-30	35.6	54.2	37.4	19.2	41.7

praveeny1986 · May 10, 2019, 7:04am

Hi @tinok,

Were you able to conduct your tests? Do you have results yet?

Thanks

julianharris · May 11, 2019, 1:25pm

I’m surprised not to see Voicebase in there – our internal tests have shown good results. I guess it comes down to focusing on broadband acoustic models (vs those over traditional phone lines?)