I’d like to help. My main job over the last year or so has been technical marketplace intelligence in NLP including speech to text. There doesn’t seem to be anywhere near as much interest in the community to compete against such performance indicators in speech to text vs say text-based conversational AI (e.g. SQuAD, CoQA etc) but maybe there will be in future. One issue we came across was that for arbitrary-length files, WER’s cumulative error rate and apparently arbitrary sentence chunking methods made it, in our research anyway, impractical so we’ve been evaluating using difflib that produces 100% reliable insert / delete / omission counts for any two texts, as long as you preprocess each word / token to separate lines.
This isn’t a problem right now for DeepSpeech with its “sentence-length” / “whole recording visibility” constraint (practically meaning you needing to break up an audio recording into 5-7s chunks) but will be an issue for dictation scenarios. Although with the streaming work maybe this has been eliminated – please school me if I’m out of date.