My System’s Core Specs:
Core i7-6800k (6 cores, 12 threads)
16gigs of RAM
Nvidia GTX1070
I carried out some tests on continuous streaming without closing the stream. Each audio file was divided into chunks of 20ms and the chunks were then run through a loop for speech and VAD detection. The results below are for DeepSpeech GPU decoding, as it was faster:
The first two files use a custom LM generated using KenLM. Since the new LM is much smaller, the inference performance is faster. File 2 below is File 1 concatenated four times. Both File 1 and File 2 are relatively clean, recorded directly into mic in a quiet environment.
File 1
Original Audio: 1:49 (109 seconds)
Processed In:
50.66 seconds (Continuous streaming with inference being printed out)
50.88 seconds (Continuous streaming, with last 30 characters of stream printed out)
44.37 seconds (3 second stream batches and additional VAD to close streams)
File 2
Original Audio: 9:09 minutes (549 seconds)
397 seconds (3 seconds batches and speech detection)
453 seconds (continuous streaming)
File 3 is an audio file that was extracted from a short film on YouTube. As such, it contains all the sound and music effects and can be considered noisy. DeepSpeech’s supplied LM model was used for this decoding:
File 3
Original Audio: 8:49 (529 seconds)
Processed in:
573 seconds (Continuous streaming)
567 seconds (Continuous streaming, printing only last 30 characters for each print out)
423 seconds (3 second stream batches and additional VAD to close streams).
My Takeaways
Inference time increases the longer you keep a stream open. Given a sufficiently long stream, your inference times will become slower than real time). So you need a strategy to close the continuous stream regularly.
Yes. That is correct.
The slow decode I described is a byproduct of the way I had implemented the audio buffer.
An Extreme Example:
You maintain a 3 second audio buffer, which is sent to DeepSpeech for inference. New audio data is added every 200 milliseconds and the entire 3 second file (with only 200ms of new data) gets sent to DeepSpeech for inference. Even if DeepSpeech processes the inference in half the time, the inference now is slower than real time.
Of course, depending on the hardware, you could identify the buffer parameters to get the inferences down to real time. But in my limited testing, I found that 2-3 seconds of buffer was generally the best to get good inference results.
One way to address this may be to maintain ring buffer for the transcripts that you get back from inferencing. But I have seen that there are slight differences between inferences from continuous streams and standalone inferences. The continuous stream outputs slightly better results in my limited experimentation.
Live captioning. I am working off a script and relevant chunks of dialogues from that script need to be printed out as soon as they are identified in inference. I therefore need the latency to be as little as possible (think of a speech given live on stage). If each segment sent for inferencing is 3-4 seconds long, the inference phrase makes a lot more sense and is easier to match to the script. If I just get inferences for say 500ms of data and concatenate them to produce phrases, they are nowhere near as good as the longer chunks and the matching becomes much harder.
Can’t rely on direct inference since the results aren’t accurate (even with custom LM). Even Google’s Live Transcribe app on Android makes mistakes, so I guess we aren’t there yet. Not 100%, anyways.