Continuous Streaming

bilal.iqbal · December 17, 2019, 3:30pm

The use case:
We have a steady audio stream (such as a live broadcast).
We need to decode the stream as it comes.
At any point, keep no more than 3-4 seconds of context in memory.
Continue decoding until end of stream.

My Approach So Far:
I have been able to able to get a streaming version working with an audio file, where I feed the model 20ms of audio chunks in each loop. If the sentence is finished (using VAD), then end stream and start a new streaming example. If I don’t do this, the decoded sentence keeps on getting bigger and bigger.

However, I want to modify it so that it works like a FIFO buffer. So at any time the length of context is, say, no more than 3-4 seconds.

Does the DeepSpeech offer such a functionality?

If not, any thoughts on how one would go about implementing this using DeepSpeech’s streaming functionality? I previously tried implementing such a buffer myself and sending the audio stream along to DeepSpeech for inference. But since the model had to decode the entire audio segment every time, the performance was sub-realtime if I increased the buffer size beyond a certain point.

lissyx · December 18, 2019, 3:51pm

It’s not getting bigger and bigger, it’s getting your sentence for the whole stream. If your whole stream grows, the sentence does grow as well.

You mean, a way to enforce a ring-buffer ? We don’t have that. It should likely be done on your side, at least for now.

I’m not sure I get your point. If you pass 10 secs of audio, it should take < 10 secs to decode, depending on your hardware of course.

What was that point that would impact performances, and how much ? What’s your hardware ?

I’m curious to get the rationale / use-case for that.

bilal.iqbal · December 20, 2019, 11:07am

My System’s Core Specs:
Core i7-6800k (6 cores, 12 threads)
16gigs of RAM
Nvidia GTX1070

I carried out some tests on continuous streaming without closing the stream. Each audio file was divided into chunks of 20ms and the chunks were then run through a loop for speech and VAD detection. The results below are for DeepSpeech GPU decoding, as it was faster:

The first two files use a custom LM generated using KenLM. Since the new LM is much smaller, the inference performance is faster. File 2 below is File 1 concatenated four times. Both File 1 and File 2 are relatively clean, recorded directly into mic in a quiet environment.

File 1
Original Audio: 1:49 (109 seconds)

Processed In:
50.66 seconds (Continuous streaming with inference being printed out)
50.88 seconds (Continuous streaming, with last 30 characters of stream printed out)
44.37 seconds (3 second stream batches and additional VAD to close streams)

File 2
Original Audio: 9:09 minutes (549 seconds)

397 seconds (3 seconds batches and speech detection)
453 seconds (continuous streaming)

File 3 is an audio file that was extracted from a short film on YouTube. As such, it contains all the sound and music effects and can be considered noisy. DeepSpeech’s supplied LM model was used for this decoding:

File 3
Original Audio: 8:49 (529 seconds)

Processed in:
573 seconds (Continuous streaming)
567 seconds (Continuous streaming, printing only last 30 characters for each print out)
423 seconds (3 second stream batches and additional VAD to close streams).

My Takeaways

Inference time increases the longer you keep a stream open. Given a sufficiently long stream, your inference times will become slower than real time). So you need a strategy to close the continuous stream regularly.

Yes. That is correct.
The slow decode I described is a byproduct of the way I had implemented the audio buffer.
An Extreme Example:
You maintain a 3 second audio buffer, which is sent to DeepSpeech for inference. New audio data is added every 200 milliseconds and the entire 3 second file (with only 200ms of new data) gets sent to DeepSpeech for inference. Even if DeepSpeech processes the inference in half the time, the inference now is slower than real time.
Of course, depending on the hardware, you could identify the buffer parameters to get the inferences down to real time. But in my limited testing, I found that 2-3 seconds of buffer was generally the best to get good inference results.
One way to address this may be to maintain ring buffer for the transcripts that you get back from inferencing. But I have seen that there are slight differences between inferences from continuous streams and standalone inferences. The continuous stream outputs slightly better results in my limited experimentation.

Live captioning. I am working off a script and relevant chunks of dialogues from that script need to be printed out as soon as they are identified in inference. I therefore need the latency to be as little as possible (think of a speech given live on stage). If each segment sent for inferencing is 3-4 seconds long, the inference phrase makes a lot more sense and is easier to match to the script. If I just get inferences for say 500ms of data and concatenate them to produce phrases, they are nowhere near as good as the longer chunks and the matching becomes much harder.

Can’t rely on direct inference since the results aren’t accurate (even with custom LM). Even Google’s Live Transcribe app on Android makes mistakes, so I guess we aren’t there yet. Not 100%, anyways.