Hello,
I am not sure how to properly contribute this knowledge to GitHub. I know on the FAQs there is a section that addresses that people would like to see if DeepSpeech can be used without having to save audio as a .wav file.
Well, in a nutshell (and according to client.py) the Model just needs the audio source to be a flattened Numpy Array. Another python package called SpeechRecognition, has built in support to create, in-memory, an audioData object that is acquired by some audio source (microphone, .wav file, etcâŚ)
Anyways long story short, here is the code that I can run and it allows me to use DeepSpeech without have to create a .wav file. Also this assumes you have a built and trained model. For this piece of code I just used the pre-built binaries that were included.
Anyways, I hope this can be implemented officially into the project.
from deepspeech import Model
import numpy as np
import speech_recognition as sr
sample_rate = 16000
beam_width = 500
lm_alpha = 0.75
lm_beta = 1.85
n_features = 26
n_context = 9
model_name = "output_graph.pbmm"
alphabet = "alphabet.txt"
langauage_model = "lm.binary"
trie = "trie"
audio_file = "demo.wav"
if __name__ == '__main__':
ds = Model(model_name, n_features, n_context, alphabet, beam_width)
ds.enableDecoderWithLM(alphabet, langauage_model, trie, lm_alpha, lm_beta)
r = sr.Recognizer()
with sr.Microphone(sample_rate=sample_rate) as source:
print("Say Something")
audio = r.listen(source)
fs = audio.sample_rate
audio = np.frombuffer(audio.frame_data, np.int16)
#fin = wave.open(audio_file, 'rb')
#fs = fin.getframerate()
#print("Framerate: ", fs)
#audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
#audio_length = fin.getnframes() * (1/sample_rate)
#fin.close()
print("Infering {} file".format(audio_file))
print(ds.stt(audio, fs))