Hi there,
I’m currently working with movie clip data. I would like to convert the audio dialogue to text. All clips are in english so I’m hoping I can just use the pre-trained DeepSpeech model.
However, I’m currently getting “gibberish” when I run my audio clips through the model. My concern is that the downsampling I’ve used causes information loss/corruption (even though the clip sounds fine when played).
Can someone suggest the correct way to downsample please? I’m guessing someone must have solved this but couldn’t find anything conclusive on the forums
This is my workflow:
- pretrained model is 0.5.0
- extract audio files from the video file using ffmpeg.
ffmpeg -i original.avi -ab 160k -ac 1 -ar 16000 -vn audio.wav
. The clips are at 44.1kHz before extraction and 16kHz after - Run inference on the file using:
deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio sox_out.wav
This is the soxi
of the audio file before downsampling
Input File : 'audio.wav'
Channels : 2
Sample Rate : 44100
Precision : 16-bit
Duration : 00:01:48.51 = 4785408 samples = 8138.45 CDDA sectors
File Size : 19.1M
Bit Rate : 1.41M
Sample Encoding: 16-bit Signed Integer PCM
and this is soxi
after downsampling
Input File : 'audio_down.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:01:48.51 = 1736202 samples ~ 8138.45 CDDA sectors
File Size : 3.47M
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Thanks for your help!