Hi, I am trying to use on-the-fly data augmentation on the audios.
My first idea was to insert augmentation operations right before extracting features from the audio (i.e., right here https://github.com/mozilla/DeepSpeech/blob/master/util/audio.py#L67).
For this, I’ve used a python wrapper for SoX (https://github.com/carlthome/python-audio-effects), and it worked very well. The code seems like this
from pysndfx import AudioEffectsChain
def audiofile_to_input_vector(audio_filename, numcep, numcontext):
r"""
Given a WAV audio file at ``audio_filename``, calculates ``numcep`` MFCC features
at every 0.01s time step with a window length of 0.025s. Appends ``numcontext``
context frames to the left and right of each time step, and returns this data
in a numpy array.
"""
# Load wav files
fs, audio = wav.read(audio_filename)
aug_fx = AudioEffectsChain()
if (random.random()<0.9):
aug_fx.tempo( random.uniform(0.8,1.2) )
aug_out = aug_fx(audio)
return audioToInputVector(aug_out , fs, numcep, numcontext)
The accuracy improved more than 5%.
Unfortunately, the training is like 10x slower. I noticed that during training, sometimes the GPU stays idle while the cpu is almost at 100%. My guess is that this wrapper is a bottleneck (as it only generates a command that is processed later), and it does not make use of parallel threads of deepspeech.
Do you have any idea how can I solve this?