I have some formatted audio data for training DeepSpeech. After formatting some new RAW data I took a look at the spectrograms and the differences are concerning.
It seems there was an unknown layer of audio manipulation while the training data was formatted.
I pulled the same clip from the raw audio used for training, formatted, and created the spectrograms - here are the differences:
Spectrogram of the formatted raw data
Spectrogram of the training audio
- To the (my) human ear, both of these spectrograms sound the same.
I am wondering how/if these spectrogram differences will affect inference accuracy.
Thanks