Hi all, I’m working on building a training data set of roughly 10,000 4-5 second audio clips with transcriptions for transfer learning. Many of my audio clips are cut across word boundaries, but I have text transcripts for the full word.
Example:
Audio: “sunny today with a chance of showers late in the aftern”
Text: “sunny today with a chance of showers late in the afternoon”
I may just flag these clips and see how they affect training, but since I’m putting some time and effort into preparing the training data I wonder if others on here have had experience with truncated training data. Which of these approaches has worked best for you?
- training on the truncated audio with the full word text
- training on the truncated audio with partial word text to match the truncated audio as best as possible
- training on the truncated audio with no text for truncated words
- discarding all clips with truncated audio and training only on clips with clean word boundaries
I’m ignoring a possible alternative, which is just re-cutting the audio files for the training data, because even with the best approaches I’ve tried to cut the files without truncating words this is still a common issue.
Thank you!