Truncated vs. whole words in training files

Hi all, I’m working on building a training data set of roughly 10,000 4-5 second audio clips with transcriptions for transfer learning. Many of my audio clips are cut across word boundaries, but I have text transcripts for the full word.

Example:
Audio: “sunny today with a chance of showers late in the aftern”
Text: “sunny today with a chance of showers late in the afternoon”

I may just flag these clips and see how they affect training, but since I’m putting some time and effort into preparing the training data I wonder if others on here have had experience with truncated training data. Which of these approaches has worked best for you?

  1. training on the truncated audio with the full word text
  2. training on the truncated audio with partial word text to match the truncated audio as best as possible
  3. training on the truncated audio with no text for truncated words
  4. discarding all clips with truncated audio and training only on clips with clean word boundaries

I’m ignoring a possible alternative, which is just re-cutting the audio files for the training data, because even with the best approaches I’ve tried to cut the files without truncating words this is still a common issue.

Thank you!

I think something is missing here :slight_smile:

Thanks, I’ve edited it!

I don’t think 1 nor 3 can give sound results, because then the audio will be matched to text that indeed does not represent the audio. Can you describe 2 ? I’d go for 4 for now.

Do you have figures regarding the amount of clips impacted? You say “many”, but it’s not really discriminant.

I find that about 5 percent of my samples have noticeably truncated words, i.e. at least one full syllable of a boundary word is audible but the word is truncated. I may try training both ways- if I do that I’ll report back on how that went.

That’d be interesting but, IMHO, if you have to drop 5% because it’s degrading your quality, it’s a good deal.

1 Like