Total number of audio files used to train DeepSpeech 0.5.1

In the release notes of DeepSpeech 0.5.1, it is mentioned that the model was trained for 467356 steps or 75 epochs. Going by the batch training batch size of 24

steps_per_epoch = 467356 / 75 ≅ 6231
total_number_of_samples = 6231 * 24 = 149544

Did the model achieve a word error rate of 8.22% by training only on 149544 audio samples? And what was the maximum audio length (in seconds) considered while processing the dataset?

wc -l fisher-train.csv librivox-train-{clean,other}-???.csv swb-train.csv

1041522 fisher-train.csv
28540 librivox-train-clean-100.csv
104015 librivox-train-clean-360.csv
148689 librivox-train-other-500.csv
124708 swb-train.csv
1447474 total

Minus one from each file for the CSV header, for a total of 1447469 samples. Maximum duration clipped at 10 seconds.

– reuben

1 Like