Hi, I’ve begun transfer-learning/fine-tuning on a small set of audio clips (approximately 4.5 minutes in total). The reason for the small amount is due to working on a CPU-only machine currently, I’m looking to get GPU soon. In the meantime, I’m playing around with the hyperparamters for the model to learn new domain-specific names and words.
The hyperparamters of my fine-tuning script is:
–checkpoint_dir ‘/home/…/deepspeech-0.5.1-checkpoint’
–train_files ‘/home/…/train.csv’
–dev_files ‘/home/…/dev.csv’
–test_files '/home/…/test.csv
–epochs 10
–dropout_rate 0.15
–learning_rate 0.0001
–train_batch_size 7
–dev_batch_size 1
–test_batch_size 1
–use_seq_length False
–export_dir '/home/…/EXPORT
–show_progressbar True
–report_count 100
–early_stop True
–lm_alpha 0.75
–lm_binary 1.85
“$@”
The 1 wav file in the ‘test’ folder is 17 seconds in length and in 16 kHz. The ‘train’ folder has 7 clips of varying length (9 to 81 seconds) also in 16 kHz. I read in other threads that it is best practice to use sentence-length clips of under 10 seconds. I’ll be compiling those soon, but for now, I just wanted to see that it is working.
At the end of the fine-tuning session of checkpoint 0.5.1, the results come out to:
WER: 0.983051, CER: 0.910448, loss 958.222900
res: “the tootootootootoot tootootootootoot”
I know I haven’t fed the model proper length-ed audio clips, but does that explain why the result is so off?
I know I may be missing detail to explain my case, if so, please feel free to ask for more detail.