First off, huge thanks to the mozilla team for this work. If there’s any information we can provide that would be of use to you, just let us know.
We have recently trained DeepSpeech on Te Reo Maori (The Maori Language - of the indigenous people of New Zealand).
We had a collection of 1,300 speakers and over 193,000 recordings totaling over 300 hours of recorded audio.
Our text corpus was small, 10-20MB at the moment (we are planning to increase this first during the next leg of the project).
Using this, we achieved a 14.0% word error rate on a test set consisting of 13% of all of the recorded audio (~27,000 recordings).
We made a distinction between speakers and sentences in the heldout set, and found DeepSpeech achieved a 13.8% word error rate on sentences not included in training (but spoken by speakers included in the training set).
The word error rate on sentences included in the training, spoken by speakers not in the training set was 6.2%.
Our model was trained on AWS using a p3.8xlarge instance in 18 hours, with the following hyperparameters:
lm_weight = 2.00
epoch = 1
learning_rate = 0.0001
max_to_keep = 3
display_step = 0
validation_step = 0
dropout_rate = 0.30
default_stddev = 0.046875
early_stop = 1
earlystop_nsteps = 10
log_level = 0
summary_secs = 120
fulltrace = 1
limit_train = 0
limit_dev = 0
limit_test = 0
valid_word_count_weight = 1
checkpoint_secs = 600
max_to_keep = 1
train_batch_size = 16
dev_batch_size = 16
test_batch_size = 16
checkpoint_secs = 600
summary_sec = 600
max_to_keep = 10
Everything was run inside of docker images, and we used nvidia docker to deploy the model on gpu machines. The hyperparameters were logged by our continuous integration system, Gorbachev.
We also have a more detailed report on the project, and might be able to share it (after some minor editing) if anyone else is interested.