Memory usage during training

I’m trying to train a custom model using current master 0.5.0alpha7 (tensorflow 1.13.1) with my own mixture of librispeech, commonvoice and some private data. My training seems to proceed normally, however the memory usage keeps creeping up till entire 32G memory is exhausted and I have to manually cancel the training (it takes around 2 hours to use up all available memory and I don’t have GPU OOM issue). I am wondering whether this is the correct behaviour. It looks similar to memory leak issue. Has anyone experienced similar issue. Another thing I noticed is that the the rate of step updates also goes down as training progresses.

Below is my training configuration:

python3 -u DeepSpeech.py
–train_files train-random.csv
–dev_files dev-random.csv
–test_files test-random.csv
–train_batch_size 32
–dev_batch_size 32
–test_batch_size 32
–n_hidden 2048
–learning_rate 0.0001
–dropout_rate 0.15
–lm_binary_path /srv/ml_datasets/speech_data/language_corpora/lm.binary
–lm_trie_path /srv/ml_datasets/speech_data/language_corpora/trie
–epochs 50
–checkpoint_dir checkpoint
“$@”

I did rebuilt lm.binary and trie with my dataset.

Well, we don’t have issue but I personnally have more memory. How much RAM and swap do you have, how much is available? There could be memory leaks in our code or in dependencies, as well.

I have 32G RAM and 2G swap (default ubuntu installation). The memory usage starts from around 4.5G (no other process running apart from system monitor and a terminal running deepspeech) and gradually fill up the RAM then onto swap in ~2hours. I can increase the swap size but this just avoids the issue if there is indeed an issue.

You say you have Common Voice and others, it might just be you need more memory ?

my dataset mix is dominated by librispeech, private data is less than two hours (I have trouble using 0.4.1 checkpoint for finetuning, loss goes to nan as soon as training starts). The thing that I don’t understand (without knowing how deepspeech training works) is what is holding up all the memory and not releasing it? mfcc? accumulated logits?

It might not be that, just requires more memory than you have. Can you elaborate on the size of LibriSpeech and Common Voice you are using ?

I have 328642 entries in my training.csv. assuming average 10 second per audio clip, that’s roughly 912 hours of audio. At this moment, I only have a small dev computer that allows maximum 32GB RAM.

Update: I’ve increased the swapfile size to 32G and set feature_cache flag. I think now all preprocessed features are saved to the disk. The memory size no longer increases.

1 Like

I have a similar issue, loss goes to nan in couple of steps,

D Checking for early stopping (last 4 steps) validation     loss: nan, with standard deviation: nan and mean: nan
Epoch 4 |   Training | Elapsed Time: 0:01:45 | Steps: 57 | Loss: nan           
Epoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: nan | Dataset: .
Epoch 4 | Validation | Elapsed Time: 0:00:01 | Steps: 2 | Loss: nan | Dataset: .Epoch 4 | Validation | Elapsed Time: 0:00:01 | Steps: 3 | Loss: nan | Dataset: .
Epoch 4 | Validation | Elapsed Time: 0:00:01 | Steps: 4 | Loss: nan | Dataset: .Epoch 4 | Validation | Elapsed Time: 0:00:02 | Steps: 5 | Loss: nan | Dataset: .
Epoch 4 | Validation | Elapsed Time: 0:00:02 | Steps: 6 | Loss: nan | Dataset: .Epoch 4 | Validation | Elapsed Time: 0:00:02 | Steps: 7 | Loss: nan | Dataset: .
Epoch 4 | Validation | Elapsed Time: 0:00:03 | Steps: 8 | Loss: nan | Dataset: .Epoch 4 | Validation | Elapsed Time: 0:00:03 | Steps: 9 | Loss: nan | Dataset: .
Epoch 4 | Validation | Elapsed Time: 0:00:03 | Steps: 10 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:04 | Steps: 11 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:04 | Steps: 12 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:05 | Steps: 13 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:05 | Steps: 14 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:05 | Steps: 15 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:06 | Steps: 16 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:06 | Steps: 17 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:07 | Steps: 18 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:07 | Steps: 19 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:08 | Steps: 20 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:08 | Steps: 21 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:08 | Steps: 22 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:09 | Steps: 23 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:09 | Steps: 24 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:10 | Steps: 25 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:10 | Steps: 26 | Loss: nan | Dataset: Epoch 4 | Validation | Elapsed Time: 0:00:11 | Steps: 27 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:12 | Steps: 28 | Loss: nan | Dataset: 
Epoch 4 | Validation | Elapsed Time: 0:00:12 | Steps: 28 | Loss: nan | Dataset: ./data/CV/ur/clips/dev.csv
D Checking for early stopping (last 4 steps) validation loss: nan, with standard deviation: nan and mean: nan

@lissyx What could be the reason?

How about you give us more context ? Training dataset, parameters.

Training for Urdu language, of size nearly 30hr of read audio data.
Training parameters are

sudo python3 -u DeepSpeech.py \
  --train_files ./data/CV/ur/clips/train.csv \
  --dev_files ./data/CV/ur/clips/dev.csv \
  --test_files ./data/CV/ur/clips/test.csv \
  --train_batch_size 128 \
  --dev_batch_size 32 \
  --test_batch_size 32 \
  --n_hidden 20 \
  --epoch 1 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.001 \
  --report_count 100 \
  --log_level 0 \
  --max_to_keep 20 \
  --export_dir ./data/CV/ur/model_export/ \
  --export_language ur \
  --checkpoint_dir ./data/CV/ur/checkpoint/ \
  --alphabet_config_path ./data/lm/alphabet.txt \
  --lm_binary_path ./data/lm/lm.binary \
  --lm_trie_path ./data/lm/trie \
  "$@"

Also most of time system freezes when training is started.

You should try some transfer learning, with that amount of data. Also, training on such few data, it’s not really surprising loss is behaving like that.

That depends on your hardware …

@lissyx can i do transfer learning for Urdu language, from a model trained in another language? if so, kindly refer to me the documentation for transfer learning in deepspeech

There isn’t any asr model available for urdu language, hence i am making this effort, also, i understand the data is minimum, i will try to augment it.
What is the minimum hours of data is needed for Deepspeech?

To get something viable, > 1k hours

Well, I’m not sure if there’s any way to transliterate urdu to english-compatible alphabet. If so, you can re-use checkpoints from english as well as @josh_meyer’s transfer-learning branch.

Transliteration is possible, but English alphabets aren’t enough, as total alphabets and also phonemes in Urdu is higher than English. Usually case-sensitive and dual alphabets are used to completely transliterate Urdu in English-compatible one
your suggestions now?

Have you tried being less aggressive on learning rate ? Also, I think trying to source new dataset and / or promote Common Voice contribution on Urdu (include new sentences to read, get more people to record) would be a better use of your time, at that point.

I will try it.

That’s my next to-do-list. I will start a new thread when i start in a week or so.
Any reference point or checklist before i start would be helpful