Hi there,
I’ve got a weird problem going on at the moment. I’m guessing that there’s probably a simple solution/I’m doing something stupid?
I’ve prepared the below command to train a DeepSpeech model starting from an existing collection of checkpoints that I have created in a prior train.
python3 DeepSpeech/DeepSpeech.py \
--train_files '/work/cook-island-maori/models/cim_model/train/mi_train.csv' \
--dev_files '/work/cook-island-maori/models/cim_model/train/mi_dev.csv' \
--test_files '/work/cook-island-maori/models/cim_model/train/mi_test.csv' \
--alphabet_config_path '/work/cook-island-maori/models/cim_model/train/alphabet.txt' \
--lm_binary_path '/work/cook-island-maori/models/cim_model/lm/lm.pointers.binary' \
--lm_trie_path '/work/cook-island-maori/models/cim_model/lm/pointers.trie' \
--lm_weight 1.75 \
--epoch 200 \
--train_batch_size 16 \
--dev_batch_size 16 \
--test_batch_size 16 \
--learning_rate 0.00005 \
--max_to_keep 10 \
--display_step 0 \
--validation_step 1 \
--dropout_rate 0.30 \
--default_stddev 0.046875 \
--checkpoint_dir /work/cook-island-maori/models/cim_model/checkpoints \
--decoder_library_path /work/cook-island-maori/models/cim_model/native_client/libctc_decoder_with_kenlm.so \
--log_level 0 \
--summary_dir /work/cook-island-maori/models/cim_model/summaries \
--summary_secs 120 \
--wer_log_pattern "GLOBAL LOG: logwer(\'cim_model\', %s, %s, %f)" \
--fulltrace 1 \
--limit_train 0 \
--limit_dev 0 \
--limit_test 0 \
--valid_word_count_weight 1 \
--export_dir /work/cook-island-maori/models/cim_model/export \
--checkpoint_secs 600
This is a loose attempt at transferring the model from one language to another with a similar alphabet.
Running the model produces the following logs:
2018-11-02 04:07:05.468907: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-02 04:07:05.558123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-02 04:07:05.558569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235pciBusID: 0000:00:1e.0totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-02 04:07:05.558607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-02 04:07:05.865882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 04:07:05.865959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-02 04:07:05.865974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-02 04:07:05.866288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-11-02 04:07:08.899032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-02 04:07:08.899122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 04:07:08.899136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-02 04:07:08.899145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-02 04:07:08.899293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
After this point, it idles and stops doing anything. When I check the CPU usage, it’s on 0, and so is the IO. I’m a little lost at the moment, so I should probably take a break from the problem for a while.
If I come up with a solution, I’ll try to remember to write it up here later.