ISSUE
Can not train DeepSpeech on GTX 2070. Tensorflow 1.13 isn’t compatible with the newer graphics card.
ERROR
Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
ACTIONS
-
Tensorflow 1.13 was compiled and built from source - the issue persists.
-
Added extra configuration @ config.py line 63:
c.session_config.gpu_options.per_process_gpu_memory_fraction = 0.6
c.session_config…gpu_options.allow_growth = True
The two configurations did not resolve the issue.
INFO
Using the latest version of DeepSpeech ( v0.5.0-alpha) with NVIDIA GTX 2070, CUDA-10, CUDNN-7.5, TensorflowGPU-1.13.1
LOG
root@953d2eb1cfea:/DeepSpeech-root/DeepSpeech# ./run-ldc93s1.sh + [ ! -f DeepSpeech.py ] + [ ! -f data/ldc93s1/ldc93s1.csv ] + [ -d ] + python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1")) + checkpoint_dir=/root/.local/share/deepspeech/ldc93s1 + export CUDA_VISIBLE_DEVICES=0 + python -u DeepSpeech.py --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --log_level 0 2019-05-14 11:58:21.309114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-05-14 11:58:21.424454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-05-14 11:58:21.425146: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x33890f0 executing computations on platform CUDA. Devices: 2019-05-14 11:58:21.425163: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5 2019-05-14 11:58:21.427067: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3491870000 Hz 2019-05-14 11:58:21.427566: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c633e0 executing computations on platform Host. Devices: 2019-05-14 11:58:21.427587: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2019-05-14 11:58:21.427976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62 pciBusID: 0000:02:00.0 totalMemory: 7.76GiB freeMemory: 7.39GiB 2019-05-14 11:58:21.427994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-05-14 11:58:21.428764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-14 11:58:21.428779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-05-14 11:58:21.428786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-05-14 11:58:21.429141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means `tf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-05-14 11:58:22.280196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:22.280233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:22.280241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-05-14 11:58:22.280247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-05-14 11:58:22.280595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
D Session opened.
I Initializing variables...
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2019-05-14 11:58:23.028141: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-05-14 11:58:24.275096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-14 11:58:24.289759: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 829, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 813, in main
train()
File "DeepSpeech.py", line 510, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File "DeepSpeech.py", line 483, in run_set
feed_dict=feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
Caused by op 'tower_0/conv1d/Conv2D', defined at:
File "DeepSpeech.py", line 829, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 813, in main
train()
File "DeepSpeech.py", line 400, in train
gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
File "DeepSpeech.py", line 253, in get_tower_results
avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
File "DeepSpeech.py", line 119, in create_model
batch_x = create_overlapping_windows(batch_x)
File "DeepSpeech.py", line 56, in create_overlapping_windows
batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
data_format=data_format)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
root@953d2eb1cfea:/DeepSpeech-root/DeepSpeech#
Can this issue be resloved? Any help appreciated. Thanks