Always oom problem, even if bath_size==1

myrainbowandsky · March 10, 2019, 12:05am

python3 -u ./DeepSpeech.py --train_files /home/sky-ai/xwt/DeepSpeech/data/train/train.csv --dev_files /home/sky-ai/xwt/DeepSpeech/data/cv/cv.csv --test_files /home/sky/sky-ai/xwt/DeepSpeech/data/test/test.csv --train_batch_size 24 --dev_batch_size 15 --test_batch_size 20 --epoch 20 --display_step 1 \ --validation_step 1 \ --dropout_rate 0.30 \ --default_stddev 0.046875 \ --learning_rate 0.0001 \ --log_level 0 \
2019-03-08 16:25:17.113383: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-03-08 16:25:17.213002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:17:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-03-08 16:25:17.274068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:65:00.0
totalMemory: 10.92GiB freeMemory: 10.57GiB
2019-03-08 16:25:17.274882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-03-08 16:25:17.936644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-08 16:25:17.936675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-03-08 16:25:17.936680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
2019-03-08 16:25:17.936683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
2019-03-08 16:25:17.937213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2019-03-08 16:25:17.937478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 10226 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
D Starting coordinator…
D Coordinator started. Thread id 140459615708928
Preprocessing [’/home/sky-ai/xwt/DeepSpeech/data/train/train.csv’]
Preprocessing done
Preprocessing [’/home/sky-ai/xwt/DeepSpeech/data/cv/cv.csv’]
Preprocessing done
W Parameter --validation_step needs to be >0 for early stopping to work
2019-03-08 16:26:26.961329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-03-08 16:26:26.961413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-08 16:26:26.961419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-03-08 16:26:26.961424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
2019-03-08 16:26:26.961428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
2019-03-08 16:26:26.961956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2019-03-08 16:26:26.962068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10226 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2019-03-08 16:26:28.980208: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980297: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2019-03-08 16:26:28.980344: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980353: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 7730940928
2019-03-08 16:26:28.980392: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 6957846528 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980402: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 6957846528
2019-03-08 16:26:28.980433: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 6262061568 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980440: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 6262061568
2019-03-08 16:26:28.980464: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 5635855360 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980471: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 5635855360
2019-03-08 16:26:28.980494: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 5072269824 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980501: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 5072269824
2019-03-08 16:26:28.980526: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 4565042688 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980532: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 4565042688
2019-03-08 16:26:28.980551: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 4108538368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980556: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 4108538368
2019-03-08 16:26:28.980572: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 3697684480 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980577: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 3697684480
2019-03-08 16:26:28.980602: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:28.980607: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2019-03-08 16:26:38.980783: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:38.980838: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2019-03-08 16:26:38.980875: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-03-08 16:26:38.980886: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2019-03-08 16:26:38.980901: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (cuda_host_bfc) ran out of memory trying to allocate 3.33GiB. Current allocation summary follows.
2019-03-08 16:26:38.980918: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in use in bin.
2019-03-08 16:26:38.980931: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.980942: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.980954: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.980964: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.980979: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192): Total Chunks: 12, Chunks in use: 12. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin.
2019-03-08 16:26:38.980990: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981003: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768): Total Chunks: 3, Chunks in use: 3. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin.
2019-03-08 16:26:38.981014: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981025: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981036: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981049: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288): Total Chunks: 1, Chunks in use: 0. 831.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981061: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576): Total Chunks: 3, Chunks in use: 3. 5.00MiB allocated for chunks. 5.00MiB in use in bin. 5.00MiB client-requested in use in bin.
2019-03-08 16:26:38.981075: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152): Total Chunks: 3, Chunks in use: 3. 11.58MiB allocated for chunks. 11.58MiB in use in bin. 11.58MiB client-requested in use in bin.
2019-03-08 16:26:38.981086: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981097: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981110: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216): Total Chunks: 9, Chunks in use: 9. 144.00MiB allocated for chunks. 144.00MiB in use in bin. 144.00MiB client-requested in use in bin.
2019-03-08 16:26:38.981122: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981132: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-03-08 16:26:38.981146: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728): Total Chunks: 4, Chunks in use: 3. 524.24MiB allocated for chunks. 384.00MiB in use in bin. 384.00MiB client-requested in use in bin.
2019-03-08 16:26:38.981158: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456): Total Chunks: 3, Chunks in use: 2. 7.33GiB allocated for chunks. 6.66GiB in use in bin. 6.66GiB client-requested in use in bin.
2019-03-08 16:26:38.981171: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 3.33GiB was 256.00MiB, Chunk State:
2019-03-08 16:26:38.981186: I tensorflow/core/common_runtime/bfc_allocator.cc:619] Size: 684.82MiB | Requested Size: 0B | in_use: 0, prev: Size: 3.33GiB | Requested Size: 3.33GiB | in_use: 1
2019-03-08 16:26:38.981198: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fb956000000 of size 3576881152
2019-03-08 16:26:38.981207: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free at 0x7fba2b32e000 of size 718086144
2019-03-08 16:26:38.981215: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fba74000000 of size 3576881152
2019-03-08 16:26:38.981224: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb4932e000 of size 134217728
2019-03-08 16:26:38.981232: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb5132e000 of size 134217728
2019-03-08 16:26:38.981240: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb5932e000 of size 134217728
2019-03-08 16:26:38.981248: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb6132e000 of size 1746688
2019-03-08 16:26:38.981256: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb614d8700 of size 1746688
2019-03-08 16:26:38.981264: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb61682e00 of size 1746688
2019-03-08 16:26:38.981273: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb6182d500 of size 4046848
2019-03-08 16:26:38.981281: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb61c09500 of size 4046848
2019-03-08 16:26:38.981289: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb61fe5500 of size 4046848
2019-03-08 16:26:38.981297: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb623c1500 of size 16777216
2019-03-08 16:26:38.981305: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb633c1500 of size 16777216
2019-03-08 16:26:38.981313: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb643c1500 of size 16777216
2019-03-08 16:26:38.981321: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb653c1500 of size 16777216
2019-03-08 16:26:38.981329: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb663c1500 of size 16777216
2019-03-08 16:26:38.981337: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb673c1500 of size 16777216
2019-03-08 16:26:38.981345: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb683c1500 of size 16777216
2019-03-08 16:26:38.981353: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb693c1500 of size 16777216
2019-03-08 16:26:38.981361: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb6a3c1500 of size 16777216
2019-03-08 16:26:38.981369: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free at 0x7fbb6b3c1500 of size 147057408
2019-03-08 16:26:38.981378: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14800000 of size 8192
2019-03-08 16:26:38.981386: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14802000 of size 256
2019-03-08 16:26:38.981394: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14802100 of size 8192
2019-03-08 16:26:38.981402: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14804100 of size 8192
2019-03-08 16:26:38.981410: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14806100 of size 8192
2019-03-08 16:26:38.981418: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14808100 of size 8192
2019-03-08 16:26:38.981426: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf1480a100 of size 8192
2019-03-08 16:26:38.981434: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf1480c100 of size 8192
2019-03-08 16:26:38.981442: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf1480e100 of size 8192
2019-03-08 16:26:38.981450: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14810100 of size 8192
2019-03-08 16:26:38.981458: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14812100 of size 8192
2019-03-08 16:26:38.981466: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14814100 of size 8192
2019-03-08 16:26:38.981474: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14816100 of size 8192
2019-03-08 16:26:38.981485: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14818100 of size 256
2019-03-08 16:26:38.981493: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14818200 of size 256
2019-03-08 16:26:38.981501: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14818300 of size 32768
2019-03-08 16:26:38.981509: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14820300 of size 32768
2019-03-08 16:26:38.981517: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14828300 of size 32768
2019-03-08 16:26:38.981525: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free at 0x7fbf14830300 of size 851200
2019-03-08 16:26:38.981533: I tensorflow/core/common_runtime/bfc_allocator.cc:638] Summary of in-use Chunks by size:
2019-03-08 16:26:38.981543: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 256 totalling 768B
2019-03-08 16:26:38.981553: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 12 Chunks of size 8192 totalling 96.0KiB
2019-03-08 16:26:38.981562: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 32768 totalling 96.0KiB
2019-03-08 16:26:38.981571: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 1746688 totalling 5.00MiB
2019-03-08 16:26:38.981581: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 4046848 totalling 11.58MiB
2019-03-08 16:26:38.981590: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 9 Chunks of size 16777216 totalling 144.00MiB
2019-03-08 16:26:38.981599: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 134217728 totalling 384.00MiB
2019-03-08 16:26:38.981608: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 3576881152 totalling 6.66GiB
2019-03-08 16:26:38.981617: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 7.19GiB
2019-03-08 16:26:38.981629: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit: 68719476736
InUse: 7724988416
MaxInUse: 7724988416
NumAllocs: 38
MaxAllocSize: 3576881152

2019-03-08 16:26:38.981646: W tensorflow/core/common_runtime/bfc_allocator.cc:271] _______*********
2019-03-08 16:26:38.982719: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Resource exhausted: OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
Traceback (most recent call last):
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1334, in _do_call
return fn(*args)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
[[{{node save_1/RestoreV2_1}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_1/tensor_names, save_1/RestoreV2_1/shape_and_slices)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node save_1/RestoreV2_1/_43}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_48_save_1/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./DeepSpeech.py”, line 964, in
tf.app.run(main)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “./DeepSpeech.py”, line 916, in main
train()
File “./DeepSpeech.py”, line 549, in train
config=Config.session_config) as session:
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init
_WrappedSession.init(self, self._create_session())
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session
return self._sess_creator.create_session()
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 566, in create_session
init_fn=self._scaffold.init_fn)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py”, line 288, in prepare_session
config=config)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py”, line 218, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1546, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 929, in run
run_metadata_ptr)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1152, in _run
feed_dict_tensor, options, run_metadata)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1328, in _do_run
run_metadata)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
[[node save_1/RestoreV2_1 (defined at ./DeepSpeech.py:549) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_1/tensor_names, save_1/RestoreV2_1/shape_and_slices)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node save_1/RestoreV2_1/_43}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_48_save_1/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op ‘save_1/RestoreV2_1’, defined at:
File “./DeepSpeech.py”, line 964, in
tf.app.run(main)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “./DeepSpeech.py”, line 916, in main
train()
File “./DeepSpeech.py”, line 549, in train
config=Config.session_config) as session:
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init
_WrappedSession.init(self, self._create_session())
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session
return self._sess_creator.create_session()
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 557, in create_session
self._scaffold.finalize()
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 213, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 886, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1102, in init
self.build()
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1151, in _build
build_save=build_save, build_restore=build_restore)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 789, in _build_internal
restore_sequentially, reshape)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 459, in _AddShardedRestoreOps
name=“restore_shard”))
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 406, in _AddRestoreOps
restore_sequentially)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 862, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py”, line 1466, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
op_def=op_def)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py”, line 488, in new_func
return func(*args, **kwargs)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 3274, in create_op
op_def=op_def)
File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 1770, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
[[node save_1/RestoreV2_1 (defined at ./DeepSpeech.py:549) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_1/tensor_names, save_1/RestoreV2_1/shape_and_slices)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node save_1/RestoreV2_1/_43}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_48_save_1/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

No matter how I reduced the train and CV data compatibility, it always said this OMM problem. I’ve tried batch_size 1 but all the same. I used 16GB memory, 2*1080Ti,i7. No other program was running.

lissyx · March 8, 2019, 11:01am

What is the size of your audio files?

myrainbowandsky · March 8, 2019, 11:18am

About 0.3MB for each wav.

lissyx · March 8, 2019, 11:21am

that’s quite a big shape, are you sure it’s okay?

myrainbowandsky · March 8, 2019, 12:25pm

What is the requirements for the audio file in addition to 16-bit, 16 kHz, mono?

lissyx · March 8, 2019, 12:29pm

yeah, that’s it, and 300kB that’s ~ 3s of audio, right?

myrainbowandsky · March 8, 2019, 1:28pm

Gorgeous! I will try!

myrainbowandsky · March 9, 2019, 6:28am

Input File : ‘train/C8_749.wav’
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:07.88 = 126000 samples ~ 590.625 CDDA sectors
File Size : 252k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

I donot think there is any problem with the audio file.

myrainbowandsky · March 9, 2019, 8:41am

Even if I put just 1 batch, It shows

myrainbowandsky · March 9, 2019, 8:49am

Should I compress the audiofiles?

lissyx · March 9, 2019, 8:56am

Please explain my this shape.

myrainbowandsky · March 9, 2019, 12:34pm

I suppose it is the first audio file fed into the net and is too long?

lissyx · March 11, 2019, 9:55am

Why do you suppose, and why do you keep opening threads and bugs everywhere? This is not going to get you more attention but it is likely going to make us angry.