Training - test set is run on CPU with a single thread

godeffroy.valet · April 11, 2018, 3:12pm

Every time I train my model, there is a final testing step which outputs WER.

During training, “watch -n 0.5 nvidia-smi” tells me that my GPU is used. But during final testing, only the CPU is used. The result is that testing takes almost as much time as the whole training process, which is very painful.

This happens just after the “FINISHED Optimization” text, between “Starting batch…” and “Finished batch step”, so I suppose that this is not caused by some other CPU heavy task. Anyway, the CPU load is even lower than while using GPU because a single thread seems to be used.

I am using the following command :

CUDA_VISIBLE_DEVICES=0 LD_LIBRARY_PATH=native_client/ python -u DeepSpeech.py --log_level 1 --train_files data/train/list.csv --dev_files data/dev/list.csv --test_files data/test/list.csv --checkpoint_dir ${out_dir}/checkpoints --summary_dir ${out_dir}/tensor_board --alphabet_config_path data/alphabet.txt --lm_binary_path data/LE_MONDE_full.utf8.binary_lm --lm_trie_path data/LE_MONDE_full.utf8.deep_speech_trie --use_seq_length False --validation_step 2 --n_hidden 300 --train_batch_size 100 --dev_batch_size 100 --test_batch_size 100 --epoch 40

I am new to TensorFlow, so I don’t know what could cause this behavior…

kdavis · April 11, 2018, 3:25pm

My guess is that the “final testing” part that is only using the CPU is the calculation of the WER. I agree it is annoying and slow.

This is, however, relatively painful to fix as putting the entire WER calculation on the GPU is much harder than one expects. We initially tried to do so, but then realized the person hours would be, for our setup, better offset by CPU hours.

If you ave any ideas on how this could be expedited let us know!

godeffroy.valet · April 12, 2018, 7:07am

I had the impression that this final step does exactly the same inference processing than during training, plus WER calculation, and that even inference is run on GPU.

But you are saying that only WER calculation is the real bottleneck, is that right ? (just making sure I understand correctly)

In that case, I do not think that it should take so long. I will have a look at it.

kdavis · April 12, 2018, 7:13am

Yes, the final step does the exact same inference computation as in training plus WER.

The inference is on the GPU while the WER calculation is on the CPU.

The WER is CPU bound and slow.

If you can take a look at putting the WER calculation on the GPU, it would be much appreciated.

george_fedoseev · April 12, 2018, 10:13am

Probably an easy improvement would be to make this loop multithreaded on CPU

github.com

mozilla/DeepSpeech/blob/73ef11441781d9832d6311afd7e7c733ef1e1b30/DeepSpeech.py#L766


def calculate_report(results_tuple):
r'''
This routine will calculate a WER report.
It'll compute the `mean` WER and create ``Sample`` objects of the ``report_count`` top lowest
loss items from the provided WER results tuple (only items with WER!=0 and ordered by their WER).
'''
samples = []
items = list(zip(*results_tuple))
total_levenshtein = 0.0
total_label_length = 0.0
for label, decoding, distance, loss in items:
    sample_wer = wer(label, decoding)
    sample = Sample(label, decoding, loss, distance, sample_wer)
    samples.append(sample)
    total_levenshtein += levenshtein(label.split(), decoding.split())
    total_label_length += float(len(label.split()))


# Getting the WER from the accumulated levenshteins and lengths
samples_wer = total_levenshtein / total_label_length


# Filter out all items with WER=0

reuben · April 12, 2018, 3:46pm

Yes, I have a version that does that with multiprocessing.Pool here: https://github.com/mozilla/DeepSpeech/blob/b9993aef8cb645d4377bc46ec999d15a9f5a0596/evaluate.py#L144-L170

It still isn’t clear to me what we’ll do the WER calculation code in DeepSpeech.py once that branch is merged, probably remove it and call evaluate.py instead.

reuben · April 12, 2018, 3:49pm

The problem with applying that same idea directly on DeepSpeech.py as it exists on master today is that calculate_report gets called per job (batch), not for the entire test set, so each time it gets called with only a few inputs and you lose a lot of time just dispatching jobs and collecting results hundreds of times.

george_fedoseev · April 15, 2018, 2:28pm

@reuben, actually found out that most time consuming during testing for me is LM decoding.
Because it takes 8 seconds to compute batch of size 32 for me, if i reduce beam_width from 1024 -> 10 then batch computation takes 0.2 seconds. And during batch computation I see that 2 cpus (cause I have 2 towers) take 100%, while GPU usage is 0.

Is beam search with LM supposed to be running on GPU?

Also maybe its my problem only cause I built deepspeech from source:

RUN bazel build --config=opt --config=cuda -c opt --copt=-O3 //native_client:libctc_decoder_with_kenlm.so  --verbose_failures --action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}
RUN bazel build --config=monolithic -c opt --copt=-O3 --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:deepspeech_utils //native_client:generate_trie --verbose_failures --action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}
RUN bazel build --config=opt --config=cuda  --copt=-msse4.2 //tensorflow/tools/pip_package:build_pip_package --verbose_failures --action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}

lissyx · April 15, 2018, 2:41pm

As much as I can tell, KenLM is pure CPU, no GPU code. If you build from source, you might be able to experiment more than --copt=-O3 for inference, but that won’t help that much during training. I’d advise rely on upstream TensorFlow packages for training.

It seems like you wrote some Docker, want to contribute that ? I’d be happy to review your PR

george_fedoseev · April 15, 2018, 3:39pm

I see, and it’s not multithreaded as well as TF version (#17136).

I will submit dockerfile after I clean it up)

lissyx · April 15, 2018, 3:43pm

Thanks! We’ll have to figure out how to properly test that as well on TaskCluster, but we should be able to find a way