Help with hardware for training with DS

tensorfoo · December 27, 2019, 5:54pm

With less than 1000hrs of data, will two GPU allow a better model than one GPU?

I have two GPUs available. A GTX 1080 and a GTX 1080 TI. Will the 1080 having 8Gb of memory limit the model I can build compared to with just the TI?

My preference is to build a simpler machine with a smaller case and motherboard in case I need to move overseas I can take it with me easier than a normal ATX build which can fit multiple GPUs. Any guidance will be appreciated.

lissyx · December 27, 2019, 6:20pm

The number of GPU has no impact on the quality of the model. Only training time.

That’s more of a TensorFlow question, and I don’t have an answer to that.

tensorfoo · December 27, 2019, 6:29pm

Thank you lissyx. You guys have done a great job optimising training that it’s already superfast with my data but I wish I knew how much relative improvement in training time I’d find in multigpu vs just one.

lissyx · December 27, 2019, 6:32pm

Your question above was about quality of the model, not training time.

Assuming you have enough data, and you don’t have other limiting factors (CPU, RAM, disks), then adding equivalent GPUs should improve the training by the same amount of time ; i.e., two 1080Ti will more or less process in half the time of one 1080Ti.

tensorfoo · December 27, 2019, 6:52pm

Yes, when I made the thread I was mostly concerned with quality but based on your first reply I saw the tradeoffs was mostly about efficiency so I asked about the relative benefit in training time. Thank you for your help. I’ll try make a decision based on the information.

It would be nice to find published information on training time, eg with one 1080, 100 epochs took this many seconds. With a 1080TI it took this many seconds. And then the same for multiple GPUs. I will see what Google turns up but if someone wants to share their rough ballpark for say 100hrs or 1000hrs of data that would be wonderful.

lissyx · December 30, 2019, 9:13am

Except this is dependant on many parameters: GPUs, memory subsystem, width of the model, epochs, amount of data. At some point, we did share that, and people kept “ok you give this, but what in my case” and thus there was no value. 2x 2080Ti on Threadripper 1950X, 650h of french, for 15-20 epochs, it takes ~6h.

GregorM · December 30, 2019, 12:57pm

Maybe this helps:

https://discourse.mozilla.org/t/relation-of-v100-cpu-memory-for-a-training-vm/48177