Hello everyone!
I have an access to a server with v100 gpu. So I tried to train a model there with batch size 32 for training and 16 for testing. Unfortunately, the gpu is not being used 100%. I mean, in average the load is 30-40%, occasionally it goes up to 85-90 % for a short time, and goes down to 14%. My question is, if I set the batch size, let’s say, 48 (with it set to 32 it uses 11gb of gpu ram), will the gpu load go up (training boosts) WHILE the final model performance will not be damaged?
After tests on single V100 I want to do distributed training. Should I just use distributed.py instead of train.py and provide the same config as for the single gpu training? Should I leave the same batch size?
Thank you a lot!