Training a new model with wav files of different lengths

I am trying to train a new model. The data set that I am using has some wav files which contain just a single spoken word and it’s length is ~1 second and other files are between 3-4 seconds. My first question is that since my data set is very small, I know that increasing the batch size increases the speed of training but if I take smaller batch sizes than those used for the release models, will my model learn better? Second is that I am using wav files of different durations (~1 sec and 3-4 sec), should I do that or not? Because my main purpose here is that there is a single same word being spoken by different speakers in the 1 second files and this is what I want my model to recognize.