4 Speaker Dataset - Training a Context Based Speech Recognition Model

Training a speech recognition model from a dataset containing context specific phrases and a limited vocabulary.

4 speakers: UK MALE, US, MALE, UK Female, US Female.
800hours total ~360,000 8 second phrases.
200hours/speaker ~90,000 8 second phrases.

Does anyone have any idea how well DeepSpeech could train this dataset for accurate synthesis of US and UK speakers within the context scope?

Curious about the result of separating speaker clips into batches and changing the audio tone/pitch of audio files within each respective batch - simulating more speakers.

4 batches per speaker with different pitch:
16 ‘speakers’ @ 50 hours/speaker.

What do you think would work best - if at all…

Thanks!