I am interested in building an english ASR system.
I tried pretrained models of DeepSpeech by mozilla on CPU. I converted all formats into mono .wav 16k sr. I used py-webrtcvad for chunkizing long audio in small chunks (as suggested in this forum). Building all this worked very well on Librispeech test-clean corpus, fairly for US accent converstational audios but fails for UK or indian accents.
Now I want to train it for UK english accent. At this point, I am stuck in Infrastructure required for training it on say 1000 hours in order to get good accuracy on Uk accents.
What infrastructure would I need in terms of GPU for building/training a ASR systems. Is the below server config good enough ?
CPU 16 cores
SSD 240 GB
GPU - 4 Titan X pascal OR 2 GTX 1080 TI (Which one shall I get)
RAM - 32 GB
Correct me if I am not making much sense here.