Training cluster

Dear all,

I’ve been reading about the Mozilla cluster for training having 2 towers with 8 Titan X GPUs each.
Is there any reference available on the hardware setup of the cluster?
Did you buy it off the shelf or configure on your own?

Best regards,
daniel

We had Exxact Corp put together the raw machines then configured it on our own after that.

We’ve a head node with 100TB disk where all our audio sits.

This is connected to 4 worker nodes via a 10Gb cable through a dedicated switch.

Each of the worker nodes has 8 Tiatian X Pascal GPU’s and 128GB of RAM and some minimal local disk space for the OS and the like.

Job are scheduled via snakepit and snakepit-client.

If you have more questions, feel free to ask. This was just the tl;dr version.

1 Like

thank you for the quick resonse!