Dataset split best practices?

@kdavis: If I have 1000h of data. What should be my ideal Train, Dev and Test size?

@agarwalaashish20 I’m not @kdavis but the standard for ML in (80/10/10) % for (train/eval/test) for most datasets.

The standard, unfortunately, is wrong almost all of the time.

@agarwalaashish20 The answer depends on how many clips you have.

Basically, if you have N clips and you want T clips in the training set, V clips in the validation set, and E clips in the test set, then you first need to have the obvious constraint T + V + E <= N, where you try to have T, V, and E as large as possible. In addition, you want to have V and E be “statistically significant sample sizes” relative to T.

Concretely, you’d define a “statistically significant sample size” using something like the sample size calculator with a confidence level of 99% and a margin of error of 1%. This would mean, for example, that a T of 1000000 would require that V and E both be 16369.

The reason for doing this is to insure that, for example, the WER calculated using T would closely track, as defined by confidence level of 99% and a margin of error of 1%, the the WER calculated using V or E as both V and E are “statistically significant sample sizes” relative to T.