@kdavis: If I have 1000h of data. What should be my ideal Train, Dev and Test size?
@agarwalaashish20 I’m not @kdavis but the standard for ML in (80/10/10) % for (train/eval/test) for most datasets.
The standard, unfortunately, is wrong almost all of the time.
@agarwalaashish20 The answer depends on how many clips you have.
Basically, if you have N
clips and you want T
clips in the training set, V
clips in the validation set, and E
clips in the test set, then you first need to have the obvious constraint T + V + E <= N
, where you try to have T
, V
, and E
as large as possible. In addition, you want to have V
and E
be “statistically significant sample sizes” relative to T
.
Concretely, you’d define a “statistically significant sample size” using something like the sample size calculator with a confidence level of 99% and a margin of error of 1%. This would mean, for example, that a T
of 1000000 would require that V
and E
both be 16369.
The reason for doing this is to insure that, for example, the WER calculated using T
would closely track, as defined by confidence level of 99% and a margin of error of 1%, the the WER calculated using V
or E
as both V
and E
are “statistically significant sample sizes” relative to T
.