Quote Originally Posted by Mathman View Post
To me, that is the crux of the matter. Your post raises two issues. First, subjectivity is not at all the same as sampling error. One can mitigate sampling error (statistical “noise”) by expanding the panel of judges, for instance. But that will not address the issue of subjectivity.

The second problem is this. Although there are well-established statistical methods for measuring the degree of agreement among judges, I do not see any reliable and robust statistical method for deciding whether the judges are in agreement with objective criteria. The only thing I can think of is to ask some super judges what they think of the judging, then ask some super-duper judges what they think of the super judges' evaluation of the judges. This kind of expert oversight, however, has nothing to do with statistics, that I can see.
Ahh...I see what you are saying. One can measure degrees of precision (agreement between judges), but not accuracy (how well the judging fits what is factually objective...because we cannot strictly determine that.)

However, I still maintain that in a judged discipline/sport, the agreement between experts is the best that we can go by, and for its intents and purposes, can be considered the "objective" standard. If one wishes to make a theoretical "population mean of expert judgements" (or your example of superduper judges) and set that as the objective standard, then in theory it follows that a larger judging panel can mitigate individual judges' biases (the ultimate example of subjectivity) better than a smaller panel.

Thus, to me, a statistical problem remains (and so I believe it is important to not let the panel of judges shrink further to say, 3). How can it not be so? You can apply statistical analyses to graded criteria, even if subjectivity has a role in defining part of them.