In creativity research, one method used to evaluate creative works is to go to a number of experts and have them judge it (separate from each other) on various criteria, and then look to see if they are in agreement with each other about the merits (or lack thereof). Interjudge reliability does tend to be pretty high in such studies. I wonder what the reliability would be for judges scores' if we calculated it. Who's up for some SPSS fun, then?
It's very tricky. We are trying to apply statistical methods to qualitites rather than to quantities. The question is -- what exactly do we want to measure?
In studies like the ones you describe, yes, we can measure the degree of agreement among the judges. And when we finish our analysis, that is what we have learned. That
this panel of judges was in more substantial agreement than was
that panel of judges. Or that there was substantial agreement among judges that Kozuka's skating skills were superior to Van der Perren's, but that there was less agreement on the question of whether Weir's transitions were better than Mroz's.
prettykeys said:
Other things like Skating quality, Transitions, etc. (and of course the Technical aspects) should follow objective guidelines; the subjectivity regarding the scale should be minimized (i.e. the sampling variation/error). It is very much a statistical problem if judges are to generally agree with the scale (e.g. what constitutes exceptional, superior, fair, or poor.)
To me, that is the crux of the matter. Your post raises two issues. First, subjectivity is not at all the same as sampling error. One can mitigate sampling error (statistical “noise”) by expanding the panel of judges, for instance. But that will not address the issue of subjectivity.
The second problem is this. Although there are well-established statistical methods for measuring the degree of agreement among judges, I do not see any reliable and robust
statistical method for deciding whether the judges are in agreement with objective criteria. The only thing I can think of is to ask some super judges what they think of the judging, then ask some super-duper judges what they think of the super judges' evaluation of the judges. This kind of expert oversight, however, has nothing to do with statistics, that I can see.