I'm not educated in statistics, not even a little bit. So I'm just asking questions.
It seems to me that the "measurements" that judges make -- whether on a 10-point scale or a 6/7-point scale, whether of the whole program or of individual elements or individual aspects ("components") of the whole program -- are most comparable to measurements on a visual analog scale or similar rating system, where individuals are asked to rate perceptions, etc., on a scale of 1-10 or some other range. Whether these ratings are forced into digital steps or not would depend on the rating mechanism.
Often those scales are used for measuring perceptions that are internal to the person doing the perceiving, e.g., pain. In that case, each person would be providing numbers related to their own individual object of perception. Joe can only rate Joe's own pain and Sally can only rate Sally's own pain. So differences in the numbers they each produce would vary based not only on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers, but also on variations between what they are each perceiving.
But it can still be useful for investigators to find an average level of pain perceived by subjects in a study under specific conditions. So how do they work with the rating numbers to produce usable averages? Would medians or means be more appropriate?
You could also use such scales for studies such as market research that would ask people to evaluate an external object based on their own perceptions and preferences. Depending on what's being evaluated, there might be some degree of expertise involved and required of the evaluators, or it could be purely a matter of personal preference.
With judges evaluating skating, all the judges are evaluating the same object of perception, and there is expertise expected in being able to recognize and identify levels of technical skill and adherence to criteria. But the numbers they come up with will still vary based on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers. There isn't a single true number that represents the true objective measurement of a fixed parameter, such as the length of a rod (to use an example Mathman has invoked a few months ago).
At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?
Edited to add, my point is that I don't think this assumption is true:
There isn't any direct association between a given level of skill and the number 7.25 other than a consensus developed by the larger pool of trained judges of which any given judging panel is a subset.
It might be possible to define objective benchmarks for 7.0 and 8.0, for example. Maybe even 7.0 and 7.5. But for all actual examples that fall somewhere in between those benchmarks, it will still be up to the individual discretion of each judge to determine whether that intermediate example is best represented by the number 7.0, 7.25, or 7.5.
For GOEs, there are much clearer benchmarks already established. And very often the GOEs for a given element are unanimous, much more often than PCS. But even so there is also often a fluctuation between, say, 0 and +1 or 0 and -1, or sometimes between -1 and +1, as different judges differently perceive completed elements as slightly better or worse than the norm and draw the line at different points as to when to lower or raise the GOE or not.
Obviously the more judges contributing the data to the averaging process, the more "accurate" the results would be (which is exactly the reason for the concern over using fewer judges that this thread started with). But there's no independent measurement of the numerical value of a skating program or element aside from a consensus of experts -- there's no independent way to confirm whether any panel of judges got the right answer or not.
Given that that is the case, what is the best statistical method for crunching the numbers that a judging panel comes up with?
Is using larger panels the best way or only to ensure more "accurate" results?
I think random selection of some judges' scores not to count will always hurt the statistical accuracy. The justification for random selection is that it enhances judges' ability to avoid outside political influences on their judging process.
My question is whether it really does have a positive effect on that ability. If yes, then it's worth keeping for reasons extrinsic to the statistical process. If no, then it's a worthless provision.
It seems to me that the "measurements" that judges make -- whether on a 10-point scale or a 6/7-point scale, whether of the whole program or of individual elements or individual aspects ("components") of the whole program -- are most comparable to measurements on a visual analog scale or similar rating system, where individuals are asked to rate perceptions, etc., on a scale of 1-10 or some other range. Whether these ratings are forced into digital steps or not would depend on the rating mechanism.
Often those scales are used for measuring perceptions that are internal to the person doing the perceiving, e.g., pain. In that case, each person would be providing numbers related to their own individual object of perception. Joe can only rate Joe's own pain and Sally can only rate Sally's own pain. So differences in the numbers they each produce would vary based not only on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers, but also on variations between what they are each perceiving.
But it can still be useful for investigators to find an average level of pain perceived by subjects in a study under specific conditions. So how do they work with the rating numbers to produce usable averages? Would medians or means be more appropriate?
You could also use such scales for studies such as market research that would ask people to evaluate an external object based on their own perceptions and preferences. Depending on what's being evaluated, there might be some degree of expertise involved and required of the evaluators, or it could be purely a matter of personal preference.
With judges evaluating skating, all the judges are evaluating the same object of perception, and there is expertise expected in being able to recognize and identify levels of technical skill and adherence to criteria. But the numbers they come up with will still vary based on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers. There isn't a single true number that represents the true objective measurement of a fixed parameter, such as the length of a rod (to use an example Mathman has invoked a few months ago).
At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?
Edited to add, my point is that I don't think this assumption is true:
(a) There is a correct mark, independent of our efforts to measure it.
There isn't any direct association between a given level of skill and the number 7.25 other than a consensus developed by the larger pool of trained judges of which any given judging panel is a subset.
It might be possible to define objective benchmarks for 7.0 and 8.0, for example. Maybe even 7.0 and 7.5. But for all actual examples that fall somewhere in between those benchmarks, it will still be up to the individual discretion of each judge to determine whether that intermediate example is best represented by the number 7.0, 7.25, or 7.5.
For GOEs, there are much clearer benchmarks already established. And very often the GOEs for a given element are unanimous, much more often than PCS. But even so there is also often a fluctuation between, say, 0 and +1 or 0 and -1, or sometimes between -1 and +1, as different judges differently perceive completed elements as slightly better or worse than the norm and draw the line at different points as to when to lower or raise the GOE or not.
Obviously the more judges contributing the data to the averaging process, the more "accurate" the results would be (which is exactly the reason for the concern over using fewer judges that this thread started with). But there's no independent measurement of the numerical value of a skating program or element aside from a consensus of experts -- there's no independent way to confirm whether any panel of judges got the right answer or not.
Given that that is the case, what is the best statistical method for crunching the numbers that a judging panel comes up with?
Is using larger panels the best way or only to ensure more "accurate" results?
I think random selection of some judges' scores not to count will always hurt the statistical accuracy. The justification for random selection is that it enhances judges' ability to avoid outside political influences on their judging process.
My question is whether it really does have a positive effect on that ability. If yes, then it's worth keeping for reasons extrinsic to the statistical process. If no, then it's a worthless provision.
Last edited: