# Thread: Olympic judging changes ( 5 judge results)

1. 0
Originally Posted by Mathman
Actually, I was thinking of the extreme case. Suppose you sat 9 judges, "counted" all nine, then took the median. This reduces the magic number from 5 all the way down to 1.
I don't agree. All nine marks are being used to determine the median in your example. If eight of the nine marks were randomly dropped, the one remaining mark would often be significantly different from the median of the nine. If all nine marks are different, the one chosen mark would differ from the true median 8 of 9 time. And in all cases, the median (for PCs) could differ from the mean of the distribution by up to 0.25 points while the standard deviation of the mean for marks is typically 0.13 points.

The benefit from using a median also depends if you are trying to filter out random noise, or systematic errors, or outliers; and also depends on the number of samples in the distribution.

To expand on an earlier comment. Designing the system around the obsession with the occassional deal making judge is counter productive because the impact of day to day national bias, incompetence and random differences of opinion gets ignored -- and those problems are more common (than deal making) and generally more important.

The system has to be designed to deal will all potential sources of error, not just the one that is most popular to discuss.

But in general, more judges are always better than less. I can't think of any error sources where having more judges is a liabiilty. Worst case is that at some point adding more judges does not improve the reliabliity of the results. But we are so far from having enough judges to reliably decide scores to 0.01 points, that having too many judges will never be an issue.

2. 0
It still boils down to the 5-4 split-want to bet the 5 judges that stay on are the ones that count

No i don't believe they will ever judge total fair. They will always judge by beneift of doubt,money and politics.

3. 0
I still think it is a little bit misleading to focus our mathematical wrath on the number five. It seems to me that the statistical culprit is the random draw that reduces the panel from 9 to 7. The trimming of the mean from 7 to 5 seems statistically benign to me.

In other words, I would expect the trimmed mean of random samples of size 7 to behave more like the untrimmed mean of samples of size 7 than like the untrimmed mean of samples of size 5.

Originally Posted by gsrossano
I don't agree. All nine marks are being used to determine the median in your example.
By the same token, in the trimmed mean all seven marks play a role.

And in all cases, the median (for PCs) could differ from the mean of the distribution by up to 0.25 points while the standard deviation of the mean for marks is typically 0.13 points.
I think this is the crux of the matter. When the median and the mean are different, which one is "right?" Are we judging quality or are we measuring quantity?

The median is our best shot at addressing the question, "what is the judgment of the most typical judge."

In contrast, the branch of statistics that looks at means, standard deviations, etc., rests on a number of assumptions that I do not think are satisfied in the case of figure skating judging. The most important is the assumption that there is an objective quantifiable thing, external to and independent of our methods of measurement, that is "out there" waiting to be measured.

In the case of judging GOEs and PCSs, it we take the marks of each judge, add them up, and divide by n, then we have...what? At best we would have an estimate of what we would get if we added up the scores of all judges in the ISU judges' pool and divided by he number of such judges. Again, is it really the mean of all these numbers that we should be interested in estimating?

4. 0
Originally Posted by Mathman
In other words, I would expect the trimmed mean of random samples of size 7 to behave more like the untrimmed mean of samples of size 7 than like the untrimmed mean of samples of size 5.
Pardon for quoting myself , but after I wrote that sentence I got to wondering if it was really true or not. I had to look it up (I am far from being an expert on this subject.)

Here is the formula (under a few mild conditions, but not assuming normality) for the standard error of the trimmed mean:

S.E. = s_w/[(1-2g)xsqrt(n)]

where s_w is the Windsorized standard deviation, g is the percent of the data that you trim off each end, and n is the full sample size before trimming (distributed by the t distribution with n(1-2g)-1 d.f.)

5. 0
Originally Posted by Mathman
The most important is the assumption that there is an objective quantifiable thing, external to and independent of our methods of measurement, that is "out there" waiting to be measured.
Hm!

6. 0
Originally Posted by Mathman
In contrast, the branch of statistics that looks at means, standard deviations, etc., rests on a number of assumptions that I do not think are satisfied in the case of figure skating judging. The most important is the assumption that there is an objective quantifiable thing, external to and independent of our methods of measurement, that is "out there" waiting to be measured.
Could not disagree more. Further, in staking out that position you are saying that there is no point in discussing the mathematics of the scores since a skating program cannot be measured; and there is no way to combine the evaluations of the judges since that would involve means or medians or some such things, the laws of which do not apply to skating marks in your view.

If skating performances cannot be measured (and you are not the only one I have heard take that point of view) then there is no point in holding a competition, and all we should have are shows and festivals.

So rather than trying to fix/change/improve the judging system, let's save everyone a lot of grief and just end competitions and limit skating to being a source of entertainment.

7. 0
Originally Posted by gsrossano
Further, in staking out that position you are saying that there is no point in discussing the mathematics of the scores since a skating program cannot be measured;...
Heavens no! If I didn't get off on the mathematics of figure skating judging I wouldn't have spent all day yesterday reading up on bootstrap methods for measuring the robustness of valiidity of the trimmed mean.

What I do think is this.

Ordinals are quite mathematical enough and express more honestly what is really going on in the "second mark" than do add-up-the-points methods. Under ordinal judging we can still do plenty of mathematical analyzes. I do not agree that because something cannot be measured therefore it cannot be judged, or that the judging cannot be subjected to mathematical scrutiny.

What exactly are we adding up -- and taking the mean and standard deviation of -- when we come to a judgment that this skater displayed better musical interpretation than that skater?

That having been said, I still think the most piquant comment ever made about the IJS is your, "I support the CoP a full 51 per cent."

For instance, I think the CoP is a great improvement over ordinal judging for skating contests at the developmental level (gkelly taught me this on this board). Now (I hope) we have a thousand children rushing to the protocols after their competitons to see what they need to work on next, rather than a thousand kids weeping, "the judges hate me for no reason."

8. 0
Originally Posted by Mathman
Ordinals are quite mathematical enough and express more honestly what is really going on in the "second mark" than do add-up-the-points methods. Under ordinal judging we can still do plenty of mathematical analyzes. I do not agree that because something cannot be measured therefore it cannot be judged, or that the judging cannot be subjected to mathematical scrutiny.
Well, then I quess we completely disagree on this too. I find this comment and your previous comment to be 100% at odds with each other. Even ordinals are a measuement -- a relative measurement but still a measuement -- and have a mean, a median, a standard deviation and an an associated uncertainty in the final results.

9. 0
Obviously GRossano knows this, and certainly many others do, but it seems from other posts that some still think "five judges" make the decision. Maybe this is just semantics, but there are seven scoring judges, and the high and low mark for each element and each PC are dropped. So for each line item the five judges' marks that figure into the result will be different, unless there is an extreme judge that it is high or low on every single line. Seven judges' marks count and IMO this emphasis on "five judges make the decision" is misleading. "Seven, not nine, judges make the decision" would IMO be more accurate.

As GRossano points out, from a mathematical perspective, none of this changes anything as it is the number of votes cast, rather than the person casting the vote, that factors into the mathematical analysis.

10. 0
I think what is bugging me is this.

I think I am being a little bit lazy when I say, I know only one thing about statistics. I know that

t = s/sqrt(n).

Therefore, every time I see some numbers I will compute the mean and the standard deviation and then I will apply this formula.

But here is where I accuse myself of laziness or a faulty memory. Although I remembered the conclusion of the Central Limit Theorem, I forgot the hypotheses. In so far as we want to apply this formula to data collected from the real world, the hypotheses are three.

(a) There is a correct mark, independent of our efforts to measure it.

Applied to figure skating scores, this means, for example, that the real and true value of the choreography of a skating program is 7.25, and if a judge gives a mark of 7.50 or 7.00, that judge has simply measured it wrong.

(b) Our measuring technique is such that the collection of all possible measurements comprises a normal distribution whose mean is the correct mark.

I do not know whether the normal distribution part is true or not for figure skating judging. The reason for using a trimmed mean (or median) is that it works better in the case where the distribution is not normal. If it is normal but with the wrong mean, that indicates a systematic error in our measurement technique.

If we are just doing a mathematical exercise and don’t really care about the true mark (or if we concede that it does not exist), then we can substitute the mean of the population of measurements for the true mark throughout.

(c) The particular sample that we have before us was chosen randomly. That is, every measurement in the population has an equal chance of being chosen for the sample.

In figure skating, this means a lot more than just, put the names of all the judges in a hat and draw some out. There are many ways in which this condition may be violated in the case of figure skating judging.

What I fear is that the IJS has rushed past the hypotheses and arrived at the conclusion with unjustified bravado and confidence.

Originally Posted by gsrossana
Even ordinals are a measurement -- a relative measurement but still a measurement -- and have a mean, a median, a standard deviation and an an associated uncertainty in the final results.
I think we are using the word "measurement" differently. To me, "measuring" something means assigning a real number to it along a continuum. I do not regard saying, skater A is better than skater B, as "measuring."

I think of it as the primal "herdsmen versus farmers" war. Cowboys count, farmers measure.

11. 0
I'm not educated in statistics, not even a little bit. So I'm just asking questions.

It seems to me that the "measurements" that judges make -- whether on a 10-point scale or a 6/7-point scale, whether of the whole program or of individual elements or individual aspects ("components") of the whole program -- are most comparable to measurements on a visual analog scale or similar rating system, where individuals are asked to rate perceptions, etc., on a scale of 1-10 or some other range. Whether these ratings are forced into digital steps or not would depend on the rating mechanism.

Often those scales are used for measuring perceptions that are internal to the person doing the perceiving, e.g., pain. In that case, each person would be providing numbers related to their own individual object of perception. Joe can only rate Joe's own pain and Sally can only rate Sally's own pain. So differences in the numbers they each produce would vary based not only on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers, but also on variations between what they are each perceiving.

But it can still be useful for investigators to find an average level of pain perceived by subjects in a study under specific conditions. So how do they work with the rating numbers to produce usable averages? Would medians or means be more appropriate?

You could also use such scales for studies such as market research that would ask people to evaluate an external object based on their own perceptions and preferences. Depending on what's being evaluated, there might be some degree of expertise involved and required of the evaluators, or it could be purely a matter of personal preference.

With judges evaluating skating, all the judges are evaluating the same object of perception, and there is expertise expected in being able to recognize and identify levels of technical skill and adherence to criteria. But the numbers they come up with will still vary based on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers. There isn't a single true number that represents the true objective measurement of a fixed parameter, such as the length of a rod (to use an example Mathman has invoked a few months ago).

At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?

Edited to add, my point is that I don't think this assumption is true:
(a) There is a correct mark, independent of our efforts to measure it.
There isn't any direct association between a given level of skill and the number 7.25 other than a consensus developed by the larger pool of trained judges of which any given judging panel is a subset.

It might be possible to define objective benchmarks for 7.0 and 8.0, for example. Maybe even 7.0 and 7.5. But for all actual examples that fall somewhere in between those benchmarks, it will still be up to the individual discretion of each judge to determine whether that intermediate example is best represented by the number 7.0, 7.25, or 7.5.

For GOEs, there are much clearer benchmarks already established. And very often the GOEs for a given element are unanimous, much more often than PCS. But even so there is also often a fluctuation between, say, 0 and +1 or 0 and -1, or sometimes between -1 and +1, as different judges differently perceive completed elements as slightly better or worse than the norm and draw the line at different points as to when to lower or raise the GOE or not.

Obviously the more judges contributing the data to the averaging process, the more "accurate" the results would be (which is exactly the reason for the concern over using fewer judges that this thread started with). But there's no independent measurement of the numerical value of a skating program or element aside from a consensus of experts -- there's no independent way to confirm whether any panel of judges got the right answer or not.

Given that that is the case, what is the best statistical method for crunching the numbers that a judging panel comes up with?

Is using larger panels the best way or only to ensure more "accurate" results?

I think random selection of some judges' scores not to count will always hurt the statistical accuracy. The justification for random selection is that it enhances judges' ability to avoid outside political influences on their judging process.

My question is whether it really does have a positive effect on that ability. If yes, then it's worth keeping for reasons extrinsic to the statistical process. If no, then it's a worthless provision.

12. 0
gkelly, I'm not an expert in statistics or mathematics, but I'll try to answer as someone who does have a bit of a background in statistical methodlogy and some knowledge - very basic! - on creativity research (not all of it in English, so I hope this will make sense ).

In many research fields, especially in the social sciences, research participants answer self-report tests (obviously there are other research methods, but surveys are cheap and easy to analyze). So you'll have several items measuring each factor you want to study, and assuming the scales are reliable and valid, you use the calculated means from each scale (factor) for further analysis. As a researcher, you have to assume that the answers given by the participants are meaningful, otherwise your scales are not valid and you may as well leave the research to those dealing in physics and the like .

Anyway, what you do with your data is treat these variables as quasi-interval, and in that case, means can be used in a meaningful manner - something that can't be done with interval scales, for which you'll report median or mode instead, and do the appropriate non-parametic tests.

The second relevant bit of methodology is how to assess a creative product, which figure skating programs are in many ways - the only really objective parts are the base marks for the jumps; levels can differ based on a caller's perceptions. One way you can do this is to use consensual assessment technique. In CAT, creative products are rated by several judges, and the ratings are then averaged into a global rating and analyzed for interjudge reliability. In theory, those with a good understanding of a certain field should be able to agree more or less on what's exceptional, what's good, what's bad and what's mediocre.

I took this from a paper I wrote once; I could probably expand it and cite sources, but let's not do that . For further reading, the technique comes from the work of a researcher named Teresa Amabile. Prof. Amabile is on the faculty at Harvard and her work has been cited quite a bit, so I imagine she knows what she's talking about.

CAT also has some other requirements, some of which are a part of the current judging system. Assuming this procedure is followed, you don't need a lot of judges to assess creative products. Now, I can't remember if CAT is applicable in the case of performances, but it's definitely been used in many different areas of creativity. CoP and 6.0 don't exactly follow the procedure, but they do/did follow it in part. So I'd say that assuming the judges know what they're doing and are judging in good faith, you don't need a huge panel. Of course, those can be big ifs in figure skating.

Hope that was coherent and helpful. I'll now leave the math to Mathman .

13. 0
Originally Posted by Buttercup
Hope that was coherent and helpful. I'll now leave the math to Mathman .
Now if we could just get Mathman to tone down his Blowhard the Bombastic act for a minute...

The last two posts, gkelly's and yours, are very coherent and helpful indeed.

14. 0
The fundamental premise of IJS is that the value of programs will be determined on an absolute scale in accordance with standards and requirements specified in the rules and ISU communications. It is most definitely not meant to be a relative (comparative) system relying on individual standards and priorities for each judge.

Originally Posted by Mathman
(a) There is a correct mark, independent of our efforts to measure it.

For each error in an element there is a correct reduction to the GoE, and for each positive aspect there is a correct enhancement to the GoE. For the GoEs this is pretty well defined, and any differences among the judges is (should be)due to differences in observational skill and differences in knowledge, not in different expectations or individual standards for what is an error or a positive aspect.

For the PCs there is also a correct mark on an absolute scale tied for the percentage of the program the skater doing the criteria correctly at a recognized skill level. The calibration of the judges to mark to this standard is still very weak. But that does not negate the intent of the marking scale to be an absolute standard for which there is a correct mark,

(b) Our measuring technique is such that the collection of all possible measurements comprises a normal distribution whose mean is the correct mark.
The distribution of GoEs and PCs is indeed Gaussian (at least for the last two seasons I have done the analysis). The mean of the marks is the best estimate of the mean of the distribution. It is not however, the mean of the underlying distribution. Likewise the measured standard deviation is our best estimate of the standard deviation of the underlying distribution, but not the actual standard deviation of the distribution itself. The only way to know the true mean and standard deviation of the distribution is to take an infinite number of samples.

One can estimate the amount by which the measured mean and standard deviation depart from the true values using the measured values and the number of independent samples. One can also determine the probability the measured mean departs from the true mean and the probability that the difference between two averages is real (the t-distribution noted in an earlier post).

(c) The particular sample that we have before us was chosen randomly. That is, every measurement in the population has an equal chance of being chosen for the sample.
Not exactly sure what you are tying to say here, so I am not sure if I agree or not.

I think the real requirement is that each sample has to be independent. So long at the judges don't have a prior agreement or are copying from each other that requirement is met.

I think we are using the word "measurement" differently. To me, "measuring" something means assigning a real number to it along a continuum. I do not regard saying, skater A is better than skater B, as "measuring."
A continuum is not required. Most measurements today are digital, and inherently quantized (and many physical phenomena are quantized, even on a macroscopic scale).

Don't know what you mean by a "real" number. There are absolute measurements and there are relative measurements. Both are measurements.

When I decide skater A is better than skater B I have taken into account all the technical content, the presentation, (everything in the IJS criteria in fact) and decided who has correctly done more of what I have been trained to be looking for. In a group of skaters I may decide A is one place better than B or two places or 5 places, and my ordinal numerically specifies how many places I think A is better than B.

Many of the recent posts are immensely entertaining, but sight of the forest is being lost for discussion of the trees, the leaves of the trees, and the pollen on the leaves of the trees.

The bottom line, regardless of the methodology used and the exact metric calculated, is that the accuracy and precision of the results is determined by the spread in the marks and the number of marks used in the calculation. No matter how you decide to crunch the numbers the fewer the judges the less certain and the less accurate the results. That is the forest.

To have reduced the size of the panels has only made things worse. And it cannot be argued there were more than enough before, so reducing the size only brings us to the number we really need. The previous use of 9 scoring judges was already too few. If 9 was already bad, then going to seven scoring judges is only worse.

15. 0
Originally Posted by gkelly
It seems to me that the "measurements" that judges make -- whether on a 10-point scale or a 6/7-point scale, whether of the whole program or of individual elements or individual aspects ("components") of the whole program -- are most comparable to measurements on a visual analog scale or similar rating system, where individuals are asked to rate perceptions, etc., on a scale of 1-10 or some other range. Whether these ratings are forced into digital steps or not would depend on the rating mechanism...
In my opinion this is correct, and a useful way to look at it.

But it can still be useful for investigators to find an average level of pain perceived by subjects in a study under specific conditions. So how do they work with the rating numbers to produce usable averages? Would medians or means be more appropriate?
I think means are almost always the right way to go when dealing with measured data, even if there is a subjective component in how the data is obtained.

The only quibble is if it turned out that the distribution of all measurements is not "normal" (bell-shaped), then the formulas for margin of error, confidence intervals, etc., do not work as advertised. This might happen, for example, if there were some weird individuals with super-human tolerance for pain (the trimmed mean provides a partial remedy.)

With judges evaluating skating, all the judges are evaluating the same object of perception, and there is expertise expected in being able to recognize and identify levels of technical skill and adherence to criteria. But the numbers they come up with will still vary based on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers. There isn't a single true number that represents the true objective measurement of a fixed parameter, such as the length of a rod (to use an example Mathman has invoked a few months ago).
I think this is the one thing that almost everyone will agree on.

At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?
Yes, I think so.

Obviously the more judges contributing the data to the averaging process, the more "accurate" the results would be (which is exactly the reason for the concern over using fewer judges that this thread started with). But there's no independent measurement of the numerical value of a skating program or element aside from a consensus of experts -- there's no independent way to confirm whether any panel of judges got the right answer or not.

Given that that is the case, what is the best statistical method for crunching the numbers that a judging panel comes up with?

Is using larger panels the best way or only to ensure more "accurate" results?
Pretty much, yes, it is the only way. Here "accurate" means that the average scores of the panel match the average of the scores that would hypothetically be given by all well qualified judges.

I think random selection of some judges' scores not to count will always hurt the statistical accuracy.
I believe that is the second thing about which there is no disagreement.

The justification for random selection is that it enhances judges' ability to avoid outside political influences on their judging process.

My question is whether it really does have a positive effect on that ability. If yes, then it's worth keeping for reasons extrinsic to the statistical process. If no, then it's a worthless provision.
This, of course, is not a statistical question. My personal opinion is that it is worse than merely worthless, it is harmful to the goal of generating public trust in the integrity of the judging.

Let me make one more little comment about the following, because this is the point at which the mathematiucs comes in.

At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?
Having abandoned the "measuring a steel rod" model, now there are no errors of measurement, but only "sampling error" (random statistical noise) to worry about. If the average score given by a judging panel is 132.17, but the margin of error is plus or minus10 points, then we would not have much confidence that this estimate is close to what the average of the population of all possible judges' marks would be.

But I still think we are on shaky ground. There are many reasons for variation in judges' marks besides random sampling error. If one judge gives a high mark and another judge gives a low one, is this just statistical noise that we should try to reduce or filter out? Or is this the very thing that we want to study?

Page 3 of 7 First 1 2 3 4 5 6 7 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•