# Thread: Should the IJS use median scores instead of the trimmed mean?

1. 0

Go Ad-Free! Become a GS Supporter!
Originally Posted by Vanshilar
Uh actually, no aggregating method of any type is good in ranked voting, which is what the Arrow's Impossibility Theorem (colloquially) says (I know I'm just loosely stating it). So it doesn't really matter per se, there won't be a "perfect" way to aggregate how multiple judges score the same event…
I look at this question slightly differently. Arrow’s Theorem says that it is impossible to design an ordinal system that satisfies a list of properties that are considered desirable in many settings in economics and political science. As applied to figure skating, the most important of these is that in an ideal system the entry of a new candidate who finishes a distant third should not affect the placement of the top two.

The canonical example is the 2000 U.S. Presidential election between Al Gore and George Bush. It all came down to Florida, where Gore was clinging to a small lead, let us say 1,000,000 to 999,000. Along comes liberal candidate Ralph Nader, to the left of Gore. He siphons off 2000 votes from Gore. The final tally is

Bush 999,000; Gore 998,000; Nader 2000.

Bush (after a few hanging chads and a five to four vote in the U.S. Supreme court) wins all of Florida’s electoral votes and the presidency.

This sort of thing is regarded as anti-democratic because the will of the people (they preferred Gore to Bush) has been frustrated by an “irrelevant alternative” (Nader).

In figure skating this problem came to a head at the 1997 European men’s competition, where nobody skated well and the ordinals were all over the place. The ISU rushed a new system in place (OBO), but it did not address this particular problem. (All it did was make it harder for Michelle Kwan to win the 2002 Olympics. ) Here is an excellent article, by Sandra Loosemore of Frogs on Ice about all this.

http://www.frogsonice.com/skateweb/o...analysis.shtml

Anyway, in figure skating the problem arises in a situation like this. Here are the ordinals given by nine judges after two skaters, A and B have gone.

Skater A: 1 1 1 1 1 2 2 2 2
Skater B: 2 2 2 2 2 1 1 1 1

Skater A is winning. She is preferred over skater B by a majority of the panel.

Now skater C goes. The new rankings are

Skater A: 1 1 1 2 2 2 2 3 3
Skater B: 2 2 2 3 3 1 1 1 1
Skater C: 3 3 3 1 1 3 3 2 2

Skater B wins with four first place ordinals, three seconds, and two thirds. Skater A must be satisfied with silver even though head-to-head she beat skater B by a score of five judges to four and she beat Skater C by a score of five judges to two. Skater A beat everybody (she is the “Condorset winner”), but Skater B won the gold medal.

So the question is, is this the wrong outcome? If we do not announce any intermediate results, but wait until the end to tally all the votes, Skater B has a good claim. Her 4 firsts and 3 seconds is better that Skater A’s 3 firsts and 4 seconds, both receiving 2 thirds. The only reason that Skater A is mad is because she thought she was winning before Skater C snuck in there and stole some first place ordinals from her. I don’t know that this is so terrible in figure skating, despite the fact that Prof. Arrow (a Nobel Prize winning economist) didn’t like it.

2. 0
Originally Posted by Mathman
However, if the question is, can we look back at the protocols of past events and see whether a minority was able to dominate the majority and whether this could have been remedied, the answer is, "no." We cannot do that because of anonymous randomized judging.
Well, you could try looking at some events where the judging wasn't anonymous and randomized, e.g., JGP or US Nationals.

Of course, there would be less incentive to cheat at those events, and much less incentive or possibility to form national blocs. But there would still be minority opinions.

3. 0
Originally Posted by gkelly
Well, you could try looking at some events where the judging wasn't anonymous and randomized, e.g., JGP or US Nationals.
!!!!!!!!!

Of course, there would be less incentive to cheat at those events, and much less incentive or possibility to form national blocs. But there would still be minority opinions.
That's OK. I am not necessarily trying to catch cheaters but rather to understand the mathematical peccadillos of the judging system.

4. 0
The question of what measurement of central tendency is best depends on what kind of bias we are trying to correct for. If we are worried about a minority in collusion, then obviously median is the best method. If a majority of the judges are in collusion the median will lead to worse results than most other methods, but it's pretty much impossible to design a system than can handle majority-collusion effectively. But if the bias isn't too extreme and more individualistic - like a judge awarding +2 spin with a +3 or 8.50 Skating Skills with 9.00, while most of the other judges are fair, then maybe something like a winsorized mean would be appropriate.

The point is, we need to get a good handle on what kind of bias is going on in figure skating and the ISU's insistence on anonymous voting makes this really difficult to study statistically.

5. 0
I get what you're trying to do, but it seems like a waste to throw out 8 judges scores. If the panels are randomly selected, what advantage is there in having 9 judges over 7 or even 5 if only one score is going to count?

6. 0
Just for fun, here is the example given above by Sandpiper to show how the median might not capture the judges' intent.

A: 6.00, 6.50, 7.25, 7.75, 8.25, 8.25, 8.50, 8.75, 8.75
B: 7.75, 8.00, 8.00, 8.25, 8.25, 9.00, 9.00, 9.25, 9.50

Mean (all nine judges): A, 7.78; B, 8.55

Trimmed mean (throw out highest and lowest): A, 7.89; B, 8.54

Winsorized mean (11% at each end): A, 7.83; B, 8.55

Median: A, 8.25; B, 8.25

Conclusion? It seems like no matter how many examples we look at, no one wants to consider the obvious solution. Ordinal judging. Quality, not quantity. Judging, not measuring. I just can't wrap my mind around the ISU's claim that the best way to judge choreography, musical interpretation, and merit as performing art is to add up some decimal numbers.

7. 0
Originally Posted by drivingmissdaisy
I get what you're trying to do, but it seems like a waste to throw out 8 judges scores. If the panels are randomly selected, what advantage is there in having 9 judges over 7 or even 5 if only one score is going to count?
I think that's not the right way to look at it, though. All nine judges play a role to insure a nice spread of values before we select the middle of the spread. If we had fewer judges, the spread of scores might be skewed to one side of the other before we begin. Each of the nine judges, in giving out his score, has an equal chance that his will turn out to be the middle value.

The advantage of the median is that it is more difficult for an individual judge, or a small cabal of judges acting in concert, to skew the results by giving exaggeratedly high scores to their favorite and exaggeratedly low scores to his rival. For means, the results can be affected even by giving little bumps of .25 or .5 here and there.

But you are right that the drawback to using the median is that too much useful information about the judges' intent is lost, as in Sand[piper's example.

8. 0
Originally Posted by caelum
The point is, we need to get a good handle on what kind of bias is going on in figure skating and the ISU's insistence on anonymous voting makes this really difficult to study statistically.
I think there are several different sources of bias, which may work at cross purposes, making it hard to tease them apart:

*Individual personal preference as to what are the most important criteria defining "good skating" (honest and unavoidable)

*National or regional trends in what different skating cultures consider most important (honest and unavoidable)

*Unconscious influence by factors other than the actual quality of the skating such as skate order, knowledge of past results, "buzz" about up-and-coming skaters, crowd response, etc. (honest)

*Conscious effort to tailor one's judging toward instructions or recommendations by ISU technical committees, event referees, etc. (honest only to the extent that the instructions are derived from honest attempts to clarify judging standards; dishonest if the instructions are intended or interpreted as coded attempts to reward specific skaters or skaters from specific federations)

*Individual national bias by judges toward compatriot skaters [several layers: unconscious (honest and not completely avoidable); conscious effort to (over?)compensate for known tendency toward nationalism (honest attempt to avoid the foregoing); conscious effort to help compatriot skaters place as high as possible (dishonest, whether on the judges' own initiative or under instructions from their federations)

It seems to me that the ideal system would be designed to develop the best possible honest consensus controlling for the honest sources of bias, and then build in additional safeguards to discourage or minimize the effects of active dishonest attempts to manipulate the results.

How to do that is the problem to be solved.

Originally Posted by Mathman
I just can't wrap my mind around the ISU's claim that the best way to judge choreography, musical interpretation, and merit as performing art is to add up some decimal numbers.
I think the move toward an absolute-points system was inspired in large part by the idea that the difficulty of various jumps and many other technical elements is quantifiable.

If we grant that but disagree that holistic assessments of quality across whole programs is quantifiable in the same way, what kind of scoring could best combined assigning points for elements with comparisons of whole-program quality across the field in the current competition?

Also, what's the best way to keep track of 24 or 30 or more skaters in the same event, each of whom is better in some aspects and worse in others than each of their nearest competitors?

9. 0
Originally Posted by Mathman
Conclusion? It seems like no matter how many examples we look at, no one wants to consider the obvious solution. Ordinal judging. Quality, not quantity. Judging, not measuring. I just can't wrap my mind around the ISU's claim that the best way to judge choreography, musical interpretation, and merit as performing art is to add up some decimal numbers.
I'm fine with this obvious solution. There are aspects of IJS that I like (being able to see protocols and exactly what everyone did at the event), but I too don't think adding up numbers is the best way to evaluate performances. But it appears we're the only ones, lol.

My example was very extreme, but even if you flip some of the numbers around, it likely still comes out that more judges favoured B over A, and yet they get the same mark.

10. 0
Originally Posted by gkelly
I did some studies about identifying pairs and small groups of judges who were of a like mind, several years ago when judging was anonymous but not randomized. In those early years one could study the voting pattern of "judge #5" all the way down the competition, we just could not attach a name or nationality to that judge. Later on the ISU decided that this was not obscure enough, so they introduce randomizing. (What's next? Encrypting the scores?)

Anyway, the problem is that there are many other reasons -- all the reasons that you listed that also apply to judges singly -- why two judges night give scores in tandem, perfectly innocently. I will look at some Junior Grand Prix protocols from last year and see if anything of interest pops up.

I think the move toward an absolute-points system was inspired in large part by the idea that the difficulty of various jumps and many other technical elements is quantifiable.
I guess so. But it is still frustrating. The ISU brain trust determined scientifically and objectively that a triple Salchow is 12.5% moire difficult than a triple toe loop. Then a few years later they changed their mind and decided that, no, objectively and scientifically speaking a triple Salchow is only 4.878% more difficult. it seems like there is still a lot of, "I'll just put down the score that seems right to me" going on, the main difference being that now the ISU technical committee is doing it instead of the individual judges. Maybe this leads to more uniform and reliable judging, but I do not recall that this (one jump is harder than another) was a problem for judges before the CoP.

If we grant that but disagree that holistic assessments of quality across whole programs is quantifiable in the same way, what kind of scoring could best combined assigning points for elements with comparisons of whole-program quality across the field in the current competition?
I don't know. What was wrong with the old first mark, second mark idea?

Also, what's the best way to keep track of 24 or 30 or more skaters in the same event, each of whom is better in some aspects and worse in others than each of their nearest competitors?
The old method was to use marks like 5.7 as place holders and mnemonic aids. I thought it worked OK. The 8.25s could serve the same function. It is the adding these scores together for different judges where the problem lies, IMHO. Your 8.25 might not mean the same thing as my 8.25.

I do not have a big ax to grind about this (all my ax-grinding posts notwithstanding. ) It just rubs me the wrong way to see mathematics misused in such a way as to give the illusion of precision where none exists.

11. 0
I'm all for median scores. Diving does it (they actually remove the two highest and lowest scores in a panel of 7), freestyle skiing does it, and I'm not sure but I think gymnastics does it too, where they get rid of the highest and lowest scores.

Not only should they have anonymous judging and use median scores, they should track which judges are continually skewing and phase them out. Once judges realize their favouritism is irrelevant, hopefully they will dole out scores that match other judges.

12. 0
I just want to say that this thread proves that perhaps the system is a bit mathematically complicated.

Sincerely,

HumanitiesWoman.

13. 0
No matter what system you choose, someone is going to be unhappy with the outcome.

Each system has its pros/cons. In theory, I think the IJS system is a bit less subjective than the older system.

14. 0
Originally Posted by Mathman
The ISU brain trust determined scientifically and objectively that a triple Salchow is 12.5% moire difficult than a triple toe loop. Then a few years later they changed their mind and decided that, no, objectively and scientifically speaking a triple Salchow is only 4.878% more difficult. it seems like there is still a lot of, "I'll just put down the score that seems right to me" going on, the main difference being that now the ISU technical committee is doing it instead of the individual judges. Maybe this leads to more uniform and reliable judging, but I do not recall that this (one jump is harder than another) was a problem for judges before the CoP.
The other alternative is for each judge to decide for themselves how much each jump should be worth, which isn't preferable IMO. As a judge under 6.0 I'd have a hard time deciding whether to mark Yuna's LP or Caro's LP higher in Sochi; with IJS I'd don't have to decide whom I liked better, I just mark what I see based on set criteria. Of course, in any judged system a judge can manipulate marks but I think it is an improvement when a judge is more focused on the elements than they were under 6.0.

15. 0

Go Ad-Free! Become a GS Supporter!
Originally Posted by louisa05
Sincerely,

HumanitiesWoman.

Page 2 of 5 First 1 2 3 4 5 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•