Should the IJS use median scores instead of the trimmed mean?

Mathman · Jun 8, 2014

Just for fun, here is the example given above by Sandpiper to show how the median might not capture the judges' intent.

A: 6.00, 6.50, 7.25, 7.75, 8.25, 8.25, 8.50, 8.75, 8.75
B: 7.75, 8.00, 8.00, 8.25, 8.25, 9.00, 9.00, 9.25, 9.50

Mean (all nine judges): A, 7.78; B, 8.55

Trimmed mean (throw out highest and lowest): A, 7.89; B, 8.54

Winsorized mean (11% at each end): A, 7.83; B, 8.55

Median: A, 8.25; B, 8.25

Conclusion? It seems like no matter how many examples we look at, no one wants to consider the obvious solution. Ordinal judging.

Quality, not quantity. Judging, not measuring. I just can't wrap my mind around the ISU's claim that the best way to judge choreography, musical interpretation, and merit as performing art is to add up some decimal numbers.

Mathman · Jun 8, 2014

drivingmissdaisy said:
I get what you're trying to do, but it seems like a waste to throw out 8 judges scores. If the panels are randomly selected, what advantage is there in having 9 judges over 7 or even 5 if only one score is going to count?

I think that's not the right way to look at it, though. All nine judges play a role to insure a nice spread of values before we select the middle of the spread. If we had fewer judges, the spread of scores might be skewed to one side of the other before we begin. Each of the nine judges, in giving out his score, has an equal chance that his will turn out to be the middle value.

The advantage of the median is that it is more difficult for an individual judge, or a small cabal of judges acting in concert, to skew the results by giving exaggeratedly high scores to their favorite and exaggeratedly low scores to his rival. For means, the results can be affected even by giving little bumps of .25 or .5 here and there.

But you are right that the drawback to using the median is that too much useful information about the judges' intent is lost, as in Sand[piper's example.

gkelly · Jun 8, 2014

caelum said:
The point is, we need to get a good handle on what kind of bias is going on in figure skating and the ISU's insistence on anonymous voting makes this really difficult to study statistically.

I think there are several different sources of bias, which may work at cross purposes, making it hard to tease them apart:

*Individual personal preference as to what are the most important criteria defining "good skating" (honest and unavoidable)

*National or regional trends in what different skating cultures consider most important (honest and unavoidable)

*Unconscious influence by factors other than the actual quality of the skating such as skate order, knowledge of past results, "buzz" about up-and-coming skaters, crowd response, etc. (honest)

*Conscious effort to tailor one's judging toward instructions or recommendations by ISU technical committees, event referees, etc. (honest only to the extent that the instructions are derived from honest attempts to clarify judging standards; dishonest if the instructions are intended or interpreted as coded attempts to reward specific skaters or skaters from specific federations)

*Individual national bias by judges toward compatriot skaters [several layers: unconscious (honest and not completely avoidable); conscious effort to (over?)compensate for known tendency toward nationalism (honest attempt to avoid the foregoing); conscious effort to help compatriot skaters place as high as possible (dishonest, whether on the judges' own initiative or under instructions from their federations)

*Collusion/vote trading/deal making/blocs (dishonest)

It seems to me that the ideal system would be designed to develop the best possible honest consensus controlling for the honest sources of bias, and then build in additional safeguards to discourage or minimize the effects of active dishonest attempts to manipulate the results.

How to do that is the problem to be solved.

Mathman said:
I just can't wrap my mind around the ISU's claim that the best way to judge choreography, musical interpretation, and merit as performing art is to add up some decimal numbers.

I think the move toward an absolute-points system was inspired in large part by the idea that the difficulty of various jumps and many other technical elements is quantifiable.

If we grant that but disagree that holistic assessments of quality across whole programs is quantifiable in the same way, what kind of scoring could best combined assigning points for elements with comparisons of whole-program quality across the field in the current competition?

Also, what's the best way to keep track of 24 or 30 or more skaters in the same event, each of whom is better in some aspects and worse in others than each of their nearest competitors?

Sandpiper · Jun 8, 2014

Mathman said:
Conclusion? It seems like no matter how many examples we look at, no one wants to consider the obvious solution. Ordinal judging. Quality, not quantity. Judging, not measuring. I just can't wrap my mind around the ISU's claim that the best way to judge choreography, musical interpretation, and merit as performing art is to add up some decimal numbers.

I'm fine with this obvious solution. There are aspects of IJS that I like (being able to see protocols and exactly what everyone did at the event), but I too don't think adding up numbers is the best way to evaluate performances. But it appears we're the only ones, lol.

My example was very extreme, but even if you flip some of the numbers around, it likely still comes out that more judges favoured B over A, and yet they get the same mark.

Mathman · Jun 8, 2014

gkelly said:
*Collusion/vote trading/deal making/blocs (dishonest)

I did some studies about identifying pairs and small groups of judges who were of a like mind, several years ago when judging was anonymous but not randomized. In those early years one could study the voting pattern of "judge #5" all the way down the competition, we just could not attach a name or nationality to that judge. Later on the ISU decided that this was not obscure enough, so they introduce randomizing. (What's next? Encrypting the scores?)

Anyway, the problem is that there are many other reasons -- all the reasons that you listed that also apply to judges singly -- why two judges night give scores in tandem, perfectly innocently. I will look at some Junior Grand Prix protocols from last year and see if anything of interest pops up.

I think the move toward an absolute-points system was inspired in large part by the idea that the difficulty of various jumps and many other technical elements is quantifiable.

I guess so. But it is still frustrating. The ISU brain trust determined scientifically and objectively that a triple Salchow is 12.5% moire difficult than a triple toe loop. Then a few years later they changed their mind and decided that, no, objectively and scientifically speaking a triple Salchow is only 4.878% more difficult. it seems like there is still a lot of, "I'll just put down the score that seems right to me" going on, the main difference being that now the ISU technical committee is doing it instead of the individual judges. Maybe this leads to more uniform and reliable judging, but I do not recall that this (one jump is harder than another) was a problem for judges before the CoP.

If we grant that but disagree that holistic assessments of quality across whole programs is quantifiable in the same way, what kind of scoring could best combined assigning points for elements with comparisons of whole-program quality across the field in the current competition?

I don't know. What was wrong with the old first mark, second mark idea?

Also, what's the best way to keep track of 24 or 30 or more skaters in the same event, each of whom is better in some aspects and worse in others than each of their nearest competitors?

The old method was to use marks like 5.7 as place holders and mnemonic aids. I thought it worked OK. The 8.25s could serve the same function. It is the adding these scores together for different judges where the problem lies, IMHO. Your 8.25 might not mean the same thing as my 8.25.

I do not have a big ax to grind about this (all my ax-grinding posts notwithstanding.

) It just rubs me the wrong way to see mathematics misused in such a way as to give the illusion of precision where none exists.

CanadianSkaterGuy · Jun 9, 2014

I'm all for median scores. Diving does it (they actually remove the two highest and lowest scores in a panel of 7), freestyle skiing does it, and I'm not sure but I think gymnastics does it too, where they get rid of the highest and lowest scores.

Not only should they have anonymous judging and use median scores, they should track which judges are continually skewing and phase them out. Once judges realize their favouritism is irrelevant, hopefully they will dole out scores that match other judges.

louisa05 · Jun 9, 2014

I just want to say that this thread proves that perhaps the system is a bit mathematically complicated.

Sincerely,

HumanitiesWoman.

concorde · Jun 9, 2014

No matter what system you choose, someone is going to be unhappy with the outcome.

Each system has its pros/cons. In theory, I think the IJS system is a bit less subjective than the older system.

drivingmissdaisy · Jun 9, 2014

Mathman said:
The ISU brain trust determined scientifically and objectively that a triple Salchow is 12.5% moire difficult than a triple toe loop. Then a few years later they changed their mind and decided that, no, objectively and scientifically speaking a triple Salchow is only 4.878% more difficult. it seems like there is still a lot of, "I'll just put down the score that seems right to me" going on, the main difference being that now the ISU technical committee is doing it instead of the individual judges. Maybe this leads to more uniform and reliable judging, but I do not recall that this (one jump is harder than another) was a problem for judges before the CoP.

The other alternative is for each judge to decide for themselves how much each jump should be worth, which isn't preferable IMO. As a judge under 6.0 I'd have a hard time deciding whether to mark Yuna's LP or Caro's LP higher in Sochi; with IJS I'd don't have to decide whom I liked better, I just mark what I see based on set criteria. Of course, in any judged system a judge can manipulate marks but I think it is an improvement when a judge is more focused on the elements than they were under 6.0.

Mathman · Jun 9, 2014

louisa05 said:
Sincerely,

HumanitiesWoman.

Mathman · Jun 9, 2014

concorde said:
No matter what system you choose, someone is going to be unhappy with the outcome.

This is true, but I think we can say something even stronger than that. No judging system, however excellent and however detailed, can anticipate everything that can possibly happen. No matter what judging system we choose, there will always be individual competitions where we all say together, oops, the scoring system failed us on this occasion.

drivingmissdaisy said:
The other alternative is for each judge to decide for themselves how much each jump should be worth, which isn't preferable IMO.

To me, some triple toes are worth more than others, in terms of their impact as choreographic exclamation points, etc. I guess this can be addressed with GOEs and PCSs, though.

Sandpiper · Jun 9, 2014

drivingmissdaisy said:
The other alternative is for each judge to decide for themselves how much each jump should be worth, which isn't preferable IMO. As a judge under 6.0 I'd have a hard time deciding whether to mark Yuna's LP or Caro's LP higher in Sochi; with IJS I'd don't have to decide whom I liked better, I just mark what I see based on set criteria. Of course, in any judged system a judge can manipulate marks but I think it is an improvement when a judge is more focused on the elements than they were under 6.0.

But as a judge, isn't your job to decide who was better? In this case, if I were a judge under 6.0, I would say: I will place Yuna ahead, because the quality of her 3-3 combination outweighs Carolina's extra triple. Also, I enjoyed Yuna's program more.

You might disagree. Hence why there are multiple judges, and majority rules.

You do have a point about the judges needing to keep track of all 20+ contestants though. But they will have computers to see how they scored everyone that already went.

Also, as I've stated before, I do think it's good to have protocols for the elements everyone did.

Mathman · Jun 9, 2014

Sandpiper said:
But as a judge, isn't your job to decide who was better? In this case, if I were a judge under 6.0, I would say: I will place Yuna ahead, because the quality of her 3-3 combination outweighs Carolina's extra triple. Also, I enjoyed Yuna's program more

That's the big riddle, right there. There might be many reasons for enjoying one program more than another, having little to do with who deserves to win an athletic competition.

Sam-Skwantch · Jun 9, 2014

A major issue for me is simply how we get our PCS scores. How is the value derived? Is a 9.0 really supposed to be equal to a 9.0 in all other events or is it just event specific? By that I mean is there a specific set of requirements to achieve this score or is it mostly awarded like 6.0 and instead just representing the skaters artistic marks in the event at hand and more importantly based in relation to the skaters in the event and not a predetermined set of skills? This should be more clear. Is it spelled out in the ISU guidelines because I have not seen it.

Another question I've been wondering about is if we should just award a certain score bonus for the last two groups? Something like a 3pt bonus reflected on the final score for making the final group and 1.5 for the second group (FS ONLY). We know it already happens in every event. Why not just get it out there and identify it and why it is happening? I recognize there is added hype and pressure being head to head with the best in the competition and as such I don't even know if I'd have a problem with this. It's all these little unspoken rules and ways that the numbers are derived that make the score a :bang:

for me. Lets just identify the elephant in the room already and then let the Math take over from there.

Somehow to me this all feels like we are setting the price of a piece of art. We all know the cost of the frame and it is always going to be an objective price. So maybe the frame is our TES scores. Now comes the hard part. Who wants to be the jerk to set the price of the painting inside and imagine having to explain it out and imagine expecting a consistency from dealer to dealer?

concorde · Jun 9, 2014

Most high level judge are former competitive skaters and their personal experience also plays into how they score. As one international level judge told me, if someone does a move well that you found to be particularly difficult to execute, you tend to give that skaters a high mark. The judge beside you may not view the same move as particularly difficult, so that judge will not give out as high of marks.

As an ice skating parent, you quickly see how "fickle" the scoring system is. You just hope that on your child's big way, the stars align so the system works in your child's favor.

gkelly · Jun 9, 2014

Sandpiper said:
But as a judge, isn't your job to decide who was better?

Under ordinal judging, it was the job of each judge to rank the skaters as best, second best, etc.
And then the accounting algorithms (majority or OBO) combined the individual judges' rankings into a consensus for the whole panel, which often did not exactly match any individual judge's rankings.

That job changed with the change in scoring systems.

Now it's the job of the technical committees and system designers to publish in advance how much each technical element is worth, relative to each other and relative to the range of scores available for the program components.

It's the job of the technical panel to determine which technical skills the skater actually performed well enough to get credit for.

It's the job of the judges to determine how well the skater performed each technical element, and how well they fulfilled the program component criteria.

Unlike ordinal judging, it's not supposed to be a comparative system.

Then all those decisions are combined into an overall score for each skater, and the different total scores will result in ranked results for all the skaters. The rankings by base value alone and the rankings by GOE and PCS alone might both be very different not only from each other but also from the final overall rankings.

No one official ranks the skaters or even one panel -- that is not their job now.

Perhaps some judges, as well as some fans, still think in terms of ranking skaters, but that's 6.0 thinking and will diminish as older judges retire and younger ones come through the system without having devoted years to thinking in terms of comparing skaters.

concorde · Jun 9, 2014

gkelly said:
Under ordinal judging, it was the job of each judge to rank the skaters as best, second best, etc.
And then the accounting algorithms (majority or OBO) combined the individual judges' rankings into a consensus for the whole panel, which often did not exactly match any individual judge's rankings.

This was mentioned before but I wanted to highlight it. Under the ordinal system, no judge could put a Skater A in first place but Skater A could win. Strange but true. That is why I think IJS is a better system since each component of a program is evaluated & scored, and then a final score is given to the entire program. The final score is a sum of the individual components scores.

CanadianSkaterGuy · Jun 9, 2014

The problem with ordinals is that a judge can easily hold down a skater with the artistic marks. Under IJS, the judges have less ability to manipulate the final score. Certainly if you take out the highest and lowest judges, you mitigate any outliers. And if a judge is worried about their scores being the skew, maybe they should mark accordingly.

In an ideal world, every judge would give the same marks, without conferring with each other, because they would give what the skater deserves. In this perfect world, you don't get outliers, so why not modify the system to at least MINIMIZE the number of highs and lows.

I think it would certainly reduce the amount of corruption, because you can't really pay off judges to give higher or lower marks to fit your agenda because their marks might be thrown out anyways.

It's so stupid that GOE is a random selection too. It should be the median 7 marks out of 9. Ideally, it would be the median 5 marks out of 9, so as to really minimize the number of outliers, in case 2 judges are in cahoots together. Any basic knowledge of statistics would show that this leads to more consistent judging/results across the board.

Sandpiper · Jun 9, 2014

As Mathman has explained, 6.0 requires a majority of votes. That's how they deal with the outliers.

Judges aren't going to all give out the same marks. Ever. No two people are going to react the same way to the same performance. That's just the way figure skating is. There is no universal truth.

I agree the "computer throws out random marks" thing is incredibly stupid though.

Under the ordinal system, no judge could put a Skater A in first place but Skater A could win.

How is the current system any different? No judge could give a skater the highest score for a certain component, but that skater could still get the highest marks for that component if they have the highest average score.

gkelly · Jun 9, 2014

CanadianSkaterGuy said:
It's so stupid that GOE is a random selection too.

It isn't. IIRC the random selection was eliminated ca. 2006 or 2008.

Since then, all judges' scores count throughout the competition. Only the highest and lowest score are thrown out for each GOE and for each component, but usually even a judge who is marking especially high or low will have at least some of their scores count for each skater.

It should be the median 7 marks out of 9.

It is.

Ideally, it would be the median 5 marks out of 9, so as to really minimize the number of outliers, in case 2 judges are in cahoots together. Any basic knowledge of statistics would show that this leads to more consistent judging/results across the board.

That's the question the mathematicians are debating in this thread.

Should the IJS use median scores instead of the trimmed mean?

Mathman

Mathman

gkelly

Sandpiper

Mathman

CanadianSkaterGuy

louisa05

concorde

drivingmissdaisy

Mathman

Mathman

Sandpiper

Mathman

Sam-Skwantch

“I solemnly swear I’m up to no good”

concorde

gkelly

concorde

CanadianSkaterGuy

Sandpiper

gkelly

Similar threads

Connect with us