Two Olympic Judges suspended by ISU

Ares · Jun 22, 2018

[email protected] said:
We've covered it millions of times and I am not usually provoked by yet another mentioning of Lakernik, Shekhovtsova, Sochi, etc. I do not share popular by some views of conspiracy and scandal - I think that they are delusional. But I do respect the people's right for their own point of view, including those on competency of Lakernik. I just cannot bear the labels like "fraudster".

What kind of delusional conspiracy it is when people can count / identify & track steps and see striking discordance with attributed tech level noticing shady call? Or when they do see the very same skater skating almost virtually the same with miraculous increment in points on everything they do and receiving how conveniently no calls for what they did wrong again like someone used to throughout the season. It results in fraudulent result, so calling a fraud a fraud is not egregious from that stance.

Convoluted rules, various top-down powers and subjective factor unfortunately makes it easier to cheat and it still does continue even if nothing beats that one instance in my memory. You can always trust in people to defend and proclaim that they ''saw everything differently because they were there, they are tech specialists, skilled judges, they were closer ...''.

The thing stays, they are not willing to deal with the core of their corruption, because they'd have to get rid of themselves essentially.

cohen-esque · Jun 22, 2018

narcissa said:
Actually he's right, some federations will just have better skaters. It's a confounding variable. You can't just assume that the mean aptitude of skaters from each fed is the same.

Some feds have better skaters, and you can’t assume that the mean aptitude of skaters from all countries is the same. All true.

I think that this is utterly irrelevant to Miller’s analysis, but, I don’t think they expressed what I believe they’re trying to say very clearly. I think that when they say “placed their skater 1st” or “nine our nine 1st places” they aren’t referring to the judge ranking the skater in the competition. They are referring to the judges marks for an individual skater compared to whole panel: so, a “1st place out of 9” here would mean that they marked their own skater the highest of all the judges, a second place would mean they gave the second-highest marks, and so on.

Basically, find how high their scores are vs everyone else, and convert it to an ordinal value for easy math.

We should expect that the judges’ marks for individual skaters are similar. For example, the skater in 5th places who scores 175 points in the FS should receive ~175 points from every single judge, with some normal coincidental variation that *is not due to judging bias.* Since scores can be very, very close in skating competitions, the total scoring deviations from a judge for an individual skater may not actually tell us anything useful. (See 2014 Worlds, Ice Dance podium.) And their scores ranked against the other judges’ may not tell us anything, since with close scores even small deviations can produce dramatic swings there. If the US judge is scoring one USA skater one time and gives them the highest marks of the panel by 2.78 points (spread out over 12-13 elements and 5 PCS categories), is that due to bias, or just coincidence?

But if that country has more than one skater in an event, then that country’s judge will be evaluating a home skater on more than one occasion. Especially if that judge is on the short and free panels and their skaters progress to the free. In this case, since without bias it should be coincidental whether they give their own skaters the second-highest or third-lower standards or whatever else of the panel, we expect their average ordinal to be 4.5– so, for obvious reasons, let’s say we’ll take either between the 4th or 5th ordinal marks among the judges and say that’s fine.

For an example, say three Armenian Pairs make the FS at Worlds. There is an Armenian judge on the SP and the FS panels, so she evaluates her own Pair six times (3 Pairs x 2 segments.) The break-down of her marks, ranked against the other judges marks, looks like so:

Armenian Pair 1: SP 1st, FS 7th highest marks
Armenian Paid 2: SP 1st, FS 3rd highest marks
Armenian Pair 3: SP 4th, FS 6th highest marks

So, on average, her ordinal score for the Armenian team is 3.67. We said 4th highest is good so she’s a little high... but is she too far out of bounds? Let’s check it against her other scores for her non-home country skaters... and, voila, 3.81. So her relative scores for her own skaters are really just fine, compared to how she usually scores.

Meanwhile we get the Moroccan judge, who gives his skaters an average 1.8. His average for non-Moroccan skaters is 6.78. Now, *he* is clearly biased.

In both cases, you need to establish an acceptable corridor of scores for those judges, which I will arbitrarily set within one standard deviation of the mean. Say the SD is 1.0 (cuz I’m lazy.) That means a judge would come under scrutiny for:

-Likely national (or other specific) bias if they score only certain skaters an average of 3.5+ or 5.5-
-Generally incompetent judging if they score all skaters like that.
-Both bias and incompetence if it’s mixed (I.e., the Moroccan judge.)

Notice that it doesn’t matter what the rank of their skaters in the competition— their aptitude, in other words— was for this method to work. It looks only at whether a judge gives relatively high marks compared to both herself and her peers for her own skaters. (Or, relatively low marks.)

Now, I’m not necessarily endorsing this approach. I see a *lot* of potential issues with it, particularly regarding very small relevant datasets. In real life, it would be just about useless at identifying if the Moroccan judge is biased since it’s such a small fed with few skaters. (Are there even any Moroccan judges?) It would work best for judges from successful, big feds with lots of opportunities, and over whole seasons rather than individual competitions. (It would work extremely well, though, to determine whether all the judges, together, of a certain fed tend to be more or less biased, which seems closest to Miller’s post regarding national bias, in general.)

*Or this isn’t anything at all like what Miller meant, in which case... take it as my own interesting proposal for a possible system of judging review, I guess.

**Or I’ve made some obvious glaring error and this whole thing is nonsensical. I have had a very long work week, so it is very possible. :bed:

narcissa · Jun 22, 2018

cohen-esque said:
Some feds have better skaters, and you can’t assume that the mean aptitude of skaters from all countries is the same. All true.

I think that this is utterly irrelevant to Miller’s analysis, but, I don’t think they expressed what I believe they’re trying to say very clearly. I think that when they say “placed their skater 1st” or “nine our nine 1st places” they aren’t referring to the judge ranking the skater in the competition. They are referring to the judges marks for an individual skater compared to whole panel: so, a “1st place out of 9” here would mean that they marked their own skater the highest of all the judges, a second place would mean they gave the second-highest marks, and so on.

Basically, find how high their scores are vs everyone else, and convert it to an ordinal value for easy math.

We should expect that the judges’ marks for individual skaters are similar. For example, the skater in 5th places who scores 175 points in the FS should receive ~175 points from every single judge, with some normal coincidental variation that *is not due to judging bias.* Since scores can be very, very close in skating competitions, the total scoring deviations from a judge for an individual skater may not actually tell us anything useful. (See 2014 Worlds, Ice Dance podium.) And their scores ranked against the other judges’ may not tell us anything, since with close scores even small deviations can produce dramatic swings there. If the US judge is scoring one USA skater one time and gives them the highest marks of the panel by 2.78 points (spread out over 12-13 elements and 5 PCS categories), is that due to bias, or just coincidence?

But if that country has more than one skater in an event, then that country’s judge will be evaluating a home skater on more than one occasion. Especially if that judge is on the short and free panels and their skaters progress to the free. In this case, since without bias it should be coincidental whether they give their own skaters the second-highest or third-lower standards or whatever else of the panel, we expect their average ordinal to be 4.5– so, for obvious reasons, let’s say we’ll take either between the 4th or 5th ordinal marks among the judges and say that’s fine.

For an example, say three Armenian Pairs make the FS at Worlds. There is an Armenian judge on the SP and the FS panels, so she evaluates her own Pair six times (3 Pairs x 2 segments.) The break-down of her marks, ranked against the other judges marks, looks like so:

Armenian Pair 1: SP 1st, FS 7th highest marks
Armenian Paid 2: SP 1st, FS 3rd highest marks
Armenian Pair 3: SP 4th, FS 6th highest marks

So, on average, her ordinal score for the Armenian team is 3.67. We said 4th highest is good so she’s a little high... but is she too far out of bounds? Let’s check it against her other scores for her non-home country skaters... and, voila, 3.81. So her relative scores for her own skaters are really just fine, compared to how she usually scores.

Meanwhile we get the Moroccan judge, who gives his skaters an average 1.8. His average for non-Moroccan skaters is 6.78. Now, *he* is clearly biased.

In both cases, you need to establish an acceptable corridor of scores for those judges, which I will arbitrarily set within one standard deviation of the mean. Say the SD is 1.0 (cuz I’m lazy.) That means a judge would come under scrutiny for:

-Likely national (or other specific) bias if they score only certain skaters an average of 3.5+ or 5.5-
-Generally incompetent judging if they score all skaters like that.
-Both bias and incompetence if it’s mixed (I.e., the Moroccan judge.)

Notice that it doesn’t matter what the rank of their skaters in the competition— their aptitude, in other words— was for this method to work. It looks only at whether a judge gives relatively high marks compared to both herself and her peers for her own skaters. (Or, relatively low marks.)

Now, I’m not necessarily endorsing this approach. I see a *lot* of potential issues with it, particularly regarding very small relevant datasets. In real life, it would be just about useless at identifying if the Moroccan judge is biased since it’s such a small fed with few skaters. (Are there even any Moroccan judges?) It would work best for judges from successful, big feds with lots of opportunities, and over whole seasons rather than individual competitions. (It would work extremely well, though, to determine whether all the judges, together, of a certain fed tend to be more or less biased, which seems closest to Miller’s post regarding national bias, in general.)

*Or this isn’t anything at all like what Miller meant, in which case... take it as my own interesting proposal for a possible system of judging review, I guess.

**Or I’ve made some obvious glaring error and this whole thing is nonsensical. I have had a very long work week, so it is very possible.

Ah, this makes a lot of sense and is an interesting study, yes. If that is what Miller meant, I apologize (and certainly find his results very interesting...).

Shanshani · Jun 22, 2018

I'm actually currently working on a method to evaluate nationalistic judge bias based on score differentials across competitions. Basically, I'm taking a judge and look at all their scores for all senior level top tier competitions in the same segment of the same field (I'm testing this right now on Senior Men's Free Skate scores). Then I find the difference between their scores and the average score the skaters received from all of the judges--let's call that measure the score differential. Then I split the score differentials up into two groups: scores for skaters from the same country as the judge, and scores from skaters from a different country. I find the average score differential for both groups and compare them. So for instance, if the average score differential for Judge A for skaters from her country is +5, while the average score differential for her scores for skaters not from her country is -1, that means that on average, she scores skaters from her country 5 points higher than other judges, and scores skaters not from her country one point lower. I then perform a one-tailed t-test (that's a statistical test used to see if the difference between two averages is significant, ie. not due to random chance) on the two groups of data. This produces a statistic called p, which measures the odds between 0 and 1 that the difference between two averages would be due to random chance if the "real" averages are equal (in other words, if the judge was totally unbiased). So if p is, say 0.05, that means that the difference will occur by chance only 5% of the time if the judge is unbiased and therefore the judge is most probably not unbiased.

I've looks at 11 judges so far, and it's not pretty. All 11 judges scored their home country skaters higher than they scored skaters not from their home country, usually between 4-6 points higher. In 9 out of 11 cases, that difference is significant ie., highly unlikely to be the result of an unbiased judge using p=0.05 as the significance threshold, which is the standard significance threshold for scientific studies. If we lower the threshold to p=0.01, 6 judges' scoring records still show a statistically significant difference between how they score skaters from different countries versus their own. In 2 cases (Weiguang Chen and Peggy Graham, a USA judge), the p value is so small that the program I'm using to calculate it ceases to show a number and instead just displays p < 0.00001. Lorrie Parker also clocks in at an extremely low p=0.000024. Drugs are approved on the backs of higher p values than that!

I'll do a full write up when I'm done analyzing, since I'm sure a lot of what I just said doesn't make sense to people. But right now, there's a few more judges I want to look at. I also wish I could look at things other than the Men's Free Skate, but it's very time consuming and there are reasons I think you should look at each segment/field separately, so that's probably not going to happen. But still, what I'm doing is not particularly sophisticated, and the ISU could automate it if they contracted a programmer (I don't know how to program, so I have to do a lot of this by hand). No reason not to apply basic statistical tools to examine judges' scores.

Izabela · Jun 22, 2018

cohen-esque said:
entire post

Hmmm, isn't this rather too neat though and assumes that judges are too dumb in strategizing how much they will give a "point edge" to their own skaters relative to others? It may not also reflect real life situations where judges would discriminate on their own skaters and would only give 1-2 extra points to them (and may judge them more fairly) depending on their actual performance and assumed ability to actually podium, so there's *more* chance that their score would fall around 4th-5h highest, and in effect, may offset the overscoring of their *chosen* skater who they think may have better chance of winning/medalling? For example:

Judge 1 (with multiple home skaters):

Skater 1: 1st SP; 1st FS
Skater 2: 4th SP; 3th FS
Skater 3: 2nd SP; 7th FS

Ordinals: 3.0
Ordinals for non-home skaters: 3.90 (for example)

So even if Judge 1 gave 220 points for Skater 1 vs. ~190 points against other Judges, it wouldn't reflect that eyebrow-raising 30+ difference in their scoring since Judge 1 scored their other home skaters very tightly with other judges (again, since the judge doesn't really think their Skaters 2 and 3 would have a chance to podium); and thus offset the over scoring?

And wouldn't that kind of system beneficial to big feds too since they are likely to have more skaters competing? (and I'm just following the "law of average" here where the more data you have, the more chances you'll hover closer to the mean)?

Let say:

Judge 2 (with only 1 home skater):

Skater 1: 1st SP; 1st FS

Ordinals: 1.0
Ordinals for non-home skaters: 4.7

And that's it. Since Judge 2's ordinal is 1 and obviously way away from 4.5, Judge 2 looks more obviously biased in comparison to Judge 1. Even though Judge 2's actual score may actually only have a ~10 difference against the other judges' score for Skater 1.

On the other hand, this may not capture overscoring, but underscoring for other non-home skaters who they consider actual threat to their own skaters.

(PLEASE SHOOT ME IF I TOTALLY DIDN'T GET YOUR LOGIC).

Harriet · Jun 22, 2018

Shanshani said:
I'll do a full write up when I'm done analyzing, since I'm sure a lot of what I just said doesn't make sense to people.

Never mind a full write-up here (though I'll definitely appreciate it when you do - I'm a textual analysis person, not stats, though I've edited enough science and marketing papers to follow the gist), I think you're pulling together enough content of substance at this point that you could write your investigation up as a full-blown journal paper and submit it for publication. There are sports science and sports ethics journals where this sort of analysis would fit right in.

gkelly · Jun 22, 2018

If you want to look at how judges "rank" skaters, it makes more sense to look at how they rank the GOEs and PCS, compared to other skaters in the event and compare to how other judges rank the same skaters, since that's the part of the scoring they have control over.

Yes, with IJS the actual scores, the size of the gaps between skaters matter too. So figure that in as well.

But does it make sense to say that a judge "placed" a skater third if she gave him her 8th highest GOEs and PCS?

The base values of the elements are the same for all judges, and the individual judges will only have a rough idea of what it will be while they are scoring the performances.

Look at what the judges are actually doing, what they actually know in real time while awarding their scores. See if there is a pattern of overmarking or undermarking there.

Most judges are going to overmark their own skaters for reasons that they may not always be consciously aware of -- familiarity and warm feelings for skaters they've been watching since well before junior level, national/cultural preferences about what makes good skating, which of the written and unwritten criteria are most important, etc.

How much is acceptable? How much should trigger an investigation?

How can the natural unconscious national bias be distinguished from conscious manipulation? Is there a point where even unconscious bias would be unacceptable.

Shanshani · Jun 22, 2018

I don't think whether it's conscious or unconscious really matters. What matters is whether the judge can grade fairly or not, not whether the judge is grading unfairly consciously or unconsciously.

And as it so happens, out of the 11 judges I've checked so far 2, of them do not show significant evidence of nationalistic bias: judges Wendy Enzmann, a US judge, and judge Agita Abele, a Latvian judge. They both scored their home country skaters less than 1.5 points more than they scored other skaters, and when I calculated the probability that a completely unbiased judge might score like they did, I got a high enough probability (p=0.21 and p=0.25 respectively) that that could occur for there to be no conclusive evidence of bias. Both Enzmann and Abele have pretty moderate to large judging portfolios too, so it's not for lack of available evidence that p is high. So it's definitely possible for judges to possess no nationalistic bias, or a small enough bias that it both doesn't really matter and isn't really detectable. If you compare Enzmann's judging record and her compatriots Graham or Parker's, it's like night and day--Parker has a 5 point pro-US bias across all the competitions she's judged, and Graham has a 9.4 point pro-US bias, and both of them have p values so low (ie. the chance that they are unbiased and happened to score that way randomly) that if they were drugs I would be well on my way to FDA approval. In Grahams case, the p value was so low that my computation program refused to compute the exact number. So there really is a pretty clear distinction between biased judges and unbiased judges.

chopinskate · Jun 22, 2018

Harriet said:
Never mind a full write-up here (though I'll definitely appreciate it when you do - I'm a textual analysis person, not stats, though I've edited enough science and marketing papers to follow the gist), I think you're pulling together enough content of substance at this point that you could write your investigation up as a full-blown journal paper and submit it for publication. There are sports science and sports ethics journals where this sort of analysis would fit right in.

Do it, Shanshani! This was great to read!

gkelly · Jun 22, 2018

Shanshani said:
I don't think whether it's conscious or unconscious really matters. What matters is whether the judge can grade fairly or not, not whether the judge is grading unfairly consciously or unconsciously.

It may not matter for the skaters' results, or for strategies to minimize the effect of such bias on the results short of removing these judges from the judging ranks.

But it does matter when people start throwing around words like "corruption" and "fraud" or even "strategizing" or "manipulating."

yume · Jun 22, 2018

Shanshani, are you working for NASA or something like that?

chopinskate · Jun 22, 2018

gkelly said:
But it does matter when people start throwing around words like "corruption" and "fraud" or even "strategizing" or "manipulating."

This is Utopian.

It's exactly what's implied by the result of this particular investigation. People are voicing stuff like this out: https://twitter.com/LynnRutherford/status/1009161346146070528

It might not be directly ever said, but it's implied, and it will weigh in on people's minds. It might certainly weigh in on the judges' minds if they were to mark Chinese skaters the next season.

If you've seen people perhaps using those words here, they are just voicing out what has been implied to them. They are also extremely annoyed that only the Chinese judges were suspended, when the US judges also showed bias. People who've bothered to read Shanshani's or cohen-esque's analyses will know that the US judges were biased, too, but the labels you've pointed out will stick majorly only to the Chinese judges where the "casual" fans are concerned, and I have only a little doubt will weigh in on the minds of the judges evaluating the Chinese skaters next season.

The ISU has indeed shown that it might be biased in picking out which countries it will single out. If the more involved fans are attaching these labels to the ISU or the US judges, well, they are not doing anything different from what a lot more will be doing with the Chinese judges. Just a pity it will not be of the same magnitude.

synteis · Jun 22, 2018

Shanshani said:
Whole post.

This is fantastic. I'm a heavy user of stats in my real life job so this makes sense to me and is also super super cool. I don't know if you're using excel or matlab or R or what not for this but it might also be cool to compare GOE vs PCS and see if biased judges tend to boost one or the other. But idk how you score breakdown works so that might be a lot of extra work!

Eclair · Jun 22, 2018

gkelly said:
If you want to look at how judges "rank" skaters, it makes more sense to look at how they rank the GOEs and PCS, compared to other skaters in the event and compare to how other judges rank the same skaters, since that's the part of the scoring they have control over.

Yes, with IJS the actual scores, the size of the gaps between skaters matter too. So figure that in as well.

But does it make sense to say that a judge "placed" a skater third if she gave him her 8th highest GOEs and PCS?

The base values of the elements are the same for all judges, and the individual judges will only have a rough idea of what it will be while they are scoring the performances.

Look at what the judges are actually doing, what they actually know in real time while awarding their scores. See if there is a pattern of overmarking or undermarking there.

Most judges are going to overmark their own skaters for reasons that they may not always be consciously aware of -- familiarity and warm feelings for skaters they've been watching since well before junior level, national/cultural preferences about what makes good skating, which of the written and unwritten criteria are most important, etc.

How much is acceptable? How much should trigger an investigation?

How can the natural unconscious national bias be distinguished from conscious manipulation? Is there a point where even unconscious bias would be unacceptable.

If the warm feeling the judge is feeling leads to her scoring Vincent above Yuzuru then this should lead to an investigation.
If the familiarity that judge is feeling leads to her scoring Nathan approx. 10 points more than the average of the judges, this should lead to an investigation.

narcissa · Jun 22, 2018

Shanshani said:
I'm actually currently working on a method to evaluate nationalistic judge bias based on score differentials across competitions. Basically, I'm taking a judge and look at all their scores for all senior level top tier competitions in the same segment of the same field (I'm testing this right now on Senior Men's Free Skate scores). Then I find the difference between their scores and the average score the skaters received from all of the judges--let's call that measure the score differential. Then I split the score differentials up into two groups: scores for skaters from the same country as the judge, and scores from skaters from a different country. I find the average score differential for both groups and compare them. So for instance, if the average score differential for Judge A for skaters from her country is +5, while the average score differential for her scores for skaters not from her country is -1, that means that on average, she scores skaters from her country 5 points higher than other judges, and scores skaters not from her country one point lower. I then perform a one-tailed t-test (that's a statistical test used to see if the difference between two averages is significant, ie. not due to random chance) on the two groups of data. This produces a statistic called p, which measures the odds between 0 and 1 that the difference between two averages would be due to random chance if the "real" averages are equal (in other words, if the judge was totally unbiased). So if p is, say 0.05, that means that the difference will occur by chance only 5% of the time if the judge is unbiased and therefore the judge is most probably not unbiased.

I've looks at 11 judges so far, and it's not pretty. All 11 judges scored their home country skaters higher than they scored skaters not from their home country, usually between 4-6 points higher. In 9 out of 11 cases, that difference is significant ie., highly unlikely to be the result of an unbiased judge using p=0.05 as the significance threshold, which is the standard significance threshold for scientific studies. If we lower the threshold to p=0.01, 6 judges' scoring records still show a statistically significant difference between how they score skaters from different countries versus their own. In 2 cases (Weiguang Chen and Peggy Graham, a USA judge), the p value is so small that the program I'm using to calculate it ceases to show a number and instead just displays p < 0.00001. Lorrie Parker also clocks in at an extremely low p=0.000024. Drugs are approved on the backs of higher p values than that!

I'll do a full write up when I'm done analyzing, since I'm sure a lot of what I just said doesn't make sense to people. But right now, there's a few more judges I want to look at. I also wish I could look at things other than the Men's Free Skate, but it's very time consuming and there are reasons I think you should look at each segment/field separately, so that's probably not going to happen. But still, what I'm doing is not particularly sophisticated, and the ISU could automate it if they contracted a programmer (I don't know how to program, so I have to do a lot of this by hand). No reason not to apply basic statistical tools to examine judges' scores.

This is exactly what I did too, although I didn't look at individual judges -- hmm, maybe I should have. Because I had to look up the names of each judge to determine what country they're from, too, but then I just sorted them by country. Maybe I'll go back and add those in because in light of this discussion it is very interesting indeed.

Even without looking at individual judges, the effect is pretty significant -- EVERY judge from top 5 countries (Can, Japan, USA, Russia, China) gives their skater significantly (p <<<< 0.05, which is what I used) more GOE and PCS per element/PCS category, and their competitors significantly (p << 0.05) lower GOE/PCS than average, except for (1) Japan, in GOE, (they give competitors higher GOE than average) and (2) Russia, in PCS (not significant).

(these charts are per ELEMENT)
All skaters: https://4.bp.blogspot.com/-L0mS8QZY...FoD9VjU5PwSgCLcBGAs/s640/download+%284%29.png
Men: https://1.bp.blogspot.com/-bS7hjBtQ...IzGEpLOttGjQCLcBGAs/s640/download+%283%29.png
Women: https://4.bp.blogspot.com/-IqUAf0Kr...UIx8LZfRRDQACLcBGAs/s640/download+%282%29.png

Results are pretty consistent across the board. It's hard to say who's the worse but China is DEFINITELY quite out there. So are some other countries.

gkelly · Jun 22, 2018

Eclair said:
If the warm feeling the judge is feeling leads to her scoring Vincent above Yuzuru then this should lead to an investigation.

She didn't score Vincent above Yuzuru. Both the GOEs and the PCS she gave to Vincent were significantly lower those she gave to Yuzuru, especially the PCS.

The difference in the actual scores was smaller than for most of the other judges. Therefore her scores plus the base values ended up higher than their scores plus the base values. But none of the judges had any control over the base values. That was all down to Zhou doing harder jumps, not getting +REP on any of those jumps, etc.

Nor would any of the judges have been entirely aware of exactly how much higher one skater's base values were than another. They don't get a score tracker. They don't have the scale of values in front of them. They don't get told what levels are called for the spins and steps (or fall deductions, though often it's obvious; or timing deductions). Or in freeskates to remember the exact point totals of every skater from the short program.

With all the attention they need to give to scoring both the GOEs and the PCS, do you really think they have time to think "Quad flip is worth X points, my +2 is worth Y points, but oh there was an edge call so it was only worth Z; triple axel is worth P points, my -1 is worth Q; that spin was probably a level 4 worth R points with S points for my +1, but if it was only called as level 3 then the base value is T instead of R and my GOE is only U instead of S" etc. etc. for 12 elements, then figure out a total and remember the totals of all the previous skaters and decide how high or low to score the PCS to make sure that skater L slots in ahead of skater J and behind skater K?

If they want to boost a skater's scores, they'll just score that skater's GOEs and PCS as high as they dare. But they can't figure out exactly where that's going to place that skater in relation to another skater with a different base value -- especially one who hasn't even skated yet.

If you want to say that Judge X has scored skater A higher than skater B, look at the the scores that Judge X actually gave to skaters A and B.

If the familiarity that judge is feeling leads to her scoring Nathan approx. 10 points more than the average of the judges, this should lead to an investigation.

Yes, if the PCS and GOE scores are not only that much above the average of the other judges but also significantly above the next highest judge, that would be a reason to investigate.

But 1) Don't assume the results of the investigation before it takes place, and 2) don't look at the final placements that a judge "gave" to a specific skater combined with all the scoring pieces the judge had no control over or in some cases knowledge of and assume that the judge intended to produce that result. If several skaters are close in overall scores, perhaps especially when some are significantly stronger in base value and others in GOEs and/or PCS, the IJS does not allow judges to know exactly where they are "placing" skaters. So it's meaningless to analyze or try to guess their placement intentions by assuming they were working with knowledge that they didn't have.

If they're cheating, all they can do is try to help their favored skaters as much as possible. They can't control whether they're going to push that skater up to 3rd, 4th, or 5th in the pseudo-rankings based on their own scores plus tech panel calls, let alone how much effect their scores will have on the final results.

It's meaningful to say that Judge X has scored Skater A 10 points higher than the rest of the panel and that is suspicious. It is not meaningful to say that Judge X placed Skater A first, or third, or whatever if in fact Judge X's scores for Skater A are not the highest or third highest scores that s/he gave in the event.

cohen-esque · Jun 22, 2018

Izabela said:
Hmmm, isn't this rather too neat though and assumes that judges are too dumb in strategizing how much they will give a "point edge" to their own skaters relative to others? It may not also reflect real life situations where judges would discriminate on their own skaters and would only give 1-2 extra points to them (and may judge them more fairly) depending on their actual performance and assumed ability to actually podium, so there's *more* chance that their score would fall around 4th-5h highest, and in effect, may offset the overscoring of their *chosen* skater who they think may have better chance of winning/medalling? For example:

Judge 1 (with multiple home skaters):

Skater 1: 1st SP; 1st FS
Skater 2: 4th SP; 3th FS
Skater 3: 2nd SP; 7th FS

Ordinals: 3.0
Ordinals for non-home skaters: 3.90 (for example)

So even if Judge 1 gave 220 points for Skater 1 vs. ~190 points against other Judges, it wouldn't reflect that eyebrow-raising 30+ difference in their scoring since Judge 1 scored their other home skaters very tightly with other judges (again, since the judge doesn't really think their Skaters 2 and 3 would have a chance to podium); and thus offset the over scoring?

And wouldn't that kind of system beneficial to big feds too since they are likely to have more skaters competing? (and I'm just following the "law of average" here where the more data you have, the more chances you'll hover closer to the mean)?

Let say:

Judge 2 (with only 1 home skater):

Skater 1: 1st SP; 1st FS

Ordinals: 1.0
Ordinals for non-home skaters: 4.7

And that's it. Since Judge 2's ordinal is 1 and obviously way away from 4.5, Judge 2 looks more obviously biased in comparison to Judge 1. Even though Judge 2's actual score may actually only have a ~10 difference against the other judges' score for Skater 1.

On the other hand, this may not capture overscoring, but underscoring for other non-home skaters who they consider actual threat to their own skaters.

(PLEASE SHOOT ME IF I TOTALLY DIDN'T GET YOUR LOGIC).

No, you’ve basically captured a lot of the things I find wrong with such a system. It makes several assumptions about the judging that I’m not sure you can justify— and without those, as gkelly must be despairing of pointing out by now, it doesn’t really make sense to consider the “ordinals” of the whole panel as opposed to their actual scores, since that’s not really how IJS works.

Its one advantage is that you don’t need to deal with points at all; its major, glaring downside is also that it doesn’t deal with points at all. You also need to consider that it’s prone to poor results from small datasets... I didn’t really have any testing procedure in mind when I was making my post as I don’t endorse this approach, but if we say n>5 ends up as lowest limit then it will only work for judges on both panels with three skaters, at least two of whom progress to the FS, for any individual competition... There are ways to sidestep that, but it makes me nervous to accuse a judge of corruption based on, say, one single short program score falling slightly outside the normal deviation (although is a problem with just about any method of detecting bias; I’m just too nice to the judges.)

The strength of such a system would be in establishing national bias exists generally on the grand scale (as in, all competitions in all disciplines over the whole season, or all competitions a particular judge has ever been in)— and here it shouldn’t favor the large feds that much, even with strategic overscoring. That’s how Miller was originally using it in his post: my post is a bit of a clarification of what I thought he meant to Narcissa, and then a brief general breakdown of the idea of evaluating the judging using their scores as ordinal variables.

I am interested now how such an approach would have worked under 6.0... still small datasets, but aside from that, it’s a beautifully simple method. Oh well; it’s irrelevant to a discussion of IJS scoring.

synteis · Jun 22, 2018

narcissa said:
This is exactly what I did too, although I didn't look at individual judges -- hmm, maybe I should have. Because I had to look up the names of each judge to determine what country they're from, too, but then I just sorted them by country. Maybe I'll go back and add those in because in light of this discussion it is very interesting indeed.

Even without looking at individual judges, the effect is pretty significant -- EVERY judge from top 5 countries (Can, Japan, USA, Russia, China) gives their skater significantly (p <<<< 0.05, which is what I used) more GOE and PCS per element/PCS category, and their competitors significantly (p << 0.05) lower GOE/PCS than average, except for (1) Japan, in GOE, (they give competitors higher GOE than average) and (2) Russia, in PCS (not significant).

(these charts are per ELEMENT)
All skaters: https://4.bp.blogspot.com/-L0mS8QZY...FoD9VjU5PwSgCLcBGAs/s640/download+%284%29.png
Men: https://1.bp.blogspot.com/-bS7hjBtQ...IzGEpLOttGjQCLcBGAs/s640/download+%283%29.png
Women: https://4.bp.blogspot.com/-IqUAf0Kr...UIx8LZfRRDQACLcBGAs/s640/download+%282%29.png

Results are pretty consistent across the board. It's hard to say who's the worse but China is DEFINITELY quite out there. So are some other countries.

Very interesting to see the ways different countries solve the same problem. I was surprised to see that Canada alone seems to do most of their over- and under-scoring in GOEs and not PCS for instance while other countries it tends to be more in PCS. And also which countries tend to under-score vs those which over-score instead.

jenaj · Jun 22, 2018

These suspensions are a start, but they don't address what would be a much more serious problem, which would be collusion among several judges. One judge acting as an outlier is unlikely to have much effect on the final result. As we can see, Boyang Jin did not win a medal and Sui and Han did not win gold. Aren't the highest and lowest PCS scores still thrown out? Two or more judges could conceivably collude to affect results even if their scores were within the corridor. Isn't that what people thought happened in Sochi with the ladies? Also, as others have noted, the technical panel has much more power to affect the results than any individual judge.

Mamamiia · Jun 22, 2018

Shanshani said:
Taking the men's free skate for instance (since the men's free skate is coming up a lot), I found that 75% of the time, a judge's score will fall within 5.14 points of the skater's actual score, and 95% of the time, a judge's score will fall within 8.77 points of the actual score (keep in mind, this is valid only for the men's free skate--other competitions and other segments tended to have lower deviations). So a score more than 8.77 points away from the actual score has only a 5% chance of occurring. Note that Judge Parker had four and only four instances of scores that deviated more than 8.77 points from the actual score. They are:

Nathan Chen (over-scored by 10.22 points)
Yuzuru Hanyu (under-scored by 8.92 points)
Boyang Jin (under-scored by 10.02 points)
Adam Rippon (over-scored by 10.30 points)

(She also over-scored Vincent Zhou by 7.19 points.)

Even if you ignore her Vincent Zhou score, the chances of this occurring by accident are therefore roughly:
0.05*0.05*0.05*0.05=0.00000625=0.000625% or 1 in 160,000. (In actuality, the chances of this result occurring by accident are even lower, because the chances of a 10 point over or under-score are even lower than 5%). If you add in Vincent Zhou, it falls even farther. Any system that's unable to catch that is very broken.

Out of curiosity I took a look at both US and Chinese judges' scores for top 6 with a similar approach. Because Chen judged both short and long programs, I added the total scores for Chen. In total, Chen scored Jin 31.65 above average , Fernandez 18.82 below average and Uno 13.78 below average :palmf:

. Chinese judge's scores showed 2-3 times more bias than US's.

	US judge (FS)	Chinese judge(FS)	Chinese judge (total score)
Hanyu	-8.92	+2.18	+2.04
Uno	+0.71	-11.09	-13.78
Fernandez	+2.04	-14.99	-18.82
Jin	-10.02	+22.18	+31.65
Chen	+10.22	-2.98	-3.45
Zhou	+7.19	-2.31	-0.9

I feel bad not only for those skaters that were underscored but also for those over-scored. This is not their fault yet it can be an embarrassment they have to deal with in the figure skating community. There is no better way for judges to show their respect to skaters than giving fair marks.

Two Olympic Judges suspended by ISU

Ares

cohen-esque

narcissa

Shanshani

Izabela

Harriet

gkelly

Shanshani

chopinskate

gkelly

yume

🍉

chopinskate

synteis

Eclair

narcissa

gkelly

cohen-esque

synteis

jenaj

Mamamiia

Similar threads

Connect with us