Clearly biased: 20 judges' score histories

Mathman · Jun 23, 2018

Shanshani said:
3. Median score might be interesting, and in theory it would be more protective from multiple extreme values than trimmed mean (what we have now), which means it might be somewhat effective against bloc judging, actually.

I believe that people who have studied this have pretty much concluded that the results turn out the same. The ISU has tried more severe trimming (drop the two highest and two lowest) also without much effect. What the trimming accomplishes is not so much to protect against bias as to correct for accidental keyboarding errors, where a judge accidentally enters 0.25 when he intended 8.25.

chopinskate · Jun 23, 2018

Mathman said:
I believe that people who have studied this have pretty much concluded that the results turn out the same. The ISU has tried more severe trimming (drop the two highest and two lowest) also without much effect. What the trimming accomplishes is not so much to protect against bias as to correct for accidental keyboarding errors, where a judge accidentally enters 0.25 when he intended 8.25.

Won't it even change the corridor for marking, and how the judges behave because of it? It won't protect against the "unwritten" rules of scoring like doing more quads, etc., but is there genuinely no change in the corridor either?

cohen-esque · Jun 23, 2018

chopinskate said:
Won't it even change the corridor for marking, and how the judges behave because of it? It won't protect against the "unwritten" rules of scoring like doing more quads, etc., but is there genuinely no change in the corridor either?

The corridor— at least, the official ISU one— takes into account the average marks of all judges on the panel, so the trimmed mean has no effect on that one way or the other.

Shanshani · Jun 23, 2018

Mathman said:
I believe that people who have studied this have pretty much concluded that the results turn out the same. The ISU has tried more severe trimming (drop the two highest and two lowest) also without much effect. What the trimming accomplishes is not so much to protect against bias as to correct for accidental keyboarding errors, where a judge accidentally enters 0.25 when he intended 8.25.

That doesn't surprise me much either. Trimmed mean rarely changes results already, and score deviations are normally distributed so there's no reason to think that the median would stray much from the mean typically.

nocturnalis · Jun 23, 2018

This is great, but to get an accurate measurement of bias, individual performances have to be analyzed. Sometimes judging panels all overlook flaws of prominent skaters.

Sam-Skwantch · Jun 23, 2018

cohen-esque said:
The corridor— at least, the official ISU one— takes into account the average marks of all judges on the panel, so the trimmed mean has no effect on that one way or the other.

I like the idea of the trimmed mean for the purpose of producing official scores. It makes sense but when evaluating a judge’s performance the only thing I’m interested in is a +- corridor between the way a judge marks their own federations skaters and their nearest opponent. If they are the highest for their own and the lowest for their direct competitors then a flag raises in my mind. What that specific corridor should be is unclear to me and would be better addressed thru an individual review and comparison to past marks.

Even if there is no action taken in regards to punishment I’d like to see the judge asked to submit a detailed reasoning as to their reasons for falling outside of the corridor. It’s entirely possible they have a valid reasoning....even if I disagree with them.

Ziotic · Jun 23, 2018

chopinskate said:
It is supposed to. And perhaps you would do that. But I'm not sure if it stands true for everyone.

I try to score skaters sometimes (like in worlds 2018 when there was a fight whether Wakaba or Kaetlyn should have won), and the scores turn out looking pretty fair. The skaters I don't care for can beat the ones I like based on what they put on the ice. Completely depends on the performance. I'm not sure if I'd expect every judge to do this. It's a genuine question, and I don't think Ziotic is saying it's bias that is "acceptable", simply that it exists.

Exactly. It exists, we are humans therefore it’s impossible to be unbiased. Our human experiences make us bias.

Unless we start judging by metrics that are measurable. An example would be the speed of rotation on a spin, you can’t argue over who was faster but you can argue over what position is best.

People’s bias will determine what position they find “best” and judge accordingly.

For this reason scoring will ALWAYS be debated because no one will agree 100%.

Those things that different people consider “best” are usually related to regional preferences. That being said. Some judges are blatantly cooking for the home team, such as the marks from the US judge for Vincent.

chopinskate · Jun 23, 2018

cohen-esque said:
The corridor— at least, the official ISU one— takes into account the average marks of all judges on the panel, so the trimmed mean has no effect on that one way or the other.

OK, genuine question that seems like a natural follow-up: How do we fix all of this, apart from the obvious answer of "make the judges not be biased and follow the rules"?

shiroKJ · Jun 23, 2018

chopinskate said:
OK, genuine question that seems like a natural follow-up: How do we fix all of this, apart from the obvious answer of "make the judges not be biased and follow the rules"?

As long as its a judges sport with human judges on panel there will always be some sort of instinctive bias. The only thing that can be done is to make sure the judges are judging without intentional bias, which is really the problem here IMO.

chopinskate · Jun 23, 2018

shiroKJ said:
As long as its a judges sport with human judges on panel there will always be some sort of instinctive bias. The only thing that can be done is to make sure the judges are judging without intentional bias, which is really the problem here IMO.

True, but I was asking with the assumption that this won't happen (and what I'd meant by bias in the first place).

Mathman · Jun 23, 2018

Sam-Skwantch said:
I like the idea of the trimmed mean for the purpose of producing official scores. It makes sense but when evaluating a judge’s performance the only thing I’m interested in is a +- corridor between the way a judge marks their own federations skaters and their nearest opponent. If they are the highest for their own and the lowest for their direct competitors then a flag raises in my mind. What that specific corridor should be is unclear to me and would be better addressed thru an individual review and comparison to past marks.

Even if there is no action taken in regards to punishment I’d like to see the judge asked to submit a detailed reasoning as to their reasons for falling outside of the corridor. It’s entirely possible they have a valid reasoning....even if I disagree with them.

The "corridor" as defined by the ISU rules is so wide that it seems impossible that any judge would ever get caught. (But a few do each year, especially in ice dance at lower than championship levels.) The problem with having a narrower corridor is that this would encourage judges to score not what they saw but rather what they expect the other judges to score. If I am not mistaken, for the purpose of applying "assessments" and "anomalies" the corridor also includes the scores of the referee and of any members of the judges' oversight committee that are in attendance.

The judges who do get called on the carpet have the opportunity to explain their scores to the judges' oversight committee, including presenting video evidence. The judges and referee also discuss the scores among themselves on site right after the event. These discussions are not made public.

By the way, if you want to do statistics with trimmed means, be sure to use the full sample size (for example, 9 judges), rather than the size of the trimmed sample (7 judges count) in computing standard errors, etc.

Andrea82 · Jun 24, 2018

Mathman said:
The "corridor" as defined by the ISU rules is so wide that it seems impossible that any judge would ever get caught. (But a few do each year, especially in ice dance at lower than championship levels.) The problem with having a narrower corridor is that this would encourage judges to score not what they saw but rather what they expect the other judges to score. If I am not mistaken, for the purpose of applying "assessments" and "anomalies" the corridor also includes the scores of the referee and of any members of the judges' oversight committee that are in attendance.

The judges who do get called on the carpet have the opportunity to explain their scores to the judges' oversight committee, including presenting video evidence. The judges and referee also discuss the scores among themselves on site right after the event. These discussions are not made public.

I think the referee's marks are omitted when they calculate the mean to determinate the corridor.
From Communication 2098: "For each element performed the computer calculates the average GOE of all the Judges. The GOE’s awarded by the Referee are NOT used in this calculation." There is the same line also for Components.

(I guess we may have the situation where it's the Referee who is at risk to fall out from the corridor!)

In the referee report, there is a section where he/she can elaborate if he/she supports the scores of judges highlighted as outside the range.

As you said, there is the round table discussion among the panel at the end of the event, so they can discuss the scores and the judge who ended up outside the corridor may explain to the referee and other judges his reasonings.

Format for Referee Report:
https://www.isu.org/inside-single-p...reports/268-referee-report-2016-17-sandp/file
https://www.isu.org/inside-single-p...-reports/12223-id-referee-report-2017-18/file

Attachment to Referee Report to single out judges who made serious errors according to the Referee . It's only for Senior Bs without Official Assessment Commission on site.
https://www.isu.org/inside-single-p...es-report-for-international-competitions/file

Format for Technical Controller report: https://www.isu.org/inside-single-p...-technical-controller-report-2017-18-s-p/file
https://www.isu.org/inside-single-p...4-id-technical-controller-report-2017-18/file

Guideline for Judges meeting (taking place before start of competition): https://www.isu.org/inside-single-p...ports/264-guideline-judges-meeting-sandp/file
https://www.isu.org/inside-single-p...ports/263-guideline-judges-meeting-dance/file

Guidelines for Round Table discussion: https://www.isu.org/inside-single-p...rts/265-guideline-round-table-discussion/file

OS · Jun 24, 2018

Fantastic work, thanks for looking into it.

Many of these biases only manifest when Fed has a strong contender in the race. I doubt China would have shown that degree with bias had BYJ not been a contender. It is likely the reason we see some marks from Israel and the US show more bias this year due to their stronger men's this year. If Nathen/Vincent, Alexei Bychenko had been not in for a shout of the medal, we may see fewer biases from them for example.

The thing is... ISU is acting ridiculous. They have actively created a judging system set to encourage federations to 'support' their skater, and then decide to selectively punish the judges for doing too well 'supporting' their own skaters once a blue moon. So this system will only ever work for federation judges who can work together who don't step outside their corridor, likely due to not having a contender in the race - which rarely happens. With the geopolitics climate of today, this system heavily favours Europeans due to historical traditions and social politics in the region to do with the sport, more than Asia or N. America (Especially with Patrick Chan no longer compete, and apparently US and Canada are no longer bros :laugh:

). The pattern will be something else different per each sports categories, subject to the makeup of contender rather than one that is down specific federation trends. For example in Dance, I'd easily expect European judges' marks quite different for France vs Canada.

This further convinces me, it is a terrible idea to allow judges 'reward' their own skaters score. That somehow only federations with contenders in the race are allowed to judge, instead of treating all judges equally regardless of federations, big or small. Every judge should be looked upon their own merit, to build experience, establish credibility and deserve a chance to qualify judging at the biggest and most prestigious competition. No matter which federation you come from... it should be on the judge's own individual merit, not the federations. If judges are not allowed to reward their own skater scores, and get rewards for impartiality so they can qualify for the biggest and most prestigious competitions, we may see different pattern of judging emerges.

The corridor is so easily manipulated any way as long as you place 2 judges who have a high probability of marking certain ways, designed to push the higher corridor to help 'preferable' skaters, a lower corridor for rivals. Such a strategy can work so well over a season in suppressing momentum as well as to snowball momentum.

dante · Jun 24, 2018

By the way, why judges' nationality is no longer indicated in the official results, like here?

gkelly · Jun 24, 2018

That was true even before anonymous judging and before IJS.

See, e.g., results from 2002 Olympics (scroll down for the judges' list).

My understanding is that the idea was that ISU judges at ISU championships were representing the ISU, not their individual federations.

Of course just saying so doesn't automatically make them impartial. But that was the idea. Or the ideal.

Andrea82 · Jun 24, 2018

dante said:
By the way, why judges' nationality is no longer indicated in the official results, like here?

At Olympics and ISU Championships (Worlds, Euros and 4 CCs) they put the "ISU". At GPs and Senior Bs, they put the country name next to the judges' names.

I thought it was related to the qualification the judges must have to serve in various type of competitions. However, after some checks, it is not the case. At GP Final they put the country next to the names even if the judges must be ISU Judge to be on the GP Final panel.

There are 2 types of judges able to serve in International competitions:

ISU judges
International judges

First you become an "International judge". Then with the required experience and having passed a further exam you can become an "ISU judge"

Olympics, Nebelhorn when serving as Olympic qualification, Worlds, Euros, 4 Continents, Junior Worlds and GP Final:
Referee, Technical Controller, Technical Specialists, Judges, Data Operator, Reply Operator must have the "ISU" qualificaiton

Grand Prix (except Final): Judges can be "International" while Referee, Controller, Specialists and Data/Reply Operators must be "ISU"

Junior Grand Prix (except Final): Judges and Referee can be "International"; Controller, Specialists and Data/Reply Operators must be "ISU"

Senior Bs: all can be "International"

eppen · Jun 24, 2018

This was a great bit of research and goes to show what does seem to be a unfortunate common practice with a long tradition indeed - there is research on Cold War era bloc voting. And Sonia Henie apparently got a lot of Norwegian judges on her side in competitions back in the day. Seriously wonder if that element can ever be erased from judging?

But would there be a way to figure out how much those hig/low scores change the final score a skater gets? I mean, if you have your own judge scoring you high and then your biggest rival's judge scoring you low, then those cancel each other out? I think already from the period before the possibility to follow who scores what, it was shown statistically that if you have a judge of your own in the panel, your scores were likely to be higher.

E

Mathman · Jun 24, 2018

eppen said:
Seriously wonder if that element can ever be erased from judging?

No. The ISU is set up and organized that way.

Each ISU member (i.e., national federation) sends its team to a competition. The team consist of skaters, judges, officials, coaches, doctors, etc. Each team is expected to work together to maximize the chances of winning as many medals as possible.

chopinskate · Jun 24, 2018

Mathman said:
No. The ISU is set up and organized that way.

Each ISU member (i.e., national federation) sends its team to a competition. The team consist of skaters, judges, officials, coaches, doctors, etc. Each team is expected to work together to maximize the chances of winning as many medals as possible.

That's... depressing.

Miller · Jun 25, 2018

Shanshani said:
Why the men’s free skate?

Like I mentioned before, the data that I used to produce these calculations comes exclusively from the men’s free skates for Olys, Worlds, 4CC/Euros, and Grand Prix competitions. There were a few reasons I decided to restrict my data set in this way. First, because of different PCS factoring and different required elements between the short and the free program and between, say, men’s and ladies’, score differentials are not comparable across different competition segments and different fields. The men’s free skate tends to have relatively large variations in scores relative to other fields and segments, as there are simply more available GOE and PCS points. Therefore, it wouldn’t make sense to put men’s free skate data in the same data set as the ladies’ free skate, or the men’s short.

Because of how time consuming it was to do the analysis for a single segment of a single field, I did not look at other segments or other fields, nor did I look at B level competitions. Sorry, even I don’t have that much time on my hands. However, I would like to note that the time consuming aspects of running the analysis can be automated, so the ISU could still run the same analysis as long as they were willing to pay the one time cost of hiring a programmer. I don’t know how to program, so I have to do it by hand.

As for why the men’s free, and not the ladies’ free, or the pairs’ short, or whatever, that’s just a matter of personal preference. The men’s event is my favorite event, so I’m more interested in data about it than data from other events. *shrug* Hey, if you want to see what the other events looks like, you can run this analysis yourself. It’s worth noting that almost all judges judge for more than one field, and if they’re biased judging for one field, they’re probably biased judging for other fields.

I was wondering if it might be possible to include other fields or even the Men's SP by doing some sort of factoring. That way you could increase sample sizes and maybe identify other judges that might be biased but where the sample size is too small in one individual segment of one particular field e.g. Lorrie Parker judged 15 American men in the LP over the course of the 2 years in question on skatingscores.com (probably enough of a sample size), but also judged 22 other Americans i.e. men in the SP, plus Ladies and Pairs (I can already tell you she's just as biased in those fields as she is in Mens LP, just looking at how high she scored American skaters relative to other judges on the panel).

However with some other judges you won't have as many as 15 cases, but may have that or more if you include other fields.

For example, the factoring for Mens LP is 2.0 for PCS, but 1.0 in the SP. Is it possible simply to double the differences in the SP, to get something that could be included as well? Obviously TES won't be exactly half what you get in the LP due to the different elements, but it won't be far off, plus PCS will be exactly half. Would it be possible to include something like this, or does it have to be 'exactly perfect' i.e. doing it segment by segment.

Also if you did this you could include Ladies SP, Pairs SP, Dance SD, multiply differences by 2.5, Ladies LP and Pairs LP multiply by 1.25, and Free Dance multiply by 1.667.

Would be interesting to know if this were possible, or if not is it possible to combine different p's? e.g. p for the Men's SP might be x based on a sample size of y, but the LP it might be a based on sample size of b. However can you get a combined p based on the values of x and a, and a sample size of y + b.

Clearly biased: 20 judges' score histories

Mathman

chopinskate

cohen-esque

Shanshani

nocturnalis

Sam-Skwantch

“I solemnly swear I’m up to no good”

Ziotic

chopinskate

shiroKJ

Back to the forest you go.

chopinskate

Mathman

Andrea82

OS

Sedated by Modonium

dante

a dark lord

gkelly

Andrea82

eppen

Mathman

chopinskate

Miller

Similar threads

Connect with us