Scoring bias at the national level | Page 4 | Golden Skate

Scoring bias at the national level

Sometimes 'tis better to bear the ills we have than fly to others that we know not of.

If the proposal is to have 9 sitting judges and each skater is marked by a slightly different sub-panel of 8. I think that this presents problems, too.
For instance an over-lenient but unbiased judge could end up giving high marks to everyone except the skater from a particular country, skewing the results. It could, in fact,provide extra incentive for a judge deliberately to lowball other skaters.

One version of the early IJS had, as I recall, 14 sitting judges, and featured a random and secret draw that selected which 9 marks would count and which 5 judges (without being told) would just be sitting there like fools. (This scheme was eventually laughed off the stage.)
You need the same panel for an event as every judge scores differently, so it's difficult to not have judges from that country. Maybe not have judges from top 5 countries to avoid placement issues? I don't know, still a reach. All I'm saying is ice dance judging is much worse than this example (look through the last 4 years of Christopher buchanan on skatingscores, or the italian judges wirh G/F, or the US judges with C/B )

My proposal would be for skaters who don't have a judge from their country on the panel to be judged exactly as now i.e. you take the middle 7 out of 9 for each individual GOE/PCS mark. Then where you do you take the middle 6 out of 8. An alternative would be to have an extra judge whose marks only come into it when a skater is performing who has a judge on the normal panel, to my mind this extra judge would be no different from that of the other judges on the panel in that they are independent of the skater so their marks would be randomly the same, higher or lower just as the rest of the panel are.

However the panel wouldn't be the same for each skater whichever way you look at it, plus at smaller events you really start getting a minimal number of judges judging e.g. where there's a 5 judge event you're only taking 2 out 4 judges marks when you don't include those of a judge whose skater is skating. Perhaps we're only talking about 9 judge events i.e. GP and above events where it is feasible (for Challengers it would be 4 out 6 remaining rather than 5 out of 7 - I don't think with the cost implications you'd be talking substitute judges at events below GP level).

However you would have to tell me mathematically whether what I am saying is a goer i.e. is the element of uncertainty in the results by effectively having a different panel for different skaters better or worse than that by which an individual judge can influence the results without being picked up on it - based on the case of Mr Williams in this thread it would appear that about a point (or 10 in my favourite metric of 'effective score') across the SP and LP combined is the most that a judge can affect the score without being picked up.

I would assume that for 5 judge events the mathematics would defeat the removing the judge option, but for 9 judge ones then I don't know - over to you!
 
Last edited:
Actually, cookie cutter judging should be the aim.

...Again, going back to other judges sports I do watch (diving, moguls) there is very little variation between judges. I don't see as many wuzrobbed... Athletes are fine with their scores (most often). Less scandals. Are there judges kicked out for nationalist bias ? Not that I have witnessed...
That being the case, I think that you should be quite happy with figure skating judging. :)

I was playing around with some statistical tests on the PCSs at the recent Grand Prix Final. Here is Kaori Sakamoto. The rows are the three components, Composition, Musical Interpretation and Skating Skills. The columns are the 9 judges.

9.25, 9.00, 9.25, 9.50, 8.25, 9.00, 9.00, 9.50 8.75
9.00, 8.75, 9.00, 9.25, 9.25, 9.25, 9.25 9.00, 9.25
9.25, 9.00, 9.25, 9.25, 9.25, 9.25, 9.00, 9.25, 9.50

The classical way to analyze such data, for instance by a 2-way ANOVA (Ananysis of Variance) test, is as follows. Here are 27 numbers. They are not all the same. Why not?

Well, there are three possible reasons: Differences among judges, differences among the three components, and “neither of the above .” Neither of the above is called sampling (or measurement) error. If I want to know how tall my top hat is, I measure it: 25.004 cm. Just to make sure I measure it again: 24,996 cm. So I take the average, 25 cm as my best guess as to the true height. and the extra +/- .004 is “sampling error.” The default assumption in inferential statistics is that if you measure something 100 times, quite naturally you’ll get 100 different answers, depending on the subtlety of your measuring devices.

So then there are various mathematical formulas for calculating the total amount of variation in these 27 numbers, and for splitting out the part that is correlated with different judges and the part that is correlated with different components. The rest is written off as sampling error. If the judges’ part is too big, or if the components’ part is too big, or both, then the results are regarded as “statistically significant.” If not, then “eh?”

But here I think that not all that much is gained by such exercises. Just eyeballing the 27 numbers., it is obvious that they are all about the same: about 9.25 or a tad less. Only one number out of the 27 is even as much as 1 point off, that oddity being the 8.25 in composition given by judge #5. (But that judge gave 9.25 to each of the other two components. It would be interesting to ask that judge why Kaori's interpretation of the music was better than her choreography.)

So if the goal is to make figure skating more like moguls, where all the expert judges agree, I think you’ve got you druthers. :)

[Historical sidebar: The rows and columns classically are designated Treatments and Blocks. The reason for this is that this particular family of statistical tests was developed in an agricultural setting: which type of fertilizer is best. So you divide your farm into a bunch of small plots (blocks) and treat each of them with various fertilizers (treatments), and see what the yield is. Maybe one plot of land is just naturally more fertile than another regardless of the treatment. Maybe one brand of fertilizer is better than others for all “blocks.” And maybe the experiment does not support any conclusion at all except that Mother Nature is capricious.]
 
That being the case, I think that you should be quite happy with figure skating judging. :)

I was playing around with some statistical tests on the PCSs at the recent Grand Prix Final. Here is Kaori Sakamoto. The rows are the three components, Composition, Musical Interpretation and Skating Skills. The columns are the 9 judges.

9.25, 9.00, 9.25, 9.50, 8.25, 9.00, 9.00, 9.50 8.75
9.00, 8.75, 9.00, 9.25, 9.25, 9.25, 9.25 9.00, 9.25
9.25, 9.00, 9.25, 9.25, 9.25, 9.25, 9.00, 9.25, 9.50
One skater isn't a relevant sample as you should know and especially when she is the world champion. GPF has six skaters... that's really the last place I would look for judge's variance. Try ISU championships. The reality is that this thread exists because the ISU keeps issuing warning to judges for being out of line. IT DOES happen.

BTW : funny and interesting FACT since you brought up GPF : two judges invited at GPF, one from China and one from Georgia had been suspended by the ISU previously for nationalistic bias... yet, they were allowed to judge again...at one of the most popular event of the season.
The classical way to analyze such data, for instance by a 2-way ANOVA (Ananysis of Variance) test, is as follows. Here are 27 numbers. They are not all the same. Why not?

Well, there are three possible reasons: Differences among judges, differences among the three components, and “neither of the above .” Neither of the above is called sampling (or measurement) error. If I want to know how tall my top hat is, I measure it: 25.004 cm. Just to make sure I measure it again: 24,996 cm. So I take the average, 25 cm as my best guess as to the true height. and the extra +/- .004 is “sampling error.” The default assumption in inferential statistics is that if you measure something 100 times, quite naturally you’ll get 100 different answers, depending on the subtlety of your measuring devices.

So then there are various mathematical formulas for calculating the total amount of variation in these 27 numbers, and for splitting out the part that is correlated with different judges and the part that is correlated with different components. The rest is written off as sampling error. If the judges’ part is too big, or if the components’ part is too big, or both, then the results are regarded as “statistically significant.” If not, then “eh?”

But here I think that not all that much is gained by such exercises. Just eyeballing the 27 numbers., it is obvious that they are all about the same: about 9.25 or a tad less. Only one number out of the 27 is even as much as 1 point off, that oddity being the 8.25 in composition given by judge #5. (But that judge gave 9.25 to each of the other two components. It would be interesting to ask that judge why Kaori's interpretation of the music was better than her choreography.)

So if the goal is to make figure skating more like moguls, where all the expert judges agree, I think you’ve got you druthers. :)

[Historical sidebar: The rows and columns classically are designated Treatments and Blocks. The reason for this is that this particular family of statistical tests was developed in an agricultural setting: which type of fertilizer is best. So you divide your farm into a bunch of small plots (blocks) and treat each of them with various fertilizers (treatments), and see what the yield is. Maybe one plot of land is just naturally more fertile than another regardless of the treatment. Maybe one brand of fertilizer is better than others for all “blocks.” And maybe the experiment does not support any conclusion at all except that Mother Nature is capricious.]
 
However you would have to tell me mathematically whether what I am saying is a goer i.e. is the element of uncertainty in the results by effectively having a different panel for different skaters better or worse than that by which an individual judge can influence the results without being picked up on it...
I think that this is not a mathematical question, but rather a question about figure skating. My intuition is that your proposal is "better rather than worse" compared to the possibility, even the ubiquity, of national bias.

However, I still don't think it would work out. In Amereican football, for instance, the NFL would never consider shuffling in different referees depending on which team had the ball. I think that this would just be admitting that we have a problem with bad callls and we can't do anything about it. For figure skating, the IOC may even have some rules about the selection of judges in the judged sports that it oversees. (?)

(College football does have its version of home cooking. The Rose Bowl game is next Monday between Michigan of the Big Ten conference (comprising 14 teams :laugh: ) and Alabama of the SEC. The two conferences will be invited to send representative to serve on the officiating crew. Are they ever tempted to "be true to your conference?")
 
Last edited:
One skater isn't a relevant sample as you should know and especially when she is the world champion.
Statisticians can do useful things even with small samples. In fact there is a whole sub-discipline of Small Sample Theory; :bow:

The part about Sakamoto being the world champiomn, that is kind of the point that I am trying to make. The question of whether these numbers came from a world champion figure skater or from measuring the height of my top hat, that is not of within the perimeter of concern to statisticians. The only thing that counts is, here are a bunbch of numbers. What can we say about them without any further input.

This is both the power and the limitation of mathematics. (That's what I think, anyway.)
 
Statisticians can do useful things even with small samples. In fact there is a whole sub-discipline of Small Sample Theory; :bow:

The part about Sakamoto being the world champiomn, that is kind of the point that I am trying to make. The question of whether these numbers came from a world champion figure skater or from measuring the height of my top hat, that is not of within the perimeter of concern to statisticians. The only thing that counts is, here are a bunbch of numbers. What can we say about them without any further input.

This is both the power and the limitation of mathematics. (That's what I think, anyway.)

The thing is that you cannot take one example of a skater's scores (at one event) and say "see, everything's fine".
I will come up with a hundred examples from the past 5 years to show the opposite if it's needed. (I don't think it is.)

But I don't fully agree with the corridor thing in the first place. I do agree that similar scores are desirable, but only if they do actually stem from clear rules not from all judges having the same biases...
 
The thing is that you cannot take one example of a skater's scores (at one event) and say "see, everything's fine".
I will come up with a hundred examples from the past 5 years to show the opposite if it's needed. (I don't think it is.)
indeed :)
But I don't fully agree with the corridor thing in the first place. I do agree that similar scores are desirable, but only if they do actually stem from clear rules not from all judges having the same biases...
This is important. When I wish for similar scores across the panel, it's exactly because I believe that judges, if they used the rules at their disposal, the training they go through, instead of personal/cultural preferences and nationalistic biases, would come up with similar results, making the sport better understood by the general public. The whole "niche" sport label given to figure skating is not because the sport is unattractive, It's, in my opinion, due to the simple fact that most fans do not understand how the sport is judged + the many scandals over the years. People do not care for things they cannot compute easily and/or if they feel that things are rigged. I don't understand all of the particularities of some other judged sports but when I see the scores lining up pretty much in harmony, the lack of protest/scandals/wuzrobbed moments, I trust the process. So in that sense, though I know way less about diving, when i watch it, I believe that the right winners are on the podium and it makes it so much more enjoyable.
 
BTW : funny and interesting FACT since you brought up GPF : two judges invited at GPF, one from China and one from Georgia had been suspended by the ISU previously for nationalistic bias... yet, they were allowed to judge again...at one of the most popular event of the season.
Let me try again. This information is about the real world. It is not about statistics. There is nothing in the numbers from the Grand Prix Final that tells us that a judge from Georgia was suspended by the ISU. This is information of a non-statistical nature. This information cannot possibly be obtained by any statistical means.
 
Let me try again. This information is about the real world. It is not about statistics. There is nothing in the numbers from the Grand Prix Final that tells us that a judge from Georgia was suspended by the ISU. This is information of a non-statistical nature. This information cannot possibly be obtained by any statistical means.
Not sure I understand what you mean. The two judges previously suspended behaved well at gpf. Good for them. At the same time, they didn't have a horse in the race.
 
Not sure I understand what you mean.
I guess I don't understand either. I thought I was drawing a distinction between what we can learn from lists of numbers alone and what requires input outside these lists if we want to achieve a fuller picture.

IceWhite said:
The thing is that you cannot take one example of a skater's scores (at one event) and say "see, everything's fine".

My intention was to say the opposite. Show me every number that ever existed (after we first try to understand the sense in which numbers can be said to "exist" :) ) -- but we cannot conclude anything from this about "fineness." Fineness is not a mathelatical concept. That's all.
 
Statisticians can do useful things even with small samples. In fact there is a whole sub-discipline of Small Sample Theory; :bow:

The part about Sakamoto being the world champiomn, that is kind of the point that I am trying to make. The question of whether these numbers came from a world champion figure skater or from measuring the height of my top hat, that is not of within the perimeter of concern to statisticians. The only thing that counts is, here are a bunbch of numbers. What can we say about them without any further input.

This is both the power and the limitation of mathematics. (That's what I think, anyway.)
Yet the Small Sample Theory, AFAIK, applies to samples equal or smaller than 30 cases (and even this number is rejected as too small by some researchers). I never heard about Student's t or any similar statistical tool applied to a single case study, though. If you have, please let us know, I will very gladly broaden my horizons :). But for all I know now, a single case study is... right, just that, a case study. It can bring interesting material for further discussion and analysis, yet I think we all know it does not allow for any kind of statistical generalization of results to a larger population. For this you would need a sample bigger than 1, for sure. :)
 
I guess I don't understand either. I thought I was drawing a distinction between what we can learn from lists of numbers alone and what requires input outside these lists if we want to achieve a fuller picture.



My intention was to say the opposite. Show me every number that ever existed (after we first try to understand the sense in which numbers can be said to "exist" :) ) -- but we cannot conclude anything from this about "fineness." Fineness is not a mathelatical concept. That's all.

Then what's your point?
 
For this you would need a sample bigger than 1, for sure. :)
I think that it depends of what we are trying to measure. We might, for instance, be interested in the population of all possible individual program component marks that all possible well-quailfied judges might give to this performance. In that case the sample size is 27.

The n = 30 rule (yes, 50 is better) is for estimating population parameters from a distribution that is asssumed to be approximetely normal. The interesting thing abpout the t-distrubtion (for estimating the population mean from the sample mean, for instance), especially for finite populations) is that the population of sample means of fixed sample size is approximately normal (Central Limit Theorem) even if the underlying population is not. That's why n = 30 is usually good enough for this kind of test.

In the analysis of variance example that I tried to explain above in words without any mathematical formulas, the distribution of the statistic under study is the ratio of two chi-square distributions -- i.e., the F distribution. This gives useful information even with small bidrees of freedom (n1 = 3 and n2 = 9 in that example.)

OK. that's all I have to say, except to add -- don't mind me. I myself am not a statistician. My areas of interst (unlike statistics) are quite far removed from anything of any conceivable practical use. Back to the topic of biased and incompetant judging.
 
I think that it depends of what we are trying to measure. We might, for instance, be interested in the population of all possible individual program component marks that all possible well-quailfied judges might give to this performance. In that case the sample size is 27.

The n = 30 rule (yes, 50 is better) is for estimating population parameters from a distribution that is asssumed to be approximetely normal. The interesting thing abpout the t-distrubtion (for estimating the population mean from the sample mean, for instance), especially for finite populations) is that the population of sample means of fixed sample size is approximately normal (Central Limit Theorem) even if the underlying population is not. That's why n = 30 is usually good enough for this kind of test.

In the analysis of variance example that I tried to explainabove in words without any mathematical formulas, the distribution of the statistic under study is the ratio of two chi-square distributions -- i.e., the F distribution. This gives useful information even with small bidrees of freedom (n1 = 3 and n2 = 9 in that example.)

OK. that's all I have to say, except to add -- don't mind me. I myself am not a statistician. My areas of interst (unlike statistics) are quite far removed from anything of any conceivable practical use. Back to the topic of biased and incompetant judging. :laugh:
That's right. Yet all the 27 measurements refer to the same case (1 skater at 1 competition). To use statistics on them, you need to change the number of cases under study from 1 to 27. Doing so allows you to use statistics, but it changes the definition of population the stats apply to (it becomes the population of scores obtained by Kaori Sakamoto at a such and such comp, and not general FS scores of all the skaters at all the comps). So while the change allows for a thorough and deep analysis of this one case, you cannot extrapolate the results to make any conclusions or predictions as to the general population of skater scores, i.e. different sets of data referring to different cases, or even of the same skater, or of different skaters at the same competition etc. etc. That's what I said. This is an analysis of a single case. Says nothing about other potential cases. Not allowing for any broader conclusions, really.
 
I think you are talking too much about statistical methods, but the process of finding some bias in judging really begins with the judges meeting after the competition. If the referee of the competition finds some unusual things in the scoring sheets, (s)he needs to write this on the report, because thats the job of the referee, to write the report about the whole competition including the judges who were part of it and send to the ISU... During the skating referee is writing down components for every skater in a wider range, for example Levito SS should be from 7.5 to 8.5, CO from 7.75 to 9.00 and PE from 7.5 to 8.75. If some judge scored lets say her SS 8.75 or PE 7.25, that judge needs to defend that score in the judges meeting. If the explanation doesnt sound logical for the referee and the panel of judges, then ISU will react, based on the report of the referee, and use other measures including statistics, and charging the judge after it...
And to edit - In this particular situation, explanation "I-m the best judge, you should not ask me to explain my decisions" is not a valid explanation
 
Last edited:
I think you are talking too much about statistical methods, but the process of finding some bias in judging really begins with the judges meeting after the competition. If the referee of the competition finds some unusual things in the scoring sheets, (s)he needs to write this on the report, because thats the job of the referee, to write the report about the whole competition including the judges who were part of it and send to the ISU... During the skating referee is writing down components for every skater in a wider range, for example Levito SS should be from 7.5 to 8.5, CO from 7.75 to 9.00 and PE from 7.5 to 8.75. If some judge scored lets say her SS 8.75 or PE 7.25, that judge needs to defend that score in the judges meeting. If the explanation doesnt sound logical for the referee and the panel of judges, then ISU will react, based on the report of the referee, and use other measures including statistics, and charging the judge after it...
And to edit - In this particular situation, explanation "I-m the best judge, you should not ask me to explain my decisions" is not a valid explanation
X's SS should be.... says who? based on what?
This is where it really begins, IMHO.
 
X's SS should be.... says who? based on what?
This is where it really begins, IMHO.
Referee of the competition, based on the practice of judging the figure skating competitions... Its not like that hard to make a corridor for a skater's components, given the fact that a range inside the corridor is that big (at least one point range in every component and sometimes even closer to two), which doesn't help the skater to win or lose the competition... And the judges outside of the corridor have a chance to defend their opinion... but this judge obviously did not in the right manner... I mean, we can debate why some skater component should be 8 instead of 7, but to debate how for the same skater the same component needs to be much higher than 8 or lower than 7 is really not a possibility in a judging scale from 1 to 10 (and for the skaters on the GP is really a scale from 5 to 10), that means that for some personal and not objective reason we were watching two different things. Which is what bias, national or not, means after all...
 
Last edited:
Referee of the competition, based on the practice of judging the figure skating competitions... Its not like that hard to make a corridor for a skater's components, given the fact that a range inside the corridor is that big (at least one point range in every component and sometimes even closer to two), which doesn't help the skater to win or lose the competition... And the judges outside of the corridor have a chance to defend their opinion... but this judge obviously did not in the right manner... I mean, we can debate why some skater component should be 8 instead of 7, but to debate how for the same skater the same component needs to be much higher than 8 or lower than 7 is really not a possibility in a judging scale from 1 to 10, that means that for some personal and not objective reason we were watching two different things. Which is what bias, national or not, means after all...
I beg to differ on your opinion that the corridor does not help anyone to win or lose the competition. Of course, it does as most judges would rather avoid the necessity to justify their scores being too far off so they will simply comply with the corridor, whether they agree with it or not. Now, you say the corridor is based on the previous scoring practice. But these very scores mostly complied with the previously set corridor....
It further complicates things... and, IMHO, it also forms a basic methodological, or cognitive, error - you justify a fact by a factor depending on what you justify by this factor...
 
...Its not like that hard to make a corridor for a skater's components, given the fact that a range inside the corridor is that big (at least one point range in every component and sometimes even closer to two)...
I am not sure how you are using the word "corridor" here. In the ISU documentts that define this term and illustrate the definition with examples, the corridor for each skater is defined by the marks that the judges actually give to that skater for that particular performance (a "case study" if you will. ;) ) "Assessments" are then made when scores fall outside this range.

It may be easy for a referee to make up a "corridor" (in the ordinary sense of this word, not what the ISU judging rules specify) out of thin air, but ... are you sure that referees actually do this instead of following the ISU's own rules?

Here is the relevant ISU Communication, courtesy of Andrea82 in post #7 above.

 
Last edited:
Back
Top