- Joined
- Jan 27, 2014
I’m not sure that this approach is completely sound... what about negative national bias against other countries?I'm actually currently working on a method to evaluate nationalistic judge bias based on score differentials across competitions. Basically, I'm taking a judge and look at all their scores for all senior level top tier competitions in the same segment of the same field (I'm testing this right now on Senior Men's Free Skate scores). Then I find the difference between their scores and the average score the skaters received from all of the judges--let's call that measure the score differential. Then I split the score differentials up into two groups: scores for skaters from the same country as the judge, and scores from skaters from a different country.
For example, say the USA judge exhibits pretty much no bias towards her own skaters and her scores are well within the average for other skaters as a group. But she had a terrible date with Didier 20 years ago and has ruthlessly lowballed French skaters ever since. Your method would say she’s golden, because the French teams scoring deviation from her would get washed out in her average deviation for all the teams as part of the second group.
For a potential real-life example, the French judge’s scores for Savchenko/Massot at the Olympics look suspiciously like retaliation at first glance.
I find the average score differential for both groups and compare them. So for instance, if the average score differential for Judge A for skaters from her country is +5, while the average score differential for her scores for skaters not from her country is -1, that means that on average, she scores skaters from her country 5 points higher than other judges, and scores skaters not from her country one point lower. I then perform a one-tailed t-test (that's a statistical test used to see if the difference between two averages is significant, ie. not due to random chance) on the two groups of data.
Why not a two-tailed test? It seems like we’d want to account for the deviations in both directions.
This produces a statistic called p, which measures the odds between 0 and 1 that the difference between two averages would be due to random chance if the "real" averages are equal (in other words, if the judge was totally unbiased). So if p is, say 0.05, that means that the difference will occur by chance only 5% of the time if the judge is unbiased and therefore the judge is most probably not unbiased.
I've looks at 11 judges so far, and it's not pretty. All 11 judges scored their home country skaters higher than they scored skaters not from their home country, usually between 4-6 points higher. In 9 out of 11 cases, that difference is significant ie., highly unlikely to be the result of an unbiased judge using p=0.05 as the significance threshold, which is the standard significance threshold for scientific studies. If we lower the threshold to p=0.01, 6 judges' scoring records still show a statistically significant difference between how they score skaters from different countries versus their own.
I do think that 0.05 is a bit of a high standard for judges, when you consider all the myriad ways they could end up with significant devation in the *total* score: especially under the new percentage-based +/-5 GOE system, the difference between an eyebrow raising score might be something fairly innocuous, like they consistently lowballed their own skater on low-value spins but gave higher GOE to the all the high-value jumps. It’s a conspicuous pattern that’s probably not just coincidental, but is it national bias? (And considering the trimmed mean I think we can risk raising the threshold for Type 1 errors slightly.) I’d go with 0.10, although one could arbitrarily set it to something like 0.075 if we don’t want to give quite so much leeway.
:eeking:In 2 cases (Weiguang Chen and Peggy Graham, a USA judge), the p value is so small that the program I'm using to calculate it ceases to show a number and instead just displays p < 0.00001. Lorrie Parker also clocks in at an extremely low p=0.000024. Drugs are approved on the backs of higher p values than that!