Tl;dr: 10/20 judges I examined are almost certainly biased, and 6 more are highly likely to be. Top 5 worst offenders by obviousness: Weiguang Chen (China), Anna Kantor (Israel), Peggy Graham (US), Lorrie Parker (US), Saodat Numanova (Uzbekistan). Judges from all major federations show strong evidence of nationalistic bias.
There’s been a lot of discussion of judging bias recently, in particular which judges may or may not be biased and how the ISU seems to be very inconsistent about catching it. In order to ground this discussion in more solid evidence, and also to demonstrate that there are better methods of catching judging bias than whatever the ISU is using now, I’ve devised a statistical method for detecting judging bias which I hope to show is both powerful and fair.
I ran the method on scores for the men’s free skate portion of all high level senior international competitions (ie. Olys, Worlds, 4CC, Euros, Grand Prix Series) in the past two seasons for 20 different judges to determine each judge’s level of bias in these competitions and the strength of the evidence for the existence of biased. For each judge, I combined all the scores they gave out in the men’s free skate portion of each of the competitions they scored. I looked at two statistics—first, the difference between how much a judge overscored their own skaters relative to the skater’s average score and how much they over/underscored other skaters (let’s call this the point differential (PD). You can think of the PD as the average amount of bonus points skaters from the same country get from that judge versus skaters from other countries. The second statistic is the probability that a score differential at least the size of the actual PD would have occurred by accident if the judge is unbiased. This is called p. The lower p is, the stronger the evidence that the judge is biased (for context, p = 0.01 is already considered really low). Here are my results, in order of strongest evidence to weakest, broken into three groups.
Results
Bias highly probable to virtually certain (p < 0.01) :
Weiguang Chen (China) - PD: 10.1, p < 0.00001
Anna Kantor (Israel) - PD: 9.87, p < 0.00001
Peggy Graham (USA) - PD: 9.38, p < 0.00001
Lorrie Parker (USA) - PD: 5.12, p = 0.000024
Saodat Numanova (Uzbekistan) - PD: 5.76, p = 0.00010
Olga Kozhemiakina (Russia) - PD: 5.19, p = 0.00020
Tatiana Sharkina (Russia) - PD: 4.56, p = 0.000275
Jeff Lukasik (Canada) - PD: 9.56, p = 0.00082
Albert Zaydman (Israel) - PD: 4.83, p = 0.0043
Sakae Yamamoto (Japan) - PD: 4.47, p = 0.0044
Yes, a full half of the judges I checked are pretty obviously biased.
Bias quite probable (0.01 < p < 0.05):
Masako Kubota (Japan) - PD: 3.67, p = 0.012
Daniel Delfa (Spain) - PD: 5.86, p = 0.012
Yuri Guskov (Kazakhstan) - PD: 4.36, p = 0.027
Janice Hunter (Canada) - PD: 3.63, p = 0.029
Igor Obraztsov (Russia) - PD: 3.04, p = 0.038
Nobuhiko Yoshioka (Japan) - PD: 4.74, p = 0.046
p < 0.05 is the typical standard for statistical significance, for instance in scientific studies. 16/20 judges show statistically significant evidence of bias.
No strong evidence of bias (p > 0.05):
Na Young Ahn (South Korea) - PD: 2.48, p = 0.13
Wendy Enzmann (USA) - PD: 1.75, p = 0.17
Sung Hee Koh (South Korea) - PD: 1.5, p = 0.20
Agita Abele (Latvia) - PD: 1.4, p = 0.22
So, as you can see, only 4 out of the 20 judges I looked at show no significant evidence of bias. And *every* judge I looked at scored their home skaters higher than they scored other skaters—the only question is whether there was enough evidence that difference wasn’t due to chance rather than bias. But in any case, we can see already that my way of testing for judging bias is much more effective than the ISU’s way.
You can see the full results and calculations in a spreadsheet which is located here. The spreadsheet also has the judges ordered by PD, if you would like to see the judges in order of who gave the most bonus points. I also recorded some extra statistics, like the number of home country or foreign skates each judge scored, and the 95% confidence interval—you can kind of think of this as giving the likely possible range of PDs a judge’s scores will settle around if they judge a ton more skates and don’t change their behavior. For judges who didn’t have as many scores, the range is quite large, because we don’t have that much information about how the judge tends to score. However, for many of those judges, their bias is so large it’s still detectable even without much data. Yeah...
You can also see judges individual score deviations for most of the men’s free skates these past two seasons. Except for the Olys one which I took from a previous spreadsheet, these aren’t color coded or anything, so they’re a bit hard to read, but if anyone wants to zoom in on specific instances of suspicious judging, have at it.
Methodology
Ok, these results are interesting, but you might be wondering how I calculated the p values, what p even is really, why I only looked at Men’s free skate scores, etc. So I’m going to spell out exactly what I did, why I did it, and how the whole p value thing works here.
Gathering Data
The first step in the analysis was to gather all the data. Fortunately skatingscores.com has all the judges scores laid out in an easily accessible manner. For each judge that I examined, I determined which competitions they judged men’s free skates for in the past two seasons, and then imported all of the judges' scores for all of the skaters from those competitions into an excel sheet. I then computed the mean (ie. average) score for each skater in a column next to the judges’ individual scores.
Note that the mean score is different from the skater’s actual score—the mean score is simply the number you get when you add up all the judges' scores and divide by the number of judges, whereas the actual score is calculated in a much more convoluted manner. I used the mean score instead of the actual score in my calculations for simplicity, though I don’t think the results would have changed much if I had used the actual score. If anything, using the mean score makes it slightly easier for judging bias to hide.
Underneath my table of judges scores, I created another table, which computes the difference between the score each judge gave and the mean score the skater received. So for instance, if a judge scored a skate at 195 points, and a skater averaged 193 points between all 9 judges, then the cell underneath the judge in that skater’s row would read 2. Let’s call this the score deviation.
After I computed the differences for all of the competitions that were part of my data set, I had to combine all of the scoring data for each judge. In a separate sheet, I collected all the score deviations for each score that each of the judges I examined gave across all the competitions they scored the men’s free skate for. For each judge, I split these score deviations into two groups: scores deviations for skaters from the same country and score deviations for skaters from different countries. For instance, the first judge I examined, Lorrie Parker, a US judge, would have all of her score deviations for US skaters like Nathan Chen and Adam Rippon listed under one column, and all her score deviations for her scores for non US skaters like Yuzuru Hanyu and Mikhail Kolyada listed under another column. All of the analysis that I did is based on comparing these two groups of data.
I computed the mean for these two groups, producing two numbers: the judge's average score deviation for home country skaters and average score deviation for other countries’ skaters across all the competitions I examined. In other words, the first number gives the average number of points the judge overscored her own skaters relative to the mean score of all the judges, and the second number gives the average number of points the judge over/underscored other skaters (while no judges underscored their own skaters on average, some judges are just nice and tend to score everyone highly). I recorded these two numbers in the results page, and a third number, the difference between the two averages. This is the point differential (PD) statistic mentioned earlier, and it represents how many points higher a judge tends to score her own skaters than she tends to score other skaters.
Because PD is the difference between two averages, it’s more useful to look at than just the home skater average. Some judges are just stricter or more forgiving in general, which means that just looking at the home average can be a bit misleading. For instance, if you only looked at Canadian judge Lukasik’s home skater score deviation average, you might find the fact that he scored Canadian skaters an average of 4.36 points higher than the other judges concerning, but not as egregious as some other judges. However, when you notice the fact that he *also* scored non-home skaters 5.2 points lower than other judges, the full extent of his bias (9.56 extra points for Canadian skaters over non-Canadian skaters) becomes more apparent.
p value and significance testing
After I computed these basic statistics, I turned to computing a statistic that is far less transparent to most people: p. What is p? Well, like I said earlier, in this context, it’s the probability (from 0 to 1, so a p of 0.5 is the same as 50%) that a difference between the home skater and foreign skater averages (in other words, the PD) could occur through random chance, if the judge does not systematically score one group of skaters (her home skaters) higher than the other (foreign skaters), relative to the other judges’ scores. To put it in simpler terms, it’s the probability that an unbiased judge would wind up with the same or higher PD through luck.
In order to understand this better, let’s imagine a hypothetical totally unbiased judge (difficult hypothetical, I know). If this judge scored thousands and thousands of men’s free skates and thousands and thousands of skaters, we would expect that would be no difference in how that judge scores skaters from her home country and foreign skaters. She might be a nice judge or a mean judge (so her average score deviations could be high or low), but either way, the difference between her scores and other judges’ scores should be the same on average regardless of whether the skaters are from her country. In other words, she should have a PD of zero.
Of course, real judges don’t score thousands of men’s free skates—since judging stopped being anonymized, they might at best have scored 5 or 6. Thus a real, unbiased judge is likely to have a PD that isn’t zero, just through random chance. The higher the PD, the less likely that they are truly unbiased, but it’s possible for a judge to have a PD of, say, 3, without that judge being truly biased—it’s just luck that they wound up over scoring their own skaters relative to other skaters by 3 points. The question therefore is, how do we know if the point differential reflects a real tendency to score home-skaters higher, or just dumb luck?
That’s where p comes in. p tells us *exactly* how unlikely it is for the PD to be that high if the judge is truly unbiased (ie. does not tend to score one group of skaters higher than another group). Thus if p is very, very low, we can be quite sure that the judge does not score home skaters and foreign skaters the same.
This p statistic, by the way, is used all the time in science and other fields where it might be useful to be able to determine if a difference between two sets of data is due to chance or some other factor. Read more about it here. In fact, there are established norms for when we consider p values low enough to reject the idea that the groups being compared (in this case, a judge’s scores deviations for home skaters versus foreign skaters) have the same underlying mean (in this case, same tendency to over or under-mark skaters, regardless of country of origin) for whatever is being measured. Typically, this threshold is p <0.05, though depending on how confident you need to be, sometimes the threshold is lower. As you can see, 80% of the judges I looked at had scores that met this threshold. You might have heard the phrase “statistically significant”—this phrase refers precisely to whether the p value for your comparison falls below whatever p threshold you set, again typically p <0.05. So, for 16/20 judges I looked at, I found a statistically significant difference between how they mark home country skaters versus foreign skaters.
How did I actually calculate p? Well, to be honest, the nitty-gritty details of how p is calculated are too complicated to get into here—you’d need to take a college level statistics class to get into that. But we don’t really need to know all that information—in fact, I don’t know off the top of my head either. For the actual calculation bit, all I needed to do was feed the data—the two groups of score deviations—into a calculator built to compute p values (here is the calculator I used). The main thing that you need to know is which test I used (there are different tests for determining significance depending on what kind of data you have and how it’s organized).
I used a one-tailed T-test. Why a T-test? T-tests are used for comparing means of two data sets and seeing if they differ significantly from each other—basically all the stuff I was describing. I was comparing the mean score deviation for home country skaters versus foreign skaters for each judge. Why one-tailed? The choice between one-tailed versus two-tailed depends on what you’re looking for: are you looking for differences in means in both directions, or only one direction? For my purposes, I was not concerned with cases where judges significantly underscore their own skaters relative to foreign skaters. First, I haven’t even run into a case where a judge underscored their own skaters relative to foreign skaters at all, let alone significantly. Second, if that ever occurs, it would not really be of interest to us, because we’re only interested in discovering pro-home country bias. If a judge is biased against their home country, well, that would be kind of weird, but I’m not really sure there would be a case for disciplining that judge (plus their fed probably wouldn’t let them judge any more anyway). (But if you really want to quibble about one tail vs two, you can easily re-run using a two tailed test. Or just double the p values.)
Why the men’s free skate?
Like I mentioned before, the data that I used to produce these calculations comes exclusively from the men’s free skates for Olys, Worlds, 4CC/Euros, and Grand Prix competitions. There were a few reasons I decided to restrict my data set in this way. First, because of different PCS factoring and different required elements between the short and the free program and between, say, men’s and ladies’, score differentials are not comparable across different competition segments and different fields. The men’s free skate tends to have relatively large variations in scores relative to other fields and segments, as there are simply more available GOE and PCS points. Therefore, it wouldn’t make sense to put men’s free skate data in the same data set as the ladies’ free skate, or the men’s short.
Because of how time consuming it was to do the analysis for a single segment of a single field, I did not look at other segments or other fields, nor did I look at B level competitions. Sorry, even I don’t have that much time on my hands. However, I would like to note that the time consuming aspects of running the analysis can be automated, so the ISU could still run the same analysis as long as they were willing to pay the one time cost of hiring a programmer. I don’t know how to program, so I have to do it by hand.
As for why the men’s free, and not the ladies’ free, or the pairs’ short, or whatever, that’s just a matter of personal preference. The men’s event is my favorite event, so I’m more interested in data about it than data from other events. *shrug* Hey, if you want to see what the other events looks like, you can run this analysis yourself. It’s worth noting that almost all judges judge for more than one field, and if they’re biased judging for one field, they’re probably biased judging for other fields.
Why these twenty judges
Admittedly, this was where I was most ISU-like, in that the process by which I chose which judges to examine was a bit unsystematic and arbitrary. For some judges, I simply wanted to know if that judge in particular was biased (for instance, Lorrie Parker, the first judge I looked at, was chosen in this way). But largely, I picked judges if I saw that their names show up a lot in men’s free skate judging panels, and if I saw that they scored at their own skaters at least 3 times. If a judge hasn’t scored many skates, then it becomes difficult to conduct statistical analysis on them. For instance, I wanted to look at Chinese judges other than Weiguang Chen (particularly the other suspended judge, Feng Huang), but none of the other Chinese judges who’ve judge the men’s free skate judged enough of their own skaters to be included (Chen hogged nearly all of the Chinese men’s judging). However, because I didn’t go through every judging panel and identify exactly how many times each judge judged (again, too time-consuming), it’s possible I might have missed some judges who’ve judged enough to be analyzed. I think I got all or almost all of them though.
Limitations
There are some limitations to the form of analysis I’ve conducted here. First of all, this method only catches cases where judges hold up skaters from their own country, and is not suited to catching bloc-judging (ie., judging where judges from multiple nationalities collude to help skaters of a given nationality). I think it could possibly be altered to test for bloc-judging (or at least, you could test for whether say, a Kazakh judge scored Russian skaters higher than other skaters, for example) but you’d have to come up with a hypothesis for what the blocs in question are ahead of time.
Second, my analysis is built around the question of whether we can reject the idea that a judge scores home and foreign skaters the same. While this provides pretty strong evidence for nationalistic bias (ie. scoring a skater higher because they are from your home country), it does not rule out other explanations for systematic differences between a judge’s scores for home versus foreign skaters (ie. the judge is biased in favor of certain skaters for other reasons and some of those skaters just happen to be from their home country). This is especially true of cases where a judge is scoring only one skater from their home country. It’s possible that they simply like that skater more than other judges for innocent reasons. For instance, Numanova, the Uzbek judge, shows a highly, highly significant pattern of scoring Uzbek skaters more favorably than other skaters. However, though she scored an Uzbek skater multiple times, there was only one Uzbek skater whom she scored: Misha Ge. It’s possible that she’s scoring him higher not because he is Uzbek, but because she believes that the level of artistry he shows in his skates should be reflected with higher PCS numbers than the other judges award him. It’s entirely possible for judges to have preferences/biases in favor of individual skaters for reasons pertaining to the individual and not to nationality—for example, Israeli Judge Anna Kantor consistently over-scores Yuzuru Hanyu relative to other judges AND despite her tendency to underscore in general, presumably because she just really likes his skating. My analytical methods cannot distinguish between personal bias and nationalistic bias when a judge is only scoring one or two skaters—it’s up to you to decide which is the more likely explanation. (The flip side of this is that the results are more robust for judges from large federations that field multiple skaters at any given competition).
Additionally, while my analysis can give strong evidence for the existence of nationalistic bias, it cannot tell you why that bias exists. It’s possible that some sources of nationalistic bias are innocent—for instance, perhaps judges simply tend to connect better with the programs of skaters from the same country because of shared cultural norms and expectations, and thus tend to award higher PCS scores to those programs. It cannot prove that a given judge or federation is corrupt and trying to fix the results competitions to benefit their own skaters—it’s entirely possible that a judge is well meaning, but subconsciously grades their own skaters more forgivingly than other skaters. All it can show is that a given judge marks home country skaters higher, not the reason why.
Also, because we’re doing so many comparisons, the methodology is unfortunately likely to flag someone at some point that is innocent of significant bias, at least for some of the higher p-values. There is, after all, a non-zero probability that they are unbiased and randomly wound up with the PD that they did, and the chance that we’ll encounter such a person goes up the more people we test. That’s still really unlikely for those p < 0.0001 folks though.
On a different note, some judges seem more markedly biased at certain competitions than at others. Lorrie Parker, for instance, was highly biased at the Olympics, but scored in a much more reasonable and not markedly biased way at CoC 2017 and Rostelecom 2017. To some extent, judging in an unbiased way in some competitions can cover for bias in other competitions. Aggregating the competition data therefore may sometimes hide the extent to which a judge is biased at a particular competition, though it’s usually not enough to hide the fact that they are biased entirely. Weiguang Chen was similar—her 2016 Grand Prix scores are much more reasonable than her 2018 Olys scores. Therefore, it might be useful to run this same analysis method, but for individual competitions. I haven’t done that here, but it should be easy enough to do.
Similarly, because I put all foreign skaters in a single bucket for each judge, the extent to which judges are biased against foreign skaters their own skaters are in direct competition with is also masked. The fact that Chen scores foreign skaters 1.45 points lower across all competitions relative to other judges doesn’t really capture, for instance, how she underscored Shoma Uno by 11.09 points and Javier Fernandez by 14.99 points. Nor does the fact that Lorrie Parker actually scores non-US skaters 0.53 points higher than other judges across all competitions capture the fact that she underscored Yuzuru Hanyu by 8.92 points and Boyang Jin by 10.02 points in the Olympics. So in reality, the magnitude of the bias that biased judges bring to the table is in practice likely significantly higher than the PD numbers given. Unfortunately, I couldn’t think of a way to measure this direct competitor effect without making contentious calls over who might be considered a “direct competitor” of various skaters.
Recommendations
To be honest, I think everyone in the p < 0.01 category should be warned, at the very least. Second, I think that the ISU should switch to a method similar or identical to mine for detecting nationalistic bias. As we can see, this is a much more effective way of catching biased judges than the score corridor method they have now, which gives judges so much leeway that they can be obviously, provably biased without being scrutinized for it. Though the method I employed here may look complicated, it’s actually not, and there is no reason the ISU could not employ it themselves. They could even contract a programmer who can make a program to compute all of these things automatically and automatically flag suspicious judges—after constructing the methodology, nothing I did requires any human judgment whatsoever, and is entirely a mechanical application of formulas to numbers. You can even use the method on individual competitions, provided that there are multiple skaters from the judge’s country. For instance, when I ran the numbers solely on Lorrie Parker’s Olys Men’s Single’s event judging, I found that PD = 9.62 and p = 0.00057 for that one competition (!).
There are, however, a few drawbacks. First, it may be possible to game this method by judging in an unbiased (or even anti-biased) way at relatively unimportant competitions while still being biased at important competitions (but I think any method of catching bias will be prone to this problem). Second, this method may incentivize bad faith judges to artificially inflate the scores of foreign skaters who do not directly compete with skaters from the judge’s country. This will lower their PD score and consequently make bias harder to detect. However, the more judges do this, the less effective it will be. Interestingly, if this occurs, it may have the happy side effect of counter-acting reputational score inflation (if it exists), since it will likely be lower ranked skaters who are inflated, thus narrowing the gap between lower ranked skaters and high ranked skaters.
Disclosures, aka the part where I say what my biases are before you accuse me of having them
Since a lot of this analysis and the data entry that supports it was done by hand, there may be some mistakes in the data. I don’t think that these mistakes, if they exist, will significantly affect the conclusions, especially for judges no where near significance thresholds, but please let me know if you find any. Also, I’m not a professional statistician by any means, so if I have misapplied or mis-explained a statistical concept, or if you simply have a suggestion for improving the methodology, please let me know as well. (And, as someone in the thread about Chinese judges being suspended asked, I don’t work for NASA. I’m not really using anything beyond statistics 101 concepts here, honestly.)
Also, if you read my posts on the Chinese judge thread, please revise your memory of my thoughts to what I’ve written here. Some of the stuff I said was, in hindsight, probably not a good application of statistics, and other stuff I said was when I was in the middle of working on this project, and I have since revised my methodology or fixed some errors, resulting in different numbers than the ones I posted there.
I totally started on this project because I wanted to prove that Lorrie Parker is a biased, biased judge. Also, yes, I’m totally a die-hard Yuzuru Hanyu fan. I also like Boyang Jin and Mikhail Kolyada. BUT I don’t think any of that affected my analysis—the exact same methodology was applied to every judge I looked at, and I didn’t shy away from looking at Japanese judges. Also, of the top 3 most ridiculously biased judges I found, one is biased in favor of Boyang Jin (and for some reason doesn’t exhibit any tendency to underscore Yuzu despite underscoring his other competitors) and one appears to be biased in favor of Yuzuru (this didn’t factor into the analysis in any significant way and is unrelated to nationalistic bias—I just happened to notice it). Turns out you can have favorites and still be objective.
There’s been a lot of discussion of judging bias recently, in particular which judges may or may not be biased and how the ISU seems to be very inconsistent about catching it. In order to ground this discussion in more solid evidence, and also to demonstrate that there are better methods of catching judging bias than whatever the ISU is using now, I’ve devised a statistical method for detecting judging bias which I hope to show is both powerful and fair.
I ran the method on scores for the men’s free skate portion of all high level senior international competitions (ie. Olys, Worlds, 4CC, Euros, Grand Prix Series) in the past two seasons for 20 different judges to determine each judge’s level of bias in these competitions and the strength of the evidence for the existence of biased. For each judge, I combined all the scores they gave out in the men’s free skate portion of each of the competitions they scored. I looked at two statistics—first, the difference between how much a judge overscored their own skaters relative to the skater’s average score and how much they over/underscored other skaters (let’s call this the point differential (PD). You can think of the PD as the average amount of bonus points skaters from the same country get from that judge versus skaters from other countries. The second statistic is the probability that a score differential at least the size of the actual PD would have occurred by accident if the judge is unbiased. This is called p. The lower p is, the stronger the evidence that the judge is biased (for context, p = 0.01 is already considered really low). Here are my results, in order of strongest evidence to weakest, broken into three groups.
Results
Bias highly probable to virtually certain (p < 0.01) :
Weiguang Chen (China) - PD: 10.1, p < 0.00001
Anna Kantor (Israel) - PD: 9.87, p < 0.00001
Peggy Graham (USA) - PD: 9.38, p < 0.00001
Lorrie Parker (USA) - PD: 5.12, p = 0.000024
Saodat Numanova (Uzbekistan) - PD: 5.76, p = 0.00010
Olga Kozhemiakina (Russia) - PD: 5.19, p = 0.00020
Tatiana Sharkina (Russia) - PD: 4.56, p = 0.000275
Jeff Lukasik (Canada) - PD: 9.56, p = 0.00082
Albert Zaydman (Israel) - PD: 4.83, p = 0.0043
Sakae Yamamoto (Japan) - PD: 4.47, p = 0.0044
Yes, a full half of the judges I checked are pretty obviously biased.
Bias quite probable (0.01 < p < 0.05):
Masako Kubota (Japan) - PD: 3.67, p = 0.012
Daniel Delfa (Spain) - PD: 5.86, p = 0.012
Yuri Guskov (Kazakhstan) - PD: 4.36, p = 0.027
Janice Hunter (Canada) - PD: 3.63, p = 0.029
Igor Obraztsov (Russia) - PD: 3.04, p = 0.038
Nobuhiko Yoshioka (Japan) - PD: 4.74, p = 0.046
p < 0.05 is the typical standard for statistical significance, for instance in scientific studies. 16/20 judges show statistically significant evidence of bias.
No strong evidence of bias (p > 0.05):
Na Young Ahn (South Korea) - PD: 2.48, p = 0.13
Wendy Enzmann (USA) - PD: 1.75, p = 0.17
Sung Hee Koh (South Korea) - PD: 1.5, p = 0.20
Agita Abele (Latvia) - PD: 1.4, p = 0.22
So, as you can see, only 4 out of the 20 judges I looked at show no significant evidence of bias. And *every* judge I looked at scored their home skaters higher than they scored other skaters—the only question is whether there was enough evidence that difference wasn’t due to chance rather than bias. But in any case, we can see already that my way of testing for judging bias is much more effective than the ISU’s way.
You can see the full results and calculations in a spreadsheet which is located here. The spreadsheet also has the judges ordered by PD, if you would like to see the judges in order of who gave the most bonus points. I also recorded some extra statistics, like the number of home country or foreign skates each judge scored, and the 95% confidence interval—you can kind of think of this as giving the likely possible range of PDs a judge’s scores will settle around if they judge a ton more skates and don’t change their behavior. For judges who didn’t have as many scores, the range is quite large, because we don’t have that much information about how the judge tends to score. However, for many of those judges, their bias is so large it’s still detectable even without much data. Yeah...
You can also see judges individual score deviations for most of the men’s free skates these past two seasons. Except for the Olys one which I took from a previous spreadsheet, these aren’t color coded or anything, so they’re a bit hard to read, but if anyone wants to zoom in on specific instances of suspicious judging, have at it.
Methodology
Ok, these results are interesting, but you might be wondering how I calculated the p values, what p even is really, why I only looked at Men’s free skate scores, etc. So I’m going to spell out exactly what I did, why I did it, and how the whole p value thing works here.
Gathering Data
The first step in the analysis was to gather all the data. Fortunately skatingscores.com has all the judges scores laid out in an easily accessible manner. For each judge that I examined, I determined which competitions they judged men’s free skates for in the past two seasons, and then imported all of the judges' scores for all of the skaters from those competitions into an excel sheet. I then computed the mean (ie. average) score for each skater in a column next to the judges’ individual scores.
Note that the mean score is different from the skater’s actual score—the mean score is simply the number you get when you add up all the judges' scores and divide by the number of judges, whereas the actual score is calculated in a much more convoluted manner. I used the mean score instead of the actual score in my calculations for simplicity, though I don’t think the results would have changed much if I had used the actual score. If anything, using the mean score makes it slightly easier for judging bias to hide.
Underneath my table of judges scores, I created another table, which computes the difference between the score each judge gave and the mean score the skater received. So for instance, if a judge scored a skate at 195 points, and a skater averaged 193 points between all 9 judges, then the cell underneath the judge in that skater’s row would read 2. Let’s call this the score deviation.
After I computed the differences for all of the competitions that were part of my data set, I had to combine all of the scoring data for each judge. In a separate sheet, I collected all the score deviations for each score that each of the judges I examined gave across all the competitions they scored the men’s free skate for. For each judge, I split these score deviations into two groups: scores deviations for skaters from the same country and score deviations for skaters from different countries. For instance, the first judge I examined, Lorrie Parker, a US judge, would have all of her score deviations for US skaters like Nathan Chen and Adam Rippon listed under one column, and all her score deviations for her scores for non US skaters like Yuzuru Hanyu and Mikhail Kolyada listed under another column. All of the analysis that I did is based on comparing these two groups of data.
I computed the mean for these two groups, producing two numbers: the judge's average score deviation for home country skaters and average score deviation for other countries’ skaters across all the competitions I examined. In other words, the first number gives the average number of points the judge overscored her own skaters relative to the mean score of all the judges, and the second number gives the average number of points the judge over/underscored other skaters (while no judges underscored their own skaters on average, some judges are just nice and tend to score everyone highly). I recorded these two numbers in the results page, and a third number, the difference between the two averages. This is the point differential (PD) statistic mentioned earlier, and it represents how many points higher a judge tends to score her own skaters than she tends to score other skaters.
Because PD is the difference between two averages, it’s more useful to look at than just the home skater average. Some judges are just stricter or more forgiving in general, which means that just looking at the home average can be a bit misleading. For instance, if you only looked at Canadian judge Lukasik’s home skater score deviation average, you might find the fact that he scored Canadian skaters an average of 4.36 points higher than the other judges concerning, but not as egregious as some other judges. However, when you notice the fact that he *also* scored non-home skaters 5.2 points lower than other judges, the full extent of his bias (9.56 extra points for Canadian skaters over non-Canadian skaters) becomes more apparent.
p value and significance testing
After I computed these basic statistics, I turned to computing a statistic that is far less transparent to most people: p. What is p? Well, like I said earlier, in this context, it’s the probability (from 0 to 1, so a p of 0.5 is the same as 50%) that a difference between the home skater and foreign skater averages (in other words, the PD) could occur through random chance, if the judge does not systematically score one group of skaters (her home skaters) higher than the other (foreign skaters), relative to the other judges’ scores. To put it in simpler terms, it’s the probability that an unbiased judge would wind up with the same or higher PD through luck.
In order to understand this better, let’s imagine a hypothetical totally unbiased judge (difficult hypothetical, I know). If this judge scored thousands and thousands of men’s free skates and thousands and thousands of skaters, we would expect that would be no difference in how that judge scores skaters from her home country and foreign skaters. She might be a nice judge or a mean judge (so her average score deviations could be high or low), but either way, the difference between her scores and other judges’ scores should be the same on average regardless of whether the skaters are from her country. In other words, she should have a PD of zero.
Of course, real judges don’t score thousands of men’s free skates—since judging stopped being anonymized, they might at best have scored 5 or 6. Thus a real, unbiased judge is likely to have a PD that isn’t zero, just through random chance. The higher the PD, the less likely that they are truly unbiased, but it’s possible for a judge to have a PD of, say, 3, without that judge being truly biased—it’s just luck that they wound up over scoring their own skaters relative to other skaters by 3 points. The question therefore is, how do we know if the point differential reflects a real tendency to score home-skaters higher, or just dumb luck?
That’s where p comes in. p tells us *exactly* how unlikely it is for the PD to be that high if the judge is truly unbiased (ie. does not tend to score one group of skaters higher than another group). Thus if p is very, very low, we can be quite sure that the judge does not score home skaters and foreign skaters the same.
This p statistic, by the way, is used all the time in science and other fields where it might be useful to be able to determine if a difference between two sets of data is due to chance or some other factor. Read more about it here. In fact, there are established norms for when we consider p values low enough to reject the idea that the groups being compared (in this case, a judge’s scores deviations for home skaters versus foreign skaters) have the same underlying mean (in this case, same tendency to over or under-mark skaters, regardless of country of origin) for whatever is being measured. Typically, this threshold is p <0.05, though depending on how confident you need to be, sometimes the threshold is lower. As you can see, 80% of the judges I looked at had scores that met this threshold. You might have heard the phrase “statistically significant”—this phrase refers precisely to whether the p value for your comparison falls below whatever p threshold you set, again typically p <0.05. So, for 16/20 judges I looked at, I found a statistically significant difference between how they mark home country skaters versus foreign skaters.
How did I actually calculate p? Well, to be honest, the nitty-gritty details of how p is calculated are too complicated to get into here—you’d need to take a college level statistics class to get into that. But we don’t really need to know all that information—in fact, I don’t know off the top of my head either. For the actual calculation bit, all I needed to do was feed the data—the two groups of score deviations—into a calculator built to compute p values (here is the calculator I used). The main thing that you need to know is which test I used (there are different tests for determining significance depending on what kind of data you have and how it’s organized).
I used a one-tailed T-test. Why a T-test? T-tests are used for comparing means of two data sets and seeing if they differ significantly from each other—basically all the stuff I was describing. I was comparing the mean score deviation for home country skaters versus foreign skaters for each judge. Why one-tailed? The choice between one-tailed versus two-tailed depends on what you’re looking for: are you looking for differences in means in both directions, or only one direction? For my purposes, I was not concerned with cases where judges significantly underscore their own skaters relative to foreign skaters. First, I haven’t even run into a case where a judge underscored their own skaters relative to foreign skaters at all, let alone significantly. Second, if that ever occurs, it would not really be of interest to us, because we’re only interested in discovering pro-home country bias. If a judge is biased against their home country, well, that would be kind of weird, but I’m not really sure there would be a case for disciplining that judge (plus their fed probably wouldn’t let them judge any more anyway). (But if you really want to quibble about one tail vs two, you can easily re-run using a two tailed test. Or just double the p values.)
Why the men’s free skate?
Like I mentioned before, the data that I used to produce these calculations comes exclusively from the men’s free skates for Olys, Worlds, 4CC/Euros, and Grand Prix competitions. There were a few reasons I decided to restrict my data set in this way. First, because of different PCS factoring and different required elements between the short and the free program and between, say, men’s and ladies’, score differentials are not comparable across different competition segments and different fields. The men’s free skate tends to have relatively large variations in scores relative to other fields and segments, as there are simply more available GOE and PCS points. Therefore, it wouldn’t make sense to put men’s free skate data in the same data set as the ladies’ free skate, or the men’s short.
Because of how time consuming it was to do the analysis for a single segment of a single field, I did not look at other segments or other fields, nor did I look at B level competitions. Sorry, even I don’t have that much time on my hands. However, I would like to note that the time consuming aspects of running the analysis can be automated, so the ISU could still run the same analysis as long as they were willing to pay the one time cost of hiring a programmer. I don’t know how to program, so I have to do it by hand.
As for why the men’s free, and not the ladies’ free, or the pairs’ short, or whatever, that’s just a matter of personal preference. The men’s event is my favorite event, so I’m more interested in data about it than data from other events. *shrug* Hey, if you want to see what the other events looks like, you can run this analysis yourself. It’s worth noting that almost all judges judge for more than one field, and if they’re biased judging for one field, they’re probably biased judging for other fields.
Why these twenty judges
Admittedly, this was where I was most ISU-like, in that the process by which I chose which judges to examine was a bit unsystematic and arbitrary. For some judges, I simply wanted to know if that judge in particular was biased (for instance, Lorrie Parker, the first judge I looked at, was chosen in this way). But largely, I picked judges if I saw that their names show up a lot in men’s free skate judging panels, and if I saw that they scored at their own skaters at least 3 times. If a judge hasn’t scored many skates, then it becomes difficult to conduct statistical analysis on them. For instance, I wanted to look at Chinese judges other than Weiguang Chen (particularly the other suspended judge, Feng Huang), but none of the other Chinese judges who’ve judge the men’s free skate judged enough of their own skaters to be included (Chen hogged nearly all of the Chinese men’s judging). However, because I didn’t go through every judging panel and identify exactly how many times each judge judged (again, too time-consuming), it’s possible I might have missed some judges who’ve judged enough to be analyzed. I think I got all or almost all of them though.
Limitations
There are some limitations to the form of analysis I’ve conducted here. First of all, this method only catches cases where judges hold up skaters from their own country, and is not suited to catching bloc-judging (ie., judging where judges from multiple nationalities collude to help skaters of a given nationality). I think it could possibly be altered to test for bloc-judging (or at least, you could test for whether say, a Kazakh judge scored Russian skaters higher than other skaters, for example) but you’d have to come up with a hypothesis for what the blocs in question are ahead of time.
Second, my analysis is built around the question of whether we can reject the idea that a judge scores home and foreign skaters the same. While this provides pretty strong evidence for nationalistic bias (ie. scoring a skater higher because they are from your home country), it does not rule out other explanations for systematic differences between a judge’s scores for home versus foreign skaters (ie. the judge is biased in favor of certain skaters for other reasons and some of those skaters just happen to be from their home country). This is especially true of cases where a judge is scoring only one skater from their home country. It’s possible that they simply like that skater more than other judges for innocent reasons. For instance, Numanova, the Uzbek judge, shows a highly, highly significant pattern of scoring Uzbek skaters more favorably than other skaters. However, though she scored an Uzbek skater multiple times, there was only one Uzbek skater whom she scored: Misha Ge. It’s possible that she’s scoring him higher not because he is Uzbek, but because she believes that the level of artistry he shows in his skates should be reflected with higher PCS numbers than the other judges award him. It’s entirely possible for judges to have preferences/biases in favor of individual skaters for reasons pertaining to the individual and not to nationality—for example, Israeli Judge Anna Kantor consistently over-scores Yuzuru Hanyu relative to other judges AND despite her tendency to underscore in general, presumably because she just really likes his skating. My analytical methods cannot distinguish between personal bias and nationalistic bias when a judge is only scoring one or two skaters—it’s up to you to decide which is the more likely explanation. (The flip side of this is that the results are more robust for judges from large federations that field multiple skaters at any given competition).
Additionally, while my analysis can give strong evidence for the existence of nationalistic bias, it cannot tell you why that bias exists. It’s possible that some sources of nationalistic bias are innocent—for instance, perhaps judges simply tend to connect better with the programs of skaters from the same country because of shared cultural norms and expectations, and thus tend to award higher PCS scores to those programs. It cannot prove that a given judge or federation is corrupt and trying to fix the results competitions to benefit their own skaters—it’s entirely possible that a judge is well meaning, but subconsciously grades their own skaters more forgivingly than other skaters. All it can show is that a given judge marks home country skaters higher, not the reason why.
Also, because we’re doing so many comparisons, the methodology is unfortunately likely to flag someone at some point that is innocent of significant bias, at least for some of the higher p-values. There is, after all, a non-zero probability that they are unbiased and randomly wound up with the PD that they did, and the chance that we’ll encounter such a person goes up the more people we test. That’s still really unlikely for those p < 0.0001 folks though.
On a different note, some judges seem more markedly biased at certain competitions than at others. Lorrie Parker, for instance, was highly biased at the Olympics, but scored in a much more reasonable and not markedly biased way at CoC 2017 and Rostelecom 2017. To some extent, judging in an unbiased way in some competitions can cover for bias in other competitions. Aggregating the competition data therefore may sometimes hide the extent to which a judge is biased at a particular competition, though it’s usually not enough to hide the fact that they are biased entirely. Weiguang Chen was similar—her 2016 Grand Prix scores are much more reasonable than her 2018 Olys scores. Therefore, it might be useful to run this same analysis method, but for individual competitions. I haven’t done that here, but it should be easy enough to do.
Similarly, because I put all foreign skaters in a single bucket for each judge, the extent to which judges are biased against foreign skaters their own skaters are in direct competition with is also masked. The fact that Chen scores foreign skaters 1.45 points lower across all competitions relative to other judges doesn’t really capture, for instance, how she underscored Shoma Uno by 11.09 points and Javier Fernandez by 14.99 points. Nor does the fact that Lorrie Parker actually scores non-US skaters 0.53 points higher than other judges across all competitions capture the fact that she underscored Yuzuru Hanyu by 8.92 points and Boyang Jin by 10.02 points in the Olympics. So in reality, the magnitude of the bias that biased judges bring to the table is in practice likely significantly higher than the PD numbers given. Unfortunately, I couldn’t think of a way to measure this direct competitor effect without making contentious calls over who might be considered a “direct competitor” of various skaters.
Recommendations
To be honest, I think everyone in the p < 0.01 category should be warned, at the very least. Second, I think that the ISU should switch to a method similar or identical to mine for detecting nationalistic bias. As we can see, this is a much more effective way of catching biased judges than the score corridor method they have now, which gives judges so much leeway that they can be obviously, provably biased without being scrutinized for it. Though the method I employed here may look complicated, it’s actually not, and there is no reason the ISU could not employ it themselves. They could even contract a programmer who can make a program to compute all of these things automatically and automatically flag suspicious judges—after constructing the methodology, nothing I did requires any human judgment whatsoever, and is entirely a mechanical application of formulas to numbers. You can even use the method on individual competitions, provided that there are multiple skaters from the judge’s country. For instance, when I ran the numbers solely on Lorrie Parker’s Olys Men’s Single’s event judging, I found that PD = 9.62 and p = 0.00057 for that one competition (!).
There are, however, a few drawbacks. First, it may be possible to game this method by judging in an unbiased (or even anti-biased) way at relatively unimportant competitions while still being biased at important competitions (but I think any method of catching bias will be prone to this problem). Second, this method may incentivize bad faith judges to artificially inflate the scores of foreign skaters who do not directly compete with skaters from the judge’s country. This will lower their PD score and consequently make bias harder to detect. However, the more judges do this, the less effective it will be. Interestingly, if this occurs, it may have the happy side effect of counter-acting reputational score inflation (if it exists), since it will likely be lower ranked skaters who are inflated, thus narrowing the gap between lower ranked skaters and high ranked skaters.
Disclosures, aka the part where I say what my biases are before you accuse me of having them
Since a lot of this analysis and the data entry that supports it was done by hand, there may be some mistakes in the data. I don’t think that these mistakes, if they exist, will significantly affect the conclusions, especially for judges no where near significance thresholds, but please let me know if you find any. Also, I’m not a professional statistician by any means, so if I have misapplied or mis-explained a statistical concept, or if you simply have a suggestion for improving the methodology, please let me know as well. (And, as someone in the thread about Chinese judges being suspended asked, I don’t work for NASA. I’m not really using anything beyond statistics 101 concepts here, honestly.)
Also, if you read my posts on the Chinese judge thread, please revise your memory of my thoughts to what I’ve written here. Some of the stuff I said was, in hindsight, probably not a good application of statistics, and other stuff I said was when I was in the middle of working on this project, and I have since revised my methodology or fixed some errors, resulting in different numbers than the ones I posted there.
I totally started on this project because I wanted to prove that Lorrie Parker is a biased, biased judge. Also, yes, I’m totally a die-hard Yuzuru Hanyu fan. I also like Boyang Jin and Mikhail Kolyada. BUT I don’t think any of that affected my analysis—the exact same methodology was applied to every judge I looked at, and I didn’t shy away from looking at Japanese judges. Also, of the top 3 most ridiculously biased judges I found, one is biased in favor of Boyang Jin (and for some reason doesn’t exhibit any tendency to underscore Yuzu despite underscoring his other competitors) and one appears to be biased in favor of Yuzuru (this didn’t factor into the analysis in any significant way and is unrelated to nationalistic bias—I just happened to notice it). Turns out you can have favorites and still be objective.