Clearly biased: 20 judges' score histories | Golden Skate

# Clearly biased: 20 judges' score histories

#### Shanshani

On the Ice
Tl;dr: 10/20 judges I examined are almost certainly biased, and 6 more are highly likely to be. Top 5 worst offenders by obviousness: Weiguang Chen (China), Anna Kantor (Israel), Peggy Graham (US), Lorrie Parker (US), Saodat Numanova (Uzbekistan). Judges from all major federations show strong evidence of nationalistic bias.

There’s been a lot of discussion of judging bias recently, in particular which judges may or may not be biased and how the ISU seems to be very inconsistent about catching it. In order to ground this discussion in more solid evidence, and also to demonstrate that there are better methods of catching judging bias than whatever the ISU is using now, I’ve devised a statistical method for detecting judging bias which I hope to show is both powerful and fair.

I ran the method on scores for the men’s free skate portion of all high level senior international competitions (ie. Olys, Worlds, 4CC, Euros, Grand Prix Series) in the past two seasons for 20 different judges to determine each judge’s level of bias in these competitions and the strength of the evidence for the existence of biased. For each judge, I combined all the scores they gave out in the men’s free skate portion of each of the competitions they scored. I looked at two statistics—first, the difference between how much a judge overscored their own skaters relative to the skater’s average score and how much they over/underscored other skaters (let’s call this the point differential (PD). You can think of the PD as the average amount of bonus points skaters from the same country get from that judge versus skaters from other countries. The second statistic is the probability that a score differential at least the size of the actual PD would have occurred by accident if the judge is unbiased. This is called p. The lower p is, the stronger the evidence that the judge is biased (for context, p = 0.01 is already considered really low). Here are my results, in order of strongest evidence to weakest, broken into three groups.

Results

Bias highly probable to virtually certain (p < 0.01) :

Weiguang Chen (China) - PD: 10.1, p < 0.00001
Anna Kantor (Israel) - PD: 9.87, p < 0.00001
Peggy Graham (USA) - PD: 9.38, p < 0.00001
Lorrie Parker (USA) - PD: 5.12, p = 0.000024
Saodat Numanova (Uzbekistan) - PD: 5.76, p = 0.00010
Olga Kozhemiakina (Russia) - PD: 5.19, p = 0.00020
Tatiana Sharkina (Russia) - PD: 4.56, p = 0.000275
Jeff Lukasik (Canada) - PD: 9.56, p = 0.00082
Albert Zaydman (Israel) - PD: 4.83, p = 0.0043
Sakae Yamamoto (Japan) - PD: 4.47, p = 0.0044

Yes, a full half of the judges I checked are pretty obviously biased.

Bias quite probable (0.01 < p < 0.05):

Masako Kubota (Japan) - PD: 3.67, p = 0.012
Daniel Delfa (Spain) - PD: 5.86, p = 0.012
Yuri Guskov (Kazakhstan) - PD: 4.36, p = 0.027
Janice Hunter (Canada) - PD: 3.63, p = 0.029
Igor Obraztsov (Russia) - PD: 3.04, p = 0.038
Nobuhiko Yoshioka (Japan) - PD: 4.74, p = 0.046

p < 0.05 is the typical standard for statistical significance, for instance in scientific studies. 16/20 judges show statistically significant evidence of bias.

No strong evidence of bias (p > 0.05):

Na Young Ahn (South Korea) - PD: 2.48, p = 0.13
Wendy Enzmann (USA) - PD: 1.75, p = 0.17
Sung Hee Koh (South Korea) - PD: 1.5, p = 0.20
Agita Abele (Latvia) - PD: 1.4, p = 0.22

So, as you can see, only 4 out of the 20 judges I looked at show no significant evidence of bias. And *every* judge I looked at scored their home skaters higher than they scored other skaters—the only question is whether there was enough evidence that difference wasn’t due to chance rather than bias. But in any case, we can see already that my way of testing for judging bias is much more effective than the ISU’s way.

You can see the full results and calculations in a spreadsheet which is located here. The spreadsheet also has the judges ordered by PD, if you would like to see the judges in order of who gave the most bonus points. I also recorded some extra statistics, like the number of home country or foreign skates each judge scored, and the 95% confidence interval—you can kind of think of this as giving the likely possible range of PDs a judge’s scores will settle around if they judge a ton more skates and don’t change their behavior. For judges who didn’t have as many scores, the range is quite large, because we don’t have that much information about how the judge tends to score. However, for many of those judges, their bias is so large it’s still detectable even without much data. Yeah...

You can also see judges individual score deviations for most of the men’s free skates these past two seasons. Except for the Olys one which I took from a previous spreadsheet, these aren’t color coded or anything, so they’re a bit hard to read, but if anyone wants to zoom in on specific instances of suspicious judging, have at it.

Methodology

Ok, these results are interesting, but you might be wondering how I calculated the p values, what p even is really, why I only looked at Men’s free skate scores, etc. So I’m going to spell out exactly what I did, why I did it, and how the whole p value thing works here.

Gathering Data

The first step in the analysis was to gather all the data. Fortunately skatingscores.com has all the judges scores laid out in an easily accessible manner. For each judge that I examined, I determined which competitions they judged men’s free skates for in the past two seasons, and then imported all of the judges' scores for all of the skaters from those competitions into an excel sheet. I then computed the mean (ie. average) score for each skater in a column next to the judges’ individual scores.

Note that the mean score is different from the skater’s actual score—the mean score is simply the number you get when you add up all the judges' scores and divide by the number of judges, whereas the actual score is calculated in a much more convoluted manner. I used the mean score instead of the actual score in my calculations for simplicity, though I don’t think the results would have changed much if I had used the actual score. If anything, using the mean score makes it slightly easier for judging bias to hide.

Underneath my table of judges scores, I created another table, which computes the difference between the score each judge gave and the mean score the skater received. So for instance, if a judge scored a skate at 195 points, and a skater averaged 193 points between all 9 judges, then the cell underneath the judge in that skater’s row would read 2. Let’s call this the score deviation.

After I computed the differences for all of the competitions that were part of my data set, I had to combine all of the scoring data for each judge. In a separate sheet, I collected all the score deviations for each score that each of the judges I examined gave across all the competitions they scored the men’s free skate for. For each judge, I split these score deviations into two groups: scores deviations for skaters from the same country and score deviations for skaters from different countries. For instance, the first judge I examined, Lorrie Parker, a US judge, would have all of her score deviations for US skaters like Nathan Chen and Adam Rippon listed under one column, and all her score deviations for her scores for non US skaters like Yuzuru Hanyu and Mikhail Kolyada listed under another column. All of the analysis that I did is based on comparing these two groups of data.

I computed the mean for these two groups, producing two numbers: the judge's average score deviation for home country skaters and average score deviation for other countries’ skaters across all the competitions I examined. In other words, the first number gives the average number of points the judge overscored her own skaters relative to the mean score of all the judges, and the second number gives the average number of points the judge over/underscored other skaters (while no judges underscored their own skaters on average, some judges are just nice and tend to score everyone highly). I recorded these two numbers in the results page, and a third number, the difference between the two averages. This is the point differential (PD) statistic mentioned earlier, and it represents how many points higher a judge tends to score her own skaters than she tends to score other skaters.

Because PD is the difference between two averages, it’s more useful to look at than just the home skater average. Some judges are just stricter or more forgiving in general, which means that just looking at the home average can be a bit misleading. For instance, if you only looked at Canadian judge Lukasik’s home skater score deviation average, you might find the fact that he scored Canadian skaters an average of 4.36 points higher than the other judges concerning, but not as egregious as some other judges. However, when you notice the fact that he *also* scored non-home skaters 5.2 points lower than other judges, the full extent of his bias (9.56 extra points for Canadian skaters over non-Canadian skaters) becomes more apparent.

p value and significance testing

After I computed these basic statistics, I turned to computing a statistic that is far less transparent to most people: p. What is p? Well, like I said earlier, in this context, it’s the probability (from 0 to 1, so a p of 0.5 is the same as 50%) that a difference between the home skater and foreign skater averages (in other words, the PD) could occur through random chance, if the judge does not systematically score one group of skaters (her home skaters) higher than the other (foreign skaters), relative to the other judges’ scores. To put it in simpler terms, it’s the probability that an unbiased judge would wind up with the same or higher PD through luck.

In order to understand this better, let’s imagine a hypothetical totally unbiased judge (difficult hypothetical, I know). If this judge scored thousands and thousands of men’s free skates and thousands and thousands of skaters, we would expect that would be no difference in how that judge scores skaters from her home country and foreign skaters. She might be a nice judge or a mean judge (so her average score deviations could be high or low), but either way, the difference between her scores and other judges’ scores should be the same on average regardless of whether the skaters are from her country. In other words, she should have a PD of zero.

Of course, real judges don’t score thousands of men’s free skates—since judging stopped being anonymized, they might at best have scored 5 or 6. Thus a real, unbiased judge is likely to have a PD that isn’t zero, just through random chance. The higher the PD, the less likely that they are truly unbiased, but it’s possible for a judge to have a PD of, say, 3, without that judge being truly biased—it’s just luck that they wound up over scoring their own skaters relative to other skaters by 3 points. The question therefore is, how do we know if the point differential reflects a real tendency to score home-skaters higher, or just dumb luck?

That’s where p comes in. p tells us *exactly* how unlikely it is for the PD to be that high if the judge is truly unbiased (ie. does not tend to score one group of skaters higher than another group). Thus if p is very, very low, we can be quite sure that the judge does not score home skaters and foreign skaters the same.

This p statistic, by the way, is used all the time in science and other fields where it might be useful to be able to determine if a difference between two sets of data is due to chance or some other factor. Read more about it here. In fact, there are established norms for when we consider p values low enough to reject the idea that the groups being compared (in this case, a judge’s scores deviations for home skaters versus foreign skaters) have the same underlying mean (in this case, same tendency to over or under-mark skaters, regardless of country of origin) for whatever is being measured. Typically, this threshold is p <0.05, though depending on how confident you need to be, sometimes the threshold is lower. As you can see, 80% of the judges I looked at had scores that met this threshold. You might have heard the phrase “statistically significant”—this phrase refers precisely to whether the p value for your comparison falls below whatever p threshold you set, again typically p <0.05. So, for 16/20 judges I looked at, I found a statistically significant difference between how they mark home country skaters versus foreign skaters.

How did I actually calculate p? Well, to be honest, the nitty-gritty details of how p is calculated are too complicated to get into here—you’d need to take a college level statistics class to get into that. But we don’t really need to know all that information—in fact, I don’t know off the top of my head either. For the actual calculation bit, all I needed to do was feed the data—the two groups of score deviations—into a calculator built to compute p values (here is the calculator I used). The main thing that you need to know is which test I used (there are different tests for determining significance depending on what kind of data you have and how it’s organized).

I used a one-tailed T-test. Why a T-test? T-tests are used for comparing means of two data sets and seeing if they differ significantly from each other—basically all the stuff I was describing. I was comparing the mean score deviation for home country skaters versus foreign skaters for each judge. Why one-tailed? The choice between one-tailed versus two-tailed depends on what you’re looking for: are you looking for differences in means in both directions, or only one direction? For my purposes, I was not concerned with cases where judges significantly underscore their own skaters relative to foreign skaters. First, I haven’t even run into a case where a judge underscored their own skaters relative to foreign skaters at all, let alone significantly. Second, if that ever occurs, it would not really be of interest to us, because we’re only interested in discovering pro-home country bias. If a judge is biased against their home country, well, that would be kind of weird, but I’m not really sure there would be a case for disciplining that judge (plus their fed probably wouldn’t let them judge any more anyway). (But if you really want to quibble about one tail vs two, you can easily re-run using a two tailed test. Or just double the p values.)

Why the men’s free skate?

Like I mentioned before, the data that I used to produce these calculations comes exclusively from the men’s free skates for Olys, Worlds, 4CC/Euros, and Grand Prix competitions. There were a few reasons I decided to restrict my data set in this way. First, because of different PCS factoring and different required elements between the short and the free program and between, say, men’s and ladies’, score differentials are not comparable across different competition segments and different fields. The men’s free skate tends to have relatively large variations in scores relative to other fields and segments, as there are simply more available GOE and PCS points. Therefore, it wouldn’t make sense to put men’s free skate data in the same data set as the ladies’ free skate, or the men’s short.

Because of how time consuming it was to do the analysis for a single segment of a single field, I did not look at other segments or other fields, nor did I look at B level competitions. Sorry, even I don’t have that much time on my hands. However, I would like to note that the time consuming aspects of running the analysis can be automated, so the ISU could still run the same analysis as long as they were willing to pay the one time cost of hiring a programmer. I don’t know how to program, so I have to do it by hand.

As for why the men’s free, and not the ladies’ free, or the pairs’ short, or whatever, that’s just a matter of personal preference. The men’s event is my favorite event, so I’m more interested in data about it than data from other events. *shrug* Hey, if you want to see what the other events looks like, you can run this analysis yourself. It’s worth noting that almost all judges judge for more than one field, and if they’re biased judging for one field, they’re probably biased judging for other fields.

Why these twenty judges

Admittedly, this was where I was most ISU-like, in that the process by which I chose which judges to examine was a bit unsystematic and arbitrary. For some judges, I simply wanted to know if that judge in particular was biased (for instance, Lorrie Parker, the first judge I looked at, was chosen in this way). But largely, I picked judges if I saw that their names show up a lot in men’s free skate judging panels, and if I saw that they scored at their own skaters at least 3 times. If a judge hasn’t scored many skates, then it becomes difficult to conduct statistical analysis on them. For instance, I wanted to look at Chinese judges other than Weiguang Chen (particularly the other suspended judge, Feng Huang), but none of the other Chinese judges who’ve judge the men’s free skate judged enough of their own skaters to be included (Chen hogged nearly all of the Chinese men’s judging). However, because I didn’t go through every judging panel and identify exactly how many times each judge judged (again, too time-consuming), it’s possible I might have missed some judges who’ve judged enough to be analyzed. I think I got all or almost all of them though.

Limitations

There are some limitations to the form of analysis I’ve conducted here. First of all, this method only catches cases where judges hold up skaters from their own country, and is not suited to catching bloc-judging (ie., judging where judges from multiple nationalities collude to help skaters of a given nationality). I think it could possibly be altered to test for bloc-judging (or at least, you could test for whether say, a Kazakh judge scored Russian skaters higher than other skaters, for example) but you’d have to come up with a hypothesis for what the blocs in question are ahead of time.

Second, my analysis is built around the question of whether we can reject the idea that a judge scores home and foreign skaters the same. While this provides pretty strong evidence for nationalistic bias (ie. scoring a skater higher because they are from your home country), it does not rule out other explanations for systematic differences between a judge’s scores for home versus foreign skaters (ie. the judge is biased in favor of certain skaters for other reasons and some of those skaters just happen to be from their home country). This is especially true of cases where a judge is scoring only one skater from their home country. It’s possible that they simply like that skater more than other judges for innocent reasons. For instance, Numanova, the Uzbek judge, shows a highly, highly significant pattern of scoring Uzbek skaters more favorably than other skaters. However, though she scored an Uzbek skater multiple times, there was only one Uzbek skater whom she scored: Misha Ge. It’s possible that she’s scoring him higher not because he is Uzbek, but because she believes that the level of artistry he shows in his skates should be reflected with higher PCS numbers than the other judges award him. It’s entirely possible for judges to have preferences/biases in favor of individual skaters for reasons pertaining to the individual and not to nationality—for example, Israeli Judge Anna Kantor consistently over-scores Yuzuru Hanyu relative to other judges AND despite her tendency to underscore in general, presumably because she just really likes his skating. My analytical methods cannot distinguish between personal bias and nationalistic bias when a judge is only scoring one or two skaters—it’s up to you to decide which is the more likely explanation. (The flip side of this is that the results are more robust for judges from large federations that field multiple skaters at any given competition).

Additionally, while my analysis can give strong evidence for the existence of nationalistic bias, it cannot tell you why that bias exists. It’s possible that some sources of nationalistic bias are innocent—for instance, perhaps judges simply tend to connect better with the programs of skaters from the same country because of shared cultural norms and expectations, and thus tend to award higher PCS scores to those programs. It cannot prove that a given judge or federation is corrupt and trying to fix the results competitions to benefit their own skaters—it’s entirely possible that a judge is well meaning, but subconsciously grades their own skaters more forgivingly than other skaters. All it can show is that a given judge marks home country skaters higher, not the reason why.

Also, because we’re doing so many comparisons, the methodology is unfortunately likely to flag someone at some point that is innocent of significant bias, at least for some of the higher p-values. There is, after all, a non-zero probability that they are unbiased and randomly wound up with the PD that they did, and the chance that we’ll encounter such a person goes up the more people we test. That’s still really unlikely for those p < 0.0001 folks though.

On a different note, some judges seem more markedly biased at certain competitions than at others. Lorrie Parker, for instance, was highly biased at the Olympics, but scored in a much more reasonable and not markedly biased way at CoC 2017 and Rostelecom 2017. To some extent, judging in an unbiased way in some competitions can cover for bias in other competitions. Aggregating the competition data therefore may sometimes hide the extent to which a judge is biased at a particular competition, though it’s usually not enough to hide the fact that they are biased entirely. Weiguang Chen was similar—her 2016 Grand Prix scores are much more reasonable than her 2018 Olys scores. Therefore, it might be useful to run this same analysis method, but for individual competitions. I haven’t done that here, but it should be easy enough to do.

Similarly, because I put all foreign skaters in a single bucket for each judge, the extent to which judges are biased against foreign skaters their own skaters are in direct competition with is also masked. The fact that Chen scores foreign skaters 1.45 points lower across all competitions relative to other judges doesn’t really capture, for instance, how she underscored Shoma Uno by 11.09 points and Javier Fernandez by 14.99 points. Nor does the fact that Lorrie Parker actually scores non-US skaters 0.53 points higher than other judges across all competitions capture the fact that she underscored Yuzuru Hanyu by 8.92 points and Boyang Jin by 10.02 points in the Olympics. So in reality, the magnitude of the bias that biased judges bring to the table is in practice likely significantly higher than the PD numbers given. Unfortunately, I couldn’t think of a way to measure this direct competitor effect without making contentious calls over who might be considered a “direct competitor” of various skaters.

Recommendations

To be honest, I think everyone in the p < 0.01 category should be warned, at the very least. Second, I think that the ISU should switch to a method similar or identical to mine for detecting nationalistic bias. As we can see, this is a much more effective way of catching biased judges than the score corridor method they have now, which gives judges so much leeway that they can be obviously, provably biased without being scrutinized for it. Though the method I employed here may look complicated, it’s actually not, and there is no reason the ISU could not employ it themselves. They could even contract a programmer who can make a program to compute all of these things automatically and automatically flag suspicious judges—after constructing the methodology, nothing I did requires any human judgment whatsoever, and is entirely a mechanical application of formulas to numbers. You can even use the method on individual competitions, provided that there are multiple skaters from the judge’s country. For instance, when I ran the numbers solely on Lorrie Parker’s Olys Men’s Single’s event judging, I found that PD = 9.62 and p = 0.00057 for that one competition (!).

There are, however, a few drawbacks. First, it may be possible to game this method by judging in an unbiased (or even anti-biased) way at relatively unimportant competitions while still being biased at important competitions (but I think any method of catching bias will be prone to this problem). Second, this method may incentivize bad faith judges to artificially inflate the scores of foreign skaters who do not directly compete with skaters from the judge’s country. This will lower their PD score and consequently make bias harder to detect. However, the more judges do this, the less effective it will be. Interestingly, if this occurs, it may have the happy side effect of counter-acting reputational score inflation (if it exists), since it will likely be lower ranked skaters who are inflated, thus narrowing the gap between lower ranked skaters and high ranked skaters.

Disclosures, aka the part where I say what my biases are before you accuse me of having them

Since a lot of this analysis and the data entry that supports it was done by hand, there may be some mistakes in the data. I don’t think that these mistakes, if they exist, will significantly affect the conclusions, especially for judges no where near significance thresholds, but please let me know if you find any. Also, I’m not a professional statistician by any means, so if I have misapplied or mis-explained a statistical concept, or if you simply have a suggestion for improving the methodology, please let me know as well. (And, as someone in the thread about Chinese judges being suspended asked, I don’t work for NASA. I’m not really using anything beyond statistics 101 concepts here, honestly.)

Also, if you read my posts on the Chinese judge thread, please revise your memory of my thoughts to what I’ve written here. Some of the stuff I said was, in hindsight, probably not a good application of statistics, and other stuff I said was when I was in the middle of working on this project, and I have since revised my methodology or fixed some errors, resulting in different numbers than the ones I posted there.

I totally started on this project because I wanted to prove that Lorrie Parker is a biased, biased judge. Also, yes, I’m totally a die-hard Yuzuru Hanyu fan. I also like Boyang Jin and Mikhail Kolyada. BUT I don’t think any of that affected my analysis—the exact same methodology was applied to every judge I looked at, and I didn’t shy away from looking at Japanese judges. Also, of the top 3 most ridiculously biased judges I found, one is biased in favor of Boyang Jin (and for some reason doesn’t exhibit any tendency to underscore Yuzu despite underscoring his other competitors) and one appears to be biased in favor of Yuzuru (this didn’t factor into the analysis in any significant way and is unrelated to nationalistic bias—I just happened to notice it). Turns out you can have favorites and still be objective.

#### chopinskate

Thanks for this. Interesting results.

I totally started on this project because I wanted to prove that Lorrie Parker is a biased, biased judge. Also, yes, I’m totally a die-hard Yuzuru Hanyu fan. I also like Boyang Jin and Mikhail Kolyada. BUT I don’t think any of that affected my analysis—the exact same methodology was applied to every judge I looked at, and I didn’t shy away from looking at Japanese judges. Also, of the top 3 most ridiculously biased judges I found, one is biased in favor of Boyang Jin (and for some reason doesn’t exhibit any tendency to underscore Yuzu despite underscoring his other competitors) and one appears to be biased in favor of Yuzuru (this didn’t factor into the analysis in any significant way and is unrelated to nationalistic bias—I just happened to notice it). Turns out you can have favorites and still be objective.

:agree:

#### yume

##### 🍉
Record Breaker
Thanks for this analysis.

#### Shanshani

On the Ice
I used to have other hobbies...

Anyway, I wrote *a lot* of stuff, some of which is actually quite technical, so if anyone needs clarifications, feel free to ask.

#### Miller

Final Flight
Looking at Lorrie Parker's (sorry to pick on her, but she seems the 'judge of the day', plus there are other question marks about her Olympic's judging) history on skatingscores.com http://skatingscores.com/official/usa/lorrie_parker/ I found that she judged American skaters on 37 occasions across the 2 years in question. I did not look at how much she might have favoured them by, or how she might have marked other countries skaters. This is just a snapshot of how she placed the American skaters relative to the other judges i.e. did she place them 1st out of the 9 judges, 2nd out of the 9 etc.

This is what I got - 1st place positions 20, 2nd 4, 3rd 9, 4th 3, 6th (Nathan Chen 6th in Men's SP at 2017 4CC) 1. Average position 1.97, vs random, totally unbiased judging 5.00.

Could this act as a quick 'snapshot' as to how biased a judge might be for further investigation e.g. if the average position they place their skaters across the board is lower than say, 3.00, then maybe investigate further, and especially if the number of occasions they judged their skaters is quite large - 37 would seem more than enough.

Something like this could work quite quickly - it obviously didn't take me too long to look at the stuff on skatingscores and come up with the above. To program it should be really easy also, no need to worry about things like standard deviations and stuff (sorry to all you mathematiicians out there).

#### dante

##### a dark lord
Final Flight
Wow. I wish all journalism was more like that - educational and (at least honestly trying to be) unbiased.

I've always lacked understanding of statistics, so now I'm regretting that I ignored the thread on the judges' suspension - this quarrel is going to be as educational for me as the one about WADA's scandal.

I will later give your post another, more careful read; sorry if some of my questions are already covered by it:
1. Say, a judge took part only in one competition and gave his compatriot 1 bonus point (PD=1). What's his p-value?
2. Which value should be taken into account when selecting the judges to warn or reprimand? Is it p*PD? Or should we better find out a PD for which the p-value would be equal to 0.5 (or 0.05)?
3. Would it be less prone to judges' bias to score the median GOEs and PCSs, i.e. to drop the 8 extreme values instead of 2?

Similarly, because I put all foreign skaters in a single bucket for each judge, the extent to which judges are biased against foreign skaters their own skaters are in direct competition with is also masked.

You can use the difference in ISU ranking as a factor. Or better calculate your own ranking for the last 12 months, because the standings change rapidly in this sports and the per-season values will be inaccurate.

#### shyne

Final Flight
The sample size is probably small for those without a significant p

#### moriel

Record Breaker
Wow. I wish all journalism was more like that - educational and (at least honestly trying to be) unbiased.

I've always lacked understanding of statistics, so now I'm regretting that I ignored the thread on the judges' suspension - this quarrel is going to be as educational for me as the one about WADA's scandal.

I will later give your post another, more careful read; sorry if some of my questions are already covered by it:
1. Say, a judge took part only in one competition and gave his compatriot 1 bonus point (PD=1). What's his p-value?
2. Which value should be taken into account when selecting the judges to warn or reprimand? Is it p*PD? Or should we better find out a PD for which the p-value would be equal to 0.5 (or 0.05)?
3. Would it be less prone to judges' bias to score the median GOEs and PCSs, i.e. to drop the 8 extreme values instead of 2?

You can use the difference in ISU ranking as a factor. Or better calculate your own ranking for the last 12 months, because the standings change rapidly in this sports and the per-season values will be inaccurate.

1. This is hard to tell. For the test used, p-value depends on variance. What does this mean? Imagine you a judge giving PDs such as -2, 2, 3, -1, -3, 1 and so on. In average, there is no bias, but the interval for the PDs is quite large. On the other hand, you could have a judge with PDs such as 1, 1.1, 0.8, 1.3, 0.9 etc, which are all contained in a small interval. Notice that in the first case, a PD of 1 would be perfectly fine, because this judge just has a huge PD variation, so we can think that in one competition he gives a PD 1, but in another one it may be as well -1, depending on the skaters performance or whatever like that. In the second case, the judge is clearly biased if gives a PD of 1, because it always consistently haves a PD of 1. Looking at his PDs, you cannot think "nah, this is fine, maybe his home skater just performed very well and next competition there will be a PD of -1".
Not 100% sure of the methodology used by the OP, but usually the T-test uses the sample variance, which means p-value depend on all the PDs of that judge.
Now, if a judge judged just one competition, there is no way for us to determine this variance for the specific judge. The only way to evaluate it would be to use some sort of statistics from the other judges - for example, study their PDs, and determine that your typical judge usually gives PDs in, for example, +-2 interval around his average PD. If his average PD is 0, he will give PDs between -2 and +2. If his average PD is 5, he will give PDs between 3 and 7. Then we could use this info to decide if the judge that took part in only one competition has national bias.
Another method to avoid this issue is, rather than pulling all the foreign skaters together in a competition is to calculate a PD for each of the individually - so you have a large number of PDs in one competition.

2. This is a philosophic question. For example, you can say that any bias is bad. So, if a judge has a PD of 0.01 consistently in all competition, this judge should be reprimanded, despite his bias being small. THis would mean that all the judges with low p-value would be reprimanded, regardless of their average PD. You can also think that a bias of 0.01 does not matter, so one would need to have a low p-value and a high PD to be reprimanded.
I personally really dislike this method for reprimanding, though, since it requires a very large history on each judge, i would say at least 5+ competitions to get any meaningful results. Now, there is this thing - judges are less likely to have a bias when their skaters are nowhere near top. For example, in the data analysed, one can claim that South Korean judges are not unbiased, its just that they had no real reason to overscore their own skaters because it does not really matter if their skater is 15th or 17th. National bias tend to be larger in top 10 or so, where medals and world/euro/4cc spots are at stake.
For example, take the Olympics FS. The russian judge gave a score of 161.33 to Medvedeva (4.68 points above her final score), 160.31 to Zagitova (3.66 points above her final score), but 135.01 to Sotskova (0.77 points above her final score).
p * PD is a bad measure to reprimand a judge, because high PD is bad, and small p is bad, so they sort of cancel each other when you multiply.

3. Yep it would, in theory, since median is less sensible to outliers than the mean. The thing is, in practice, it will make no difference, because the trimmed mean is effective enough against national bias, so that is already mostly filtered, and it will not save us from stuff like block voting

#### moriel

Record Breaker
@Shanshani, I am wondering if it is possible to add the sample size for each judge.

Also, as a side note, if you analized only those 20, the bonferroni correction would be p < 0.0005 for bias highly certain and 0.0005 < p < 0.0025 for bias quite probable, meaning a lot less judges in those groups.

#### Shanshani

On the Ice
I'm busy right now so I'll respond to the rest of the questions people have brought up later, but just a few quick notes:

Sample sizes are available in the excel sheet under N home and N non home.

I am not sure correcting for multiple comparisons is required, because each judge's scores are analyzed as a separate set of data. I'm not comparing the same two groups on multiple measures. The hypotheses being tested are independent hypotheses about whether a particular judge is biased. Plus, isn't bonferroni correction usually seen as excessively conservative? But I'll think about this more.

#### dante

##### a dark lord
Final Flight
p * PD is a bad measure to reprimand a judge, because high PD is bad, and small p is bad, so they sort of cancel each other when you multiply.

Yeah, I mean PD/p, not PD*p.

This is a philosophic question. For example, you can say that any bias is bad. So, if a judge has a PD of 0.01 consistently in all competition, this judge should be reprimanded, despite his bias being small. THis would mean that all the judges with low p-value would be reprimanded, regardless of their average PD. You can also think that a bias of 0.01 does not matter, so one would need to have a low p-value and a high PD to be reprimanded.

My question is, can we have a single scalar value that would adequately represent the degree of bias of a judge? Let's call it bias, B.

Now, if a judge judged just one competition, there is no way for us to determine this variance for the specific judge.

If a judge judged just one competition and gave his compatriot +30 points, he should have a high B that would give the ISU a formal reason to take measures.

#### Shanshani

On the Ice
@Shanshani, I am wondering if it is possible to add the sample size for each judge.

Also, as a side note, if you analized only those 20, the bonferroni correction would be p < 0.0005 for bias highly certain and 0.0005 < p < 0.0025 for bias quite probable, meaning a lot less judges in those groups.

Ok, thought about this more. I am not a statistician, but I don't think correcting for multiple comparisons is necessary here, and certainly not bonferroni correction. Bonferroni correction, as I understand it, is always too conservative, which means it makes it too hard (though apparently still not hard enough if some judges are getting caught even with the extremely low significance thresholds it allows) to catch real effects, and there are other ways to correct for multiple comparisons that are less conservative. But I don't think it's actually necessary to correct at all. You're right that I would have to correct for multiple comparisons if I wanted to be as sure I could that I wouldn't accidentally catch a non-biased judge by accident (I alluded to this in one of the points under limitations). For instance, supposing I had a high number of p=0.05 judges, if I used p=0.05 as my threshold for flagging judges, chances are at least one of the judges I flagged is actually innocent. But I don't think we should be overly worried about that type of error, to the point where we're substantially cutting into our ability to detect real home country effects (in other words, I think correcting for multiple comparisons results in too many type II errors in an overzealous effort to prevent type I errors). If this method occasionally catches an innocent judge, that can be handled through a punishment system that is forgiving of "first time" offenses--eg. a warning if the evidence for bias reaches a particular significance threshold rather than an automatic suspension. If it's a huge concern, just set the p bar lower--you'll notice that in "recommendations", I effectively set it at p=0.01 rather than p=0.05. (I also think there's value in the clarity provided by setting one, unmoving p-threshold).

#### moriel

Record Breaker
I'm busy right now so I'll respond to the rest of the questions people have brought up later, but just a few quick notes:

Sample sizes are available in the excel sheet under N home and N non home.

I am not sure correcting for multiple comparisons is required, because each judge's scores are analyzed as a separate set of data. I'm not comparing the same two groups on multiple measures. The hypotheses being tested are independent hypotheses about whether a particular judge is biased. Plus, isn't bonferroni correction usually seen as excessively conservative? But I'll think about this more.

I dunno, I am kind of split here. Because, for example, the judges are judging the same competitions and all, so probably better be conservative.
On the other hand, no matter what metric we take, national bias exists.

#### Shanshani

On the Ice
On the topic of what PD and p should ring alarm bells:

So first let me clarify the difference between these two measures. PD is a measurement of how much, on average, being from the same country as a judge increased your score relative to how that judge grades skaters from a different country for the competitions that were included in the data set. In simple terms, this is the number of points the judge is biased by on average.

p, on the other hand, is a measure of how confident we can be that the PD we found was arrived at because the judge in fact grades home country skaters and foreign country skaters differently. In crude, oversimplified terms, p is how sure we are that the judge is biased.

These two measures are related. In general, the larger PD is, the smaller p is--this makes sense because the more bonus points we see a judge giving skaters from their home country, the more likely it is that they really are biased. However, they're not perfectly related, because there's a third factor which affects p besides PD. That's the sample size. With a larger sample size, it's possible to be more confident of the existence of bias, even when PD is not very large. If a judge grades gives home skaters extra 2 points in comparison to foreign skaters in one or two competitions, well, it's possible that the judge doesn't really grade home and foreign skaters differently--it was just random chance. But if the judge judges a hundred competitions and we see that the PD is still 2, well it's a lot more likely then that a real, if modestly sized, bias exists.

Now, as for when a judge should be punished or warned, I think you need to look at both factors. First, if p is not sufficiently small (<0.05 or <0.01 depending on how careful you want to be), then you don't really have good evidence of bias, even if the PD is 4 points or whatever. This is because the higher the p value, the higher the chance that the judge doesn't really grade home and foreign skaters differently, and whatever PD they have is just an accident. However, just because p is small, doesn't necessarily mean you should punish them. This gets into the philosophical issue moriel mentioned earlier--what level of bias are we actually willing to tolerate? Just because something is statistically significant, doesn't make it practically so. For instance, if a judge consistently gives home skaters 1 extra bonus point, well that 1 extra point from one judge probably isn't going to affect the competition very much, and may be a result of innocent bias (eg. shared culture leading to greater appreciation of program). So I think that it's fine to ignore small but statistically significant biases. This wasn't really an issue in my analysis, however. In general, sample sizes were pretty small, so the effect of the bias has to be pretty large to be detected in the first place, which meant that the cases of very tiny p values were accompanied by fairly hefty PDs.

I think that there should be two thresholds: a p threshold (to ensure that there's enough evidence of bias) and a PD threshold (to ensure that we aren't catching out judges who've judged a lot on nationalistic biases that are practically insignificant/not a big deal). I would set the p threshold at something like 0.01 for now (since punishing the guys closer to 0.05 seems like less of a priority, plus there are questions here about what tradeoff we should be making between failing to catching biased judges and accidentally catching innocent judges). I don't know what PD threshold I'd set--I'd have to look into the data further.

There's a few caveats in particular relating to a PD threshold, that I alluded to in the limitations section. The main caveat is that the PD statistic actually likely under-states the magnitude of judges' biases in important match ups, because highly practically significant cases of judging bias against competitor skaters get mixed in with cases where the judge doesn't really have an incentive to be too biased, and higher stakes competitions, where judges have more incentive to be biased, get mixed with lower stakes competitions where judges don't have as much reason to judge in a biased manner. On paper, 4 points might not look like a big deal given that it's just one judge in one competition, but that's averaged out across all of competitions that the judge graded--the judge might be much more biased in one particular instance. Lorrie Parker, for instance, has a PD of 5.12, which you might think is not such a big deal, but that bias translated into a 20 point difference between how she scored Nathan Chen relative to other judges and how she scored Boyang Jin (10 points higher than average for Nathan, 10 points lower than average for Boyang). Similarly, Weiguang Chen's 10 point home skater average bias turned into a 37 point difference between how she scored Boyang Jin (22 points above average) and how she scored Javier Fernandez (15 points below). So PD's that may look modest on paper may turn into judging behavior that is potentially highly consequential (especially if the biases don't more-or-less neatly cancel out, as happened at the Olympics).

#### Ziotic

Medalist
I dunno, I am kind of split here. Because, for example, the judges are judging the same competitions and all, so probably better be conservative.
On the other hand, no matter what metric we take, national bias exists.

I agree that bias exist. But I don’t know that it’s always nationalistic at its core. I’m North American, and when I’m looking at the top skaters I tend to prefer North American and Asian skaters.

I almost always enjoy their skating more than European skaters.

That’s not to say I can’t appreciate say T/M but I much prefer the way S/H skate. Similarly I would choose to watch Wakaba over Alina any day of the week.

If I were a judged, those preferences would be reflected. Although technically proficient, Alina to me is not as good of a jumper as Kaori. I would score Kaori higher in GOE for most jumping elements compares to what she currently gets and Alina lower.

Everyone here argues about “fair” scoring but unfortunately no matter what, scoring will always have a reflection of personal preference and not everyone will agree.

#### moriel

Record Breaker
Yeah, I mean PD/p, not PD*p.

My question is, can we have a single scalar value that would adequately represent the degree of bias of a judge? Let's call it bias, B.

If a judge judged just one competition and gave his compatriot +30 points, he should have a high B that would give the ISU a formal reason to take measures.

We could, yes, set a bias threshold for any generic judge, based on how other judges behave.
The basic idea is to find out some acceptable interval for things that happen by chance, and use it to identify things that are very unlikely to happen by chance (which means purposeful over/ under scoring). For example (all numbers made up), by studying the whole judge pool we can find that an average judge's PD varies lets say from 0 to 4, with mean 2. Average judge is biased, but that is not important - now we focus on the width of the interval. We can assume (yep, debatable), that any judge will have a PD interval of similar width, just with different center.
So we can think like this: an unbiased judge, then, will have PDs somewhere in the interval between -2 and 2.
Now, lets take the judge A. If he gives +1 point to his compatriot, we cannot really claim this judge is biased, because that is something an unbiased judge may do. Now, if the judge A gives +5 points to his compatriot, this score is far outside the PD interval for an unbiased judge. Since it is something extremely unlikely to happen if the judge is unbiased, we can conclude that it is very likely that this judge has national bias.

So, for my example with made up numbers, we could set the bias threshold at 2 or 3 for example.

This is the statistical approach, but in practice, it is bad, because the threshold depends on the judges, and punishes not the large bias, but judges who are outside the corridor. If all judges are always on spot, the one that got a 0.1 PD will get reprimanded, because he got a relatively large PD.

The doable way is to just decide which PD is good and which is bad. For example, we can use the following logic: there are 7 judges whose scores factor into the final score. This means that a +1 by one judge means +0.14 on the final score. This is actually nothing, because honestly, round up errors in FS can add up to a higher value, i think. So a PD of 1 is not really a big deal. And then discuss it and set it on, lets say, 3 because I think that this is where it starts impacting the score significantly, and reprimand all judges with PD above 3.

Overall, I am not a fan of this approach because my fear is that this will force yet more of corridor judging. Rather than setting thresholds and reprimanding, I would drastically change the scoring system instead, by forcing higher accountability, and then track and reprimand based on those.
For example, currently, the judges award GOE as a number, but they could be forced to simply check what applies to a certain element. Right now, a judge can give a +3 or a +2 or whatever, and that is it, and if you ask to explain, he can just say "well, the jump looked very good, blablabla" and it is really hard to prove anything. Now, if you have judges checking bullets, you can actually point out stuff like "ohh, so Osmond's jumps are not big enough and do not have good flow in and out?".

#### Shanshani

On the Ice
Wow. I wish all journalism was more like that - educational and (at least honestly trying to be) unbiased.

I've always lacked understanding of statistics, so now I'm regretting that I ignored the thread on the judges' suspension - this quarrel is going to be as educational for me as the one about WADA's scandal.

I will later give your post another, more careful read; sorry if some of my questions are already covered by it:
1. Say, a judge took part only in one competition and gave his compatriot 1 bonus point (PD=1). What's his p-value?
2. Which value should be taken into account when selecting the judges to warn or reprimand? Is it p*PD? Or should we better find out a PD for which the p-value would be equal to 0.5 (or 0.05)?
3. Would it be less prone to judges' bias to score the median GOEs and PCSs, i.e. to drop the 8 extreme values instead of 2?

You can use the difference in ISU ranking as a factor. Or better calculate your own ranking for the last 12 months, because the standings change rapidly in this sports and the per-season values will be inaccurate.

moriel already answered these questions pretty well, but I thought I would too since maybe hearing multiple perspectives might make the issues easier to understand.

1. You need at least two data points in each group (home country skates and foreign skates) in order to run a t-test, so the answer is that you can't come up with a p score for this scenario. But let's say we have 2 home skate scores, both of which are slightly above average. The p value would depend on a lot of factors like how the judge scored the other skaters, whether there was a lot of variation in how much his scores were above/below average and so on. Let's say that our judge was exactly on the dot for foreign skaters, neither over nor underscoring them relative to other judges on average. Let's also say that there's an average amount of variation in his scores--for some he overscored by a few points, others he underscored by a few points. In this kind of scenario, the p value would be pretty high, because it's highly likely that an unbiased judge might by complete accident slightly overscore his compatriots by a small margin.

2. I answered this in the above post, but just to be clear--both should be taken into account separately. I don't think multiplying them together or dividing or whatever produces a meaningful statistic, and they measure different things. PD measures degree of bias, p measures confidence that bias exists (speaking loosely--you can read the OP in order to learn what p technically measures). I would say that in order to punish judges, you need both clear evidence that bias exists (so, low p) as well as a degree of bias which is practically significant (we could ignore some low PDs even if p were low).

3. Median score might be interesting, and in theory it would be more protective from multiple extreme values than trimmed mean (what we have now), which means it might be somewhat effective against bloc judging, actually. But I haven't really fiddled around with it enough to say whether it would really change things that much. Eyeballing my data sheets, trimmed mean versus simple mean actually rarely affects placements, and usually when it does it's for placements that people don't tend to be too concerned about (eg. 19th versus 20th).

#### chopinskate

Median score might be interesting, and in theory it would be more protective from multiple extreme values than trimmed mean (what we have now), which means it might be somewhat effective against bloc judging, actually. But I haven't really fiddled around with it enough to say whether it would really change things that much.

It might change the corridor (and how the judges score because of it). However, I don't think it would eliminate biases like the "quad bonus" PCS.

I do think median might be a step in the right direction.

#### Sam-Skwantch

##### “I solemnly swear I’m up to no good”
Record Breaker
I agree that bias exist. But I don’t know that it’s always nationalistic at its core. I’m North American, and when I’m looking at the top skaters I tend to prefer North American and Asian skaters.

I almost always enjoy their skating more than European skaters.

That’s not to say I can’t appreciate say T/M but I much prefer the way S/H skate. Similarly I would choose to watch Wakaba over Alina any day of the week.

If I were a judged, those preferences would be reflected. Although technically proficient, Alina to me is not as good of a jumper as Kaori. I would score Kaori higher in GOE for most jumping elements compares to what she currently gets and Alina lower.

Everyone here argues about “fair” scoring but unfortunately no matter what, scoring will always have a reflection of personal preference and not everyone will agree.

This whole post puzzles me. I’ve never cared much where people are from and could never make any declaration that I have any preference toward any region. It might seem strange to you but I find skaters I like from all parts of the world and in spite of minor stylistic differences would score them on equal footing.....even if it didn’t suit my personal taste. That’s the way professional judging is supposed to work IMO. It’s not supposed to be an opportunity for a judge to impose their preferences on the world. I’m not sure if Alina should score higher or not but to even generally suggest that seems like an improper way to assess scores. Judges are supposed to look at the jump and judge the jump in the moment..not relate it to and compare it with other skaters. It should reflect the effort of the day and who knows what scenario could transpire to shape the results. IMO a good judge will be surprised from time to time at their marks and not operate under the type of bias you think is normal and acceptable.

#### chopinskate

That’s the way professional judging is supposed to work IMO.

It is supposed to. And perhaps you would do that. But I'm not sure if it stands true for everyone.

I try to score skaters sometimes (like in worlds 2018 when there was a fight whether Wakaba or Kaetlyn should have won), and the scores turn out looking pretty fair. The skaters I don't care for can beat the ones I like based on what they put on the ice. Completely depends on the performance. I'm not sure if I'd expect every judge to do this. It's a genuine question, and I don't think Ziotic is saying it's bias that is "acceptable", simply that it exists.

Replies
125
Views
5K
Replies
1
Views
179
Replies
13
Views
883
Replies
25
Views
1K
Replies
11
Views
3K