This thread is for anyone who wants to understand some of the measures that statisticians use to draw conclusions about the merits and demerits of different judging systems. This is easy. Don't be intimidated if math isn't your favorite subject. It is my job to make it your favorite subject, LOL.
I will start with this question. Under the ordinal system, how can we decide whether the judges are in substantial agreement in their ordinal placements, or whether the agreement is so bad that we begin to suspect that something is fishy?
I will use the data from the Ladies skate at the recent International Skating Challenge to show how to do this. Again, this is easy!
---RUS GER CAN USA JPN
MK 1 1 2 1 2
SA 4 2 1 2 1
SC 2 3 4 5 3
JR 5 4 3 4 4
JK 3 5 5 3 5
AP 7 6 6 6 6
ES 6 7 7 7 7
If this table looks all bunched together, here is a prettier one:
As we see by reading down the columns, the German judge was the only one that "got it right," in the sense of matching the majority in every placement. Let's see how far off the Russian judge is from the German judge.
For each skater, take the difference (d) between the placements given by the two judges. Square this difference, then add them all up.
RUS GER d d-squared
1 1 0 0
4 2 2 4
2 3 1 1
5 4 1 1
3 5 2 4
7 6 1 1
6 7 1 1
Now run this 12 through the following formula, where n is the number of skaters:
r' = 1- [6(sum of the d-squareds)]/[(n-1)n(n+1)]
r' = 1 - (6*12)/(6*7*8)
r' = .78
We say informally that there is a 78% correlation between the two rankings.
Note: This statistic r' is called the "rank correlation coefficient." It was invented by Charles Spearman (1863-1945), and it's distribution was worked out by "Student" (the nom de plume of W. S. Gossett, 1863-1937). It is a surrogate for the more common correlation coefficient r, used for continuous data. Spearman's statistic r' is distributed by the Student's T distribution with n-2 degrees of freedom.
OK, forget that. Here is the main point. So there is 78% correlation between the rankings of the Russian judge and the German judge, so what? Well, the closer this statistic is to 100%, the more the two judges agree. A correlation of 0 means they weren't even watching the same event. So 78% is "not too good, not too bad."
Now we must quantify what "not too good or bad" means. Let's suppose that whatever we want to say about these judges, we agree to hold our tongue and not say anything at all unless we can be 95% certain that we are right. In that case, there is a "critical value" that we must beat before we can say that there is a "statistically significant correlation" between the two judges. In our case, the critical value turns out to be C = .67.
So, bottom line, if we get a correlation bigger than .67, that means that we can be 95% sure that there is at least some agreement between the judges, but a correlation of less than .67 means that we cannot be sure of anything, really.
Here are the correlations of all the judges, paired two by two. Remember, if we beat .67 correlation, that's good.
RUS vs. CAN: r' = .57 (Oops. We cannot be 95% sure that these two judges are even watching the same competition.)
RUS vs. USA: r = .71
RUS vs. JPN: r' = .68
GER vs. CAN: r' = .93 (That's good.)
GER vs. USA: r' = .86
GER vs. JPN: r' = .96 (Very good match.)
CAN vs. USA: r' = .86
CAN vs. JPN: r = .96 (It looks like Germany, Canada and Japan are pretty much on the same page.)
USA vs. JPN: r' = .82
So there you are.
Wait, one more thing. In the cases where judges appear to disagree, where do these differences come from? Well...
CAN matched the majority rankings for each skater, except that she (Casey Kelly) put Jennifer Robinson ahead of Cohen.
JPN matched the majority in every ranking, except that he (Tomiko Yamada) put Shizuka Arakawa ahead of Kwan.
GER had no horse in the race, and she (Sissy Krick) matched the majority without exception.
RUS (Tatiana Danilenko) was the only judge to put Sokolova ahead of McDonough, and was also the only judge to put Cohen ahead of Arakawa.