I love the CoP
OK, here is my first shot at analyzing the degree of agreement and disagreement among the judges. One thing is clear. If the ISU follows through on its promise to give a thorough analysis of these data for all competitions at the end of the season, any judge who is way out of line with his/her colleagues will stick out like a sore thumb.
This study is a factor analysis using a two-way Analysis of Variance (ANOVA). This is a statistical procedure for computing how much variation there is among the judges and how much variation there is among the skaters. I did this for the ladies short program, for the technical scores and the program elements separately. If there is any interest I can do some of the other competitions, too. Here is a summary of the results.
The main statistic to compute is called the Sum of Squares. The Total Sum of Squares, SST, is the total amount of variation in the entire data field. The Sum of Squares due respectively to the Skaters, SS(Skaters), and to the Judges, SS(Judges), is then compared to the total. The residual variation is the part due to random fluctuations (noise).
Ladies Short Program ¡V Technical Marks
SS(Skaters) = 1875.0 (90%)
SS(Judges) = 73.2 (3%)
SS(Random) = 143.2 (7%)
SS(Total) = 2092.4
Ladies Short Program ¡V Program Elements
SS(Skaters) = 967.6 (61%)
SS(Judges) = 74.8 (5%)
SS(Random) = 541.6 (34%)
SS(Total) = 1584.0
The first set of data show that almost all of the variation in scores was due to the differences in the performances of the skaters (no surprise).
In both cases there appears to be substantial agreement among the judges.
The large "random" variation in the Program Elements compared to the Technical Marks reflects the fact that these marks are more subjective. The fact that there was much more random variation than variation among the judges means that the judges gave differing scores for particular elements, but in the total mark for each skater, there was not much difference of opinion among the judges. In other words, the judges might disagree but they are not systematically over- or undermarking any particular skater.
If you want to get a little more nerdy: the next step is to convert these Sum of Squares numbers into the standard variable of the "F" statistic. When this is done, for this test, all of the differences turned out to be "statistically significant at the .05 level of significance" (although just barely, in the case of the differences among the judges in the second set of data). This means that the differences among the judges, although slight, are nevertheless large enough to be called real differences, not just random statistical fluctuations.