1. 0
I love the CoP

OK, here is my first shot at analyzing the degree of agreement and disagreement among the judges. One thing is clear. If the ISU follows through on its promise to give a thorough analysis of these data for all competitions at the end of the season, any judge who is way out of line with his/her colleagues will stick out like a sore thumb.

This study is a factor analysis using a two-way Analysis of Variance (ANOVA). This is a statistical procedure for computing how much variation there is among the judges and how much variation there is among the skaters. I did this for the ladies short program, for the technical scores and the program elements separately. If there is any interest I can do some of the other competitions, too. Here is a summary of the results.

The main statistic to compute is called the Sum of Squares. The Total Sum of Squares, SST, is the total amount of variation in the entire data field. The Sum of Squares due respectively to the Skaters, SS(Skaters), and to the Judges, SS(Judges), is then compared to the total. The residual variation is the part due to random fluctuations (noise).

Ladies Short Program ¡V Technical Marks

SS(Skaters) = 1875.0 (90%)
SS(Judges) = 73.2 (3%)
SS(Random) = 143.2 (7%)

SS(Total) = 2092.4

Ladies Short Program ¡V Program Elements

SS(Skaters) = 967.6 (61%)
SS(Judges) = 74.8 (5%)
SS(Random) = 541.6 (34%)

SS(Total) = 1584.0

The first set of data show that almost all of the variation in scores was due to the differences in the performances of the skaters (no surprise).

In both cases there appears to be substantial agreement among the judges.

The large "random" variation in the Program Elements compared to the Technical Marks reflects the fact that these marks are more subjective. The fact that there was much more random variation than variation among the judges means that the judges gave differing scores for particular elements, but in the total mark for each skater, there was not much difference of opinion among the judges. In other words, the judges might disagree but they are not systematically over- or undermarking any particular skater.

If you want to get a little more nerdy: the next step is to convert these Sum of Squares numbers into the standard variable of the "F" statistic. When this is done, for this test, all of the differences turned out to be "statistically significant at the .05 level of significance" (although just barely, in the case of the differences among the judges in the second set of data). This means that the differences among the judges, although slight, are nevertheless large enough to be called real differences, not just random statistical fluctuations.

Mathman

2. 0
MM, may be I am rather dense, but could you explain the "random" part again please? I am well familiar with the "noise" concept, but I do not fully understand what it means in this context.

Thank you.

3. 0
So what will happen if a judge has > 0.5? Hands being slapped? So what will be the next development? EVIL THOUGHTS If I am a judge and I want to avoid the hand slapping, I will find my judging buddies and do some Toe Tapping BLOCK JUDGING

Mathman, since you love CoP so much, can you figure out a way for Fumie and Irina to place Au and AG or AG and AU at worlds?

4. 0
Ptichka --

The short answer is that it means "none of the above." That is, after you subtract off the part of the total variation that is associated with differences in the final marks given by the judges, and also the part of the total variation that is associated with the final marks received by the skaters, SS(Random) is whatever is left unaccounted for.

A better answer is that this is the part of the total variation due to sampling error. In fact, if you look in a statistics book you will see what I called SS(Random) written as SSE (the sum of the squares due to "error").

(Aside: History. The conventional names are SS(Tr), SSB and SSE, instead of SS(skaters), SS(judges) and SS(Random). Tr stands for "treatments" and B stands for "blocks." This statistical test was developed in an agricultural setting, where people wanted to measure how crops responded to different fertilizer treatments applied to different types of soil.)

I don¡¦t like to use the word "error" because it sounds like you are making a mistake. "Sampling error" refers to the fact that when you take a small sample to estimate a statistical feature of a whole population (in this case the mean), you are likely to be off by a little bit. Like when you say, I took an exit poll of 500 voters. The percentage of people who voted "yes" was 62% with a standard error of +/- 2.2%. The 2.2% reflects the fact that you can¡¦t be certain that the sample exactly mirrors the whole poplulation from which it is drawn.

In our case, the "population" is the set of all possible scores that these judges might give to these skaters if they skated the same performances over and over millions of times. The actual scores that we see comprise a "sample" of this (hypothetical) population. If we did the whole contest over again, with exactly the same performances, we might see a slight variation in scores from the time before, even though the judges and the performances haven¡¦t changed. This is the "random" factor.

RTureck --

Actually, it is pretty easy to spot both a single rogue judge who needs to have his hand slapped and a bloc of toe-tappers. In the latter case the suspected bloc will have less variation than expected within the bloc, compared to the total variation among all judges. Both the ISU and its critics are going to have a field day with all this!

Mathman

5. 0
Mathman,
FINALLY, without the alphabet blocks 6.0 system, WE AGREE on the statistics! Woo hoo! Woo hoo! Statisticians party in the Celebrations folder! Not really. Why? If you put all the statisticians around the world end to end, it would be a good idea. J/K But I really do agree with Mathman, who finally got the idea of applied statistics
Rgirl

6. 0
Mathman - I am overwhelmed with this statiscal summary of the CoP - not that I understand it - but I just trust you to know what you are figuring out. Do you think it is too late for an old skater to go back and take Statistic courses beyond 101?

"Actually, it is pretty easy to spot both a single rogue judge who needs to have his hand slapped and a bloc of toe-tappers. In the latter case the suspected bloc will have less variation than expected within the bloc, compared to the total variation among all judges. Both the ISU and its critics are going to have a field day with all this!"

The above looks good, MM. What about bloc judging? Say there are 3or 4 judges with the same cultural background, and they all seem to be scoring the same way, and given the other judges are not scoring their way, will they get a hand slap?

Joe

7. 0
Wow, Mathman. Glad you proved it satistically. I'm too lazy to digup my colledge book right now to fully comprehensive. But over all I agree what you said.

Do you think if you could use this method to analysis all GP events where CoP is used? Thus you are able to say which competetion is judged fairly based on your statistics data?

8. 0
FINALLY, without the alphabet blocks 6.0 system, WE AGREE on the statistics...But I really do agree with Mathman, who finally got the idea of applied statistics. -- Rgirl
Just remember which one of us is MATHman, R., LOL. Just because a dance piece that you choreographed is playing to the crowned heads of India...(Did you guys know that? Y-a-ay for Rgirl!) ...

But seriously, our arguments last year weren't about the 6.0 system per se, they were about the interim system. That system gave us as little data as possible and led to questions about whether it was possible to make guesses about what the missing data might have been.

Now, in contrast, we have an embarrassment of riches. We have data coming out of our ears. I love it.

The above looks good, MM. What about bloc judging? Say there are 3or 4 judges with the same cultural background, and they all seem to be scoring the same way, and given the other judges are not scoring their way, will they get a hand slap? -- Joe
That's the big question, Joe. The CoP makes it possible to make statements like: I am 99% sure that judges 3, 8, and 11 voted together significantly more often than you would expect, based on random chance." What the ISU decides to do with these conclusions is anybody's guess at this point.

Since outright and confessed cheaters get only a couple of years suspension, I doubt that anything much will come of it. That's the missing piece of the puzzle -- resolution on the part of the ISU to punish dishonesty when they find it. It is hard to give out harsh punishments because in the final analysis, everything is in the hands of the Federations -- who appoint and give oversight to the judges in the first place.
Do you think if you could use this method to analysis all GP events where CoP is used? Thus you are able to say which competetion is judged fairly based on your statistics data? -- MZheng
The test that I used can only say, yes or no, were the judges in substantial agreement, or were there significant differences in the scoring. Even if there are significant differences, it does't necessarily mean that anyone is cheating.

But there are lots of other tests that can be done. The important thing is that all the data is before us, albeit anonymously.

Mathman

9. 0
So, Mathman,

What about the 5 out of 14 judges scores being used that was mentioned in the other thread?

Is that a problem mathematically with arriving at a representative outcome?

10. 0
Norlite, that was the very question that Rgirl and I argued so much about last year. I presented what I thought was the World's All-Time Most Convincing Argument that this was fine and that mathematically it didn't matter if those other five (which might, after all, make the contest go the other way if a different five were chosen to be eliminated) were discarded.

I convinced myself of this, but I didn't convince Rgirl, and towards the end, I was weakening. (Don't tell Rgirl this. )

I will rethink the argument as it applies specifically to the CoP. Certainly the effect is mitigated by the cropped averaging, versus the ordinal placements of the old system.

Mathman

11. 0
Thanks,

This is slightly different than last year, being that if the panel is 14, 5 are thrown out off the top, (= 9, I think ) and then out of that 9, two highest and two lowest are discarded, ending up with 5 out of 14. ????

Don't know how much of a difference that makes.

12. 0
It won't be too difficult to get 5 -7 toe tappers, then the 2 -4 non tappers will look like they are rogue

BTW, I have no problem with the term sampling error.

13. 0
Dropping 5 out of 14 for Olys and Worlds or dropping 3 out of 10 for GPs helps to rid block judging, imo.

Mathman, does that figure statistically?

Joe

Page 8 of 8 First 1 2 3 4 5 6 7 8

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•