Nebelhorn Results

Norlite · Sep 9, 2003

hockeyfan228 said:
I think that the ISU will need to come up with a consistent standard for scoring the over-/under-rotated jumps. It's possible that some judges are taking it out of the GOE (double whammy), and some are expecting the devaluation by the caller -- i.e. from 3L to 2L -- to take care of it. Also, there seem to be variations in the CCoSp scoring more than some other spins. For the program elements, while most judges are relatively consistent among elements, they can be pretty far apart from each other, by 3 or more points on occasion. Sounds like more seminars to me.

I totally agree hockeyfan. I was looking earlier at Corwin's flip. It seems that a couple judges were harsh with a -3 on a jump *already* downgraded to a 2F. I would think the -2 would have been suffcient, and I could live with the -1. It's is so hard to say since this competition was not seen. But my understanding is the -3 would be used for a fall, or *major* problems with edge at take off or landing , and rotation, and step out, together. The guide say 3 or more problems of -1.
Maybe she did, who knows??

But, for the most part, they do seem mostly consistent.

Mathman- don't forget in your correlations, the 2 highest and 2 lowest scores are thrown out.

Ptichka · Sep 9, 2003

Norlite, what you say makes sense. I mean, it would make sense if it were a -3 on the 3F, but not on a downgraded jump. Does seem like judges and caller were not fully in sink on this.

Norlite · Sep 9, 2003

Yes, as it would seem. But, again, we did not see this.

We will be able to go at it full throttle after Skate America and Skate Canada :laugh:

Mathman · Sep 9, 2003

Joe, here's the link. Click on "Detailed scores" on the right.

http://www.isufs.org/events/nt03/index.htm

Joesitz (RIP) · Sep 10, 2003

Mathman - Thanks. I took a glance but I'm in a hurry on some personall matters so I couldn't study it. the quick glance did show that the names of the judges for each skater are still in secrecy. Oh well,

Joe

Mathman · Sep 10, 2003

I love

the CoP

OK, here is my first shot at analyzing the degree of agreement and disagreement among the judges. One thing is clear. If the ISU follows through on its promise to give a thorough analysis of these data for all competitions at the end of the season, any judge who is way out of line with his/her colleagues will stick out like a sore thumb.

This study is a factor analysis using a two-way Analysis of Variance (ANOVA). This is a statistical procedure for computing how much variation there is among the judges and how much variation there is among the skaters. I did this for the ladies short program, for the technical scores and the program elements separately. If there is any interest I can do some of the other competitions, too. Here is a summary of the results.

The main statistic to compute is called the Sum of Squares. The Total Sum of Squares, SST, is the total amount of variation in the entire data field. The Sum of Squares due respectively to the Skaters, SS(Skaters), and to the Judges, SS(Judges), is then compared to the total. The residual variation is the part due to random fluctuations (noise).

Ladies Short Program ¡V Technical Marks

SS(Skaters) = 1875.0 (90%)
SS(Judges) = 73.2 (3%)
SS(Random) = 143.2 (7%)

SS(Total) = 2092.4

Ladies Short Program ¡V Program Elements

SS(Skaters) = 967.6 (61%)
SS(Judges) = 74.8 (5%)
SS(Random) = 541.6 (34%)

SS(Total) = 1584.0

The first set of data show that almost all of the variation in scores was due to the differences in the performances of the skaters (no surprise).

In both cases there appears to be substantial agreement among the judges.

The large "random" variation in the Program Elements compared to the Technical Marks reflects the fact that these marks are more subjective. The fact that there was much more random variation than variation among the judges means that the judges gave differing scores for particular elements, but in the total mark for each skater, there was not much difference of opinion among the judges. In other words, the judges might disagree but they are not systematically over- or undermarking any particular skater.

If you want to get a little more nerdy: the next step is to convert these Sum of Squares numbers into the standard variable of the "F" statistic. When this is done, for this test, all of the differences turned out to be "statistically significant at the .05 level of significance" (although just barely, in the case of the differences among the judges in the second set of data). This means that the differences among the judges, although slight, are nevertheless large enough to be called real differences, not just random statistical fluctuations.

Mathman

Ptichka · Sep 10, 2003

MM, may be I am rather dense, but could you explain the "random" part again please? I am well familiar with the "noise" concept, but I do not fully understand what it means in this context.

Thank you.

rtureck · Sep 10, 2003

So what will happen if a judge has > 0.5? Hands being slapped? So what will be the next development? EVIL THOUGHTS If I am a judge and I want to avoid the hand slapping, I will find my judging buddies and do some Toe Tapping BLOCK JUDGING :laugh:

Mathman, since you love CoP so much, can you figure out a way for Fumie and Irina to place Au and AG or AG and AU at worlds?

Mathman · Sep 10, 2003

Ptichka --

The short answer is that it means "none of the above." That is, after you subtract off the part of the total variation that is associated with differences in the final marks given by the judges, and also the part of the total variation that is associated with the final marks received by the skaters, SS(Random) is whatever is left unaccounted for.

A better answer is that this is the part of the total variation due to sampling error. In fact, if you look in a statistics book you will see what I called SS(Random) written as SSE (the sum of the squares due to "error").

(Aside: History. The conventional names are SS(Tr), SSB and SSE, instead of SS(skaters), SS(judges) and SS(Random). Tr stands for "treatments" and B stands for "blocks." This statistical test was developed in an agricultural setting, where people wanted to measure how crops responded to different fertilizer treatments applied to different types of soil.)

I don¡¦t like to use the word "error" because it sounds like you are making a mistake. "Sampling error" refers to the fact that when you take a small sample to estimate a statistical feature of a whole population (in this case the mean), you are likely to be off by a little bit. Like when you say, I took an exit poll of 500 voters. The percentage of people who voted "yes" was 62% with a standard error of +/- 2.2%. The 2.2% reflects the fact that you can¡¦t be certain that the sample exactly mirrors the whole poplulation from which it is drawn.

In our case, the "population" is the set of all possible scores that these judges might give to these skaters if they skated the same performances over and over millions of times. The actual scores that we see comprise a "sample" of this (hypothetical) population. If we did the whole contest over again, with exactly the same performances, we might see a slight variation in scores from the time before, even though the judges and the performances haven¡¦t changed. This is the "random" factor.

RTureck --

Actually, it is pretty easy to spot both a single rogue judge who needs to have his hand slapped and a bloc of toe-tappers. In the latter case the suspected bloc will have less variation than expected within the bloc, compared to the total variation among all judges. Both the ISU and its critics are going to have a field day with all this!

Mathman

Rgirl · Sep 10, 2003

Mathman,
FINALLY, without the alphabet blocks 6.0 system, WE AGREE on the statistics! Woo hoo! Woo hoo! Statisticians party in the Celebrations folder! Not really. Why? If you put all the statisticians around the world end to end, it would be a good idea. J/K But I really do agree with Mathman, who finally got the idea of applied statistics

Rgirl

Joesitz (RIP) · Sep 10, 2003

Mathman - I am overwhelmed with this statiscal summary of the CoP - not that I understand it - but I just trust you to know what you are figuring out. Do you think it is too late for an old skater to go back and take Statistic courses beyond 101?

"Actually, it is pretty easy to spot both a single rogue judge who needs to have his hand slapped and a bloc of toe-tappers. In the latter case the suspected bloc will have less variation than expected within the bloc, compared to the total variation among all judges. Both the ISU and its critics are going to have a field day with all this!"

The above looks good, MM. What about bloc judging? Say there are 3or 4 judges with the same cultural background, and they all seem to be scoring the same way, and given the other judges are not scoring their way, will they get a hand slap?

In general, I'm feeling better about this COP.

Joe

mzheng · Sep 10, 2003

Wow, Mathman. Glad you proved it satistically. I'm too lazy to digup my colledge book right now to fully comprehensive. But over all I agree what you said.

Do you think if you could use this method to analysis all GP events where CoP is used? Thus you are able to say which competetion is judged fairly based on your statistics data?

Mathman · Sep 10, 2003

FINALLY, without the alphabet blocks 6.0 system, WE AGREE on the statistics...But I really do agree with Mathman, who finally got the idea of applied statistics. -- Rgirl

Just remember which one of us is MATHman, R., LOL. Just because a dance piece that you choreographed is playing to the crowned heads of India...(Did you guys know that? Y-a-ay for Rgirl!) ...

But seriously, our arguments last year weren't about the 6.0 system per se, they were about the interim system. That system gave us as little data as possible and led to questions about whether it was possible to make guesses about what the missing data might have been.

Now, in contrast, we have an embarrassment of riches. We have data coming out of our ears. I love it.

The above looks good, MM. What about bloc judging? Say there are 3or 4 judges with the same cultural background, and they all seem to be scoring the same way, and given the other judges are not scoring their way, will they get a hand slap? -- Joe

That's the big question, Joe. The CoP makes it possible to make statements like: I am 99% sure that judges 3, 8, and 11 voted together significantly more often than you would expect, based on random chance." What the ISU decides to do with these conclusions is anybody's guess at this point.

Since outright and confessed cheaters get only a couple of years suspension, I doubt that anything much will come of it. That's the missing piece of the puzzle -- resolution on the part of the ISU to punish dishonesty when they find it. It is hard to give out harsh punishments because in the final analysis, everything is in the hands of the Federations -- who appoint and give oversight to the judges in the first place.

Do you think if you could use this method to analysis all GP events where CoP is used? Thus you are able to say which competetion is judged fairly based on your statistics data? -- MZheng

The test that I used can only say, yes or no, were the judges in substantial agreement, or were there significant differences in the scoring. Even if there are significant differences, it does't necessarily mean that anyone is cheating.

But there are lots of other tests that can be done. The important thing is that all the data is before us, albeit anonymously.

Mathman

Norlite · Sep 10, 2003

So, Mathman,

What about the 5 out of 14 judges scores being used that was mentioned in the other thread?

Is that a problem mathematically with arriving at a representative outcome?

Mathman · Sep 10, 2003

Norlite, that was the very question that Rgirl and I argued so much about last year. I presented what I thought was the World's All-Time Most Convincing Argument that this was fine and that mathematically it didn't matter if those other five (which might, after all, make the contest go the other way if a different five were chosen to be eliminated) were discarded.

I convinced myself of this, but I didn't convince Rgirl, and towards the end, I was weakening. (Don't tell Rgirl this. :laugh:

)

I will rethink the argument as it applies specifically to the CoP. Certainly the effect is mitigated by the cropped averaging, versus the ordinal placements of the old system.

Mathman

Norlite · Sep 10, 2003

Thanks,

This is slightly different than last year, being that if the panel is 14, 5 are thrown out off the top, (= 9, I think :laugh:

) and then out of that 9, two highest and two lowest are discarded, ending up with 5 out of 14. ????

Don't know how much of a difference that makes.

rtureck · Sep 11, 2003

It won't be too difficult to get 5 -7 toe tappers, then the 2 -4 non tappers will look like they are rogue :laugh:

BTW, I have no problem with the term sampling error.

Joesitz (RIP) · Sep 11, 2003

Dropping 5 out of 14 for Olys and Worlds or dropping 3 out of 10 for GPs helps to rid block judging, imo.

Mathman, does that figure statistically?

Joe

Nebelhorn Results

Norlite

Ptichka

Forum translator

Norlite

Mathman

Joesitz (RIP)

Mathman

Ptichka

Forum translator

rtureck

Mathman

Rgirl

Joesitz (RIP)

mzheng

Mathman

Norlite

Mathman

Norlite

rtureck

Joesitz (RIP)

Similar threads

Connect with us