Statistics tutorial (2) | Golden Skate

Statistics tutorial (2)

Joined
Jun 21, 2003
This thread is for anyone who wants to understand some of the measures that statisticians use to draw conclusions about the merits and demerits of different judging systems. This is easy. Don't be intimidated if math isn't your favorite subject. It is my job to make it your favorite subject, LOL.

I will start with this question. Under the ordinal system, how can we decide whether the judges are in substantial agreement in their ordinal placements, or whether the agreement is so bad that we begin to suspect that something is fishy?

I will use the data from the Ladies skate at the recent International Skating Challenge to show how to do this. Again, this is easy!

---RUS GER CAN USA JPN
MK 1 1 2 1 2
SA 4 2 1 2 1
SC 2 3 4 5 3
JR 5 4 3 4 4
JK 3 5 5 3 5
AP 7 6 6 6 6
ES 6 7 7 7 7

If this table looks all bunched together, here is a prettier one:

http://www.usfsa.org/events_results/results/200304/intl-challenge/ladies.htm

As we see by reading down the columns, the German judge was the only one that "got it right," in the sense of matching the majority in every placement. Let's see how far off the Russian judge is from the German judge.

For each skater, take the difference (d) between the placements given by the two judges. Square this difference, then add them all up.

RUS GER d d-squared
1 1 0 0
4 2 2 4
2 3 1 1
5 4 1 1
3 5 2 4
7 6 1 1
6 7 1 1
Total-12

Now run this 12 through the following formula, where n is the number of skaters:

r' = 1- [6(sum of the d-squareds)]/[(n-1)n(n+1)]

r' = 1 - (6*12)/(6*7*8)

r' = .78

We say informally that there is a 78% correlation between the two rankings.

Note: This statistic r' is called the "rank correlation coefficient." It was invented by Charles Spearman (1863-1945), and it's distribution was worked out by "Student" (the nom de plume of W. S. Gossett, 1863-1937). It is a surrogate for the more common correlation coefficient r, used for continuous data. Spearman's statistic r' is distributed by the Student's T distribution with n-2 degrees of freedom.

So what?

OK, forget that. Here is the main point. So there is 78% correlation between the rankings of the Russian judge and the German judge, so what? Well, the closer this statistic is to 100%, the more the two judges agree. A correlation of 0 means they weren't even watching the same event. So 78% is "not too good, not too bad."

Now we must quantify what "not too good or bad" means. Let's suppose that whatever we want to say about these judges, we agree to hold our tongue and not say anything at all unless we can be 95% certain that we are right. In that case, there is a "critical value" that we must beat before we can say that there is a "statistically significant correlation" between the two judges. In our case, the critical value turns out to be C = .67.

So, bottom line, if we get a correlation bigger than .67, that means that we can be 95% sure that there is at least some agreement between the judges, but a correlation of less than .67 means that we cannot be sure of anything, really.

Here are the correlations of all the judges, paired two by two. Remember, if we beat .67 correlation, that's good.

RUS vs. CAN: r' = .57 (Oops. We cannot be 95% sure that these two judges are even watching the same competition.)

RUS vs. USA: r = .71

RUS vs. JPN: r' = .68

GER vs. CAN: r' = .93 (That's good.)

GER vs. USA: r' = .86

GER vs. JPN: r' = .96 (Very good match.)

CAN vs. USA: r' = .86

CAN vs. JPN: r = .96 (It looks like Germany, Canada and Japan are pretty much on the same page.)

USA vs. JPN: r' = .82

So there you are.:)

Home cooking?

Wait, one more thing. In the cases where judges appear to disagree, where do these differences come from? Well...

CAN matched the majority rankings for each skater, except that she (Casey Kelly) put Jennifer Robinson ahead of Cohen.

JPN matched the majority in every ranking, except that he (Tomiko Yamada) put Shizuka Arakawa ahead of Kwan.

GER had no horse in the race, and she (Sissy Krick) matched the majority without exception.

RUS (Tatiana Danilenko) was the only judge to put Sokolova ahead of McDonough, and was also the only judge to put Cohen ahead of Arakawa.

Hmm.

Mathman
 
Last edited:

rtureck

Final Flight
Joined
Jul 26, 2003
Mathman said:

Now run this 12 through the following formula, where n is the number of skaters:

r' = 1- 6*(sum of the d-squareds)/(n-1)(n)(n+1)

r' = 1 - 6*12/6*7*8

r' = .78



Tutorial 2, was there a tutorial 1?

I am slow, but I try to muddle through. It helps me at least if you state the equation as

r' = 1- [6(sum of the d-squareds)]/(n-1)(n)(n+1)

that is add an extra bracket, and leave out the *, because the * may be confused to be power of instead of product of. I know you are using proper math symbols, but I am slow. i know technically you can state it as

r' = 1- 6(sum of the d-squareds)/(n-1)(n)(n+1)

and that is still correct.

BTW, should it always be 1-6 or 1-n, and in this case n = 6?

Don't be intimidated if math isn't your favorite subject. It is my job to make it your favorite subject, LOL.

I am exceedingly intimidated , but now that I muddled through it what is the reward, a box of pop corn? candies? Don't tell me self satisfaction, that does not work for me.
 
Last edited:
Joined
Jun 21, 2003
Hi, RTureck. I edited my equation as you suggested. Thanks for the feedback.

The 6 is always 6. It is the 6 in the formula for the sum of the squares of the first n integers

1^2 + 2^2 + 3^2 + ... + n^2 = n(n+1)(2n+1)/6.

MM
 
Joined
Aug 3, 2003
Ah, Mathman, this brings back such fond memories. Seriously, you know I love this stuff and the way you're right, it IS easy. Besides, in the end, even if the only thing one gets is the "Home Cookin'" section (great title), it makes me miss the COP already. I can understand Dick Button preferring the 6.0 system--it's the system he learned, competed to, and has commented on for going on 50 years--but when I see the stats for a competition like this using the 6.0 system, I find the COP is more statistically accurate right now and has the probability of gaining even greater statistical accuracy once they fix a few things than the 6.0 system ever could. I know I'm slightly off topic, but then looking at and understanding statistics is, IMO, the most important way of evaluating which judging method most accurately rewards the best skating performances with the top placements. Thanks again, Master Mathman.
Rgirl
 
Joined
Jun 21, 2003
Rgirl said:
"[W]hen I see the stats for a competition like this using the 6.0 system, I find the COP is more statistically accurate right now and has the probability of gaining even greater statistical accuracy once they fix a few things than the 6.0 system ever could." -- Rgirl
We'll see, R. IMO the CoP has its statistical peccadilloes, too. If you want the CoP to look good, analyze the ordinal system. If you want the ordinal system to look good, analyze the CoP.

My only beef is, if anyone wants to use statistics to criticize one system or the other, he or she has to do it by the numbers. Of all mathematical disciplines, statistics is the most driven by actual empirical data. For over a year now, long before we had a single datum to judge by, experts in statistics have been telling us what was going to happen under CoP judging.

OK.

We’ll see.

Mathman
 
Joined
Jul 11, 2003
Mathman -

You certainly showed the national bias where applicable among the judges for the 6.0 system.

I'd like to see you run through another competition using the comparisons of the judges in the CoP system. But then in the CoP, will we know who the judges are?

Joe
 
Joined
Aug 3, 2003
Mathman said:
We'll see, R. IMO the CoP has its statistical peccadilloes, too. If you want the CoP to look good, analyze the ordinal system. If you want the ordinal system to look good, analyze the CoP.

My only beef is, if anyone wants to use statistics to criticize one system or the other, he or she has to do it by the numbers. Of all mathematical disciplines, statistics is the most driven by actual empirical data. For over a year now, long before we had a single datum to judge by, experts in statistics have been telling us what was going to happen under CoP judging.

OK.

We’ll see.

Mathman
Very true, Mathman, and a point I do try to keep in mind--wait till we've had at least two years of actual data. The ordinal system just always seemed so easy for judges to justify almost any placement of a given skater, especially if the skaters near the top skated about the same in terms of jumps and falls. I think of the events where the ordinals ranged from 8 to 1 for a skater. It too often seemed as if the judges didn't even need to cheat, that they could give placements completely out of line with what would have been an accurate reflection of what the skater did on the ice and could justify it, if they were ever asked to justify it, by subjective generalities, ie, the choreography was poor, there was no variation in speed, the skater wasn't musical, the jumps didn't seem secure (even if the skater landed everything clean) or the opposite if they gave a skater a high placement for a poor skate. At least with the COP, most of the error is on the page. With the ordinal system, aside from the statistical error, which was not acknowledged much less addressed, the infinite opportunities for error were all inside each judge's head. Not that there still isn't error inside the judges' heads with the COP, it's just that a whole lot more of the error is there in the Detailed Results to be analyzed by skaters, fans, judges, federations, the ISU, anybody. Just to pick one example, after the '94 Olympics, several of the judges were interviewed and said they did not see Oksana two-foot her landings on her 3L/2t combo and on another jump, I think it was either her 3L alone or her 3flip. Anyway, the point is they said they didn't see the 2-foots and if they had, they would not given her the first place ordinal. We never would have known about this particular error, at least as it was reported by these judges, had someone not showed them the video and pointed it out. It didn't change anything in terms of the results, but we can't learn from mistakes (which are not the same as error, which perhaps is something you might discuss) unless we know about them and we can't adjust the system to try to minimize error unless we know about it. I know that's just one example, but I wonder how often hings like this happened as well as how often judges saw or did not see according to their national, cultural, personal, or skating style biases.

To me, at least with the COP there is an assigned value for each element for the Technical score and required deductions if an element is missed or performed incorrectly. Certainly I've seen problems with various aspects of the current COP; with implementation, such as the caller being incorrect; inconsistency in assigning levels for spins, spiral sequences, and footwork; and various other things that have been discussed in detail elsewhere. But in the COP, the fact that there are numbers for each element, as well as the the component scores, that can be analyzed is exactly what I see as one of the strongest arguments for the COP. When ice dance team Delobel and Schoenfelder fell on their lift near the end of their free dance at Trophee Lalique and not only did not receive any deduction, but received additions to the base mark of +2 from seven of the 11 judges, at least you could see what seemed to me to be a clear glitch in either the judging or the system in actual numbers. I don't want to get too detailed since this is not the thread in which to do so, but given that error is inherent in any statistical process, I like being able to see it in numbers rather than having it locked inside a judge's head where even the judge may not be aware of it. I'm not saying there aren't benefits to the ordinal system and I would never say the COP as it stands now does not have serious flaws. What I have had with COP and the events I've seen using it thus far is a comfort level most of the time with the placements, that is, the order of placements, both in each phase of the compettion and final, seems to accurately reflect the way the skaters performed. Also, the times when the placements have seemed wrong, I've been able to see in the detailed results where it was that the skaters seemed to either get additional credit when they should not have or even should have received deductions; should have received deductions and didn't; received what I thought were unfairly low or high Component scores in one or more components; or in any one of various other ways. The numbers were there to be seen and compared to what I saw the skaters do and that's what I find to be the COP's biggest advantage over the ordinal system.

ITA that if you want the COP to look good, analyze the ordinal system and if you want the ordinal system to look good, analyze the COP. After all, what student of statistics doesn't have a copy of the never-out-of-print "How to Lie with Statistics" (and they don't mean in bed, although I'm sure some judges have done that too:p). But watching the Campbell's and IFSC use the ordinal system, especially the former live, each score has to encompass so many variables not only with what the skater is doing on the ice but also in terms of skate order and all those variables are only ever in the judges' minds. For all we know, under the ordinal system, some judges may mark a skater based only on jumps or because they hate the music of a certain composer. Those are extreme, but the point is, we don't know and can never know. Even if the judges say what they variables they used to score a skater, we don't know if they did. Then those scores are really just used as a way for judges to keep track of how they would place the skaters after they've seen them all skate, so the 5.8s and 5.9s for one judge might be 5.5s and 5.6s for another. I know they have guidelines, but again, we just don't know and can never know.

Okay, I've already said more than I meant to say and said I would say and I don't want to take this thread in the wrong direction. So please, hit us wit more 'o dem numbers, baby! I got a good feelin' about 216 and 12,960,000--and I bet you know why;)
Rgirl
P.S. Re "For over a year now, long before we had a single datum to judge by, experts in statistics have been telling us what was going to happen under CoP judging." What are some of the things experts have been telling us was going to happen under COP judging? If this isn't the right thread, you can answer me under the "More Questions About the COP" thread. Just curious.
 
Last edited:

berthes ghost

Final Flight
Joined
Jul 30, 2003
I can understand Dick Button preferring the 6.0 system--it's the system he learned, competed to, and has commented on for going on 50 years-

Gosh, you talk as if the system used at this year's worlds is the exact same system used back in 1948. It isn't. The scoring system has changed many times over the years and people like Dick have always kept abrest, it's their job.
 
Joined
Aug 3, 2003
berthes ghost said:
Gosh, you talk as if the system used at this year's worlds is the exact same system used back in 1948. It isn't. The scoring system has changed many times over the years and people like Dick have always kept abrest, it's their job.
Very true, Berthes Ghost, that the scoring system has been tweaked many times along the way. I should have been clearer with my language, but it is still the 6.0 system and that's what Dick said he preferred. However, it was a quick, general remark so I myself wouldn't read too much into it. He may have meant a specific aspect of the ordinal system, like when they used to know who the judges were:)
Rgirl
 
Last edited:
Joined
Jun 21, 2003
I don't think that Dick Button is against the CoP particularly. He is against secret judging.

Mathman
 
Joined
Jun 21, 2003
(What I wrote while waiting for my car to get through its 120,000 mile checkup this afternoon. -- If you're so smart, Mathman, why can't you afford a better car?)

Statistics 101, part 3. Testing the CoP.

The only thing we really ask of a judging system is that "the right person wins." If that is too vague a goal, then we would settle for some assurance at least of consistency in judging: if lots of judges scored the performances over and over, the results would be more or less the same most of the time. In using language like this we are tacitly assuming that the judges' scores somehow represent a sample drawn from the population of all the marks that might have been given by all possible well-qualified, impartial and honest judges, conscientiously following the guidelines of the CoP.

Suppose the total scores look like this (for simplicity I will assume that there are only five judges, and I will set aside for the moment the effects of the randon draw and the trimmed mean, as well as cumulative effects of adding up many component scores and considerations of national Chauvinism, etc.)

Buttle: 180, 190, 185, 185, 185; average: 185
Goeble 185, 190, 180, 180, 185; average: 184

Buttle wins, 185 to 184.

But since this was a close contest, can we be confident that Buttle's skate really was better, and would have been certified so by the majority of all judging panels that might have scored this event?

First, to investigate this question as serious scientists rather than as fans of one skater or another, we must agree not to go popping off before we know what we are talking about. To insure this, let us stipulate the following:

IF WE CANNOT BE 95% SURE, THEN WE WILL REMAIN SILENT

Next, we pretend for the sake of the argument that we want the CoP to work. That is, we will test the hypothesis that Buttle's performance really was better. To do this, we set up a straw man, called the Null Hypothesis: The two performances were exactly the same. We don't believe this, however, and we want to show that the null hypothesis is wrong.

Null Hypothesis: The two performances were exactly the same.
Our Hypothesis: No, Buttle's was better and so the CoP worked.

To test our hypothesis, we compute a statistic called the variance. This is a measure of how far each individual score is from the average. For Buttle the average was 185 and the individual scores are

180 (off by 5), 190 (off by 5), 185 (off by 0), 185 (off by 0), 185 (off by 0).

Take the sum of the squares of these differences and divide by n-1, where n is the sample size. That's the variance. For Buttle,

v = (5x5 + 5x5 + 0x0 + 0x0 + 0X0)/4 = 12.5

For Goebel,

v = (1x1 + 6x6 + 4x4 + 4x4 + 1x1)/4 = 17.5

(Remark: The standard deviation is the square root of the variance.)

Now we need a standard unit-free measure of the difference between Buttle's and Goebel's average scores. The formula is

T = (x1-x2)squareroot(n)/squareroot(v1 + v2)

T = (185-184)squareroot(5)/squareroot(12.5 + 17.5)

T = 0.4

(In techspeak, Buttle's average score is 0.4 standard errors bigger than Goebel's.)

But in order to be 95% sure of anything we must beat a certain "critical value." The exact number that we have to beat varies slightly with the sample size and other variables, but as a rule of thumb most of the critical values are around 2. So if this T score is bigger than 2, then we can be 95% confident that Buttle's performance really was better. If T is not bigger than 2, then we cannot be 95% sure, so we must remain silent.

Conclusion: Our T score of 0.4 is not bigger than 2.0. Therefore we are not 95% sure (of anything).

So, bottom line. We tried our hardest to say something good about the CoP, but we could not be sure.

Mathman

PS. Some pet peeves.

(a) If you have taken Stat 101 in the last 20 years, you have probably used a textbook that made use of the "p-value" in hypothesis testing. The authors of these texts, IMO, are often somewhat lazy in explaining why it is cheating to use the sample data to determine the p-value, rather than (correctly) to set the level of significance first, then take the sample. I will explain more about this if there is any interest.

(b) Most statistics texts say that we "accept" or "reject" the null hypothesis. This language is quite misleading. We never accept the null hypothesis -- it's just that we cannot be 95% sure that it's wrong.

(c) In one-tailed hypothesis test some texts say "x1 is less than or equal to x2" instead of "x1 = x2" for the Null Hypothesis. While satisfactory English, IMO this creates confusion by hiding the mathematical assumptions on which the calculations actually rest.

AND ONE MORE THING, LOL.

(d) Here's a good way to tell if your statistics text is any good or not. If it refers to the "Chi square" distribution, it's crap. It should be "Chi squared." (Want some barbeque ribs?)
 
Last edited:

rtureck

Final Flight
Joined
Jul 26, 2003
Mathman said:
T = (x1-x2)SQRT(n)/SQRT(v1 + v2)

T = (185-184)SQRT(5)/SQRT(12.5 + 17.5)

T = 0.4

I am slow this is so intimidating, so what is SQRT? can you use

___
/ 5

instead? or spell the whole thing out, thanks. It took me a long time to figure that out? I had problem with 4th grade math, and you are talking about null hypothesis? But thanks for the I am learning :)


So, bottom line. We tried our hardest to say something good about the CoP, but we could not be sure.

Can you figure out Shizuka's CoP scores in Skate Canada?
 
Last edited:
Joined
Jun 21, 2003
OK on "squareroot."

Yes, these calculations are based on the assumption of the Null Hypothesis. The logic is, we are against the Null Hypothesis. So we are giving the Null Hypothesis enough rope to hang himself. That is, we assume the Null Hypothesis to be true and then prove that this assumption leads to a conclusion that is probably (with 95% probabilty) wrong. It is the probabilistic version of the reduction to absurdity.

I'll look at the Skate Canada scores. But I won't say anything unless I am 95% sure.

MM
 
Last edited:
Joined
Jul 11, 2003
Quote:
Next, we pretend for the sake of the argument that we want the CoP to work.
____________________________________________________

That is like the Warren Commission which set forth that the single bullet theory was correct and collected evidence to support that theory. Any opposing evidence was discounted.

Am I correct?

Joe
 
Joined
Jun 21, 2003
Well, it's kind of the backward version of that. We assume the opposite of what we want. Then we try to collect as much evidence as we can to show that our assumption is wrong.

In this case (Buttle versus Goebel), we were not able to collect enough evidence one way or the other to come to a conclusion.

Mathman
 

Doggygirl

Record Breaker
Joined
Dec 18, 2003
Hi Mathman...

LOL, just wanted to thank you for the grim reminder that I nearly flunked Statistics which was a required course for me back in the college days. :eek: So just like back then, I think I'll have a beer and let you do the math.

Seriously, this is very interesting and to the degree you're willing to provide us with analysis on various events, I'm all "eyes."

DG
 
Joined
Aug 3, 2003
Re your most recent stat tutorial--APPLAUSE, APPLAUSE! BRAVO! Author! Author!

Your car should get a 120,000 mile check-up more often, lol. Seriously, your stat tutorials are all good.

Thanks for bringing up how statistics have changed over the last 20 years. I had my undergrad and grad courses in stats in '84-'85. After (not during:)) our "discussions" lol on stats last year, I emailed my graduate stat professor. Although you and I didn't get into p values and some of the things you mentioned here, we did get into some of them. Indeed, Professor Statman told me that things had changed in statistics and detailed many of the things you brought up here. However, Prof. Statman is still a True Score believer (sing that to the music of "Daydream Believer";)), though at the time I emailed him the COP had not been used, only set out in the first Communications. Obviously he said he could not comment on it since he had not looked into it nor did he have the data to make an assessment.

As for Pet Peeve (b) "Most statistics texts say that we "accept" or "reject" the null hypothesis. This language is quite misleading. We never accept the null hypothesis -- it's just that we cannot be 95% sure that it's wrong." I learned "support" or "reject" the null hypothesis. Do you feel that the only proper language is "we can or cannot be 95% sure that the null is wrong" or would "support" or "reject" be generally satisfactory to you? However, I do feel that the "95% sure" way of stating it is the most accurate.

Also, although I think it was great to use an individual example for explaining whether we can be 95% sure that the null is wrong, do you think that in order to evaluate the COP system overall, and the ordinal system as well, that results for a large sample in which the COP and ordinal systems were used would give a more accurate indication as to the statistical accuracy of each system? In other words, whether COP or ordinal, an individual case of two skaters who competed against each other can give information about those skaters in those events, but can it give information about the accuracy of the system as a whole?

Great stuff, Mathman. BTW, I think your way of giving definitions in context is far better than my idea of just doing definitions separately. But then that's why you're Professor Mathman and Author Rgirl, lol.
Rgirl
 
Joined
Jun 21, 2003
Hi, Doggygirl (no, not you, Rgirl, there really is a "Doggygirl" on the board now!), I see you're up to 21 posts already! Just wait till I finish turning out the CoP, you can go back to your statistics class and get an A. LOL.

(Aside -- Here's how you can tell whether you statistics teacher regards him/herself as a mathematician as opposed to a teacher or a person interested in applications. Mathematicians never say "stat" when they mean statistics, or "math" when they mean mathematics, LOL.)

RGirl, no, I don't like the language "these data 'support' the null hypothesis," and this for two reasons. First, it is not true. Even in a close contest, even in a contest that is "too close to call," like Buttle 185, Goebel 184, the data still "support," however tenuously, the belief that Buttle's performance was a tiny bit better than Goebel's.

Consider, for example, a presidential election. You have a pre-election poll that says 51% for G.W. Bush and 49% for H. Clinton, so you say that we are not sure that Bush is really ahead of Clinton, so it's "too close to call." But the null hypothesis says, e.g., that if 30,958,324 people vote for Bush, then exactly 30,958,324 people will vote for Clinton. If only 30,958,323 people voted for Clinton, then the null hypothesis is wrong and the alternative hypothesis (more people support Bush) is right. It is virtually impossible for the null hypothesis to be true no matter how close the sample proportions are. So I think it is not good language at all to say that the results of the poll, or of any poll, no matter how close, "support" the null hypothesis.

But there is a worse problem with this language. What question are we asking? When we give the answer, what is the subject of the sentence and what is the predicate?

The question is, Are we 95% sure that Buttle's performance really was better? The subject is "we," the predicate is "are sure" and the answer is either yes or no.

To me, this is so clean. If in fact our test shows that we can be only 94.99% sure, the answer is still no.

BTW, you can modify the null hypothesis to something like: The amount by which Buttle's "true score" exceeds Goebel's "true score" is 0.5 points. Then you can investigate whether the sample data allows you to be 95% confident that the difference is actually bigger than that.

Prof. Statman might also ask why I did an "independent sample" test rather than a "paired sample" test for these data, since it is the same judges marking both skaters. I went back and forth on that point. Maybe I was wrong about that. (This is a question of choosing the best mathematical model in the context of the real world problem.)

About the True Score. This is what is "true" about the true score: If your purpose in taking a sample in the first place is to estimate the mean of a population, then there is nothing wrong with calling the mean of the population the "true mean." I guess ... Oh, hell, yes there is. What's "true" about it? Why not call it what it is, the "population mean?" Why invoke TRUTH, JUSTICE AND THE AMERICAN WAY?
Do you think that in order to evaluate the COP system overall, and the ordinal system as well, that results for a large sample in which the COP and ordinal systems were used would give a more accurate indication as to the statistical accuracy of each system? In other words, whether COP or ordinal, an individual case of two skaters who competed against each other can give information about those skaters in those events, but can it give information about the accuracy of the system as a whole?
What I really think is that all these mathematical questions are overwhelmed by the real problems in figure skating judging: politics, deal-making and national federations pushing their own agendas. I am doing this just for fun, but I don't think that it amounts to a hill of beans.

Not that this will prevent the publication of tons of statistical analyses of the CoP in the coming year. Unfortunately, for most of the scholarly analysis I have seen, all you have to do is read the name of the author and you know what the conclusion will be.

Mathman

PS. RGirl, you asked earlier about what I meant when I talked about articles published last year which made "predictions" about the CoP. Predictions isn't really the right word. What I meant was, people published papers which gave theoretical mathematical reasons why ordinal-based systems or point-total systems, the median or the mean, etc., would turn out to be the more "robust" and reliable in the context of figure skating judging.

These articles, in my opinion, were quite delightfully data-free.
 
Joined
Aug 3, 2003
Thanks, Mathman. Not only do I agree with you on the language "Are we 95% certain? Yes or no?" but so is Prof. Statman (I gave him that name because I'm too lazy to write Prof. Statisicsman, MATHman, lol). That was one of things he said had changed over the last 20 years.

His belief in the "true score" has more to do with timed tests such as the 100-meter dash and as I said, when I emailed him the COP Communique hadn't been released yet.

As for me, although I agree 100% (okay, 99%) that the biggest problems facing figure skating judging are the judges themselves, I think that is true in any judged sport and that to go too far in the other direction, as in, just make sure the judges aren't cheating and know what they're doing and pretty much any judging system will work about as well as another (not that you said that, but I'm 95% certain you would;)) is just as bad as publishing articles with nonexistant data. I still think statistics are important in minimizing the effects of the inevitable cheating, bias, poor judging, mistakes, and everything else that goes into having humans as judges. I agree that the ISU is using hocus-pocus statistics to try to make it seem like the COP is "cheat proof," but even with the best crackdown on cheating judges, no system is "human proof." That's why I still would like to see a statistical comparison between the ordinal and the COP. Although you're right that the anti-COP group will manipulate the statistics to make the ordinal system look best and vice versa. However, since the politics seem to dictate that the COP is going to be ratified, I'd like it to be as statistically robust as possible. Also, at least we can try to make the COP as accurate as possible in rewarding the skaters who skate the best according to the standards established for figure skating since getting anywhere on cheating judges, federation influence, etc. is going to take many, many years even if every single member of the ISU were full-bore behind it.

I know, it's our same old argument. I say push for designing the best statistical system for judging we can while we wait for a change in the ISU's attitude towards cheating, etc.--which might never come--and you say, well, you can speak for yourself of course. Anyway, the issue for me is finding the right balance between designing a judging system that statistically reduces the influence fo cheating or bias and designing a system that finds and gets rid of the cheating and biased judges. They have to deal with it in diving, gymnastics, synchronized swimming, and other sports, but IMO the challenge with figure skating is its complexity relative to those other sports, that is, there are a lot of things to evaluate at once and many different ways to be "excellent" especially when it comes to the component elements and presentation.

BTW, if by some magic means, we woke up tomorrow and the ISU had a very tight, very harsh system of dealing with cheating and biased judges, do you have a preference for any particular kind of judging system (it doesn't have to be COP or ordinal, it could be something else too)? Or would you lose all interest in judging if there were no cheating or bias;)?
Rgirl
 
Top