1. 0

## Statistics tutorial (2)

This thread is for anyone who wants to understand some of the measures that statisticians use to draw conclusions about the merits and demerits of different judging systems. This is easy. Don't be intimidated if math isn't your favorite subject. It is my job to make it your favorite subject, LOL.

I will start with this question. Under the ordinal system, how can we decide whether the judges are in substantial agreement in their ordinal placements, or whether the agreement is so bad that we begin to suspect that something is fishy?

I will use the data from the Ladies skate at the recent International Skating Challenge to show how to do this. Again, this is easy!

---RUS GER CAN USA JPN
MK 1 1 2 1 2
SA 4 2 1 2 1
SC 2 3 4 5 3
JR 5 4 3 4 4
JK 3 5 5 3 5
AP 7 6 6 6 6
ES 6 7 7 7 7

If this table looks all bunched together, here is a prettier one:

As we see by reading down the columns, the German judge was the only one that "got it right," in the sense of matching the majority in every placement. Let's see how far off the Russian judge is from the German judge.

For each skater, take the difference (d) between the placements given by the two judges. Square this difference, then add them all up.

RUS GER d d-squared
1 1 0 0
4 2 2 4
2 3 1 1
5 4 1 1
3 5 2 4
7 6 1 1
6 7 1 1
Total-12

Now run this 12 through the following formula, where n is the number of skaters:

r' = 1- [6(sum of the d-squareds)]/[(n-1)n(n+1)]

r' = 1 - (6*12)/(6*7*8)

r' = .78

We say informally that there is a 78% correlation between the two rankings.

Note: This statistic r' is called the "rank correlation coefficient." It was invented by Charles Spearman (1863-1945), and it's distribution was worked out by "Student" (the nom de plume of W. S. Gossett, 1863-1937). It is a surrogate for the more common correlation coefficient r, used for continuous data. Spearman's statistic r' is distributed by the Student's T distribution with n-2 degrees of freedom.

So what?

OK, forget that. Here is the main point. So there is 78% correlation between the rankings of the Russian judge and the German judge, so what? Well, the closer this statistic is to 100%, the more the two judges agree. A correlation of 0 means they weren't even watching the same event. So 78% is "not too good, not too bad."

Now we must quantify what "not too good or bad" means. Let's suppose that whatever we want to say about these judges, we agree to hold our tongue and not say anything at all unless we can be 95% certain that we are right. In that case, there is a "critical value" that we must beat before we can say that there is a "statistically significant correlation" between the two judges. In our case, the critical value turns out to be C = .67.

So, bottom line, if we get a correlation bigger than .67, that means that we can be 95% sure that there is at least some agreement between the judges, but a correlation of less than .67 means that we cannot be sure of anything, really.

Here are the correlations of all the judges, paired two by two. Remember, if we beat .67 correlation, that's good.

RUS vs. CAN: r' = .57 (Oops. We cannot be 95% sure that these two judges are even watching the same competition.)

RUS vs. USA: r = .71

RUS vs. JPN: r' = .68

GER vs. CAN: r' = .93 (That's good.)

GER vs. USA: r' = .86

GER vs. JPN: r' = .96 (Very good match.)

CAN vs. USA: r' = .86

CAN vs. JPN: r = .96 (It looks like Germany, Canada and Japan are pretty much on the same page.)

USA vs. JPN: r' = .82

So there you are.

Home cooking?

Wait, one more thing. In the cases where judges appear to disagree, where do these differences come from? Well...

CAN matched the majority rankings for each skater, except that she (Casey Kelly) put Jennifer Robinson ahead of Cohen.

JPN matched the majority in every ranking, except that he (Tomiko Yamada) put Shizuka Arakawa ahead of Kwan.

GER had no horse in the race, and she (Sissy Krick) matched the majority without exception.

RUS (Tatiana Danilenko) was the only judge to put Sokolova ahead of McDonough, and was also the only judge to put Cohen ahead of Arakawa.

Hmm.

Mathman

2. 0

## Re: Statistics tutorial (2)

Originally posted by Mathman
[B]
Now run this 12 through the following formula, where n is the number of skaters:

r' = 1- 6*(sum of the d-squareds)/(n-1)(n)(n+1)

r' = 1 - 6*12/6*7*8

r' = .78

Tutorial 2, was there a tutorial 1?

I am slow, but I try to muddle through. It helps me at least if you state the equation as

r' = 1- [6(sum of the d-squareds)]/(n-1)(n)(n+1)

that is add an extra bracket, and leave out the *, because the * may be confused to be power of instead of product of. I know you are using proper math symbols, but I am slow. i know technically you can state it as

r' = 1- 6(sum of the d-squareds)/(n-1)(n)(n+1)

and that is still correct.

BTW, should it always be 1-6 or 1-n, and in this case n = 6?

Don't be intimidated if math isn't your favorite subject. It is my job to make it your favorite subject, LOL.
I am exceedingly intimidated , but now that I muddled through it what is the reward, a box of pop corn? candies? Don't tell me self satisfaction, that does not work for me.

3. 0
Hi, RTureck. I edited my equation as you suggested. Thanks for the feedback.

The 6 is always 6. It is the 6 in the formula for the sum of the squares of the first n integers

1^2 + 2^2 + 3^2 + ... + n^2 = n(n+1)(2n+1)/6.

MM

4. 0
Ah, Mathman, this brings back such fond memories. Seriously, you know I love this stuff and the way you're right, it IS easy. Besides, in the end, even if the only thing one gets is the "Home Cookin'" section (great title), it makes me miss the COP already. I can understand Dick Button preferring the 6.0 system--it's the system he learned, competed to, and has commented on for going on 50 years--but when I see the stats for a competition like this using the 6.0 system, I find the COP is more statistically accurate right now and has the probability of gaining even greater statistical accuracy once they fix a few things than the 6.0 system ever could. I know I'm slightly off topic, but then looking at and understanding statistics is, IMO, the most important way of evaluating which judging method most accurately rewards the best skating performances with the top placements. Thanks again, Master Mathman.
Rgirl

5. 0
Originally posted by Rgirl
"[W]hen I see the stats for a competition like this using the 6.0 system, I find the COP is more statistically accurate right now and has the probability of gaining even greater statistical accuracy once they fix a few things than the 6.0 system ever could." -- Rgirl
We'll see, R. IMO the CoP has its statistical peccadilloes, too. If you want the CoP to look good, analyze the ordinal system. If you want the ordinal system to look good, analyze the CoP.

My only beef is, if anyone wants to use statistics to criticize one system or the other, he or she has to do it by the numbers. Of all mathematical disciplines, statistics is the most driven by actual empirical data. For over a year now, long before we had a single datum to judge by, experts in statistics have been telling us what was going to happen under CoP judging.

OK.

We’ll see.

Mathman

6. 0
Mathman -

You certainly showed the national bias where applicable among the judges for the 6.0 system.

I'd like to see you run through another competition using the comparisons of the judges in the CoP system. But then in the CoP, will we know who the judges are?

Joe

7. 0
Originally posted by Mathman
We'll see, R. IMO the CoP has its statistical peccadilloes, too. If you want the CoP to look good, analyze the ordinal system. If you want the ordinal system to look good, analyze the CoP.

My only beef is, if anyone wants to use statistics to criticize one system or the other, he or she has to do it by the numbers. Of all mathematical disciplines, statistics is the most driven by actual empirical data. For over a year now, long before we had a single datum to judge by, experts in statistics have been telling us what was going to happen under CoP judging.

OK.

We’ll see.

Mathman

ITA that if you want the COP to look good, analyze the ordinal system and if you want the ordinal system to look good, analyze the COP. After all, what student of statistics doesn't have a copy of the never-out-of-print "How to Lie with Statistics" (and they don't mean in bed, although I'm sure some judges have done that too). But watching the Campbell's and IFSC use the ordinal system, especially the former live, each score has to encompass so many variables not only with what the skater is doing on the ice but also in terms of skate order and all those variables are only ever in the judges' minds. For all we know, under the ordinal system, some judges may mark a skater based only on jumps or because they hate the music of a certain composer. Those are extreme, but the point is, we don't know and can never know. Even if the judges say what they variables they used to score a skater, we don't know if they did. Then those scores are really just used as a way for judges to keep track of how they would place the skaters after they've seen them all skate, so the 5.8s and 5.9s for one judge might be 5.5s and 5.6s for another. I know they have guidelines, but again, we just don't know and can never know.

Okay, I've already said more than I meant to say and said I would say and I don't want to take this thread in the wrong direction. So please, hit us wit more 'o dem numbers, baby! I got a good feelin' about 216 and 12,960,000--and I bet you know why
Rgirl
P.S. Re "For over a year now, long before we had a single datum to judge by, experts in statistics have been telling us what was going to happen under CoP judging." What are some of the things experts have been telling us was going to happen under COP judging? If this isn't the right thread, you can answer me under the "More Questions About the COP" thread. Just curious.

8. 0
I can understand Dick Button preferring the 6.0 system--it's the system he learned, competed to, and has commented on for going on 50 years-
Gosh, you talk as if the system used at this year's worlds is the exact same system used back in 1948. It isn't. The scoring system has changed many times over the years and people like Dick have always kept abrest, it's their job.

9. 0
Originally posted by berthes ghost
Gosh, you talk as if the system used at this year's worlds is the exact same system used back in 1948. It isn't. The scoring system has changed many times over the years and people like Dick have always kept abrest, it's their job.
Very true, Berthes Ghost, that the scoring system has been tweaked many times along the way. I should have been clearer with my language, but it is still the 6.0 system and that's what Dick said he preferred. However, it was a quick, general remark so I myself wouldn't read too much into it. He may have meant a specific aspect of the ordinal system, like when they used to know who the judges were
Rgirl

10. 0
I don't think that Dick Button is against the CoP particularly. He is against secret judging.

Mathman

11. 0
Hooray for Dick. I'm against secret judging. I want to see those bias scores.

Joe

12. 0
(What I wrote while waiting for my car to get through its 120,000 mile checkup this afternoon. -- If you're so smart, Mathman, why can't you afford a better car?)

Statistics 101, part 3. Testing the CoP.

The only thing we really ask of a judging system is that "the right person wins." If that is too vague a goal, then we would settle for some assurance at least of consistency in judging: if lots of judges scored the performances over and over, the results would be more or less the same most of the time. In using language like this we are tacitly assuming that the judges' scores somehow represent a sample drawn from the population of all the marks that might have been given by all possible well-qualified, impartial and honest judges, conscientiously following the guidelines of the CoP.

Suppose the total scores look like this (for simplicity I will assume that there are only five judges, and I will set aside for the moment the effects of the randon draw and the trimmed mean, as well as cumulative effects of adding up many component scores and considerations of national Chauvinism, etc.)

Buttle: 180, 190, 185, 185, 185; average: 185
Goeble 185, 190, 180, 180, 185; average: 184

Buttle wins, 185 to 184.

But since this was a close contest, can we be confident that Buttle's skate really was better, and would have been certified so by the majority of all judging panels that might have scored this event?

First, to investigate this question as serious scientists rather than as fans of one skater or another, we must agree not to go popping off before we know what we are talking about. To insure this, let us stipulate the following:

IF WE CANNOT BE 95% SURE, THEN WE WILL REMAIN SILENT

Next, we pretend for the sake of the argument that we want the CoP to work. That is, we will test the hypothesis that Buttle's performance really was better. To do this, we set up a straw man, called the Null Hypothesis: The two performances were exactly the same. We don't believe this, however, and we want to show that the null hypothesis is wrong.

Null Hypothesis: The two performances were exactly the same.
Our Hypothesis: No, Buttle's was better and so the CoP worked.

To test our hypothesis, we compute a statistic called the variance. This is a measure of how far each individual score is from the average. For Buttle the average was 185 and the individual scores are

180 (off by 5), 190 (off by 5), 185 (off by 0), 185 (off by 0), 185 (off by 0).

Take the sum of the squares of these differences and divide by n-1, where n is the sample size. That's the variance. For Buttle,

v = (5x5 + 5x5 + 0x0 + 0x0 + 0X0)/4 = 12.5

For Goebel,

v = (1x1 + 6x6 + 4x4 + 4x4 + 1x1)/4 = 17.5

(Remark: The standard deviation is the square root of the variance.)

Now we need a standard unit-free measure of the difference between Buttle's and Goebel's average scores. The formula is

T = (x1-x2)squareroot(n)/squareroot(v1 + v2)

T = (185-184)squareroot(5)/squareroot(12.5 + 17.5)

T = 0.4

(In techspeak, Buttle's average score is 0.4 standard errors bigger than Goebel's.)

But in order to be 95% sure of anything we must beat a certain "critical value." The exact number that we have to beat varies slightly with the sample size and other variables, but as a rule of thumb most of the critical values are around 2. So if this T score is bigger than 2, then we can be 95% confident that Buttle's performance really was better. If T is not bigger than 2, then we cannot be 95% sure, so we must remain silent.

Conclusion: Our T score of 0.4 is not bigger than 2.0. Therefore we are not 95% sure (of anything).

So, bottom line. We tried our hardest to say something good about the CoP, but we could not be sure.

Mathman

PS. Some pet peeves.

(a) If you have taken Stat 101 in the last 20 years, you have probably used a textbook that made use of the "p-value" in hypothesis testing. The authors of these texts, IMO, are often somewhat lazy in explaining why it is cheating to use the sample data to determine the p-value, rather than (correctly) to set the level of significance first, then take the sample. I will explain more about this if there is any interest.

(b) Most statistics texts say that we "accept" or "reject" the null hypothesis. This language is quite misleading. We never accept the null hypothesis -- it's just that we cannot be 95% sure that it's wrong.

(c) In one-tailed hypothesis test some texts say "x1 is less than or equal to x2" instead of "x1 = x2" for the Null Hypothesis. While satisfactory English, IMO this creates confusion by hiding the mathematical assumptions on which the calculations actually rest.

AND ONE MORE THING, LOL.

(d) Here's a good way to tell if your statistics text is any good or not. If it refers to the "Chi square" distribution, it's crap. It should be "Chi squared." (Want some barbeque ribs?)

13. 0
Originally posted by Mathman

T = (x1-x2)SQRT(n)/SQRT(v1 + v2)

T = (185-184)SQRT(5)/SQRT(12.5 + 17.5)

T = 0.4
I am slow this is so intimidating, so what is SQRT? can you use

___
/ 5

instead? or spell the whole thing out, thanks. It took me a long time to figure that out? I had problem with 4th grade math, and you are talking about null hypothesis? But thanks for the I am learning

So, bottom line. We tried our hardest to say something good about the CoP, but we could not be sure.
Can you figure out Shizuka's CoP scores in Skate Canada?

14. 0
OK on "squareroot."

Yes, these calculations are based on the assumption of the Null Hypothesis. The logic is, we are against the Null Hypothesis. So we are giving the Null Hypothesis enough rope to hang himself. That is, we assume the Null Hypothesis to be true and then prove that this assumption leads to a conclusion that is probably (with 95% probabilty) wrong. It is the probabilistic version of the reduction to absurdity.

I'll look at the Skate Canada scores. But I won't say anything unless I am 95% sure.

MM

15. 0
Quote:
Next, we pretend for the sake of the argument that we want the CoP to work.
__________________________________________________ __

That is like the Warren Commission which set forth that the single bullet theory was correct and collected evidence to support that theory. Any opposing evidence was discounted.

Am I correct?

Joe

Page 1 of 2 1 2 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•