# Thread: How can we measure the degree of agreement between two judges?

1. 0

## How can we measure the degree of agreement between two judges?

Under the CoP the judges’ protocol sheets contain so much data that it is hard to know how to get started in addressing the question of whether two judges see things pretty much the same way or not.

When judging is not anonymous, there is an easy way it to do it. Convert each judges’ scores to ordinals. Then compute something called the “Spearman rank correlation coefficient.”

Here is how to do it.

..................RUS....…FRA..difference..d squared

Yagudin.........1..........2..........1..........1
Plushenko......2..........3..........1..........1
Goebel..........3..........1..........2..........4
Honda..........4...........4..........0..........0

Total............................................. .6

Now compute r = 6 times (sum of the squared differences) divided by nx(n^2-1)

= (6x6)/(4x15) = .6

This is 60% correlation.

2. 0
I know this isn't the point of the thread, but are those ordinals actually real? I don't think it's ever happened that a judge scored Goebel over both Yagudin and Plushenko at a competition.

We should look at a CoP competition before the protocols became randomized for each judge and see how it compares to 6.0 judging.

EDIT - 2006 Olympics might be a great one to examine because of how all over the place the competition was.

3. 0
Originally Posted by Mathman

Yagudin.........1..........2..........1..........1
Plushenko......2..........3..........1..........1
Goebel..........3..........1..........2..........4
Honda..........4...........4..........0.........0.
:sheesh:

4. 0
Calculating correlations of GOEs/PCSs would be enough, but random order spoils it.
Like one judge gives: 0 1 1 1 2 1 2 1
Another: 1 1 1 0 1 0 2 0

Pearson correlation is 0.394, medium.

In PCS those judges have 0.845 correlation

5. 0
Slightly different, but isn't there a way of calculating the divergence of a single sample (i.e. one judge's scores) from the mean score of all the judges...like, the variance? You can do that for each of the technical elements and program component scores. I'm not sure if this is appropriate, though (is it reasonable to assume that scores should follow a normal distribution about the mean?)

6. 0
prettykeys
Slightly different, but isn't there a way of calculating the divergence of a single sample (i.e. one judge's scores) from the mean score of all the judges...like, the variance? You can do that for each of the technical elements and program component scores.
Yeah, it is about searching for a judge that marks very differently from the rest. I guess if someone is out from standard deviation is enough.

I'm not sure if this is appropriate, though (is it reasonable to assume that scores should follow a normal distribution about the mean?)
Yes, it's reasonable.

7. 0
Originally Posted by prettykeys
Slightly different, but isn't there a way of calculating the divergence of a single sample (i.e. one judge's scores) from the mean score of all the judges...like, the variance? You can do that for each of the technical elements and program component scores. I'm not sure if this is appropriate, though (is it reasonable to assume that scores should follow a normal distribution about the mean?)
This would be fine. The rule of thumb is usually something like, more than three standard deviations from the mean is out of bounds.

However, the ISU’s own system for evaluating judges is somewhat different. It uses just the sum of the absolute values of the individual scores, rather than the square root of the sum of the squares.

I don’t know why they decided to do it that way. The sum of the squares is prettier (in the sense of being compatible with distance formulas in large-dimensional Euclidean spaces), it lends itself to mathematical manipulations better (the absolutel value is not differentiable), and it gives relatively greater weight to those scores that are way out of line instead of just a little.

Perhaps the concern was a possible loss of robustness if you use the more common standard deviation as your measure of variation. (This would be the case, for instance, if they -- like you -- had suspicions about whether the underlying distribution is symmetric or not.)

Anyway, here is how the ISU identifies “possible anomalies” in judges scores. A new communication (#1631) about this came out in July. Scroll down to section E, page 5.

http://www.isu.org/vsite/vnavsite/pa...v-list,00.html

For GOEs, for instance, it goes like this.

For each skater, for each element, calculate the mean score from all sitting judges, plus the referee counted twice (so this is not the same as the trimmed mean that we see in the protocols). Then for each judge, calculate the absolute value of the difference between that judge’s score and the average.

And these differences up. If the sum exceeds the number of elements being judged, then that is “outside the corridore” and this “anomaly” comes to the attention of the judges’ oversight procedure.

(The little chart about pluses and minus that accompanies this explanation in the Communication is kind of a red herring. I assume this information is kept so they can tell whether a judge is consistently favoring/dumping on a skater, or whether the judge is giving some marks way too high and others way too low for the same skater.)

8. 0
Sadly I can't understand well what you're talking about, as I'm a totally non mathematical person.

But do you know this Japanese site?
http://fssim.sakura.ne.jp/

For example, this is an analysis of OG Men's competition.
http://fssim.sakura.ne.jp/200910/200...couverMen.html

The owner of the site calculated deviation of each judge's mark.(Accoding to Mathman, ISU no longer uses the deviation in order to evaluate their judges. Rigtht? I'm such a mathematic fool.)

I'm very satisfied with the final placement of the event. So basically I don't have any complaints on it.
But looking into those scores, especially the FS scores, my unscientific brain can't help to wonder if the judges might have tried to adjust the result and succeeded.
As you know, there were two judges who gave Plushenko very low marks (147.83/151.23) and two who gave him very high ones (180.03/179.03). Actually Evan had a low one (157.53) and a high one (180.43), as well. Probably they were a U.S. judge and a Russian, and it's a rather ordinary thing to happen in competiotion. But how about two and two? Didn't they try to make it even?

I know it's off topic, but may I say this here? Being a huge fan of Daisuke, the J5 for his FS annoys me a lot. He/She gave him only 141.88 in TSS and 62.38 in TES. (He/She gave -3 GOE on 3Lz+2T. Did Daisuke fall on the jump??? He/She gave only +1 GOE on the fabulous Lv4 Cist as well.) I can't find a lower TES than it until Florent Amodio's J5 and J9. I hope the J5 already got a yellow card from ISU.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•