How can we measure the degree of agreement between two judges? | Golden Skate

How can we measure the degree of agreement between two judges?

Joined
Jun 21, 2003
Under the CoP the judges’ protocol sheets contain so much data that it is hard to know how to get started in addressing the question of whether two judges see things pretty much the same way or not.

When judging is not anonymous, there is an easy way it to do it. Convert each judges’ scores to ordinals. Then compute something called the “Spearman rank correlation coefficient.”

Here is how to do it.

..................RUS....…FRA..difference..d squared

Yagudin.........1..........2..........1..........1
Plushenko......2..........3..........1..........1
Goebel..........3..........1..........2..........4
Honda..........4...........4..........0..........0

Total..............................................6

Now compute r = 6 times (sum of the squared differences) divided by nx(n^2-1)

= (6x6)/(4x15) = .6

This is 60% correlation. :)
 
Last edited:

Blades of Passion

Skating is Art, if you let it be
Record Breaker
Joined
Sep 14, 2008
Country
France
I know this isn't the point of the thread, but are those ordinals actually real? I don't think it's ever happened that a judge scored Goebel over both Yagudin and Plushenko at a competition.

We should look at a CoP competition before the protocols became randomized for each judge and see how it compares to 6.0 judging.

EDIT - 2006 Olympics might be a great one to examine because of how all over the place the competition was.
 
Last edited:

seniorita

Record Breaker
Joined
Jun 3, 2008
Yagudin.........1..........2..........1..........1
Plushenko......2..........3..........1..........1
Goebel..........3..........1..........2..........4
Honda..........4...........4..........0.........0.
:unsure::sheesh::scowl:
 

Daniel5555

On the Ice
Joined
Jan 27, 2009
Calculating correlations of GOEs/PCSs would be enough, but random order spoils it.
Like one judge gives: 0 1 1 1 2 1 2 1
Another: 1 1 1 0 1 0 2 0

Pearson correlation is 0.394, medium.

In PCS those judges have 0.845 correlation :)
 

prettykeys

Medalist
Joined
Oct 19, 2009
Slightly different, but isn't there a way of calculating the divergence of a single sample (i.e. one judge's scores) from the mean score of all the judges...like, the variance? You can do that for each of the technical elements and program component scores. I'm not sure if this is appropriate, though (is it reasonable to assume that scores should follow a normal distribution about the mean?)
 

Daniel5555

On the Ice
Joined
Jan 27, 2009
prettykeys
Slightly different, but isn't there a way of calculating the divergence of a single sample (i.e. one judge's scores) from the mean score of all the judges...like, the variance? You can do that for each of the technical elements and program component scores.
Yeah, it is about searching for a judge that marks very differently from the rest. I guess if someone is out from standard deviation is enough.

I'm not sure if this is appropriate, though (is it reasonable to assume that scores should follow a normal distribution about the mean?)
Yes, it's reasonable.
 
Joined
Jun 21, 2003
Slightly different, but isn't there a way of calculating the divergence of a single sample (i.e. one judge's scores) from the mean score of all the judges...like, the variance? You can do that for each of the technical elements and program component scores. I'm not sure if this is appropriate, though (is it reasonable to assume that scores should follow a normal distribution about the mean?)

This would be fine. The rule of thumb is usually something like, more than three standard deviations from the mean is out of bounds.

However, the ISU’s own system for evaluating judges is somewhat different. It uses just the sum of the absolute values of the individual scores, rather than the square root of the sum of the squares.

I don’t know why they decided to do it that way. The sum of the squares is prettier (in the sense of being compatible with distance formulas in large-dimensional Euclidean spaces), it lends itself to mathematical manipulations better (the absolutel value is not differentiable), and it gives relatively greater weight to those scores that are way out of line instead of just a little.

Perhaps the concern was a possible loss of robustness if you use the more common standard deviation as your measure of variation. (This would be the case, for instance, if they -- like you -- had suspicions about whether the underlying distribution is symmetric or not.)

Anyway, here is how the ISU identifies “possible anomalies” in judges scores. A new communication (#1631) about this came out in July. Scroll down to section E, page 5.

http://www.isu.org/vsite/vnavsite/page/directory/0,10853,4844-130127-131435-nav-list,00.html

For GOEs, for instance, it goes like this.

For each skater, for each element, calculate the mean score from all sitting judges, plus the referee counted twice (so this is not the same as the trimmed mean that we see in the protocols). Then for each judge, calculate the absolute value of the difference between that judge’s score and the average.

And these differences up. If the sum exceeds the number of elements being judged, then that is “outside the corridore” and this “anomaly” comes to the attention of the judges’ oversight procedure.

(The little chart about pluses and minus that accompanies this explanation in the Communication is kind of a red herring. I assume this information is kept so they can tell whether a judge is consistently favoring/dumping on a skater, or whether the judge is giving some marks way too high and others way too low for the same skater.)
 

carignan

Rinkside
Joined
Apr 2, 2010
Sadly I can't understand well what you're talking about, as I'm a totally non mathematical person.

But do you know this Japanese site?
http://fssim.sakura.ne.jp/

For example, this is an analysis of OG Men's competition.
http://fssim.sakura.ne.jp/200910/200910VancouverMen.html

The owner of the site calculated deviation of each judge's mark.(Accoding to Mathman, ISU no longer uses the deviation in order to evaluate their judges. Rigtht? I'm such a mathematic fool.)

I'm very satisfied with the final placement of the event. So basically I don't have any complaints on it.
But looking into those scores, especially the FS scores, my unscientific brain can't help to wonder if the judges might have tried to adjust the result and succeeded.
As you know, there were two judges who gave Plushenko very low marks (147.83/151.23) and two who gave him very high ones (180.03/179.03). Actually Evan had a low one (157.53) and a high one (180.43), as well. Probably they were a U.S. judge and a Russian, and it's a rather ordinary thing to happen in competiotion. But how about two and two? Didn't they try to make it even?

I know it's off topic, but may I say this here? Being a huge fan of Daisuke, the J5 for his FS annoys me a lot. He/She gave him only 141.88 in TSS and 62.38 in TES. (He/She gave -3 GOE on 3Lz+2T. Did Daisuke fall on the jump??? He/She gave only +1 GOE on the fabulous Lv4 Cist as well.) I can't find a lower TES than it until Florent Amodio's J5 and J9. I hope the J5 already got a yellow card from ISU.
 

carignan

Rinkside
Joined
Apr 2, 2010
seniorita said:
so a judge gave Plushenko 98.10 and another gave Lysacek 97.75 and someone else Dai 102.15 ?All sound
Yah, 102.15! Hilarious! :laugh: I think it's the same judge who gave 96.1 to Plushenko, 97.75 to Evan and 102.15 to Daisuke. Even as a Daisuke's fan, it's just too much. For justice's sake, the judge was omitted.:p I guess she was a Japanese judge. What a brave / patriotic woman. :unsure:
 
Last edited:

seniorita

Record Breaker
Joined
Jun 3, 2008
Can i ask something?In each column they are not the same judge's marks? i know two judges are out in the whole competition but seems like their marks are published randomly?i thought it was anonymous but each column represents the same judge. I was wrong?:confused:

I m looking at the sheet now, what are the red marked numbers supposed to be?
 
Last edited:

carignan

Rinkside
Joined
Apr 2, 2010
seniorita said:
i thought it was anonymous but each column represents the same judge.
No, now ISU shuffles the judges. (I hate this suffling!!) J1 for Plushenko and J1 for Evan aren't the same judge. We can't see who is who now.
The owner of the site found out which two judges was eliminated by calculation and show them in red.
 

seniorita

Record Breaker
Joined
Jun 3, 2008
No, now ISU shuffles the judges. (I hate this suffling!!) J1 for Plushenko and J1 for Evan aren't the same judge. We can't see who is who now.
The owner of the site found out which two judges was eliminated by calculation and show them in red.

thank you, I thought so at first but then i take high low out and from the rest the average is not what it should be...I m sure I make something idiotic:mad:
 

carignan

Rinkside
Joined
Apr 2, 2010
seniorita said:
thank you, I thought so at first but then i take high low out and from the rest the average is not what it should be...I m sure I make something idiotic
I'm afraid I might be wrong... but I think the actual score is calculated based on each element. So you can't get the same score if you calculate using the total score of each judge.
In this table, scoreA is the score in ISU's protocol. ScoreB is the average without judge drawing. (just remove the highest and the lowest and get the average from the rest.)
 

seniorita

Record Breaker
Joined
Jun 3, 2008
Ok nevermind, I got it after :eek::
There used to be a thread here explaining CoP calculation very well, 1-2 years back, now I cant find it, I dont remember title...
 
Top