The Judging Controversy Thread

Nadya · Mar 29, 2014

os168 said:
You really have no idea on the importance and the nature of 'filial piety' to one's family, country, society, future generations that requires great degrees of selflessness and sacrifice in a Confucianism ingrained country like Korea? For her to tell herself not to care is practically denying herself of being a Korean.

(But hey, I am sure you will find some way of nitpicking the way I say thing instead of focusing on what I am trying to say. It has been your tactic of argument all along. Poking holes through semantics instead of trying to understand what the posters are trying to express. This is after all an international forum where English might not be everyone's first language.)

Look, Korea has no monopoly on filial piety or conservative mores. I am a child of a conservative society married to someone from a megaconservative society. I understand that dynamic very well. And I think you are choosing to misunderstand what I am saying. I'm not saying Kim SHOULD defy them or not care. I am saying that what Kim does, Kim chooses to do. It's a choice, it's always a choice if you strip all the crap away. If she doesn't want to skate but chooses to skate to be a good Korean, then being a good Korean is more important to her than not skating. It's always a choice. No one said choices are easy or free. There are no victims. We all all agents of our destiny.

Nadya · Mar 29, 2014

Mathman said:
The problem with all these statistical analyses is that they assume that judges' scores are just randomly thrown out there, as if we were taking a random sample from a large population. In other words, what we study in Stat 101. This isn't the case, so…

For instance, the study of the Italian physicist is based primarily on the assumption that a skater's PCSs should not go up very much from one competition to another. Well, sometimes they do. Statisticians don't like it, but there you are.

That is why 6.0 was the more honest system. In 6.0 the unit of study is the judge. In IJS the unit of study is the CoP point. Only that's kind of irrelevant and it still comes down to the individual judge and the tech panel. Why pretend otherwise?

Here's a revolutionary idea on the judging overhaul. Keep the IJS, keep the anonymity, keep the number of judges, don't exclude anyone from the panel on the basis of having a native skater in the competition. But when you tote up the scores for skater from country X, exclude the scores given by judge from country X from the total. They have to discard some scores anyway, so why not discard the scores from the judge who has the most interest in how a particular skater does? For skaters who don't have judges from their country on the panel, just discard scores from any randomly selected judge. The computer knows the identity of judges, so it should be easy to know what to discard. Voila, all accusations of bias gone.

capcomeback · Mar 29, 2014

Nadya said:
Here's a revolutionary idea on the judging overhaul. Keep the IJS, keep the anonymity, keep the number of judges, don't exclude anyone from the panel on the basis of having a native skater in the competition. But when you tote up the scores for skater from country X, exclude the scores given by judge from country X from the total. They have to discard some scores anyway, so why not discard the scores from the judge who has the most interest in how a particular skater does? For skaters who don't have judges from their country on the panel, just discard scores from any randomly selected judge. The computer knows the identity of judges, so it should be easy to know what to discard. Voila, all accusations of bias gone.

Not bad, but anonymous judging has to go. Keeping scoring in the dark only creates greater opportunity for corruption (whether bribery, vote trading, voting blocs etc.). This also does not address the issues involving the Technical Panel.

Nadya · Mar 29, 2014

capcomeback said:
Not bad, but anonymous judging has to go. Keeping scoring in the dark only creates greater opportunity for corruption (whether bribery, vote trading, voting blocs etc.). This also does not address the issues involving the Technical Panel.

With respect, I don't think this will help because bribery, vote trading and voting blocs all existed and blossomed under 6.0, where all the judges were clearly identifiable.

capcomeback · Mar 29, 2014

Nadya said:
With respect, I don't think this will help because bribery, vote trading and voting blocs all existed and blossomed under 6.0, where all the judges were clearly identifiable.

True, but there is a lot more scrutiny since 2002 than there has been in the past too. Judges would have to be accountable for their marks (which they should be anyway).

Mathman · Mar 29, 2014

Vanshilar said:
...If there's a significant change in a skater's PCS compared with how much other skaters' PCS changed, then either the skater suddenly improved in long-term abilities such as artistry etc., or the judges were biased toward/against some skaters.

I think this conclusion confuses what statisticians mean by “bias” with what figure skating fans mean when they use this word.

Plus, in hypothesis testing we have to be careful not to put the cart before the horse. Any collection of data will contain statistical oddities. Indeed, it would be statistically odd if it did not. It is a little bit cheating to perform the experiment/take the sample first and then say, oh look at this statistical anomaly.

Another difficulty is that his whole exercise is based on assumptions that IMHO require scrutiny before we apply them to IJS scores and the actual human judging process. For example, we are silently postulating: that there is a “true” mean score for a skater’s performance in the category of Interpretation, say; that this mean can be approximated by the average of the judges’ scores, perhaps over several competitions; that it is reasonable to expect that the actual judge’s scores should be more or less normally distributed about this mean with a standard deviation that can be estimated by sample standard deviations; that a skater’s performance in a given event is like picking a performance at random out of a hat containing all the performances that this skater might do; etc.

Vanshilar · Mar 29, 2014

Mathman said:
I think this conclusion confuses what statisticians mean by “bias” with what figure skating fans mean when they use this word.

Plus, in hypothesis testing we have to be careful not to put the cart before the horse. Any collection of data will contain statistical oddities. Indeed, it would be statistically odd if it did not. It is a little bit cheating to perform the experiment/take the sample first and then say, oh look at this statistical anomaly.

Another difficulty is that his whole exercise is based on assumptions that IMHO require scrutiny before we apply them to IJS scores and the actual human judging process. For example, we are silently postulating: that there is a “true” mean score for a skater’s performance in the category of Interpretation, say; that this mean can be approximated by the average of the judges’ scores, perhaps over several competitions; that it is reasonable to expect that the actual judge’s scores should be more or less normally distributed about this mean with a standard deviation that can be estimated by sample standard deviations; that a skater’s performance in a given event is like picking a performance at random out of a hat containing all the performances that this skater might do; etc.

Eh if it wasn't clear, I was using bias in the non-statistical sense, i.e. favoritism, politics, corruption, whatever you want to call it.

Yes any collection of data will have statistical oddities (in fact this is one way to determine if data was actually collected or made up "by hand"). On the other hand, he didn't do the analysis in a vacuum. Yes, any skater could "stick out" when looking at just one analysis. But if the same skaters keep sticking out as outliers, that's when the data is indicative of something suspicious, or at least should be looked at more closely. I'm not sure what you mean by cheating by doing the experiment first and then looking at the statistical anomaly. We're essentially looking at historical data, not lab-controlled experimental data.

Regarding the underlying postulates, do you think any of them are invalid in this situation? For example, should the judging as a measurement method for the skater's interpretation be modeled in a different way? Is there a different interpretation of "judging" that you want to attach to the model? All models are wrong; but some are useful.

Mathman · Mar 29, 2014

Vanshilar said:
Regarding the underlying postulates, do you think any of them are invalid in this situation? For example, should the judging as a measurement method for the skater's interpretation be modeled in a different way? Is there a different interpretation of "judging" that you want to attach to the model? All models are wrong; but some are useful.

Well, this is what I think. There is a difference between judging and measuring. The kind of statistical analysis under view applies to data that you get by taking measurements. You take many measurements of the length of a steel rod, then you take the mean of those measurements, compute the standard deviation (hoping it is not too big), and there you are.

The IJS claims that assigning scores to various features of a figure skating performance is like that. The ISU wants us to believe that it is possible to make judging criteria so precise and to train judges so well that when judges do their thing they are measuring the merit of the program.

In 6.0 judges did not measure anything. Instead they judged whether one performance was more meritorious than another. If we want to carry out statistical analyses (I do! I do!

), the unit of study is the ranking of the skaters by each judge. We can still do all kinds of cool statistical tests, trying to identify national bias, favoritism, collusion, incompetence, etc., but we must use non-parametric methods rather than the type of statistics appropriate to measurements (i.e., all those formulas that have sigma over the square root on n in them

).

Maybe I just enjoy like this type of mathematics better. (Note I did not say, I enjoy this type of mathematics more – that would be measuring.

)

capcomeback · Mar 29, 2014

There is some irony in this as what the "study" was doing was "measuring" the judges (or at least how the judge skaters). :biggrin:

Granted, while not perfect, the general principles seem interesting.

Mathman · Mar 29, 2014

capcomeback said:
There is some irony in this as what the "study" was doing was "measuring" the judges...

You said in nine words what I have been stuttering about for years.

What we should be doing is judging the judges. :yes:

Miss Ice · Mar 29, 2014

If anyone wants to talk about the "real possibility" of such a huge jump in PCS in 4 weeks, please watch the comparison between the Euro LP and the Sochi LP, and see if you can still make the same statement: https://www.youtube.com/watch?v=sMZuoi_Y4Cw

I mean, they are completely identical, except for two things. The first one has one extra double toe on one of the jumps, and mess up on the combination. The second one has ok combination (still flutz and toe-axel), no extra 2T, and a 2-foot landing. Those are the only two sole differences between the programs.

drivingmissdaisy · Mar 29, 2014

capcomeback said:
Not bad, but anonymous judging has to go. Keeping scoring in the dark only creates greater opportunity for corruption (whether bribery, vote trading, voting blocs etc.). This also does not address the issues involving the Technical Panel.

I presume the point of anonymous judging was to keep the focus on the skating and not the composition of the panel, but even when the composition of the panel is acceptable we still see some wacky scores. The technical panel is a more difficult issue because I do like they they come from countries outside of those represented by the top 5 skaters in the world. The only reason there was a Russian was because both Adelina and Liza underperformed at Worlds last year. The technical calls are always going to be disagreed upon; skaters like Mao and Mirai who are often close to < might get completely different calls based on how strict the caller is. The footwork issue needs to be addressed because it seems like that is a move that may be difficult for the panel to accurately assess in the short time they are given.

Mathman · Mar 29, 2014

drivingmissdaisy said:
I presume the point of anonymous judging was to keep the focus on the skating and not the composition of the panel,..

No, anonymous judging was put in place in 2002 as part of the "interim system" before the IJS was installed in 2003, in order to insure that a Salt Lake City debacle could not happen again. With anonymous judging no one could confront the French judge about cheating, all judging controvercies could be shrugged off, and the ISU is home free.

Anna K. · Mar 29, 2014

Mathman said:
No, anonymous judging was put in place in 2002 as part of the "interim system" before the IJS was installed in 2003, in order to insure that a Salt Lake City debacle could not happen again. With anonymous judging no one could confront the French judge about cheating, all judging controvercies could be shrugged off, and the ISU is home free.

Which was a genius solution :yes:

Julie K. · Mar 29, 2014

Like in some competitions, to rely on, for example, the middle 50% of the scores, after excluding the highest and lowest scores (50%), may reduce the judging controversy a lot. For example, for nine judges, take the scores of five in the middle, which will make it very inconvenient for the judges to favor or oppose a certain player with a bias.

Vanshilar · Mar 30, 2014

Mathman said:
Well, this is what I think. There is a difference between judging and measuring. The kind of statistical analysis under view applies to data that you get by taking measurements. You take many measurements of the length of a steel rod, then you take the mean of those measurements, compute the standard deviation (hoping it is not too big), and there you are.

Actually, judging is a type of measuring (or measuring is a type of judging...whichever's definition you hold to be more expansive). Either term refers to comparing some entity relative to some other predefined metric. In measuring a steel rod, it's comparing the steel rod to a predefined unit of length. In figure skating, it's comparing a skater's performance to a predefined set of rules about how points are given out. The total number of points are then summed up for each skater, and then each skater is ranked according to the total number of points. How the rules about giving out points may not be quantified (i.e. the height of a jump, while it can be objectively measured, is a subjective estimate under the current judging method) but just because a measurement is subjective or not quantified does not make it arbitrary nor capricious.

The points are not just ordinal but also cardinal -- they carry information about not only "whether or not" but also "how much". For example, it would be difficult to say whether or not a 3rd place routine in one competition is better than a 1st place routine in another competition with just the ranking alone as your information. But assuming that the points were given out in the same way in both competitions, you could say which was better, in the sense of, gained more points under the predefined rules. The problem is that judging may not be the same in different competitions, and indeed may not be the same for different skaters within a single competition, but at that point you have a calibration problem -- whether or not the measurement system is actually conforming to the agreed-upon metric.

I think you may be thinking of "judging" as deciding which is better or worse, i.e. eliciting preferences such as "I judge that vanilla is better than chocolate" or "I like brunettes more than blondes". In which case yes, preferences can only be ranked. But that's not what's happening here. The judges do not (ostensibly) directly decide on which skater was better or worse; rather, that's what the rules do (or more precisely, how the rules are used). The rules specifically define points to different moves that a skater may do. We then interpret the points to rank who was better (i.e. skater X had more points than skater Y, thus skater X's performance was better than skater Y's). The judges are evaluating the skaters relative to those rules. The judges do not say that an axel is better than a lutz. They are (again, ostensibly) comparing a skater's jump to the definition of an axel as well as comparing the jump's metrics (height, etc.) to predefined criteria for assessing GOEs etc. The judges do not (again, ostensibly) simply say "I like skater X better than Y, therefore skater X gets more points".

Of course, they're not supposed to, and so when there are differences between the given scores and the established rules, then people will suspect that the judges actually did go by "I like skater X more" or other personal reasons instead for giving out points and thus there will be controversy -- because the judges are effectively inserting their own preferences into the ranking system.

Because the judges are comparing each skater's performance to defined rules, and in fact do so in a quantitative way, it does make sense to use parametric methods to analyze the judging. The judges themselves are not defining value, but measuring it, (more precisely, measuring a quantity to which we then ascribe value, not value itself) and it's perfectly valid to consider them as measurement devices with all the statistical implications. Now you can certainly claim things like the judges' scores aren't actually normally distributed and such, but that's a well-known modeling issue; very few things (or at least, things that we want to measure such as weight, height, etc.) are actually normally distributed if for no other reason than they can only have positive values while any normal distribution has a finite probability of a negative value. Instead we take that as an acceptable modeling error and move on; this is why I said "All models are wrong; but some are useful" to your postulates. If I were bored I could probably argue that the judges are effectively measuring in a binomial sense (i.e. yes/no for each of the elements to receive points under the rules) and thus it makes sense that the scores should be approximately normally distributed via the central limit theorem, but I know that theorem's overly abused and I don't want to go through the assumptions (such as independence which may not hold here).

Anna K. · Mar 30, 2014

^
I like when you guys do it :yes:
I'd join you myself but gkelly just posted something for me on the Improvement Thread!

Mathman · Mar 30, 2014

Vanshilar said:
In figure skating, it's comparing a skater's performance to a predefined set of rules about how points are given out.

That is the part that the ISU has not yet convinced me of.

Do the judges, in awarding GOEs, really count up the bullets listed in the rule book, or do they just say, wow, that was great, +2. That wasn't bad, +1. Eh, but no errors. 0. (This skater deserves to win, +3.)

Same with program component scores. To me, it looks more like the judges are saying, I gave the last skater 8.50 and this skater was a little better, so I'll give her 8.75.

For skating skills, that is. Then I will repeat that mark three more times for Interpretation, Choreography, and Interpretation, and .25 less for Transitions, and call it a day. "Transitions" is almost always the lowest of the components. I think it is because this is the only one of the five components that actually can be measured. Patrick Chan got lower marks in Transitions for his Olympic free skate than he got in Choreography and Interpretation. Yet in real life Chan's transitions are exceptional and his choreography and interpretation are ordinary.

Try this. Take each skater's SS score, multiply by 5, and compare to the total PCS. I think we would be hard pressed to find evidence that P&E, CH, and INT are actually being independently measured against a well-defined standard.

Just a little bench experiment, from Worlds ladies LP:

Asada. 72.88 versus 72,76
Lipnitskaia 68.32 versus 68.39
Pogorilaya 64.32 versus 64.05
Wagner 65.68 versus 65.88
Edmunds 60.00 versus 60.52
Kostner 74.32 versus 73.78
Gold 66.56 versus 66.69
Suzuki 67.12 versus 67.13
Park 56.32 versus 55.30
Murakami 60.88 versus 59.89

Because the judges are comparing each skater's performance to defined rules, and in fact do so in a quantitative way,…

That is the claim that I have trouble swallowing. I do not think this claim is supported by the actual data. I believe that if we look at the data itself, apart from ISU propaganda, we would discover that the judges are doing nothing of the kind.

Just my opinion, as always.

ManyCairns · Mar 31, 2014

^But isn't it the point that assessing conformation to the bulleted lists of standards is what the judges should be doing (or that is what is claimed, any way), and that these stats are compelling measurable evidence that they don't do that?

And not only that they don't, but that they really, really don't in some competitions, hence the results the Italian statistician came up with?

Maybe we should repeat his entire analysis for a few other competitons and see if only certain skaters are scored so far outside the parameters -- or if that only happened to that large a degree in the Olympics. In other words, I know the statistician looked at change across several competitions for each skater, but let's look at, say, Worlds 2013 in the same way and see if any skaters had the big changes in statistical values in that competition compared to previous ones.

Speaking only of the logic of the thing of course -- no way I have the skills to do it myself!

Mathman · Mar 31, 2014

rallycairn said:
^But isn't it the point that assessing conformation to the bulleted lists of standards is what the judges should be doing (or that is what is claimed, any way), and that these stats are compelling measurable evidence that they don't do that?

And not only that they don't, but that they really, really don't in some competitions, hence the results the Italian statistician came up with?

Yes, I agree with that. I guess i am just frustrated that we cannot zero in on particular judges because of anonymity.

The Judging Controversy Thread

Nadya

Nadya

capcomeback

Nadya

capcomeback

Mathman

Vanshilar

Mathman

capcomeback

Mathman

Miss Ice

Let the sky fall~

drivingmissdaisy

Mathman

Anna K.

Julie K.

Vanshilar

Anna K.

Mathman

ManyCairns

Mathman

Similar threads

Connect with us