Olympic judging changes ( 5 judge results)

gkelly · Apr 26, 2009

I'm not educated in statistics, not even a little bit. So I'm just asking questions.

It seems to me that the "measurements" that judges make -- whether on a 10-point scale or a 6/7-point scale, whether of the whole program or of individual elements or individual aspects ("components") of the whole program -- are most comparable to measurements on a visual analog scale or similar rating system, where individuals are asked to rate perceptions, etc., on a scale of 1-10 or some other range. Whether these ratings are forced into digital steps or not would depend on the rating mechanism.

Often those scales are used for measuring perceptions that are internal to the person doing the perceiving, e.g., pain. In that case, each person would be providing numbers related to their own individual object of perception. Joe can only rate Joe's own pain and Sally can only rate Sally's own pain. So differences in the numbers they each produce would vary based not only on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers, but also on variations between what they are each perceiving.

But it can still be useful for investigators to find an average level of pain perceived by subjects in a study under specific conditions. So how do they work with the rating numbers to produce usable averages? Would medians or means be more appropriate?

You could also use such scales for studies such as market research that would ask people to evaluate an external object based on their own perceptions and preferences. Depending on what's being evaluated, there might be some degree of expertise involved and required of the evaluators, or it could be purely a matter of personal preference.

With judges evaluating skating, all the judges are evaluating the same object of perception, and there is expertise expected in being able to recognize and identify levels of technical skill and adherence to criteria. But the numbers they come up with will still vary based on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers. There isn't a single true number that represents the true objective measurement of a fixed parameter, such as the length of a rod (to use an example Mathman has invoked a few months ago).

At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?

Edited to add, my point is that I don't think this assumption is true:

(a) There is a correct mark, independent of our efforts to measure it.

Click to expand...

There isn't any direct association between a given level of skill and the number 7.25 other than a consensus developed by the larger pool of trained judges of which any given judging panel is a subset.

It might be possible to define objective benchmarks for 7.0 and 8.0, for example. Maybe even 7.0 and 7.5. But for all actual examples that fall somewhere in between those benchmarks, it will still be up to the individual discretion of each judge to determine whether that intermediate example is best represented by the number 7.0, 7.25, or 7.5.

For GOEs, there are much clearer benchmarks already established. And very often the GOEs for a given element are unanimous, much more often than PCS. But even so there is also often a fluctuation between, say, 0 and +1 or 0 and -1, or sometimes between -1 and +1, as different judges differently perceive completed elements as slightly better or worse than the norm and draw the line at different points as to when to lower or raise the GOE or not.

Obviously the more judges contributing the data to the averaging process, the more "accurate" the results would be (which is exactly the reason for the concern over using fewer judges that this thread started with). But there's no independent measurement of the numerical value of a skating program or element aside from a consensus of experts -- there's no independent way to confirm whether any panel of judges got the right answer or not.

Given that that is the case, what is the best statistical method for crunching the numbers that a judging panel comes up with?

Is using larger panels the best way or only to ensure more "accurate" results?

I think random selection of some judges' scores not to count will always hurt the statistical accuracy. The justification for random selection is that it enhances judges' ability to avoid outside political influences on their judging process.

My question is whether it really does have a positive effect on that ability. If yes, then it's worth keeping for reasons extrinsic to the statistical process. If no, then it's a worthless provision.

Buttercup · Apr 26, 2009

gkelly, I'm not an expert in statistics or mathematics, but I'll try to answer as someone who does have a bit of a background in statistical methodlogy and some knowledge - very basic! - on creativity research (not all of it in English, so I hope this will make sense

).

In many research fields, especially in the social sciences, research participants answer self-report tests (obviously there are other research methods, but surveys are cheap and easy to analyze). So you'll have several items measuring each factor you want to study, and assuming the scales are reliable and valid, you use the calculated means from each scale (factor) for further analysis. As a researcher, you have to assume that the answers given by the participants are meaningful, otherwise your scales are not valid and you may as well leave the research to those dealing in physics and the like :biggrin:

.

Anyway, what you do with your data is treat these variables as quasi-interval, and in that case, means can be used in a meaningful manner - something that can't be done with interval scales, for which you'll report median or mode instead, and do the appropriate non-parametic tests.

The second relevant bit of methodology is how to assess a creative product, which figure skating programs are in many ways - the only really objective parts are the base marks for the jumps; levels can differ based on a caller's perceptions. One way you can do this is to use consensual assessment technique. In CAT, creative products are rated by several judges, and the ratings are then averaged into a global rating and analyzed for interjudge reliability. In theory, those with a good understanding of a certain field should be able to agree more or less on what's exceptional, what's good, what's bad and what's mediocre.

I took this from a paper I wrote once; I could probably expand it and cite sources, but let's not do that

. For further reading, the technique comes from the work of a researcher named Teresa Amabile. Prof. Amabile is on the faculty at Harvard and her work has been cited quite a bit, so I imagine she knows what she's talking about.

CAT also has some other requirements, some of which are a part of the current judging system. Assuming this procedure is followed, you don't need a lot of judges to assess creative products. Now, I can't remember if CAT is applicable in the case of performances, but it's definitely been used in many different areas of creativity. CoP and 6.0 don't exactly follow the procedure, but they do/did follow it in part. So I'd say that assuming the judges know what they're doing and are judging in good faith, you don't need a huge panel. Of course, those can be big ifs in figure skating.

Hope that was coherent and helpful. I'll now leave the math to Mathman

.

Mathman · Apr 26, 2009

Buttercup said:
Hope that was coherent and helpful. I'll now leave the math to Mathman .

Now if we could just get Mathman to tone down his Blowhard the Bombastic act for a minute... :biggrin:

The last two posts, gkelly's and yours, are very coherent and helpful indeed. :yes:

gsrossano · Apr 26, 2009

The fundamental premise of IJS is that the value of programs will be determined on an absolute scale in accordance with standards and requirements specified in the rules and ISU communications. It is most definitely not meant to be a relative (comparative) system relying on individual standards and priorities for each judge.

Mathman said:
(a) There is a correct mark, independent of our efforts to measure it.

That is the design intent of the system.

For each error in an element there is a correct reduction to the GoE, and for each positive aspect there is a correct enhancement to the GoE. For the GoEs this is pretty well defined, and any differences among the judges is (should be)due to differences in observational skill and differences in knowledge, not in different expectations or individual standards for what is an error or a positive aspect.

For the PCs there is also a correct mark on an absolute scale tied for the percentage of the program the skater doing the criteria correctly at a recognized skill level. The calibration of the judges to mark to this standard is still very weak. But that does not negate the intent of the marking scale to be an absolute standard for which there is a correct mark,

(b) Our measuring technique is such that the collection of all possible measurements comprises a normal distribution whose mean is the correct mark.

The distribution of GoEs and PCs is indeed Gaussian (at least for the last two seasons I have done the analysis). The mean of the marks is the best estimate of the mean of the distribution. It is not however, the mean of the underlying distribution. Likewise the measured standard deviation is our best estimate of the standard deviation of the underlying distribution, but not the actual standard deviation of the distribution itself. The only way to know the true mean and standard deviation of the distribution is to take an infinite number of samples.

One can estimate the amount by which the measured mean and standard deviation depart from the true values using the measured values and the number of independent samples. One can also determine the probability the measured mean departs from the true mean and the probability that the difference between two averages is real (the t-distribution noted in an earlier post).

(c) The particular sample that we have before us was chosen randomly. That is, every measurement in the population has an equal chance of being chosen for the sample.

Not exactly sure what you are tying to say here, so I am not sure if I agree or not.

I think the real requirement is that each sample has to be independent. So long at the judges don't have a prior agreement or are copying from each other that requirement is met.

I think we are using the word "measurement" differently. To me, "measuring" something means assigning a real number to it along a continuum. I do not regard saying, skater A is better than skater B, as "measuring."

A continuum is not required. Most measurements today are digital, and inherently quantized (and many physical phenomena are quantized, even on a macroscopic scale).

Don't know what you mean by a "real" number. There are absolute measurements and there are relative measurements. Both are measurements.

When I decide skater A is better than skater B I have taken into account all the technical content, the presentation, (everything in the IJS criteria in fact) and decided who has correctly done more of what I have been trained to be looking for. In a group of skaters I may decide A is one place better than B or two places or 5 places, and my ordinal numerically specifies how many places I think A is better than B.

Many of the recent posts are immensely entertaining, but sight of the forest is being lost for discussion of the trees, the leaves of the trees, and the pollen on the leaves of the trees.

The bottom line, regardless of the methodology used and the exact metric calculated, is that the accuracy and precision of the results is determined by the spread in the marks and the number of marks used in the calculation. No matter how you decide to crunch the numbers the fewer the judges the less certain and the less accurate the results. That is the forest.

To have reduced the size of the panels has only made things worse. And it cannot be argued there were more than enough before, so reducing the size only brings us to the number we really need. The previous use of 9 scoring judges was already too few. If 9 was already bad, then going to seven scoring judges is only worse.

Mathman · Apr 26, 2009

gkelly said:
It seems to me that the "measurements" that judges make -- whether on a 10-point scale or a 6/7-point scale, whether of the whole program or of individual elements or individual aspects ("components") of the whole program -- are most comparable to measurements on a visual analog scale or similar rating system, where individuals are asked to rate perceptions, etc., on a scale of 1-10 or some other range. Whether these ratings are forced into digital steps or not would depend on the rating mechanism...

In my opinion this is correct, and a useful way to look at it.

But it can still be useful for investigators to find an average level of pain perceived by subjects in a study under specific conditions. So how do they work with the rating numbers to produce usable averages? Would medians or means be more appropriate?

I think means are almost always the right way to go when dealing with measured data, even if there is a subjective component in how the data is obtained.

The only quibble is if it turned out that the distribution of all measurements is not "normal" (bell-shaped), then the formulas for margin of error, confidence intervals, etc., do not work as advertised. This might happen, for example, if there were some weird individuals with super-human tolerance for pain (the trimmed mean provides a partial remedy.)

With judges evaluating skating, all the judges are evaluating the same object of perception, and there is expertise expected in being able to recognize and identify levels of technical skill and adherence to criteria. But the numbers they come up with will still vary based on the accuracy of their perceptions and on how they individually use the scale to translate perceptions in to numbers. There isn't a single true number that represents the true objective measurement of a fixed parameter, such as the length of a rod (to use an example Mathman has invoked a few months ago).

I think this is the one thing that almost everyone will agree on.

At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?

Yes, I think so.

Obviously the more judges contributing the data to the averaging process, the more "accurate" the results would be (which is exactly the reason for the concern over using fewer judges that this thread started with). But there's no independent measurement of the numerical value of a skating program or element aside from a consensus of experts -- there's no independent way to confirm whether any panel of judges got the right answer or not.

Given that that is the case, what is the best statistical method for crunching the numbers that a judging panel comes up with?

Is using larger panels the best way or only to ensure more "accurate" results?

Pretty much, yes, it is the only way. Here "accurate" means that the average scores of the panel match the average of the scores that would hypothetically be given by all well qualified judges.

I think random selection of some judges' scores not to count will always hurt the statistical accuracy.

I believe that is the second thing about which there is no disagreement.

The justification for random selection is that it enhances judges' ability to avoid outside political influences on their judging process.

My question is whether it really does have a positive effect on that ability. If yes, then it's worth keeping for reasons extrinsic to the statistical process. If no, then it's a worthless provision.

This, of course, is not a statistical question. My personal opinion is that it is worse than merely worthless, it is harmful to the goal of generating public trust in the integrity of the judging.

Let me make one more little comment about the following, because this is the point at which the mathematiucs comes in.

At best there will be a consensus as to the appropriate number that the ratings of trained experts will converge on. Would that be considered the "true" score for a skating program or aspect of a program?

Having abandoned the "measuring a steel rod" model, now there are no errors of measurement, but only "sampling error" (random statistical noise) to worry about. If the average score given by a judging panel is 132.17, but the margin of error is plus or minus10 points, then we would not have much confidence that this estimate is close to what the average of the population of all possible judges' marks would be.

But I still think we are on shaky ground. There are many reasons for variation in judges' marks besides random sampling error. If one judge gives a high mark and another judge gives a low one, is this just statistical noise that we should try to reduce or filter out? Or is this the very thing that we want to study?

Mathman · Apr 27, 2009

Mathman said:
(a) There is a correct mark, independent of our efforts to measure it.

That is the design intent of the system.

In this, I believe the ISU to be grievously in error. In my response to gkelly's post above I said that the impossibility of achieving such a notion of the "correct mark" was the one thing we could all agree on.

I see I was wrong about that. No wonder we are not able to fit the square peg of figure skating judging into the round hole of statistical analysis.

Some GOE bullets. Jumps "Superior flow in and out of jump elements."

Step sequences: "Highlight the character of the program."

Spiral sequences: "Creativity and originality."

And not only the individual bullets, but "it is up to the judge to decide on the number of bullets for any upgrade..."

In my opinion, to say that these considerations require measurement rather than judgment is to distort the words measurement and judgment out of all usefulness.

Mathman said:
To me, "measuring" something means assigning a real number to it...

gsrossano said:
Don't know what you mean by a "real" number.

By a real number I meant an element of the real number system. Like pi is a real number, or the square root of 2 is a real number.

"Third place" is not a real number. I don't mean that third place is unreal. Just that you cannot point to a place on the number line and say, this is thrid place, right here between 16.39847 and 16.39848.

gsrossano · Apr 27, 2009

Mathman said:
In my opinion this is correct, and a useful way to look at it.

Judges are not judging perceptions of intangible things, like pain, beauty, emotion, how I feel about the program, or how you feel about the program -- or at least they are not supposed to under IJS.

They are judging height, speed, rotations, cheats, positions, movements, timing, unison etc. If a jump is cheated less than 1/4 the GoE goes down 1, 1/4 cheat down 2, 1/2 cheat down 3.

Using the pain example, there is no standard unit of pain that allows one to say I am feeling 2.5 units of pain, and I can't say at all how much pain you are feeling at all since you are feeling it and not me.

But in judging there are quantifiable units of under-rotation and I can say you under-rotated by 1/4 unit of rotation, or 1/2 unit of rotation or whatever. I can also say you did 5 rotations in your spin, or 6 or 8. I can say (to a sufficient extent) if you were spinning at 2 rotations a second or 6 or 10. I can decide if your jump was 6 inches off the ice or two feet. I can say in a quantifiable way if your elements hit the highlights of the music or not, and if you were in time to the music or not. I can say you finished 5 seconds early or 10 seconds late, or stood at the boards for 5 seconds waiting for your music to catch up at the start of the step sequence. And I am not trying to perceive how you feel about your timing, I am judging your timing as I see/measure it. Pretty much every thing that gets judged in a skating performance has quantifiable units associated with it.

And the fact that we are not measuring them with a stopwatch or a ruler does not mean we are not measuring. As an example as a photographer, I can look at an arena and tell you how much light there is and what settings are correct for exposure to within 1/4 stop. After 50 years of experience my eyes and brain make a pretty good light meter -- and that is true for most photographers. A highway patrol officer can look at you on the highway and tell you speed to within 5 mph (better than 10%). They don't need a radar gun.

On the other hand what are the units for beauty? How many units of beauty does one picture have compared to another? Can't say. That's why we don't judge things like beauty or art. Sorry, but artistic impression hasn't been a judging concept for 20 years. Let it go.

IJS was created to address the IOC demand that skating must be judged according to an objective quantifiable standard -- or it's out of the Olympics. Some would say that is not the right approach. Nevertheless, that is the purpose of IJS, to objectively judge quantifiable characteristics of skating. Not to judge perceptions. The 6.0 system was killed off by the IOC because it was viewed as being nothing but a mishmash of perceptions and opinions.

From some of the previous comments it seems the reactionaries would like to push IJS back into the subjective ill defined mode of 6.0 judging. Isn't going to happen.

Mathman · Apr 27, 2009

I guess what I am saying is that no matter what the ISU comes up with, when we subject the IJS to rigorous statistical analysis, we always uncover all sorts of problems.

The reason why, IOC or no IOC, is not because the system needs tweaking but because it rests on doubtful assumptions.

(That's what I think, anyway.)

I for one am not saying that we should go back to 6.0 judging. Just that we should not be surprised when we see a bunch of error terms that are distressingly large compared to the increments of judging (in the PCSs, for instance), and stuff like that.

gsrossano · Apr 27, 2009

Mathman said:
In this, I believe the ISU to be grieviously in error.

No offense, but the ISU doesn't care. This is the road they have gone down.

And yes there is still a lot of subjective junk in the rules, but over time it is slowly being weeded out. It's hard making progress, though, because there are still some judges who want the ability to do whatever they want in whatever situation they want, to get the answer they prefer. That group puts alot of pressure on the Tech Committees to water down the system and go back to the old ways, but I don't think it is going to happen.

By a real number I meant an element of the real number system. Like pi is a real number, or the square root of 2 is a real number.

I see, counting the number of times a coin flip comes up heads is not a measurement since it is an integer, and coin flips (or dice tosses) do not follow the laws of statistics! Glad you cleared that up for us.

Hsuhs · Apr 27, 2009

I read all posts in this interesting thread. Here's my 2 cents.

1. Judging = measurement, IMO. 'I like it / I don't like it' is a measure, or so I thought.

2.

gsrossano said:
Nevertheless, that is the purpose of IJS, to objectively judge quantifiable characteristics of skating.

But what is the name of the global thing they are measuring? What do the numbers represent in the end?

Mathman · Apr 27, 2009

gsrossano said:
I see, counting the number of times a coin flip comes up heads is not a measurement since it is an integer, and coin flips (or dice tosses) do not follow the laws of statistics! Glad you cleared that up for us.

Yes, I would certainly say that counting the number of times a coin turns up heads is counting as contrasted with measuring. Coin flipping follows the laws of discrete probability. Measured quantites follow the rules of continuous probabily.

There are two grand themes that since prehistoric times have animated all mathematical thought. They are counting and measuring. On the counting side are arithmetic, number theory, and algebra. On the measuring side are analysis (calculus) and geometry ("measuring the earth").

It is the difference between "how many?" and "how much?"

Of course there are many areas of overlap and interaction, probability and statistics being two of the most interesting.

Hsuhs said:
1. Judging = measurement, IMO. 'I like it / I don't like it' is a measure, or so I thought.

It seems to be a question of how broadly we want to use the word "measurement."

A carpenter takes out his tape measure, measures a board and finds that it is 1.3 meters too long for his needs, so he cuts some off. If you ask him if he measured the board, he would certainly say yes. (The fundamental rule of carpentry: measure twice, cut once.)

Now ask him why he chose rosewood instead of mahogany for his project. I don't think he would use the word "measure" in his answer.

Well, I guess he measured the qualities of each type of wood against the qualities that his experience told him would be desirable in the finished project (sort of like figure skating judging after all.)

Still, if the customer said, "hey, you measured this cabinet wrong," I don't think the carpenter would say, "by golly, you're right, I should have gone with mahogany."

Anyway, that's semantics. On the mathematics side, the question is whether or not we are trying to fit a statistical model to a particular case study that it just doesn't match up with very well. This is a question of more than just semantic substance, IMHO.

Mathman · Apr 27, 2009

PS.

I changed my mind.

On second thought, I think semantics is important, in order that we not confuse quantity with quality.

Of three political candidates, we do not say that one is qualified, the next more qualified, and the third most qualified.

Rather, they are well qualified, better qualified and best qualified.

One skater rotates 269 degrees, the next skater rotates 271 degrees. That's quantity.

One skater achieves better (not "more") flow out of the landing.That's quality.

Hsuhs · Apr 27, 2009

Mathman said:
Now ask him why he chose rosewood instead of mahogany for his project. I don't think he would use the word "measure" in his answer.

Well, I guess he measured the qualities of each type of wood against the qualities that his experience told him would be desirable in the finished project (sort of like figure skating judging after all.)

Still, if the customer said, "hey, you measured this cabinet wrong," I don't think the carpenter would say, "by golly, you're right, I should have gone with mahogany."

Anyway, that's semantics.

Let 9 carpenters choose between rosewood and mahogany for their individual projects. You'll have a measure of 'popularity of 2 sorts of wood among professional carpenters'. Still semantics?

Mathman said:
the question is whether or not we are trying to fit a statistical model to a particular case study that it just doesn't match up with very well.

I'm still not sure what exactly IJS is supposed to measure, what's the name of the construct? Without a certainty of that knowledge, I'm afraid I'm clueless about the best fit.

Mathman · Apr 27, 2009

Hsuhs said:
Let 9 carpenters choose between rosewood and mahogany for their individual projects. You'll have a measure of 'popularity of 2 sorts of wood among professional carpenters'. Still semantics?

If the panel went 6 to 3 for mahogany I think I would prefer to say that we have a "count" of the relative popularity.

But maybe this is splitting hairs stupidly. I will give up on trying to press this distinction.

I'm still not sure what exactly IJS is supposed to measure, what's the name of the construct? Without a certainty of that knowledge, I'm afraid I'm clueless about the best fit.

I was hoping someone more knowledgable about skating than I am would jump in and give that question a shot.

I fear that the answer is, "the name of the thing we are trying to measure is 'the number of CoP points that a program ought to receive.'"

An alternative approach to figure skating judging would be to say, "the name of the thing we are trying to judge is 'the quality of the performance.'"

But this may be only half the story. There are, after all, the first mark and the second mark, the technical specialist and the judges.

Maybe: "Do hard tricks and do them well" is the mantra.

Tryng to roll up quantities and qualities into a single ball of wax is the joy and the curse of figure skating judging, in my opinion. But I have to say, of the two, I do have a special fondness in my heart for quality.

(a) I voted for this particular political candidate because he is 6 foot 2 inches tall, because he has 13,476.12 dollars in his bank account, and because he voted for tax cut bills 52.38 per cent of the time.

Or

(b) I voted for this guy because he has a sterling character, an heroic spirit and a noble mind. :thumbsup:

Joesitz (RIP) · Apr 27, 2009

Jumps have clear definitions..

If he can not do the take-off by definition, he should not do the jump he either can't or just did not do it in that comp. It's radically wrong to credit a jump that was not done by defintition.

If he can't make triiple rotations, he should do doubles or singles. Popups or not, his intent on a triple was not acomplished.

If he can't land a jump within the minimum requirements, he should leave the jump out. The Caller must know what the skater intended.

The "plus" GoEs in the Technical scoring are, more subjective and hence more against the principles of the CoP.

The "minus" GoEs serve a purpose in penalizing jumps with errors.

However, when checking protocols, it is most unusual to get a 100 per cent agreeement on these additional scores, but the numbers do get into the Final tally for each skater.

feraina · Apr 28, 2009

I think we all more or less agree on the main issues, which are that statistically random dropping of judges makes little sense, and ethically IJS's explanation that this is done to shield the judges from outside influences isn't persuasive (it more looks like they're hiding the corrupt judges from public's scrutiny). Anonymous judging suffers some of the same problems.

Some of the other questions that have come up, what gsrossano might term "pollens on the leaves of the trees", are of great interest to me, because they come close to my research.

So for instance, there are two major camps in statistics: Frequentist and Bayesian. Frequentists basically only applies probability to measure the empirical frequency of a repeated, external event (for instance, a fair coin lands on heads .5 fraction of the time), whereas Bayesians apply probability in addition to subjective uncertainty (so armchair figure skating fans can say there's .65 probability that Yu-na will win OGM, and .95 probability that the US ladies will not medal, even though this particular Olympics with these particular skaters has never taken place before). This is a philosophical difference, not a mathematical one, both Frequentists and Bayesians agree on the fundamental laws of probability, just not their applications.

I feel this is analogous to the discussion of what is actually "measured" by the judges' scores. We can argue forever whether the judges' scores reflect noisy samples of some underlying "truth", or whether there is no independent "truth" except what emerges as a consensus from judges' scores.

This is a philosophical question, not a scientific one. The fact is that we can only access judges' scores, whether an independent "truth" exists or not.

That doesn't mean one can't be scientific about it. The scientific way to go about it is to analyze the statistics of the judges' scores across competitions and judging formats, as gsrossano seems to have been doing, and quantify just how much margin of error there is in terms of absolute COP scores, and how much there is in terms of relative placements, and how these margins of error change as a function of the number of judges taken into account (and whether the mean or variance is used, and whether or not high's and low's are trimmed). This would be really useful, because what one can say is for instance, when there are 7 judges, a COP difference of 5 points in the SP, and 10 points in the LP, is a "statistical tie" (a term commentator seems to toss around a lot). And what a statistical tie means is that in fact with only 7 judges, skaters scoring within 5 points in the SP should be given a "tied" score (because the variability in their scores reflect only "noise"), and likewise for skaters scoring within 10 points of each other in the LP. Then maybe what one will find is that with 9 judges, the margins of error go down to 4 and 8 points. And with 12 judges, it goes down to 3 and 5. Etc. I'm just making all these numbers up to illustrate the point, but I think it's in the right ballpark.

I think sadly, what we will discover is that the majority of championship decisions in major competitions since the inception of COP are "statistically insignificant" -- in other words, if these scores were the results of a scientific experiment, no journal would credit it as "real" and publish it.

(Thank God at least we hold our scientific journalism to higher standards than the COP, right? Imagine statistical methodology as sloppy as COP being applied to drug cost/benefit analysis or monitoring of climate change or affects of smoking on health. Hey, actually...)

feraina · Apr 28, 2009

(I had to take a breath.

)

While a "fully rotated jump" may be precisely defined in COP (270 degrees), there is no precise, independent way of measuring it except through human judgment. Note how a tech panel may downgrade or not downgrade a jump due to camera angle, and a "strict" and "lenient" caller may call the same jump differently. Precision of definition doesn't matter if there is no precise instrument to measure it. When it comes to vision (which is a dominant sensory modality in which skating is perceived), human vision (even rat vision) far exceeds machine vision (i.e. the kind of analyses computer algorithms can perform on camera images), and the richness of figure skating visual imagery is such that I think human judgment (with no independent measurement nearly as precise or accurate) will prevail as part of the scoring systems for decades and maybe centuries to come (assuming Speedy and his successors don't kill it first).

And this is related to what some people seem to be arguing in terms of what can be "measured" and what can't be. One thought is that those can be externally verified is measurable, while those requiring human judgment is not. But if something as precisely defined as 270 degrees has to be judged by a human (or 3), does that make "rotatedness" as subjective a judgement as "beauty"?

Personally, I say "yes." One day, scientists may understand better how the brain evaluates "beauty", and put it into quantifiable terms (so, for instance, we already know that on average, people judge a symmetric face more beautiful than asymmetric face, people judge a certain female waist-to-hip ratio to be the most beautiful over all other ratios, people judge certain harmonic structure to be musically beautiful and others to be ugly). So just as "rotatedness" can be defined in degrees (who doesn't think 270 is arbitrary, by the way), then beauty may be quantifiable in some physical dimensions as well. And all of them are subjective in their own ways.

So if that's the case, then figure skating as a sport might as well give up trying to contend an absolute "truth". Let's face it, figure skating is a much more complex activity than anything other sport, and requires mastering of a much larger array of skills what anything else even comes to. And its very complexity is what draws us figure skating fans to it. All this buttonholing of FS into these oversimplistic criteria (which actually growing more complex by the month) is killing the beauty of skating. A skilled judge can judge a program as a whole much better than utilizing a system of arbitrary criteria and levels and requirements. When it's a 6.0 performance, we can all feel it, audience and judges alike.

I like whoever that proposed that COP be changed to have a single presentation score out of 6, and the TES be normalized to the same scale of 6. I think that'd be a great idea!

Joesitz (RIP) · Apr 28, 2009

Dore agrees that the cut in judges will prevent cheating.

http://www.usatoday.com/sports/olympics/2009-04-21-590485603_x.htm

Of the nine judges for the SP, five will be replaced for the LP.

I presume they draw straws for which judges will be dropped after the SP. No?

Do the 5 replacements know who they are before the competition begins?

Can a judge volunteer to be on a Panel of his choice?

All in all, it looks to me like there is no reduction in judges if 14 judges are active somehow.

Mathman · Apr 28, 2009

feraina said:
*Whole two posts*

arty:

The fact is that we can only access judges' scores, whether an independent "truth" exists or not.

So now the next question is: if there is a wide variation among judges' scores, is this really statistical noise, or is it rather the very thing that we are studyng?

feraina said:
Thank God at least we hold our scientific journalism to higher standards than the COP, right? Imagine statistical methodology as sloppy as COP being applied to drug cost/benefit analysis or monitoring of climate change or affects of smoking on health. Hey, actually...

Studies of statistical errors in prestigious medical journals like the New England Journal iof Medicine consistently reveal serious or fatal errors in about 50 per cent of the articles.

This particular summary (scroll down to Doyle, 1999)

http://catless.ncl.ac.uk/Risks/20.49.html

concludes as follows:

I suspect that one problem may be the availability of easy-to-use statistical software that makes no requirement that the user actually understand the underlying principles behind the tests employed. In any event, it would appear that the general standard of statistics in medical journals is shabby. Perhaps special emphasis should given to the necessity for medical journals to have proper statistical refereeing of submitted papers. Indeed, some journals, embarrassed by reports such as these, are doing exactly that.

Editied to add: That last sentence is especially interesting. Now medical researchers complain that the statistical censors have become so zealous that the researchers can't get their stuff published at all.

Mathman · Apr 28, 2009

Joesitz said:
Dore agrees that the cut in judges will prevent cheating.

http://www.usatoday.com/sports/olympics/2009-04-21-590485603_x.htm

Not surprising that Mr. Dore agrees with the ISU's position.. He is among the principle architects of the new judging system, and as ISU Vice President for Figure Skating he is the ISU's official spokesman on the figure skating side.

Of the nine judges for the SP, five will be replaced for the LP.

I presume they draw straws for which judges will be dropped after the SP. No?

Do the 5 replacements know who they are before the competition begins?

I think that four of the nine judges will be replaced, with five continuing on the the long program.

I think that's right, that they draw straws to see which of the four SP judges will be replaced.

Yes, the replacement judges know in advance that they will be judging the long program, after sitting out the short program.

Can a judge volunteer to be on a Panel of his choice?

For each of the four disciplines the ISU has some kind of random draw for the counties that will send judges. After this draw it is known which 9 countries will have judges for the SP, and which 4 countries will be held in reserve to replace four of the jusges in the LP.

Then later each national federation selects the individual judge that they will send, subject to the condition that the judge sent is on the ISU list of approved championship judges.

All in all, it looks to me like there is no reduction in judges if 14 judges are active somehow.

The 14 judges, the replacement of four from the SP, the elimination of two judges' scores at random, and anonymous judging -- all of these are methods by which the ISU is trying to make it harder for judges to cheat.

The mathematical question that gsrossano raises is different. When the smoke has cleared, you have seven marks to combine in a trimmed averaging procedure. This is down from a total of nine before. This reduces the trustworthiness of the average score by a factor that is measured by the ratio of the square root of 9 divided by the square root of 7.

This ratio works out to be 1.13. In other words, our confidence in the reliability of the average scores of the judges as a meaningful measure is reduced by 13%.

So in the example that Feraina gave above, suppose the margin of error is plus or minus 5 points for the short program and plus or minus 10 points fro the long, with nine judges. Then with 7 judges, the margin of error increases to 5.65 points for the short and 11.13 points for the long.

So you can see what the problem is. If a skater wins by a score of 240 points to 238 for his rival, but the margin of error built into the judging system is plus or minus 5 points, then who "really" won?

And this is just purely "statistical error." On other words, nobody did anything wrong, it's just that the procedure of taking a bunch of numbers and averaging them is not a 100 per cent reliable way of drawing a conclusion.

By the way, the part where I think that the ISU is being dishost is here:

"It's to keep everything uniform," David Dore, ISU vice president, said Tuesday. "We've already done it in the championships, but the ruling on the Olympics was really out of whack. I don't know how you can have two such important championships done two different ways.

"It was an oversight, to be honest with you."

Right. The ISU just "forgot" that they have to judge the Olympics as well as the World Championships. :eek:

hwell:

What actually happened was, last October they snuck the changes for the World Championship through without consulting the membership. Since it did not involve the Olympics, the change did not attract much attention.

Now they can come along and say, oh by the way, we are changing the judging procedure for the Olympics, too, because it is "out of whack" with the procedure for Worlds.

Olympic judging changes ( 5 judge results)

gkelly

Buttercup

Mathman

gsrossano

Mathman

Mathman

gsrossano

Mathman

gsrossano

Hsuhs

Mathman

Mathman

Hsuhs

Mathman

Joesitz (RIP)

feraina

feraina

Joesitz (RIP)

Mathman

Mathman

Similar threads

Connect with us