Bring back the 6.0 mark | Page 3 | Golden Skate

Bring back the 6.0 mark

hockeyfan228

Record Breaker
Joined
Jul 26, 2003
In the latest Blades on Ice, Don Laws was interviewed about the CoP in his role as co-chair of the ISU Coaches Committee. The question was, "They still won't be showing which judges are giving what points per element?" Laws' answer was, "At this point it will still be random but they are still considering that, (and it) might not be a permanent situation. And that is perfectly okay for you to print that they are reconsidering a change here with random judging, but whether they will actually make a change, we don't know yet." So it may be possible yet that they decide it's a better trade-off to make the names of the judges public than to keep the judges from getting pressured by their federations.

It seems to me that there are now lots of stats for the ISU to use to track potential bias, incompetence, and lack of clarity. For example, they can figure out that a particular judge has no real idea of how to judge spins, but is fine with jumps, or vice versa. So the judge can be sent to spinning seminars or jumping seminars or footwork seminars, etc. They can also interview the judges and ask for justification based on the fairly clear descriptions. They may find that there are inconsistencies in the descriptions, and can clarify them.

One of the great things about this is that they don't have to prove intent. It won't matter. If the judge doesn't improve, they can de-certify the judge. There will be plenty of numbers to look at, and computers are really wonderful at crunching huge amounts of data, and there are plenty of algorithms that have been written to track trends.

I have less of a problem with secret judging, if the ISU is able to police itself. CoP may give them the tools to do it. Now they just need the will.
 

Norlite

On the Ice
Joined
Jul 26, 2003
I'm very pressed for time this morning, but I would love to answer mpal's post. Maybe late tonight.

I did want to add one thing to hockeyfan's post re: secret judging. I had a few questions for Joe Inman and he has been gracious enough to answer each and every one. One was about the "secrecy" of judging and the skaters. He told me that he is still very happy to, and he does, discuss with the skaters who ask about their marks from him after competitions. He said he did this even after Nebelhorn with one or two skaters that asked, and he has spoken with the Euro judges and some don't have a problem with it either. I'm sure some do, we humans are all different after all, with different thoughts and opinions Even one talks about "Secret Judging" like it's some big cloak and dagger thing like in a James Bond movie. I prefer to think of it as a tool in place to make each indidvidual judge feel as comfortable as we possibly can.
 
Last edited:

rain

Record Breaker
Joined
Jul 29, 2003
New system

I haven't seen the new system in action yet, so I'm not ready to condemn it, but there are a few questions.

Like you, I'm not a particular fan of one competitor being able to gain an insurmountable lead over even their closest rival, so even before the event is finished the outcome is determined. Wasn't that one of the things people hated about figures? It takes the excitement out of it.

Also, I don't see that this is going to solve the major problem of corrupt judges. There seems to still be room to manipulate the scores.


Further, I'm not sure I love the idea of it hurting skaters who try new, harder elements, but don't suceed. To a certain extent, yeah, I don't like to see a splatfest, but I understand that to push the sport technically, the skaters need to try new things and I worry if the new system encourages them to rather play it safe. Perhaps I'm worried for nothing, and it will simply encourage - particularly in the mens competition - skaters to actually have a program that is more then a string of jumps crammed into the allotted time.


Time will tell.
 
Joined
Jul 11, 2003
Regarding the compare and contrast of Random Drawing and Highs and Lows being eliminated, both are mechanisms to protect the anonymity of the judges. I would also like a good answer to mpal's question.

My only answer to that question is that it prevents an outburst of protest from the media, fans, and skaters which happened in SLC. Again, if there are other reasons for keeping the anonymity, I would certainly like to read them.

Using the Random Drawing as was done in DC Worlds left some fans frustrated, particularly in the Dance competition. Who were the other judges whose marks were eliminated and would a different random drawing given a different result? A reasonable question but there is no reasonable answer.

Using the Highs and Lows Elimination which we've seen in some contests, there is less chance that a 'group' of judges could push a skater's ordinal scores. Please, I am not expounding that this happens in all competitions, but it is a possibility if not by design but by cultural subjectivity.

Neither system will allow for the display of a judge's score for an individual contestant. That's unfortunate for fans whose interest in figure skating goes beyond the results. Think baseball without all those statistics. Meaningless? most probably are but oh, so fascinating to the fans.

I believe we will just have to suck up the results without knowing which judges were not in the random drawing. Speedy is on top of his game!:cry:

Joe
 

mzheng

Record Breaker
Joined
Jan 16, 2005
hockeyfan228 said:

I have less of a problem with secret judging, if the ISU is able to police itself. CoP may give them the tools to do it. Now they just need the will.

I agree with that. But from what I heard last season interim system NO ONE knows who give what marks. All in secrect, not even subject to their own 'police' review. I think current break down given to public by ISU is nice and good enough to educate fans. I really don't think general public need to know which judge give which element what mark to the skater as long as there is some third party or their own officer to inspect the judgers marks.
 

hockeyfan228

Record Breaker
Joined
Jul 26, 2003
Re: New system

rain said:
Like you, I'm not a particular fan of one competitor being able to gain an insurmountable lead over even their closest rival, so even before the event is finished the outcome is determined. Wasn't that one of the things people hated about figures?

I don't think that was the primary reason for hating figures. First, figures were not telegenic, which was the stated excuse. Second, they seemed to the average viewer to have nothing to do with skating itself. Third, even if the viewer knew that they formed the basis of correct technique, why were they judged instead of the how they were applied in the final result? That's like taking a ballerina's barre exercises or class center work into consideration when watching a performance of Swan Lake. Fourth, the judging standards were even more disparate than judging free skating, and since no one was watching, and conventional wisdom said that the judges used them to pre-determine the outcome, they were immediately suspect. Fifth, because all scores were cumulative, without throwing out any highs and lows, a single judge could have a big impact.

Now, if a skater has an insurmountable lead after the SP or CD/OD, the networks can put together clips or commentary to explain exactly why that is the case, e.g. side by side shots of the combination jumps. And, in the majority of competitions, where the differences in ability are relatively small, the skater(s) with the "insurmountable" leads still have to deliver something to get some points; if they wipe out who knows what might happen, or how many competitors might "go for broke," which could be exciting in itself. The skills in the SP are similar to the skills in the LP. The only equivalent to school figures is the CD in Dance, and even there, the skaters are dancing. (It might be the only place where they are "dancing." :)) Also, when Petrenko and Boitano and Browning messed up in the short program in Lillehammer, in practice, the results were just as pre-ordained as when a skater had an insurmountable lead in figures. Bobek did move from 6th after the SP to 3rd in 1997 Nationals to capture the last place on the US squad, but that might happen more when the competition is under a single federation and less in top-level international competition.
 

Norlite

On the Ice
Joined
Jul 26, 2003
Joined
Aug 3, 2003
More Statistical Mindbenders

Ah, Mathman,
Now that you've put some of your statistical analyses of the COP on the forum, here come Rgirl to annoy you with her evil "applied statistics." Bwa-ha-ha-ha!

Actually, you and I are getting closer to agreeing. But as always, I must take issue with your interpretation of the "Right Score." The term used in applied statistics is "True Score," but that's not the point in this case. The point here is how does one use statistics in a judged situation to ensure each competitor (skater) receives AND IS JUDGED BY the closest thing to his/her True Score.

I think you do a bit of biased disservice to the True Score when you say if God were a figure skating judge, the True Score would be what S/He gives a skater's performance. It's not some abstract or as you put it "cosmic" thing. It's a testable score. It's just not an absolute score, like Pi. True Score is defined as "The score in whose determination there are no errors of measurement." The concept behind True Score is that if we could somehow accurately measure the skater's performance value over and over again, the mean of the skater's distribution of scores would be called the True Score. However, as you and I agree, in the real world of measurement there is always error, therefore we factor error into our caluclations and interpretations of measurements. For those not familiar with statistics, the standard deviation of the distribution of scores is called the standard error of the mean, or just Standard Error (SE). SE, also often called the "standard deviation," means the standard deviation of the sampling distribution of Means. (BTW, "mean" as in "average" is not capitalized. I'm capping it here just to help clarify what is mumbo jumbo enough already:))

Of course it is impossible to judge a skater's performance enough times with enough different judges to find a mean of the distribution to identify the skater's True Score. It's also unnecessary because we can estimate the standard deviation of repeated measurements and use a confidence interval approach to make a probablility statement about the skater's True Score. A confidence interval is just the range of scores that lie within the SE. For example, if the Mean (average) score +/- SE is 68.6 +/- .59 the confidence interval would be 68.01 to 69.19. In doing measurements relative to a population, eg, height in inches of 24-year-old men, we can calculate the degree of confidence depending on the size of the sample relative to the size of the true population. If we have a large enough sample size, we could say that we are 95% certain that our population mean height falls between 68.01 and 69.19. If we use an even larger sample, which would likely get us a larger SE, eg, 1.24, we could say that we are 99% certain that our population mean falls between 67.36 and 69.84.

Of course it all gets a lot more messy when the measuring device is not a calibrated tape with inches marked on it but a human being who is assigning scores based on his/her interpretation of how well a skater executes certain moves according to the judge's aesthetic values or how well a jump is executed according to a set of standards. One judge may not care if he hears a skater's blades as long as the skater has good speed whereas another judge may care a lot. And how fast is "good speed?" (I still like my idea of measuring skaters' speeds several times during their programs with a radar gun, but I digress.) One judge may emphasize what the skater does "from the blades down" so to speak, that is, the skater's speed, edging, centering on spins, run-out on landings, etc. Another may emphasize what the skater does "above the blades," ie, line, movement flow, posture, jump height, difficult spin positions. Ideally, a good figure skating judge should take it all into account, but because they are human such biases are bound to exist. And let's not even get started on cultural preferences. So with figure skating, even more so than in gymnastics, IMHO, there are many, many areas for well-educated, reasonable, and fair judges to disagree. And THAT'S why I think it's so important that the best statistical methods possible, within reason, be used to judge figure skating.

Anyway, here's an example of how to test for the True Score in subjective judging situations. (For one thing, the 6.0 method made testing for this well nigh impossible because it had such big chunks of numbers. The COP makes it a lot more refined.) Anyway, let's take Michelle as the skater and have her skate "Aranjuez" 10 times over a period of five weeks when she is in peak condition. (We'll leave out the factor of competition for the time being.) We take a random selection of a dozen ISU judges as the panel for EACH of the 10 times Michelle skates "Aranjuez." In other words, Michelle will be judged by a different panel of 12 judges every time she skates in our attempt to establish her True Score.

Now, I won't assign values to the individual COP elements of her program (I'm not THAT possessed) but I will make up mean scores assigned by each judging panel. We won't throw out the high and low in this case, although it would be perfectly fine if we did and probably more accurate but it would require more pretend calculation from me and after all I'm just trying to demonstrate a point.

So let's say that the total mean scores, including the standard error, from each panel of judges for each of Michelle's performances of "Aranjuez" turns out like this:
1. 121.36 +/- 4.23
2. 123.43 +/- 5.67
3. 120.75 +/- 3.92
4. 124.55 +/- 4.87
5. 125.69 +/- 6.02
6. 122.26 +/- 5.33
7. 124.87 +/- 3.13
8. 118.98 +/- 4.84
9. 121.68 +/- 3.76
10. 122.39 +/- 5.29

If we calculate the Mean and SE of all these scores, we get a score of 122.60 +/-4.71. This means that 120 randomly selected judges watched Michelle perform "Aranjuez" 10 times over five weeks (two performances a week) and the average score plus or minus the standard error is 122.60 +/- 4.71. If we got into levels of significance and more sophisticated statistical analyses than is reasonable to do here, at least for me, we could say that we are X% certain that this is Michelle's True Score for her skating of "Aranjuez."

So True Score is something we can calculate, but in a judged sport such as figure skating it is not feasible to assess a skater's True Score on an individual basis. However, the larger our sample of judges, the more likely it is that the average of the judges' scores will reflect the skater's True Score. I have no problem with throwing out the high and the low, but I would be more inclined to use all the other scores rather than winnowing it down to five. IMO, the probability of error is higher using the scores of five judges than it is with the scores of 12. The smaller the sample out of the population of all figure skating judges, to me, the greater the error. I realize this is supposed to reduce the possibility of collusion and cheating, but I think it does so at the expense of statistical accuracy. I think the way to reduce collusion and cheating is for the ISU to crack down on overly biased, cheating, and/or colluding judges with severe penalties. Unfortunately, the judging system seems doomed to corruption as long as Speedy is in charge. Until heads roll, I guess we'll have to give up statistical accuracy in the hope that this random selection business helps to ensure fairness.

Nothing-to-do-With-Reality Question: What would say if the SE were determined for the skaters' scores under the COP and the scores for the top two skaters were very close and their SEs significantly overlapped?

BTW, regarding some of the results of last year, you said:
"If the former, the reason that it bothers so many people is that the five who were randomly eliminated might have swung the contest the other way. Suppose the 9 "real" judges split 5-4 in favor of skater A, while the 5 "dummy" judges (the ones who were eliminated at the outset) all favored skater B. Then there is a sense in some people's mind that skater B was the real winner, 9 to 5, and was deprived of her rightful reward by an unlucky role of the dice. It is widely believed that this happened to Sasha Cohen against Viktoria Volchkova in the Cup of Russia last year, for instance.

This objection is not quite mathematically sound, but has strong emotional appeal, especially to fans of skater B."

I've always understood that the "dummy" judges scores never counted, as if they never existed, but the problem I have is in fact a mathematical one. First of all, the problem of the total scores of the judges seeming not to reflect who actually won not only happened at COR, but also at NHK. What bothered me was, even though those judges' scores were never meant to count, again, as if they'd never existed, in fact they did exist and the viewers saw them. Thus when one skater's raw scores seemed so clearly ahead of another skater's, but the skater with the lower overall scores won, my question was, "How valid are those nine preselected scores? If they don't reflect what a greater number of judges saw, are they fair? Also, how does this reflect on the reliability of ISU judging?" Forgetting the "dummy" judges idea for a moment, hypothetically speaking, if 9 judges out of a panel of 14 put Skater A in 1st place (as it was under the 6.0 system) and 5 judges put Skater B in 1st place, but then because of the preselected random selection, 1st place was given to Skater B, that means 64% of a 14-judge panel scored Skater A incorrectly. The judges did not know if their scores would count or not, so we're assuming they are judging as if their scores would count. I understand how from a purely mathematics POV it just plain does not matter what the 5 nonselected judges do because it's as if they were never there. But given that we did see all the raw scores, what it makes me question is how RELIABLE the smaller groups of judges' scores are going to be in accurately scoring the skaters. If 36% of ISU judges score the skater with weaker skating skills as being better than the skater 64% of the judges scored as better, then what is the reliability of ISU judging as a whole?

I'm hoping that the far more detailed analysis afforded by the COP, plus the referee, the computer analysis, and the whole new system in general will help improve the judging and that it was the limitation of the 6.0 system that made the results of several competitions last year questionable, not the ineptitude of a minority, yet a significant one, of the judges. The reason I bring this up is that for any measuring device, you have to determine both its validity and reliability. How do we know that the ISU judges are valid and reliable in measuring which skater is better than another?
Rgirl
 
Joined
Jun 21, 2003
Well, I guess I need to shake the cobwebs out of my brain (aka, the steel trap) on this issue. Let's see here...
Rgirl said:
Ah, Mathman,
Now that you've put some of your statistical analyses of the COP on the forum, here come Rgirl to annoy you with her evil "applied statistics." Bwa-ha-ha-ha!
The way I look at it, there is no such mathematical discipline as "applied statistics." There is statistics. After you study this topic, then you can apply it in various settings.
Actually, you and I are getting closer to agreeing.
As far as I can tell, your post agrees 100% with what I said in the last 5 paragraphs of my post to Norlite, part1, modulo a couple of inconsequential details, noted below.
But as always, I must take issue with your interpretation of the "Right Score." The term used in applied statistics is "True Score," but that's not the point in this case.
Indeed. The point is not what we call it, but rather that we agree as to what it means. The True Score is the mean of a set of numbers (called the "population" -- unfortunately so, since this word makes it sound like we are talking about people rather than numbers. Similarly, "true" carries unfortunate linguistic connotations, IMHO.) Anyway, the thing that we do have to agree on is just what that population is.
The point here is how does one use statistics in a judged situation to ensure each competitor (skater) receives AND IS JUDGED BY the closest thing to his/her True Score.
. This is a metamathematical question. In this exchange we are discussing one way of posing the problem statistically. Only after the problem has been well-formed can we turn to statstical techniques to try to solve it.
I think you do a bit of biased disservice to the True Score when you say if God were a figure skating judge, the True Score would be what S/He gives a skater's performance. It's not some abstract or as you put it "cosmic" thing. It's a testable score. It's just not an absolute score, like Pi.
The business about God was to remind us that statistical methods have their limitations. There are other possible models of absolute truth. However, they are not subject to statistical analysis, so let's put all that aside and get on with the statistics.
True Score is defined as "The score in whose determination there are no errors of measurement." The concept behind True Score is that if we could somehow accurately measure the skater's performance value over and over again, the mean of the skater's distribution of scores would be called the True Score. However, as you and I agree, in the real world of measurement there is always error, therefore we factor error into our caluclations and interpretations of measurements.
I would like to strangle the author of your textbook. Statements like "The True Score is defined to be the score in whose determination there are no errors of measurement" introduces a totally unnecessary confusion between just plain old measuring something wrong (get a better ruler) and the kind of error that is amenable to statistical analysis (sampling error -- i.e., estimating the likelihood that the mean of a sample is different from the mean of the population from which it is chosen.
For those not familiar with statistics, the standard deviation of the distribution of scores is called the standard error of the mean, or just Standard Error (SE). SE, also often called the "standard deviation," means the standard deviation of the sampling distribution of means.
The second sentence is correct, although it is better to word it like this: The standard error is the standard deviation of the distribution of sample means. (BTW, all of the samples must be of the same size to treat this idea mathematically.) The first sentence needs some clarification. The Standard error, S.E., is not the standard deviation of the scores of the judges -- it is this standard deviation divided by the square root of the size of the judging panel.

BTW, that is all you need to know to make the argument against reducing the judging panel from 14 to 9 -- it increases tha standard error by a factor of sqrt(14)/sqrt(9) = 1.25 -- a 25% increase, if you buy into this model.
Of course it is impossible to judge a skater's performance enough times with enough different judges to find a mean of the distribution to identify the skater's True Score. It's also unnecessary because we can estimate the standard deviation of repeated measurements and use a confidence interval approach to make a probablility statement about the skater's True Score. A confidence interval is just the range of scores that lie within the SE. For example, if the Mean (average) score +/- SE is 68.6 +/- .59 the confidence interval would be 68.01 to 69.19.
To all of you who have read this so far :laugh: , what Rgirl means is that we compluted the mean m and the standard deviation s of the sample, then we estimated the Standard Error by the formula s/sqrt(n) to come up with the .59. The .59 is not the standard deviation of the actual data. (Just wanted to clear that up, LOL.)
In doing measurements relative to a population, eg, height in inches of 24-year-old men, we can calculate the degree of confidence depending on the size of the sample relative to the size of the true population. If we have a large enough sample size, we could say that we are 95% certain that our population mean height falls between 68.01 and 69.19.
Hmm. We have already taken the sample size into account when we computed the standard error. We are only 68% sure that the mean height falls into this range (+/- one S.E. gives a 68% confidence interval.
If we use an even larger sample, which would likely get us a larger SE, eg, 1.24, we could say that we are 99% certain that our population mean falls between 67.36 and 69.84.
If we took a larger sample size we would get a smaller Standard Error (else why would we do it?). The result would be a narrower interval about which we could be 68% sure that it captures the true mean.

If you want to be 95% sure you have to expand the confidence interval to +/- 1.96 Standard Errors. To get a 99% confidence interval for the true mean you must expand the interval to +/- 2.58 S.E.s. (Even more if the sample is so small that you have to use the Student's t distribution instead of the normal (Gaussian) distribution of sample means.)

(Aside to the myriad readers of this post: For a fuller explanation, look in your statistics book under Statstical Inference -> Confidence Intervals for the Mean -> Central Limit Theorem.)
Of course it all gets a lot more messy when the measuring device is not a calibrated tape with inches marked on it but a human being who is assigning scores based on his/her interpretation of how well a skater executes certain moves according to the judge's aesthetic values or how well a jump is executed according to a set of standards. One judge may not care if he hears a skater's blades as long as the skater has good speed whereas another judge may care a lot. And how fast is "good speed?" (I still like my idea of measuring skaters' speeds several times during their programs with a radar gun, but I digress.) One judge may emphasize what the skater does "from the blades down" so to speak, that is, the skater's speed, edging, centering on spins, run-out on landings, etc. Another may emphasize what the skater does "above the blades," ie, line, movement flow, posture, jump height, difficult spin positions. Ideally, a good figure skating judge should take it all into account, but because they are human such biases are bound to exist. And let's not even get started on cultural preferences. So with figure skating, even more so than in gymnastics, IMHO, there are many, many areas for well-educated, reasonable, and fair judges to disagree. And THAT'S why I think it's so important that the best statistical methods possible, within reason, be used to judge figure skating.
I agree with this conclusion. Who wouldn't?
Anyway, here's an example of how to test for the True Score in subjective judging situations. (For one thing, the 6.0 method made testing for this well nigh impossible because it had such big chunks of numbers. The COP makes it a lot more refined.) Anyway, let's take Michelle as the skater and have her skate "Aranjuez" 10 times over a period of five weeks when she is in peak condition. (We'll leave out the factor of competition for the time being.) We take a random selection of a dozen ISU judges as the panel for EACH of the 10 times Michelle skates "Aranjuez." In other words, Michelle will be judged by a different panel of 12 judges every time she skates in our attempt to establish her True Score.

Now, I won't assign values to the individual COP elements of her program (I'm not THAT possessed) but I will make up mean scores assigned by each judging panel. We won't throw out the high and low in this case, although it would be perfectly fine if we did and probably more accurate but it would require more pretend calculation from me and after all I'm just trying to demonstrate a point.

So let's say that the total mean scores, including the standard error, from each panel of judges for each of Michelle's performances of "Aranjuez" turns out like this:
1. 121.36 +/- 4.23
2. 123.43 +/- 5.67
3. 120.75 +/- 3.92
4. 124.55 +/- 4.87
5. 125.69 +/- 6.02
6. 122.26 +/- 5.33
7. 124.87 +/- 3.13
8. 118.98 +/- 4.84
9. 121.68 +/- 3.76
10. 122.39 +/- 5.29

If we calculate the Mean and SE of all these scores, we get a score of 122.60 +/-4.71. This means that 120 randomly selected judges watched Michelle perform "Aranjuez" 10 times over five weeks (two performances a week) and the average score plus or minus the standard error is 122.60 +/- 4.71. If we got into levels of significance and more sophisticated statistical analyses than is reasonable to do here, at least for me, we could say that we are X% certain that this is Michelle's True Score for her skating of "Aranjuez."
Rgirl, my best buddy on this board, ignore everything else I've written. But here you have made an absolute mistake, period. We do not seek a "True Score" for all the times that Michelle might skate this program. We seek a "True Score" for the one particular performance that is being judged.Now the confusion about just what is the "population" whose mean we are calling the True Score has come home to roost. That population is the set of all scores that all possible qualified judges would give to that performance.

The best available estimate of that mean (the True Score for that performance) is obtained by selectiung the largest possible panel of judges and using all of their marks. If you had 10,000 judges, the standard error would be so small (because you get to divide the standard deviation by the square root of n to determine the standard error) that you could be pretty sure that we have done everything statistically possible to ensure that justice was done -- everything, that is, except to increase the judging panel to 10,001.

That is the statistical argument in favor of using the marks of all 14 judges.
So True Score is something we can calculate, but in a judged sport such as figure skating it is not feasible to assess a skater's True Score on an individual basis. However, the larger our sample of judges, the more likely it is that the average of the judges' scores will reflect the skater's True Score.
I couldn't agree more. Not only does statistical sampling theory support this, but so does common sense, provided we are in agreement as to exactly what we mean be the True Mean. Obviously, a poll is going to be more accurate if you have a large sample.

The conclusion of this argument is that we ought to increase the panel of judges to, say 25. Maybe the WSF will propose this. Why not, we don't pay the judges anyway, just extend the row by a few seats.

--------------------------------------------------------------------------

Well, th-th-th-that's all for now. I'm about halfway through your post, Rgirl, LOL. Putting in all of these [ / B's ] and [Quote's] is a [ Quote/"B"]. Just one little PS for now. This is not the only mathematical model for figure skating judging.

Mathman :\
 
Last edited:
Joined
Jun 21, 2003
OT -- Let me explain one more thing. The reason that I am going ballistic on you about this "applied statistics" thing is that this happens to be one of my pet peaves. If you go to any major University, such as the one where I teach, you will find a course in "biostatistics" offered by the biology department, a course in "medical statistics" offered by the medical school, a course in "engineering statistics" offered by the college of engineering, a course in "business statistics" offered by the business school, a course in "sports statistics" offered by the physical education department, etc.

What's wrong with this picture: you want to learn mathematics, so naturally you go to a biologist, an MD, an engineer, a businessperson and a gym teacher?

What am I, chopped liver?

MM ;)
 
Last edited:
Joined
Aug 3, 2003
Chopped Liver,
You say tomato, I say tomahto...

Hey, a couple of pieces of rye bread and we got a sandwich.

En garde! I know it drives you crazy when I talk about applied statistics, but I was taught my statistics by both a pure math statistician and also by what was at least called an applied statistician. Personally, I think it's a difference of focus. Do you consider the pure math no matter what you are investigating or does what you are investigating affect how you interpret (not apply) the math? I think a better word for "applied statistics" would be "interpretive statistics." Anyway, more semantics.

The reason every major university has subsets of statistics courses such as you mentioned is because there are problems unique to each subject. All I can say (without doing another giant post) is that things that were black and white when I was learning the pure math of statistics often became gray when in other classes, I had to apply them to real world situations and studies.

Too tired to go point by point through our disagreements, but if you want to strangle the author of my statistics textbook(s), you will have to grapple with about 15 of them, because the info I took was from about five different texts with at least three authors each.

As for what you called the one, two, and three standard deviations being 68%, 95%, and 99%--that's not what I was talking about at that point. I was going by a .05 level of significance for the 95% confidence interval and a .01 level of significance for the 99% confidence interval.

Also, Mathman, the best chopped liver on this forum, I used the example of how we MIGHT calculate Michelle's True Score (it's an accepted term) over 10 performances as an example of how the True Score is a real, calcuable number and not the score "God" would pick for Michelle. Of course I am completely and totally aware that we are not the least interested in what any skater does over a series of performances, that we are only interested in getting the most accurate score for each skater for a given competition. It was just a method of explanation for those who aren't familiar with any of this, if they cared to read it. Geez, Mathman, what am I? Dry pastrami?

Gotta sleep, then I'll come back for more of "The Neverending Argument Between Mathman and Rgirl (or Rgirl and Mathman) on Statistics, Judging, and Figure Skating," which I say we should ultimately publish as a paper in some cheesey journal for the ISU to read--under pseudonyms natch:)
Rgirl
 
Joined
Jun 21, 2003
Rgirl, Rgirl. Why do you do me this way? I'm nice, really I am, and now you are forcing me to be mean.

Do you consider the pure math no matter what you are investigating or does what you are investigating affect how you interpret (not apply) the math?
.The former. See example below.
The reason every major university has subsets of statistics courses such as you mentioned is because there are problems unique to each subject.
Not so. The problems are all the same. Different departments want to get into the act because of turf considerations.
All I can say (without doing another giant post) is that things that were black and white when I was learning the pure math of statistics often became gray when in other classes, I had to apply them to real world situations and studies.
In mathematics, things are black and white. That is almost the defining characteristic of mathematics. The statistics do not become grey. Maybe some other aspects of the problem do, such as deciding exactly what statistical question you want to ask.
...if you want to strangle the author of my statistics textbook(s), you will have to grapple with about 15 of them, because the info I took was from about five different texts with at least three authors each.
Sad, but true.
As for what you called the one, two, and three standard deviations being 68%, 95%, and 99%--that's not what I was talking about at that point. I was going by a .05 level of significance for the 95% confidence interval and a .01 level of significance for the 99% confidence interval.
I understand what you were going for. Now in turn I ask you to understand what I was saying: you did it wrong.

The critical value (assuming a large enough sample size) for the .025 level of significance corresponding to the 95% confidence interval is z = 1.96. For the .005 level of significance corresponding to a 99% confidence interval it is z = 2.58. The formula for the confidence interval is

True Mean = Sample mean +/- z*s/sqrt(n), where s is the standard deviation of the sample.
I used the example of how we MIGHT calculate Michelle's True Score (it's an accepted term) over 10 performances as an example of how the True Score is a real, calcuable number and not the score "God" would pick for Michelle....It was just a method of explanation for those who aren't familiar with any of this, if they cared to read it.
.A confusing method of explanation, then, IMHO, since you explained how we might estimate a number in which we have no interest in the middle of a discussion of a very closely related number that we do care about -- the right score for Michelle's performance.

I think you do a bit of biased disservice to the True Score when you say if God were a figure skating judge, the True Score would be what S/He gives a skater's performance....The True Score is defined as "The score in whose determination there are no errors of measurement."
Oh. I get it. the True Score is "the score in whose determination there are no errors of measurement." In other words, the measurement that God would make.

Now I will give an example of what I am talking about.

Here are three numbers: 131.20, 145.70, 138.90. Find a 95% confidence interval for the true mean.

Answer: The sample mean = 138.60, the sample standard deviation is s = 7.25, the Standard Error of the sampling mean is S.E. = 4.19, the critical value for the Student's t distribution with 2 degrees of freedom is t(.025) = 4.303. And the confidence interval is

True Mean = 138.6 +/ 18.0

What do these numbers represent? Maybe they are a patient's systolic blood pressure reading taken at three random times. In that case, we are studying "medical statistics."

Maybe they represent the value of hog belly futures on the NYSE. Now we are learning "business statistics."

Maybe they are the score that three randomly selected judges gave to Michelle's performance of Aranjuez at worlds. Aha! Figure skating statistics!

Mathman
 
Joined
Aug 3, 2003
Mathman,
I promise to keep this short. Scout's honor. Okay, I was a Bluebird.

Originally posted by Mathman:
"In mathematics, things are black and white."
ITA.
"That is almost the defining characteristic of mathematics."
ITA.
"The statistics do not become grey."
ITA. "
"Maybe some other aspects of the problem do, such as deciding exactly what statistical question you want to ask."
YES! And there's the rub! When I was taking advanced statistics in preparation for doing my master's thesis almost 20 years ago, the University of Utah had all the master's and PhD candidates for math, physics, chemistry, engineering, etc. take one adv. stat course and all the advanced degree candidates in fields such as health, anatomy, sociology, political science, etc. take another one. I was dating a physical chemist at the time so we compared notes. We all learned the same equations, statistical concepts, etc. Where things changed was in terms of application. Measuring people is different than measuring forces. You may disagree, but it just is. Especially when you're using one set of people to measure other people. The numbers are the same, of course, but the things you're measuring and why you want to measure them are different. That's how I see statistics as applied to figure skating. You've got people measuring the activities of other people and for this there are considerations that go beyond the black and white of numbers, IMO.

As I've said before, for me, the question is how best to use statistics to get the most valid and reliable score for a given performance of a given skater. I think the COP is a step in the right direction. I think winnowing down a panel of 14 judges to five is a step in the wrong direction. Throwing out the high and low is fine; it's a standard and accepted statistical method. But as you've said and demonstrated mathematically, the smaller the sample of judges, the greater the error.

Finally, I don't think you're being mean at all. You come from a background where one cannot argue the rightness or wrongness of an equation. Either it works or it doesn't. I come from a background where almost everything was variable and thus open to argument. Anyway, there was a guy in my adv stat class who had a bachelor's in math, a master's in anatomy, and was going for his PhD in biomechanics. Every time the professor would point out a gray area--not with the equations for the statistics, but with possible problems one might encounter in applying certain statistical methods--the guy would go nuts. I promised to keep this short, so I'll just say that the professor was unflappable, exceedingly brilliant, and would address the guy's challenges calmly and patiently--as long as he didn't take up too much class time. I don't think the guy EVER accepted the gray zones, but he could never outreason the professor. He was educated in absolutes.

So show my calculations to be wrong, criticize my "teaching" methods, take my stats apart--that's what these exchanges are for. But I maintain that there is more to consider in terms of figure skating judging than sampling theory--for one thing, the validity and reliability of the judges' scores. (There are other things too, but keeping it short.) If a measure is not valid and reliable, what good is it? It's like weighing people on multiple uncalibrated scales. My question is, with or without the COP, have figure skating judges ever been tested for the validity and reliability of the scores they give skaters? I don't know, but this seems to me to be an important question. Another question is this: Although the score the skater gets is the mean of the randomly selected judges minus the high and low, there is no standard error. Maybe this is just theoretical, but if SE were included in the scoring means, what if the two top skaters were not only very close in terms of their mean scores but also had overlapping SEs? I know they would stil give the gold to the skater with the highest score, but if the SEs signifcantly overlap, doesn't that discount the difference at least statistically? Finally, I remember once you gave an equation for the mean and I asked about the SE. You responded either that the equation accounted for error or that this was an equation for an absolute mean that had no error? It was after COR last year. Do you remember that? Is there such an equation for a mean with no error or the error accounted for by the equation?
R+/-girl
(Not very short, sorry:eek:)
 
Joined
Jun 21, 2003
Rgirl, obviously it's great that you had an excellent course in statistics from a professor who knew his stuff and was a good teacher. I myself have taught "business statistics" for business schools and "engineering statistics" at colleges of engineering. (I have also taught both "business calculus" and "engineering calculus," etc.) It was my experience that, while the mathematics was the same, the students were more interested in the subject when the techniques were introduced in contexts that were relevant to the students' major fields.

(Hmm. Pretend that I had the energy to go back an rewrite that paragraph in a way that didn’t sound so schmucky and condescending.)

About using statistical methods to study people, yeah, I agree, the wonder is that we can say anything sensible at all.
As I've said before, for me, the question is how best to use statistics to get the most valid and reliable score for a given performance of a given skater. I think the COP is a step in the right direction. I think winnowing down a panel of 14 judges to five is a step in the wrong direction.
I'm too lazy to search, but I think this is the first time that the statistical concepts of validity and reliability have been mentioned on this thread. (See below.)

I think the CoP is a step in the right direction, too. I guess. I think that we have yet to come to a consensus as to what direction we are trying to go in. The whole judging reform push had nothing to do with accuracy in judging, it had to do with the ISU wanting to save face after the Olympic pairs scandal. So I don't think that we are all on the same page yet about what we want the CoP to accomplish. If the goal is to confuse matters so much that we can never catch Mr. Cinquanta with his trousers lowered, then winnowing the panel from 14 to 9 is a giant step in the "right direction."

Remember that the old method only had 9 judges in the first place. The Interim thing and the CoP said, let's expand this to 14, then cut it back down to 9. So that's a wash, statistically speaking. Mathematically, a step in the "right direction" would be to expand the panel to 14 for real (under the CoP it would be OK to have an even number of judges), or better yet, to 25.
But I maintain that there is more to consider in terms of figure skating judging than sampling theory--for one thing, the validity and reliability of the judges' scores. (There are other things too, but keeping it short.) If a measure is not valid and reliable, what good is it? It's like weighing people on multiple uncalibrated scales. My question is, with or without the COP, have figure skating judges ever been tested for the validity and reliability of the scores they give skaters?
Now yer talkin'! I doubt that any serious reliability and validity studies have been done, but I might be wrong about that. Somewhere in the certification process for judges, as they qualify to judge at higher and higher levels, there must be some kind of data on consistency at least. This would make a tremendously interesting study. I am sure that lots of people have thought about how to design such a study.

I think a study of this type would be more complicated for the CoP than for an ordinal based system. With ordinals, it would be perfectly OK for judge A to give Vika a 5.1 and AP a 5.0, while judge B gave Vika a 5.9 and AP a 5.6. The ordinals are in perfect agreement.
Maybe this is just theoretical, but if SE were included in the scoring means, what if the two top skaters were not only very close in terms of their mean scores but also had overlapping SEs? I know they would still give the gold to the skater with the highest score, but if the SEs signifcantly overlap, doesn't that discount the difference at least statistically?
Yes, this would mean that there is a non-negligible probability that our sampling procedure to determine the higher of the two "True Means" failed and we gave the gold to the wrong person. I do not see any way to eliminate this possibility in a close contest (that's what "close" means), by any technique, statistical or otherwise. But at least we would know whether we have a legitimate argument when we start shouting about who was robbed.
Finally, I remember once you gave an equation for the mean and I asked about the SE. You responded either that the equation accounted for error or that this was an equation for an absolute mean that had no error? It was after COR last year. Do you remember that? Is there such an equation for a mean with no error or the error accounted for by the equation?
My memory is a little fuzzy about that, but I think I was talking about the option of using a different mathematical model altogether. In sampling theory, there is no possibility of eliminating sampling error since there is no way to guarantee that the sample perfectly mirrors the population from which it is drawn.

The alternative way of looking at it, if you don't like sampling theory, is this. Once the judges are seated, whatever sleight of hand has gone on before (chosing 9 out of 14, e.g.), those 9 judges become the whole universe. The idea that they are also part of a pool of many judges ceases to be relevant. To win the contest, the skater must win the favor of those 9 judges and nobody else. Under the old system, if 5 judges voted for you -- that's it. Period. End of discussion. There is no possibility of "statistical error." 5 is 5, it is not 5 +/ .001.

In other words, the votes of the judges are not a "poll," they are the election itself. That's why there was no "sampling error."

IIRC, your challenge to this view was based on the contention that, beacause of the pruning from 14 down to 9, 5 out 9 (I won!) didn't necessarily mean 5 out of 9 after all, it might have meant, say 6 out of 14 (I lost). Thus our judging system failed to produce the correct result.

The numbers that anti-ISU forces bruit about, when they say, "so much per cent of the time this system will produce the wrong result," are the results of computing the probability of this "failure" occurring. I computed these probabilities (and posted them once), but I lost interest in that analysis because I thought that whole way of looking at it was wrong.

Now that debate doesn't matter any more because under the CoP we can, if we want to, revert to the sampling theory model. This is because the statistic that we want to study -- the mean of the total score given (theoretically) by all possible qualified judges for that particular performance -- is a well-defined statistical object.

But I am still undecided whether to think of the final 9 judge panel as a sample from this infinite population ("all possible judges"), or whether the scores of the 9 judges are the population we are studying. The first case has a "standard error," the second does not.

Mathman
 
Joined
Aug 3, 2003
Originally Posted by Mathman:
I'm too lazy to search, but I think this is the first time that the statistical concepts of validity and reliability have been mentioned on this thread.
You can stay in your La-Z-Boy drinkin' brewskies, Mathman;) You won't have to search. I can address this issue quite easily. In the last paragraph of Rgirl's post of 9-12-03 she writes:
I'm hoping that the far more detailed analysis afforded by the COP, plus the referee, the computer analysis, and the whole new system in general will help improve the judging... The reason I bring this up is that for any measuring device, you have to determine both its validity and reliability. How do we know that the ISU judges are valid and reliable in measuring which skater is better than another?
So in Rgirl's post of 9-19-03, it was the second time she mentioned validity and reliability. Not that it matters. I just wanted to be right about ONE mathematical thing on this thread, even if it is just the difference between first and second, lol.
Rgirl
 
Last edited:
Joined
Aug 3, 2003
I'd expect you to search all the way to last year, Mathman, and through every word of that 1,916-word post, just so I could be right at one mathematical thing. It's the only mathematically correct thing to do, lol.
R791812
 
Top