1. 0

Statistical error in judging

Time to try this again.

There seems still to be a lot of confusion about the possibility that with the interim system of choosing nine judges out of fourteen, maybe the "wrong" person will win through "statistical error." This is not correct, even though a lot of people (some of whom claim some mathematical training and so ought to know better) are coming up with statements like, "In a 5-4 split the wrong person is given the victory 25% of the time," etc.

The problem is not with the business of selecting 9 "real" judges out of a pool of 14. Any system of selecting judges does this -- you have a pool of possible judges, then you select some of them actually to judge the contest. The problem is with the secrecy in the process.

To convince yourself of this, imagine that we use the same 9 out of 14 process, but that everything is done out in the open. Fifteen minutes before the event begins, the computer selects nine of the judges to be the real judges. Everybody knows who the nine are. The other 5 can either sit there and go through the motions just for fun, or they can repair to the nearest sports bar and watch on TV over a couple of cold ones.

Sure, a skater might complain, "Gee, I wish the computer had made a different choice -- I would have a better chance to win if I had a different judging panel." But this complaint is equally valid no matter how or when the selection of judges is made.

Mathman

2. 0

Statistical errors, part 2

I am putting this in a second post because it is a little more technical and maybe no one wants to read it.

What is "statistical error?"

In the context of the interim judging system, we are talking about "sampling error." This occurs when we use a sample to make predictions about properties of the entire population from which the sample is drawn. Like a political poll. We poll 1000 voters. 600 of them say they support George Bush. So we estimate that approximately 60% of <em>everybody in the country, not just those 1000</em> feel this way. "Sampling error" refers to the possibility that we might get a different answer if we actually asked everybody. There are very well understood mathematical formulas for calculating the probabilities that the sample percentage will be close to the true percentage.

To apply this to the ISU interim voting system, suppose that all 14 judges submitted their marks <em>and then</em> nine sets of marks were chosen at random to count. In this case we have chosen a sample of nine from the population of fourteen. Questions about whether the nine accurately reflect the choices of all fourteen are highly appropriate. Sometimes (and we can calculate how often to expect this), the wrong skater will win through "sampling error" -- the majority of the sample (the 9) supports skater A, while the majority of the population (all 14) supports skater B.

But this isn't how it's done. Instead, the nine "real judges" (not their marks) are selected before the event begins. These nine are now the "population" under study. The marks of all nine members of this population are counted. This is a "sample of size nine selected from a population of size nine." That is, <em>you have polled the entire population</em>, so there is no possibility of "sampling error." It is the election itself, not a poll.

What about the other five who have been excluded from the population before the election begins? Speedy has made fools of them. Because of their ignorance (and ours) they can't go off to the sports bar after all. They have to go through the motions. But this is because they don't know any better, not because of any question relevant to statistics.

Anyway, the moral of the story is this. If we want to protest the interim judging system we should concentrate on its cloak and dagger secrecy, rather than be drawn into fruitless arguments about statistics. The interim system makes it easier to cheat and harder to catch the cheaters, and that's the bottom line.

Mathman

3. 0

Re: Statistical errors, part 2

That's interesting...never looked at it that way.

4. 0

Re: Statistical errors, part 2

Mathman
This is coming close to something that I wondered about in the ice dance competition. After the OD, L/A were first with a 5/4 split, B/K were second. After the FD, B/K won over all with a 5/4 split. For the purposes of this hypothetical point, let us say that the FD split in fact was 5/4 for B/K. (I am not sure exactly what the split was). Within statistical uncertainity, does this mean (in our hypothetical case), that the overall win would be within random chance and could have gone either way, depending on which marks/judges the computer chose?

The way I read your second post the answer would be no...the nine judges that counted for each event were picked before the events began and the otehr five did not know they were superfluous. Everyone judged "as they saw fit" and only the nine marks of the predetermined judges were reported. Thus random chance alluded to above is not in play.

Have I got this? :rolleyes: If not, please try it again, I don't mean to be thick.

5. 0

Re: Statistical errors, part 2

Yes, that is exactly right. Your second paragraph is well stated.

However, it's kind of a subtle point. I have learned not to argue with people who don't see it that way. Just say, the real problem is with the secrecy and let it go at that.:lol:

Mathman

6. 0

Re: Statistical error in judging

Mathman--I wrote this in the "Worlds" folder before I read your post here. I agree that we are talking about sampling error, but it is error nonetheless. Also, I do think there are valid arguments against the "random selection error is meaningless" position. I've left my original post in the Worlds folder, but I've reposted it here. I agree, we will have to disagree--not about the sampling error, that it is, but about what it means. We've argued this before and I understand we will never agree, but in the interest of different points of view, I'll post this anyway

Okay, you say, but that's like saying practice counts. It's like saying that 5 of the 14 judges are only practice judges because 5 are never meant to count, only Speedy just doesn't say so. This is true. However, please indulge me and read on. I say there will still be and always has been an error rate when the judges are not anonymous, subject to significant punishment for cheating, and the random selection method is no longer in place. In any case, even if the anonymity is dropped, I still think the random selection system has an unacceptably high error built in to the system. I do not think it is just a red herring, although I also think it's only one of a mess of problems. I think skaters who should have won have always lost competitions, or gotten third instead of second or whatever, at least some of the time, it's just that with this system, ironically, Speedy has made that error rate official and statistical. Before the random selection system and anonymity, you could know who gave which skaters what scores, but as we saw in SLC and other competitions, judges could still collude to make sure a certain skater or team got or did not get a certain place. With the old system, you had to hope somebody would squeal on the other judges or else 'fess up as being part of it. Judges could easily sway a competition with perfectly acceptable and defensible scores. Now the computer tells all viewers the raw scores and if there is a significant skew of the average scores, viewers and skaters for the first time can see what judges have previously done under the table or by inherent bias. We see, for example, that Skater A's average scores are 5.92 and Skater B's average scores are 5.85, yet Skater B wins. Something is fishy, we say, but we cannot say for sure because we can't see the ordinals nor can we know who the judges are and who assigned which scores. Before the random selection system, viewers could only say, "Something is fishy" based on what they thought of the skating vs. the judges' placements. Now Speedy has given us five raw scores to at least partially quantify our feelings of fishiness, if and when it happens. However, according to what I've read, the statisticians who are reviewing all this for the ISU have access to all the ordinals and scores of all 14 judges. Anyway, Speedy's random selection system just makes certain errors in judging easier to see and more quantifiable. Even when we knew who the judges were, there were still those who were completely out of whack with the majority, such as the judge at the Olympics who put Sarah Hughes in 10th place after her SP and 4th after her LP. The differences now are that at least the statisticians have a way to quantify when the skater with the most first place votes is not awarded first place--in other words, the statisticians can see when the final placement does not align with the majority of scores of all 14 judges, which should be the best indicator of the true score. We viewers can only suspect based on the raw scores. The nine randomly selected scores should reflect the same outcome as if they were selected from all 14 judges--or 50 offiical judges or the whole world of official judges. Whether it's nine precompetition randomly selected judges or 14 officially paneled judges whose scores may or may not count toward the final outcome, the judges are supposed to represent all expert figure skating judges in the world. Since we can't get them all, we take 7, 9, 14, 25, or x number of judges and say, "You represent all judges" and then use statistical methods to try to override bias, human error, and genuine difference of opinion. It's particularly on this last point--using statistics to override bias, human error, and difference of opinion--that I think is the basis of our difference of opinion on the meaning of the error rate in the random selction system. The scores of the five nonselected judges are supposed to serve as a way to evaluate whether the selected judges are being fair and accurate. So from my POV, their scores do, or should, have meaning, even if in this year it is only to show that the random selection process is unacceptably flawed. The five nonselected judges in a way should serve as a comparison panel, just as if a nonISU organization selected nine expert FS judges for each competition in a study to determine if the ISU judges' scores correlated with those of nonISU judges.

I can see a place for a panel of x number of judges where the high and low scores are thrown out or the scores that deviate most from the mean are thrown out, but I agree with the statisticians who feel there is an error rate in the random selection system and that it is unacceptably high. I also agree that anonymous judging and lack of accountability (which can still happen if we know who the judges are; it's been happening for decades) are increasing the error rate, we just don't know by how much because we can't measure something that's kept secret.

We may have to agree to disagree on the point about the random selection and error rates, but I do agree that secrecy and lack of accountability are major parts of what is wrong with FS judging. I just don't think they are the only things wrong. I also think the random selection process is wrong, as are several other things about the current judging system. I don't think that just getting rid of the secrecy will fix things. There was no secrecy in FS judging for decades and that led to SLC, plus all the problem competitions before SLC. I think it will take a combination of knowing who the judges are; professional, paid judges; balanced panels in terms of nationality; accountability for scores; comparitive judging panels (ie, judging panels who score the same events and whose scores are then compared to those of the actual judges); strict and significant punishment for judges caught cheating; changes to the way competitions are set-up (ie, Q rounds and how skaters are assigned to them vs. something other than a Q round as it now stands; and how much weight certain parts of the competition and certain elements are given); and better statistical methods in determining which scores are used and how placements are determined in order to clean up judging in figure skating. I think there is a lot of trial and error and analysis yet to be done before we get a system that is acceptably fair to the athletes given the human involvement.

In summary: Sampling error, we agree. The meaning of this error, we disagree.
Rgirl

7. 0

Re: Statistical error in judging

Now cut that out, Rgirl. You know you can't teach an old dog new tricks.

And besides, how do you expect me to impress people with my wisdom when you keep doing me this way.

In mathematics, it's not possible for two people of full understanding to disagree. This isn't political science, or ethics, or aesthetics, for goodness sake. In those fields there is plenty of room for disagreement because the criteria for truth are themselves up for grabs. Not so in good old math.

So if there is a genuine difference of opinion it must be on the metamathematical level. You (arguing as a Neo-Platonist) are presenting the view that when a skater performs there is such a thing as a “Right Mark” for that skate, existing up there in the world of ideals -- or at least, that there is a “right mark” (small r) representing the theoretical average of the marks that would be given by all qualified and impartial judges, should it be practical to solicit and tally them.

I (a Logical Positivist) presented the view that the right mark is the mark you get -- being the only mark that actually exists in the real world, it is by definition and default the right one.

Cf. Candide’s argument that this is the best of all possible worlds: it’s the only one.

Suppose that you have won me over on this point. In that case, the only consideration of statistical merit is the size of the judging panel. We are using the marks of the judging panel as a sample to try to predict the true marks given by the hypothetical population of all qualified judges. The size of the probable error varies inversely with the square root of the sample size, and nothing else really matters except a guarantee of randomness in the selection of the panel (each qualified judge has an equal chance of being selected -- no funny business about restricting the number of judges from each "bloc").

So if you quadrupled the number of judges, you would cut the sampling error in half.

From this perspective, the old system of 9 judges is no better and no worse than the new system of 9 real judges and 5 pretend ones. If I understand your argument correctly, the 9 out of 14 may be somewhat to be preferred because the other 5 can alert us to the fact that maybe we got a bad sample. Like Doris P. mentions, maybe some of the 9 were way down at the end of the bench and couldn't get a good view of all the flutzing going on.

It would be better to include the votes of all 14. This is your example of the basketball free throw contest. <strong>Well of course it would be better to count all 14.</strong> This is just common sense. That's not what we are arguing about (we are arguing about 9 real, 0 pretend versus 9 real, 5 pretend).The only reason for not counting all 14 is to preserve secrecy. That's why I keep saying that secrecy is the enemy. If we didn't have to be secret we could count all 14 free throws, which is OBVIOUSLY the right thing to do.

It would be even better to have 25 judges. 36 would be better yet. 36 real and 36 pretend would be the same as 36 real and no pretend. But if you had 72 people sitting there, it would be best of all to tally all 72. And then to call a few more people at random (qualified judges, not Pi) on the phone so that their votes could count, too. The more people we include in our sample, the smaller the expected sampling error.

8. 0

Re: Statistical error in judging

I think I wrote this on another thread. (I have had computer problems after being away for 10 days). After viewing the 14 judges marks. I took it on myself to drop the top 3 marks and the lowest 2 marks and viewed 9 central marks. I found, for the most part, that there were no more than .2 point difference. Albeit .2 is enough to make a champion; it is at least a more definitive result than watching 4.0 to 5.5 marks on either end.

Joe

9. 0

Re: Statistical error in judging

Mathman,
Wait, wait, I know you don't really feel this way but I just want to indulge the moment:
<blockquote><strong><em>Quote:</em></strong><hr>Now I am starting to think that you are right about this.[/quote] If I weren't having computer problems (seems like we are all having computer problems since just before Worlds--hmm...maybe it's an infectious sekret computer) I'd put that sentence in big red letters.

Anyway, I see your point about why you feel secrecy is at the heart of the issue. But I still stand by my point that we had openness before and it was just as bad as this. Of course 9,000 judges would reduce the error, whether it be sampling error with the random selection method based on the idea that the score you get is the true score or be it standard error using the mean (average) score of all 9,000 scores submitted based on the idea that the true score is never attainable. Duhhh. But of course we can never get 9,000 judges just as we can never study an entire population to determine if blue-eyed skaters jump higher than brown-eyed skaters. We can only get samples. The questions for me are (maybe not you), "(a) How do we get a reasonably unbiased sample of judges, given that we are limited to, say, a panel of 14 at most? And (b) how do we get that sample of judges to most accurately represent the scores that a very large and ideally sampled group of judges would give?"

First of all, about the "true score" thing: Even though the true score is in the great realm of ideas, what we want to do with statistics, IMO, is approximate that true score as closely as possible. If a skater receives 5.9 from all 14 judges for either technical or presentation, I'd say that 5.9 is a pretty accurate representation of that skater's true score. If a skater receives a range of 4.3 up to 5.9, with a median score of 5.5, I'd say the 4.3 should not be counted because it deviates too much from the median.

You also said, "Suppose that you have won me over on this point. In that case, the only consideration of statistical merit is the size of the judging panel." Not in my opinion. IMO, there are lots of ways to alter the way statistics are used in figure skating judging that have almost nothing to do with the size of the judging panel, aside from needing at least a reasonable sample of judges.

Also, I do not think it is necessarily better to count all 14 judges' scores or even best to count all 9,000 scores of 9,000 judges. Why? Because of deviant scores, ie, the judge who gives a skater a 4.3 when 13 other judges score him in a range from 5.5 to 5.9. Here's just one possible scenario. I'm not saying it's the best or even a decent way to do things, it's just an example of using things other than the size of the jugding panel. Let's stick with the panel of 14 for the time being. Out of the 14 scores, first I'd find the mode (the mode is the score that occurs most often, in case anyone other than Mathman and Rgirl are reading these:lol: ) for both the technical and presentation evaluations for each skater. Then I'd throw out the scores with the greatest deviation from the median, eg, if the mode score is 5.5, I'd throw out the, for example, 4.3, 4.7, and 5.9. Then I'd take the mean of the remaining scores. Thus in the Rgirl System Version 1.1, each skater would receive only one score, like in gymnastics.

Yes, I would have the names, scores, and ordinals of every judge out in the open, so the full range of scores would be available for public, skater, and federation scrutiny. Also, there would be some kind of system for dealing with those judges whose scores are continually thrown out for having the greatest deviation from the mode. They wouldn't be fired or punished (unless they are found to be cheating), just asked to justify their scores. Nothing wrong with always being odd man out, just be able to back it up. Maybe those judges are deducting for things other judges SHOULD be deducting for, like flutzing. I think both the extreme and the average scores can be potentially enlightening. Even the average and median scores can be error ridden if they are the result of collusion. That's the problem I have with your "the true score is the score you get." If that score is from a judge or group of judges who are unacceptably biased or cheating, how can that score be an accurate reflection of what the skater did?

Of course if a judge whose scores are routinely extreme cannot justify them or if it's a US judge who is always out of whack with Russian skaters or vice versa, suspend him and send him back to judge training with no pay.

What it comes down to for me is, given the limits of the real world and the bias inherent in human behavior, what is the statistical method by which skaters will receive the score from the most judges on a given panel that most closely reflects the true merits of their skating. How is "the score that most closely reflects the true merits of their skating" instead of "true score"? Although I like true score just fine. (My best stat book got lost during my last move, otherwise this explanation wouldn't be so lame.)

There was some other stuff, but lucky for all of us that the repeat of "Six Feet Under" is on so I can stop:smokin:
Rgirl

[Edited because in the shower this morning I realized I used and defined "median" in the orginal version as the score that occurs most often when in fact that is the "mode." The median is the score that exactly divides the upper half of the distribution from the lower half. I was sure I'd be busted. Nothing like a stupid mistake to wreck your program. Now I know how Sasha felt shen she fell on her spin. RG]

10. 0

Re: Statistical error in judging

<blockquote><strong><em>Quote:</em></strong><hr>...in case anyone other than Mathman and Rgirl are reading these."[/quote]Are you offering any odds on that?

MM

11. 0

Re: Statistical error in judging

Interesting discussion. I was just talking to my father last night (we are both software engineers), and he suggested another way to fight corruption on judging.

After every competition, a computer could analyze the judges' scores and note which judges' marks were significantly different from the rest of the judges' marks. If the computer has in its database the information on how the judges judged previous events, by the end of the season you could see which judges are way off. It is irrelevant whether they are off because they are corrupt or incompetent. You could come up with a accumulated percentage of the difference from "average" that would throw the judge out of judging international competitions.

Of couse, such system would only really make sense once FS goes to the merit-based system of marks.

12. 0

Re: Statistical error in judging

Ptichka, I'm not an expert on this, but I think that they already do have some sort of procedure in place like you suggest. Under the old system, all of the judges report to the referee and if he or she sees something out of line with the majority, the judge must submit a written report justifying the marks that he or she gave. Presumably judges that are often off base for no good reason aren't invited back.

In the interim system, this evaluation will take place at the end of the season for all competitions together, instead of after each event separately.

Now. Rgirl. Why didn't you tell me that you were talking about APPLIED statistics. Does that mean it has something to do with that illogical and terrifying place, the real world? Truth be told, if it's not about my two little fantasy worlds, Mathematics and Planet Kwan, I'm not much interested.

Still, a couple of points.

1. About excluding extreme scores. Under the old (and also the interim, I think) system, after the first skater skates the scores are tallied and the median score is announced to the judging panel. The judges are then supposed to compare their scores with the median and judge the rest of the contest accordingly.

So for instance if you give the first skater a mark of 4.9 and the median is 5.3, then you are supposed to say to yourself, well, I was too tough on this skater by about 4 tenths, so to be fair I have to scale all of my scores down about the same. That's why it sometimes happened (under the old system) that one judge stood out like a sore thumb, giving <em>everybody</em> scores that are way too low. The judge is not being mean, he or she is trying to be fair.

This is not necessarily to be lamented, because the only thing that counts is the ordinals. If a judge got off on the wrong foot and scores everybody 4 tenths below what he or she expects the median for that skater to be, that will not affect which skater he or she puts first, second, and so on. Michelle would still be world champion under the OBO system with 8 first place ordinals, even if a judge scored her performance at 4.5, as long as that judge was consistently stingy to all skaters.

2. Aside (in case anyone is reading this besides Rgirl, Ptichka and me). OBO means one by one. It means that the winner is determined strictly by counting ordinals. If you get 5 first place ordinals out of 9, you win period.

Skater A: 1 1 1 1 1 3 5 10 15
Skater B: 2 2 2 2 2 1 1 1 1

Skater A wins.

After first place is determined, then second place is determined similarly with the first place winner out. Then third, and you go down the line, "one by one." Total scores, average scores, median scores, outliers, inliers -- none of that counts, only ordinals. Here's another example:

A: 1 1 1 1 3 3 3 3 3
B: 2 2 2 2 2 5 5 5 5

B wins. Here's how. First count the first place ordinals. No one has a majority, so we go on to round 2. In round 2 we count both first and second place ordinals. A has 4, B has 5. B wins, despite having no first place ordinals at all, and dispite the fact that every judge but one prefered A.

Note that judge number 5 decides the whole contest. Switch his or her votes around and the other person wins. This is what happened to Michelle versus Irina at Salt Lake City, where judge number five was the American judge, who gave the gold medal to Sarah by ranking Irina second ahead of Michelle in the free skate.

Note also that this system (both the old and the interim) DOES throw out ORDINALS that are way off. In the first example the 10 and the 15 don't "count" in the sense that you can change them to any value whatever without affecting the result.

So, bottom line: any scoring system that makes any use of the scores at all -- means, medians, whatever, except insofar as they translate into ordinals -- is already a radical change.

Mathman

13. 0

Re: Statistical error in judging

Am late for an appointment but had to change something (see edit to recent posst) but am glad Pitchka weighed in. I'm not so sure that the ISU ever made a serious effort to keep track of judges' scores in an attempt to identify rogue judges. Mathman, you are absolutely right about the ordinals of course, but like said, "Six Feet Under" was on and that trumps all discussions on figure skating. One thing though, just because a judge marks a skater or several skaters unusual low does not mean he marks all skaters low. That's at the heart of manipulating the outcome. Using an "us vs. them" example, score your skaters high, their skaters low, and try to look reasonable on those you don't care about. Hence the ordinals for your skaters are higher than those for their skaters. When I'm not late for the dentist I'll come back for ordinals.
Rgirl

14. 0

Re: Statistical error in judging

Here they are:

1. Rgirl
2. Rgirl's dentist

About that edit: Proving yet again what a gentleman I am.

Anyway, thanks for helping me clarify my own thoughts about this. Here is the new Eternal Truth. (Forget all those old eternal truths that I said yesterday.)

1. The old system versus the interim sytem.

Twiddling the statistical methodology -- 9 judges versus 14, judges that count versus judges that don't, proper use of the mean, median and mode, outliers, range and standard deviation -- these considerations don't really make much difference.

In any system one can construct hypothetical (or even actual) competitions where the method of determining the winner didn't work out very well, and any sytem can be manipulated by cheaters, crooks and blackguards.

The only substantive difference between the two systems is that under the interim system the judges vote in secret and under the old system the voting is done in public view.

2. The old and interim sytems, on the one hand, versus the proposed points-per-element system and the Rgirl system, on the other.

The distinction is between systems in which the winners are decided by ordinal placement and systems in which the winner is decided by some sort of totaling or averaging of points.

I suppose that the argument in favor of ordinals must go something like this. It is easy for a judge to say, of these three skaters I liked this one the best, that one next, and the other one third. It is not so easy to try to come up with that mythical,ghostly, exactly right 5.6-to-end-all-5.6's. Or to decide whether Michelle switching over to the flat at the very last second on her triple Lutz means she should get only a +1 bonus point instead of a +2 on that element.

Also, the OBO ordinal system makes it harder for a conspiratorial minority, or an individual patriot, to sway the outcome (although your idea of discarding scores that are too far from the median also accomplishes that goal).

Mathman

15. 0

Re: Last Words, Plus or Minus

Mathman,
There is no "Rgirl's system." As I said in my post, it was just an example of how using different statistical methodology might affect the scores. I haven't studied enough applied statistics or FS judging to propose a system. But then, neither has Speedy but that doesn't stop him:lol:
<blockquote><strong><em>Quote:</em></strong><hr>Twiddling the statistical methodology -- 9 judges versus 14, judges that count versus judges that don't, proper use of the mean, median and mode, outliers, range and standard deviation -- these considerations don't really make much difference.[/quote] I realize we disagree, but why the denigrating language?
<blockquote><strong><em>Quote:</em></strong><hr>Also, the OBO ordinal system makes it harder for a conspiratorial minority, or an individual patriot, to sway the outcome (although your idea of discarding scores that are too far from the median also accomplishes that goal).[/quote] Okay, quid pro quo: I thought "twiddling the statistical methodology" didn't really make much difference?

Look, it goes without saying that I agree that the secrecy should go, it's just that I disagree that just making the judges scores and ordinals public will "fix" things. As I've said before, everything was public before and we ended up with the SLC scandal and a number of other lesser known incidents of judges' cheating. And who knows how many times skaters lost out on medals they should have won because of biased judging? I'm saying that I think there are better statistical methodologies than those being used at present that will minimize the inherent bias in judging. IMO, I think it will take a combination of open judging and better statistical methods applied to the scoring plus several other changes in both the judging and the way competitions are organized, all of which I've noted before, in order for figure skaters to be awarded scores that best reflect the merits of their skating.

Beyond this, I think you and I will have to agree to disagree--respectfully of course:D
Rgirl EDB

Page 1 of 2 1 2 Last

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•