Statistical error in judging

mathman444 · Mar 30, 2003

Statistical error in judging

Time to try this again.

There seems still to be a lot of confusion about the possibility that with the interim system of choosing nine judges out of fourteen, maybe the "wrong" person will win through "statistical error." This is not correct, even though a lot of people (some of whom claim some mathematical training and so ought to know better) are coming up with statements like, "In a 5-4 split the wrong person is given the victory 25% of the time," etc.

The problem is not with the business of selecting 9 "real" judges out of a pool of 14. Any system of selecting judges does this -- you have a pool of possible judges, then you select some of them actually to judge the contest. The problem is with the secrecy in the process.

To convince yourself of this, imagine that we use the same 9 out of 14 process, but that everything is done out in the open. Fifteen minutes before the event begins, the computer selects nine of the judges to be the real judges. Everybody knows who the nine are. The other 5 can either sit there and go through the motions just for fun, or they can repair to the nearest sports bar and watch on TV over a couple of cold ones.

Sure, a skater might complain, "Gee, I wish the computer had made a different choice -- I would have a better chance to win if I had a different judging panel." But this complaint is equally valid no matter how or when the selection of judges is made.

Mathman

mathman444 · Mar 30, 2003

Statistical errors, part 2

I am putting this in a second post because it is a little more technical and maybe no one wants to read it.

What is "statistical error?"

In the context of the interim judging system, we are talking about "sampling error." This occurs when we use a sample to make predictions about properties of the entire population from which the sample is drawn. Like a political poll. We poll 1000 voters. 600 of them say they support George Bush. So we estimate that approximately 60% of everybody in the country, not just those 1000 feel this way. "Sampling error" refers to the possibility that we might get a different answer if we actually asked everybody. There are very well understood mathematical formulas for calculating the probabilities that the sample percentage will be close to the true percentage.

To apply this to the ISU interim voting system, suppose that all 14 judges submitted their marks and then nine sets of marks were chosen at random to count. In this case we have chosen a sample of nine from the population of fourteen. Questions about whether the nine accurately reflect the choices of all fourteen are highly appropriate. Sometimes (and we can calculate how often to expect this), the wrong skater will win through "sampling error" -- the majority of the sample (the 9) supports skater A, while the majority of the population (all 14) supports skater B.

But this isn't how it's done. Instead, the nine "real judges" (not their marks) are selected before the event begins. These nine are now the "population" under study. The marks of all nine members of this population are counted. This is a "sample of size nine selected from a population of size nine." That is, you have polled the entire population, so there is no possibility of "sampling error." It is the election itself, not a poll.

What about the other five who have been excluded from the population before the election begins? Speedy has made fools of them. Because of their ignorance (and ours) they can't go off to the sports bar after all. They have to go through the motions. But this is because they don't know any better, not because of any question relevant to statistics.

Anyway, the moral of the story is this. If we want to protest the interim judging system we should concentrate on its cloak and dagger secrecy, rather than be drawn into fruitless arguments about statistics. The interim system makes it easier to cheat and harder to catch the cheaters, and that's the bottom line.

Mathman

RED DOG45 · Mar 30, 2003

Re: Statistical errors, part 2

That's interesting...never looked at it that way.

southwestwind · Mar 30, 2003

Re: Statistical errors, part 2

Mathman
This is coming close to something that I wondered about in the ice dance competition. After the OD, L/A were first with a 5/4 split, B/K were second. After the FD, B/K won over all with a 5/4 split. For the purposes of this hypothetical point, let us say that the FD split in fact was 5/4 for B/K. (I am not sure exactly what the split was). Within statistical uncertainity, does this mean (in our hypothetical case), that the overall win would be within random chance and could have gone either way, depending on which marks/judges the computer chose?

The way I read your second post the answer would be no...the nine judges that counted for each event were picked before the events began and the otehr five did not know they were superfluous. Everyone judged "as they saw fit" and only the nine marks of the predetermined judges were reported. Thus random chance alluded to above is not in play.

Have I got this?

If not, please try it again, I don't mean to be thick.

mathman444 · Mar 30, 2003

Re: Statistical errors, part 2

Yes, that is exactly right. Your second paragraph is well stated.

However, it's kind of a subtle point. I have learned not to argue with people who don't see it that way. Just say, the real problem is with the secrecy and let it go at that.:lol:

Mathman

rgirl181 · Mar 31, 2003

Re: Statistical error in judging

Mathman--I wrote this in the "Worlds" folder before I read your post here. I agree that we are talking about sampling error, but it is error nonetheless. Also, I do think there are valid arguments against the "random selection error is meaningless" position. I've left my original post in the Worlds folder, but I've reposted it here. I agree, we will have to disagree--not about the sampling error, that it is, but about what it means. We've argued this before and I understand we will never agree, but in the interest of different points of view, I'll post this anyway

Okay Mathman, I'll put my head on the chopping block again

I agree that secrecy is a large part of what's wrong with the present system of FS judging, but it is not the only thing wrong. As always, I see your point about there being only nine real judges whether they are chosen six seconds or six months before the event. OTOH, I also think there is a certain amount about this argument that is about semantics and the difference between statistics and the meaning of statistics. The way I see it, there are 14 judges judging the event and all the scores of all 14 judges are shown to skaters and viewers. Dick and Peggy and my cat Pi are not trained as judges, are not brought to the event, and are not asked to officially evaluate the skaters. Also, the difference between selecting the judges six months vs. six seconds ahead of time is that with the latter system, any one of the 14 judges who has been asked to show up is a potential real judge. What any panel of judges is doing in any situation where you cannot evaluate who will win or lose based on an quantifiable measure like time, tasks completed (as in golf or basketball), height, etc. is to serve as surrogate quantifiers; they are evaluating the athletes based on an agreed upon set of criteria and selected based on their (supposed) expertise. It's often said that many posters at GS know enough about figure skating to be judges, but as far as I know, none of us has been through the judges' training system nor do we have experience in judging. You can't get rid of cheating and oddball points of view, which is why you try to build in statistical safety measures so that the proverbial true score, which should be the one that the majority of expert judges agree or come close to agreeing upon, is the one that gets assigned. The only way to determine how accurately the judges are assigning "true" scores is to by looking at large numbers of judges and their scores of various skating performances over time. From what I've read, that's what the statisticians for the ISU are trying to do and have been trying to do. While I agree that the secrecy in terms of who the judges are is one of the worst aspect of this judging system, I think it's only one several. Even if we knew who all the judges were in the present random selection system there would still be, IMO, a significant and unacceptable error rate. In any group of scores there is an error rate, whether it's from nine judges selected six months before the competition and whose names we all know or from the anonymous judges in the current system. The error rate is the degree to which the scores deviate from what the "true" score would be. The "true" score is never attainable in the real world, which is why you always have an error rate in athletic competitions or whenever one compares scores. I realize it is different in pure mathematics.

For me, as I've said before, if 5 out of 14 trained and officially assigned (I know, the computer "officially assigns" only nine judges and does so before the competition--that's really the heart of our argument) figure skating judges put Skater A in first place and 9 of those 14 judges put Skater A in second place, statistically speaking, Skater A's true finish, to the best we can determine, is second place. But if the computer selects the 5 judges who put Skater A in first and 4 of the judges who put Skater A in second for the final nine scores, thus putting Skater A in first place, then there is an error and a significant one. True it is sampling error, but the point IMO is what does this error mean? To me it's no different than if we say we're going to let you have 14 trials at making a basket from the freethrow line, but before you even start, we are randomly going to count only nine of the 14 trials towards your final score. If you make 9 baskets and miss 5, but the computer had preselected only 4 of the trials where you made the basket and the 5 you missed, then that is an unacceptable error. It does not reflect the person's true ability at shooting baskets. We go further when we bring in a competitor, shooter B. Shooter B misses 9 baskets and makes 5 but the random selection went in his favor. Thus Shooter B, with only 5 true baskets wins over Shooter A, who has 9 true baskets. The relation between the true score (the number of baskets each shooter actually made) to the number of baskets that counted toward the final decision is a significantly inaccurate representation of what actually happened. The way I see it, the judges are substitutes for baskets made or fastest times. Granted, they are a highly flawed substitute--kind of like working with a bad stopwatch--but it's the only system that sports like gymnastics, diving, and many others have.

Okay, you say, but that's like saying practice counts. It's like saying that 5 of the 14 judges are only practice judges because 5 are never meant to count, only Speedy just doesn't say so. This is true. However, please indulge me and read on. I say there will still be and always has been an error rate when the judges are not anonymous, subject to significant punishment for cheating, and the random selection method is no longer in place. In any case, even if the anonymity is dropped, I still think the random selection system has an unacceptably high error built in to the system. I do not think it is just a red herring, although I also think it's only one of a mess of problems. I think skaters who should have won have always lost competitions, or gotten third instead of second or whatever, at least some of the time, it's just that with this system, ironically, Speedy has made that error rate official and statistical. Before the random selection system and anonymity, you could know who gave which skaters what scores, but as we saw in SLC and other competitions, judges could still collude to make sure a certain skater or team got or did not get a certain place. With the old system, you had to hope somebody would squeal on the other judges or else 'fess up as being part of it. Judges could easily sway a competition with perfectly acceptable and defensible scores. Now the computer tells all viewers the raw scores and if there is a significant skew of the average scores, viewers and skaters for the first time can see what judges have previously done under the table or by inherent bias. We see, for example, that Skater A's average scores are 5.92 and Skater B's average scores are 5.85, yet Skater B wins. Something is fishy, we say, but we cannot say for sure because we can't see the ordinals nor can we know who the judges are and who assigned which scores. Before the random selection system, viewers could only say, "Something is fishy" based on what they thought of the skating vs. the judges' placements. Now Speedy has given us five raw scores to at least partially quantify our feelings of fishiness, if and when it happens. However, according to what I've read, the statisticians who are reviewing all this for the ISU have access to all the ordinals and scores of all 14 judges. Anyway, Speedy's random selection system just makes certain errors in judging easier to see and more quantifiable. Even when we knew who the judges were, there were still those who were completely out of whack with the majority, such as the judge at the Olympics who put Sarah Hughes in 10th place after her SP and 4th after her LP. The differences now are that at least the statisticians have a way to quantify when the skater with the most first place votes is not awarded first place--in other words, the statisticians can see when the final placement does not align with the majority of scores of all 14 judges, which should be the best indicator of the true score. We viewers can only suspect based on the raw scores. The nine randomly selected scores should reflect the same outcome as if they were selected from all 14 judges--or 50 offiical judges or the whole world of official judges. Whether it's nine precompetition randomly selected judges or 14 officially paneled judges whose scores may or may not count toward the final outcome, the judges are supposed to represent all expert figure skating judges in the world. Since we can't get them all, we take 7, 9, 14, 25, or x number of judges and say, "You represent all judges" and then use statistical methods to try to override bias, human error, and genuine difference of opinion. It's particularly on this last point--using statistics to override bias, human error, and difference of opinion--that I think is the basis of our difference of opinion on the meaning of the error rate in the random selction system. The scores of the five nonselected judges are supposed to serve as a way to evaluate whether the selected judges are being fair and accurate. So from my POV, their scores do, or should, have meaning, even if in this year it is only to show that the random selection process is unacceptably flawed. The five nonselected judges in a way should serve as a comparison panel, just as if a nonISU organization selected nine expert FS judges for each competition in a study to determine if the ISU judges' scores correlated with those of nonISU judges.

I can see a place for a panel of x number of judges where the high and low scores are thrown out or the scores that deviate most from the mean are thrown out, but I agree with the statisticians who feel there is an error rate in the random selection system and that it is unacceptably high. I also agree that anonymous judging and lack of accountability (which can still happen if we know who the judges are; it's been happening for decades) are increasing the error rate, we just don't know by how much because we can't measure something that's kept secret.

We may have to agree to disagree on the point about the random selection and error rates, but I do agree that secrecy and lack of accountability are major parts of what is wrong with FS judging. I just don't think they are the only things wrong. I also think the random selection process is wrong, as are several other things about the current judging system. I don't think that just getting rid of the secrecy will fix things. There was no secrecy in FS judging for decades and that led to SLC, plus all the problem competitions before SLC. I think it will take a combination of knowing who the judges are; professional, paid judges; balanced panels in terms of nationality; accountability for scores; comparitive judging panels (ie, judging panels who score the same events and whose scores are then compared to those of the actual judges); strict and significant punishment for judges caught cheating; changes to the way competitions are set-up (ie, Q rounds and how skaters are assigned to them vs. something other than a Q round as it now stands; and how much weight certain parts of the competition and certain elements are given); and better statistical methods in determining which scores are used and how placements are determined in order to clean up judging in figure skating. I think there is a lot of trial and error and analysis yet to be done before we get a system that is acceptably fair to the athletes given the human involvement.

In summary: Sampling error, we agree. The meaning of this error, we disagree.
Rgirl

mathman444 · Mar 31, 2003

Re: Statistical error in judging

Now cut that out, Rgirl. You know you can't teach an old dog new tricks.

And besides, how do you expect me to impress people with my wisdom when you keep doing me this way.

Now I am starting to think that you are right about this.

In mathematics, it's not possible for two people of full understanding to disagree. This isn't political science, or ethics, or aesthetics, for goodness sake. In those fields there is plenty of room for disagreement because the criteria for truth are themselves up for grabs. Not so in good old math.

So if there is a genuine difference of opinion it must be on the metamathematical level. You (arguing as a Neo-Platonist) are presenting the view that when a skater performs there is such a thing as a “Right Mark” for that skate, existing up there in the world of ideals -- or at least, that there is a “right mark” (small r) representing the theoretical average of the marks that would be given by all qualified and impartial judges, should it be practical to solicit and tally them.

I (a Logical Positivist) presented the view that the right mark is the mark you get -- being the only mark that actually exists in the real world, it is by definition and default the right one.

Cf. Candide’s argument that this is the best of all possible worlds: it’s the only one.

Suppose that you have won me over on this point. In that case, the only consideration of statistical merit is the size of the judging panel. We are using the marks of the judging panel as a sample to try to predict the true marks given by the hypothetical population of all qualified judges. The size of the probable error varies inversely with the square root of the sample size, and nothing else really matters except a guarantee of randomness in the selection of the panel (each qualified judge has an equal chance of being selected -- no funny business about restricting the number of judges from each "bloc").

So if you quadrupled the number of judges, you would cut the sampling error in half.

From this perspective, the old system of 9 judges is no better and no worse than the new system of 9 real judges and 5 pretend ones. If I understand your argument correctly, the 9 out of 14 may be somewhat to be preferred because the other 5 can alert us to the fact that maybe we got a bad sample. Like Doris P. mentions, maybe some of the 9 were way down at the end of the bench and couldn't get a good view of all the flutzing going on.

It would be better to include the votes of all 14. This is your example of the basketball free throw contest. Well of course it would be better to count all 14. This is just common sense. That's not what we are arguing about (we are arguing about 9 real, 0 pretend versus 9 real, 5 pretend).The only reason for not counting all 14 is to preserve secrecy. That's why I keep saying that secrecy is the enemy. If we didn't have to be secret we could count all 14 free throws, which is OBVIOUSLY the right thing to do.

It would be even better to have 25 judges. 36 would be better yet. 36 real and 36 pretend would be the same as 36 real and no pretend. But if you had 72 people sitting there, it would be best of all to tally all 72. And then to call a few more people at random (qualified judges, not Pi) on the phone so that their votes could count, too. The more people we include in our sample, the smaller the expected sampling error.

Your buddy, Mathman

Joesitz · Apr 1, 2003

Re: Statistical error in judging

I think I wrote this on another thread. (I have had computer problems after being away for 10 days). After viewing the 14 judges marks. I took it on myself to drop the top 3 marks and the lowest 2 marks and viewed 9 central marks. I found, for the most part, that there were no more than .2 point difference. Albeit .2 is enough to make a champion; it is at least a more definitive result than watching 4.0 to 5.5 marks on either end.

Joe

rgirl181 · Apr 1, 2003

Re: Statistical error in judging

Mathman,
Wait, wait, I know you don't really feel this way but I just want to indulge the moment:
<blockquote>Quote:<hr>Now I am starting to think that you are right about this.[/quote] If I weren't having computer problems (seems like we are all having computer problems since just before Worlds--hmm...maybe it's an infectious sekret computer) I'd put that sentence in big red letters.

First, please read the part in my post again where I say, "I agree that we are talking about sampling error." Actually, I cut out a part (trying to be pithy) about mathematical statistics vs. applied statistics, but just as well since you provide a much better basis for discussing this than what I said would have. Everything I learned about statistics, aside from the math basics, was from people who used them in studies of real world events, especially athletics: comparison of reaction time to final outcomes in the 100 meter dash; the ratio of quadricep to hamstring isokinetic strength in female gymasts vs. basketball players; diving judging; sports injuries; all that kind of stuff. Everything was of course taught in terms of the math but the emphasis was on what statistical method was best used in a given situation. That's my bias. In my graduate stat classes, we had arguments galore, which were often initiated and always encouraged by the professors. Not about the math--that's a glorious given--but about methodology--that's a b****.

Anyway, I see your point about why you feel secrecy is at the heart of the issue. But I still stand by my point that we had openness before and it was just as bad as this. Of course 9,000 judges would reduce the error, whether it be sampling error with the random selection method based on the idea that the score you get is the true score or be it standard error using the mean (average) score of all 9,000 scores submitted based on the idea that the true score is never attainable. Duhhh. But of course we can never get 9,000 judges just as we can never study an entire population to determine if blue-eyed skaters jump higher than brown-eyed skaters. We can only get samples. The questions for me are (maybe not you), "(a) How do we get a reasonably unbiased sample of judges, given that we are limited to, say, a panel of 14 at most? And (b) how do we get that sample of judges to most accurately represent the scores that a very large and ideally sampled group of judges would give?"

First of all, about the "true score" thing: Even though the true score is in the great realm of ideas, what we want to do with statistics, IMO, is approximate that true score as closely as possible. If a skater receives 5.9 from all 14 judges for either technical or presentation, I'd say that 5.9 is a pretty accurate representation of that skater's true score. If a skater receives a range of 4.3 up to 5.9, with a median score of 5.5, I'd say the 4.3 should not be counted because it deviates too much from the median.

You also said, "Suppose that you have won me over on this point. In that case, the only consideration of statistical merit is the size of the judging panel." Not in my opinion. IMO, there are lots of ways to alter the way statistics are used in figure skating judging that have almost nothing to do with the size of the judging panel, aside from needing at least a reasonable sample of judges.

Also, I do not think it is necessarily better to count all 14 judges' scores or even best to count all 9,000 scores of 9,000 judges. Why? Because of deviant scores, ie, the judge who gives a skater a 4.3 when 13 other judges score him in a range from 5.5 to 5.9. Here's just one possible scenario. I'm not saying it's the best or even a decent way to do things, it's just an example of using things other than the size of the jugding panel. Let's stick with the panel of 14 for the time being. Out of the 14 scores, first I'd find the mode (the mode is the score that occurs most often, in case anyone other than Mathman and Rgirl are reading these:lol: ) for both the technical and presentation evaluations for each skater. Then I'd throw out the scores with the greatest deviation from the median, eg, if the mode score is 5.5, I'd throw out the, for example, 4.3, 4.7, and 5.9. Then I'd take the mean of the remaining scores. Thus in the Rgirl System Version 1.1, each skater would receive only one score, like in gymnastics.

Yes, I would have the names, scores, and ordinals of every judge out in the open, so the full range of scores would be available for public, skater, and federation scrutiny. Also, there would be some kind of system for dealing with those judges whose scores are continually thrown out for having the greatest deviation from the mode. They wouldn't be fired or punished (unless they are found to be cheating), just asked to justify their scores. Nothing wrong with always being odd man out, just be able to back it up. Maybe those judges are deducting for things other judges SHOULD be deducting for, like flutzing. I think both the extreme and the average scores can be potentially enlightening. Even the average and median scores can be error ridden if they are the result of collusion. That's the problem I have with your "the true score is the score you get." If that score is from a judge or group of judges who are unacceptably biased or cheating, how can that score be an accurate reflection of what the skater did?

Of course if a judge whose scores are routinely extreme cannot justify them or if it's a US judge who is always out of whack with Russian skaters or vice versa, suspend him and send him back to judge training with no pay.

What it comes down to for me is, given the limits of the real world and the bias inherent in human behavior, what is the statistical method by which skaters will receive the score from the most judges on a given panel that most closely reflects the true merits of their skating. How is "the score that most closely reflects the true merits of their skating" instead of "true score"? Although I like true score just fine. (My best stat book got lost during my last move, otherwise this explanation wouldn't be so lame.)

There was some other stuff, but lucky for all of us that the repeat of "Six Feet Under" is on so I can stop:smokin:
Your Buddy in Figure Skating,
Adversary in Applied Statistics,
Rgirl

[Edited because in the shower this morning I realized I used and defined "median" in the orginal version as the score that occurs most often when in fact that is the "mode." The median is the score that exactly divides the upper half of the distribution from the lower half. I was sure I'd be busted. Nothing like a stupid mistake to wreck your program. Now I know how Sasha felt shen she fell on her spin. RG]

mathman444 · Apr 2, 2003

Re: Statistical error in judging

Now I am beginning to think that you are right about this.

<blockquote>Quote:<hr>...in case anyone other than Mathman and Rgirl are reading these."[/quote]Are you offering any odds on that?

MM

Ptichka · Apr 2, 2003

Re: Statistical error in judging

Interesting discussion. I was just talking to my father last night (we are both software engineers), and he suggested another way to fight corruption on judging.

After every competition, a computer could analyze the judges' scores and note which judges' marks were significantly different from the rest of the judges' marks. If the computer has in its database the information on how the judges judged previous events, by the end of the season you could see which judges are way off. It is irrelevant whether they are off because they are corrupt or incompetent. You could come up with a accumulated percentage of the difference from "average" that would throw the judge out of judging international competitions.

Of couse, such system would only really make sense once FS goes to the merit-based system of marks.

mathman444 · Apr 2, 2003

Re: Statistical error in judging

Ptichka, I'm not an expert on this, but I think that they already do have some sort of procedure in place like you suggest. Under the old system, all of the judges report to the referee and if he or she sees something out of line with the majority, the judge must submit a written report justifying the marks that he or she gave. Presumably judges that are often off base for no good reason aren't invited back.

In the interim system, this evaluation will take place at the end of the season for all competitions together, instead of after each event separately.

Now. Rgirl. Why didn't you tell me that you were talking about APPLIED statistics. Does that mean it has something to do with that illogical and terrifying place, the real world? Truth be told, if it's not about my two little fantasy worlds, Mathematics and Planet Kwan, I'm not much interested.

Still, a couple of points.

1. About excluding extreme scores. Under the old (and also the interim, I think) system, after the first skater skates the scores are tallied and the median score is announced to the judging panel. The judges are then supposed to compare their scores with the median and judge the rest of the contest accordingly.

So for instance if you give the first skater a mark of 4.9 and the median is 5.3, then you are supposed to say to yourself, well, I was too tough on this skater by about 4 tenths, so to be fair I have to scale all of my scores down about the same. That's why it sometimes happened (under the old system) that one judge stood out like a sore thumb, giving everybody scores that are way too low. The judge is not being mean, he or she is trying to be fair.

This is not necessarily to be lamented, because the only thing that counts is the ordinals. If a judge got off on the wrong foot and scores everybody 4 tenths below what he or she expects the median for that skater to be, that will not affect which skater he or she puts first, second, and so on. Michelle would still be world champion under the OBO system with 8 first place ordinals, even if a judge scored her performance at 4.5, as long as that judge was consistently stingy to all skaters.

2. Aside (in case anyone is reading this besides Rgirl, Ptichka and me). OBO means one by one. It means that the winner is determined strictly by counting ordinals. If you get 5 first place ordinals out of 9, you win period.

Skater A: 1 1 1 1 1 3 5 10 15
Skater B: 2 2 2 2 2 1 1 1 1

Skater A wins.

After first place is determined, then second place is determined similarly with the first place winner out. Then third, and you go down the line, "one by one." Total scores, average scores, median scores, outliers, inliers -- none of that counts, only ordinals. Here's another example:

A: 1 1 1 1 3 3 3 3 3
B: 2 2 2 2 2 5 5 5 5

B wins. Here's how. First count the first place ordinals. No one has a majority, so we go on to round 2. In round 2 we count both first and second place ordinals. A has 4, B has 5. B wins, despite having no first place ordinals at all, and dispite the fact that every judge but one prefered A.

Note that judge number 5 decides the whole contest. Switch his or her votes around and the other person wins. This is what happened to Michelle versus Irina at Salt Lake City, where judge number five was the American judge, who gave the gold medal to Sarah by ranking Irina second ahead of Michelle in the free skate.

Note also that this system (both the old and the interim) DOES throw out ORDINALS that are way off. In the first example the 10 and the 15 don't "count" in the sense that you can change them to any value whatever without affecting the result.

So, bottom line: any scoring system that makes any use of the scores at all -- means, medians, whatever, except insofar as they translate into ordinals -- is already a radical change.

Mathman

rgirl181 · Apr 3, 2003

Re: Statistical error in judging

Am late for an appointment but had to change something (see edit to recent posst) but am glad Pitchka weighed in. I'm not so sure that the ISU ever made a serious effort to keep track of judges' scores in an attempt to identify rogue judges. Mathman, you are absolutely right about the ordinals of course, but like said, "Six Feet Under" was on and that trumps all discussions on figure skating. One thing though, just because a judge marks a skater or several skaters unusual low does not mean he marks all skaters low. That's at the heart of manipulating the outcome. Using an "us vs. them" example, score your skaters high, their skaters low, and try to look reasonable on those you don't care about. Hence the ordinals for your skaters are higher than those for their skaters. When I'm not late for the dentist I'll come back for ordinals.
Rgirl

mathman444 · Apr 3, 2003

Re: Statistical error in judging

Here they are:

1. Rgirl
2. Rgirl's dentist

About that edit: Proving yet again what a gentleman I am.

Anyway, thanks for helping me clarify my own thoughts about this. Here is the new Eternal Truth. (Forget all those old eternal truths that I said yesterday.)

1. The old system versus the interim sytem.

Twiddling the statistical methodology -- 9 judges versus 14, judges that count versus judges that don't, proper use of the mean, median and mode, outliers, range and standard deviation -- these considerations don't really make much difference.

In any system one can construct hypothetical (or even actual) competitions where the method of determining the winner didn't work out very well, and any sytem can be manipulated by cheaters, crooks and blackguards.

The only substantive difference between the two systems is that under the interim system the judges vote in secret and under the old system the voting is done in public view.

2. The old and interim sytems, on the one hand, versus the proposed points-per-element system and the Rgirl system, on the other.

The distinction is between systems in which the winners are decided by ordinal placement and systems in which the winner is decided by some sort of totaling or averaging of points.

I suppose that the argument in favor of ordinals must go something like this. It is easy for a judge to say, of these three skaters I liked this one the best, that one next, and the other one third. It is not so easy to try to come up with that mythical,ghostly, exactly right 5.6-to-end-all-5.6's. Or to decide whether Michelle switching over to the flat at the very last second on her triple Lutz means she should get only a +1 bonus point instead of a +2 on that element.

Also, the OBO ordinal system makes it harder for a conspiratorial minority, or an individual patriot, to sway the outcome (although your idea of discarding scores that are too far from the median also accomplishes that goal).

Mathman

rgirl181 · Apr 7, 2003

Re: Last Words, Plus or Minus

Mathman,
There is no "Rgirl's system." As I said in my post, it was just an example of how using different statistical methodology might affect the scores. I haven't studied enough applied statistics or FS judging to propose a system. But then, neither has Speedy but that doesn't stop him:lol:
<blockquote>Quote:<hr>Twiddling the statistical methodology -- 9 judges versus 14, judges that count versus judges that don't, proper use of the mean, median and mode, outliers, range and standard deviation -- these considerations don't really make much difference.[/quote] I realize we disagree, but why the denigrating language?
<blockquote>Quote:<hr>Also, the OBO ordinal system makes it harder for a conspiratorial minority, or an individual patriot, to sway the outcome (although your idea of discarding scores that are too far from the median also accomplishes that goal).[/quote] Okay, quid pro quo: I thought "twiddling the statistical methodology" didn't really make much difference?

Look, it goes without saying that I agree that the secrecy should go, it's just that I disagree that just making the judges scores and ordinals public will "fix" things. As I've said before, everything was public before and we ended up with the SLC scandal and a number of other lesser known incidents of judges' cheating. And who knows how many times skaters lost out on medals they should have won because of biased judging? I'm saying that I think there are better statistical methodologies than those being used at present that will minimize the inherent bias in judging. IMO, I think it will take a combination of open judging and better statistical methods applied to the scoring plus several other changes in both the judging and the way competitions are organized, all of which I've noted before, in order for figure skaters to be awarded scores that best reflect the merits of their skating.

Beyond this, I think you and I will have to agree to disagree--respectfully of course

Rgirl EDB

mathman444 · Apr 7, 2003

Re: Last Words, Plus or Minus

I didn't mean for my language to be denigrating. It was my intent to say that the difference between the old and the interim systems, from a statistical point of view, was small and unimportant. (Oh wait, is that what "denigrating" means?).

I agree that neither of them answers well to the charge of permitting judging bias.

About the quid pro quo, my next point was that, in contrast, the difference between an ordinal based system and a points based system was not mere twiddling but really had content, whether for better or worse.

Never mind. The real reason that I am responding here is to ask what EDB means.

Respectfully and disagreeably yours,

Mathman (anxiously awaiting his next writing lesson).

rgirl181 · Apr 8, 2003

Re: Last Equations, Plus or Minus

Come now, Mathman, everybody knows Rgirl is an EDB

Just as everyone knows that the equation for this thread can be expressed as: RG<~~>MM = A2D/R
Rgirl MBEDB

mathman444 · Apr 8, 2003

Re: Last Equations, Plus or Minus

Let's see. The B part is obvious. Still working on the rest.

3axel · Apr 9, 2003

Whaaaat???

You can slice it and dice it anyway you want, but if the computer picks include the 9 most likely to favor skater A and leaves out the 6 most likely to favor skater B then the computer will likely skew the result, no matter how much better skater B may perform. This would make all your arguments add up to meaningless technoblab.

3axel

mathman444 · Apr 9, 2003

Re: Whaaaat???

No, no, no, 3Axel! (Rgirl, I blame you for this!) That is a logical fallacy. A seductive one to be sure, but a logical fallacy nontheless. Think about it.

What if the draw were done not by computer but by pulling names out of a hat?

What if we pulled names out of a hat 6 months before the event.

It would STILL be the case that the random draw might select from the pool 9 judges that were favorable to skater A, and not select those judges that were more favorable to skater B.

You are playing into Speedy's hands if you dwell on this argument and let him thereby obscure the real issue, which is secrecy.

Mathman

Statistical error in judging

mathman444

Guest

mathman444

Guest

RED DOG45

Guest

southwestwind

Guest

mathman444

Guest

rgirl181

Guest

mathman444

Guest

Joesitz

Guest

rgirl181

Guest

mathman444

Guest

Ptichka

Guest

mathman444

Guest

rgirl181

Guest

mathman444

Guest

rgirl181

Guest

mathman444

Guest

rgirl181

Guest

mathman444

Guest

3axel

Guest

mathman444

Guest

Similar threads