|
|
|
Code of Points vs. OrdinalsFebruary 28, 2004 This third and final article on the ISU's new Code of Points (Part I and Part II) consists of three fairly separate parts. The first of these describes an attempt at a direct comparison of the Code of Points and the more familiar ordinal scoring system, at the recent Skate America and Skate Canada competitions. The second section grows out of that first and examines the notion of the judges anonymity under the new scoring system: is it real, an illusion, or meaningless? Finally, the third section then tries to sum up all of the preceding work to address the question of whether the Code of Points is or can be an improvement over the older system. 1. Shadow judges at Skate America and Skate CanadaWhen the ISU introduced its new Code of Points (CoP) scoring system at the Grand Prix series this season, it decided to use it as the only scoring system for those events rather than attempting to run both the CoP and the older, ordinal system simultaneously to allow a comparison of the two to be made. This was, perhaps, a wise decision on the ISU's part, since any such direct comparisons could only represent lose-lose situations. That is, if the two systems agreed, why switch? And if they didn't, how could it be shown that one was better than the other? But for skating fans, this was also an unfortunate decision, since it deprived them of just this key opportunity to make such comparisons. In an attempt to remedy this situation, I asked for volunteers from among the members of one of the skating fan lists to serve as shadow judges for the Skate America and Skate Canada competitions, to compare their judgments with those of the CoP. Unfortunately, although skate fans love to argue, they seem more reluctant to actually put their reputations where their mouths are, and only three individuals responded to this request with usable protocols (my thanks to Louis, David, and Iris). While this number is far smaller than I had hoped, these judges appeared quite consistent in their evaluations and may in fact represent a sample large enough to allow some tentative conclusions to be drawn from their ratings. One of these judges attended both the Skate America and Skate Canada competitions, and provided scores for the full range of the Mens, Ladies', and Pairs' events. Another attended only Skate America, providing a full range of scores for all events. The third was limited to CTV's television coverage of the Skate Canada events, which included only about half the competitors: generally, all the Canadian entrants plus the three or four highest scoring among the remainder. As a result, the analyses below are limited to the Ladies', Men's, and Pairs' events at the two competitions, and further limited to about half the field at Skate Canada. Due to these limitations, it seemed best to confine detailed discussion of the results to the top three competitors in each event, with only some general comments applied to the rest of the field. Moreover, detailed statistical analyses could be undertaken only for the pooled competitors - combining Men, Ladies, and Pairs for 34 cases in all at Skate America, 17 at Skate Canada -- in the Short and Free programs. However, these marks yielded quite consistent results at both competitions. At Skate America, for example, in all four comparisons (technical scores and presentation marks for the Short and Free programs) the scores of the two ordinal judges agreed with each other more closely than they did with those of the CoP. In three out of four cases, the percentage of overlap values (technically, the squared correlation coefficient) ranged from about 50% to 80%, only slightly lower than what one might have found at any competition using highly trained international judges. The one exception was that of the presentation marks in the Short program, where the agreement rate for the two ordinal judges was just under 60%, but their agreement with the CoP fell to 27% and 41% - better than one might expect more than once in 1000 tries if the numbers were random, but far below what one would expect of trained judges. Similarly, for the two ordinal judges, agreement between the technical and presentation scores assigned by any one judge ranged from about 65% to 95%, as it tends to do in most competitions, while for the CoP it fell to 52% in the Free programs and an appalling 21% in the Short programs. Overall then, the two ordinal judges appeared to agree with each other about as often as could be expected, while the CoP agreed with either of them somewhat less often, and with itself, almost not at all. In more detail now, the tables below present the results for the top three finishers in each of the three disciplines at Skate America.
Here all three judges agree on the top three placements in the Ladies Free' event, and the differences between the CoP and ordinal judges in the Ladies' Short could reflect nothing more than national bias on the part of the ordinal judges. A similar case could be made for their preference of Michael Weiss over Takeshi Honda in the Men's Free program, while the results in the Men's Short show all three judges disagreeing equally. But the national bias argument runs into some trouble with Scott Smith, placed third by the CoP judges, but no better than fifth by the ordinal judges. In Pairs', finally, there are some differences again between the CoP and the ordinal judges, but they appear no greater than what one might expect among any three judges at any other competition. Parallel investigations to these were conducted for the 17 entrants scored by the two ordinal judges at Skate Canada, with similar, but somewhat more interesting results. Again, the agreement percentages among the judges for the technical and presentation scores were well within normal bounds, but consistently fell higher for the two ordinal judges with each other, than for either with the CoP. And again, the technical-performance agreement values within each of the judges were satisfactorily high for the two ordinal judges but fell to 40% for the CoP in the Free programs - statistically significant, but well below expectation. The following three tables again give the rankings of the top three finishers in each of these six events, although here it should be borne in mind that the placements are based on only this sample of five or six contestants. Thus, the rankings for CoP reflect not the competitors' actual placements, but only their relative placements within this set of competitors.
On the surface, here, these data appear a shade better than those at Skate America, with all three judges in full agreement on the top three finishers in the Ladies' Short and Free programs, as well as the Men's Free. Pairs do less well, with both ordinal judges giving Langlois and Archetto first place, while the CoP ranks them third, in the Short program. One ordinal judge ranks Zagorska and Siudek far lower than the other two in the Free program, but even this does not appear much more of a discrepancy than one might encounter in typical ordinal judging situations. But it is the Men's Free program that appears most interesting, particularly with regard to the evaluation of Emmanuel Sandhu, who was actually placed last by one judge, among the seven competitors considered in these analyses, although the CoP judges placed him at fourth. This discrepancy appeared wide enough to warrant further inspection. At the competition itself, Sandhu placed fifth in the Free Program, behind Plushenko, Buttle, Kevin van der Perren (in third) and Honda, but this showing appeared to be based primarily on his presentation marks, which were higher than van der Perren's, rather than his technical scores, which fell below those of Chengjiang Li, who finished in eighth place. But the finer detail of his technical scores went well beyond this: on five of the 13 elements for which Sandhu was scored, judges could not agree on whether the element was adequately (0 to +3) or inadequately (-1 to -3) performed. At the most extreme, two judges awarded one of his spins +2, while two others thought it rated -1. As indicated in my earlier analyses of the Skate America and Skate Canada competitions, such discrepancies have not been terribly rare at these events, but typically they occur only once or twice in any one skater's performance and the range is usually no greater than +1 to -1, with only one judge out of line. When they become as frequent and extreme as in Sandhu's case here, they seem to point to a more serious problem. But what can account for this amount of disagreement among judges? Is there something that makes Sandhu different from other skaters? Arguably, he may be more artistic than most others, as a result of his lengthy ballet training, and may also be somewhat quirkier in his programming. If one were to compare him to other skaters, he would clearly seem to be more like Toller Cranston, say, or many of the second-string French skaters of recent years, Laurent Tobel, Vincent Restencourt, or Stannick Jeanette, than like a Todd Eldredge, Michael Weiss, Takeshi Honda or Timothy Goebel. To further check the possibility that quirkiness affects CoP judgments as much as it presumably did those made under the ordinal systems, I looked at the marks for Stannick Jeanette, who also competed at Skate Canada, finishing ninth of twelve. In his Free program these showed the same sort of disagreement as to whether an element was well performed or not, in fully seven of the 14 moves for which he was scored. As in Sandhu's case, it was the spins, rather than the jumps the seemed to create the greatest difficulty, with all four being scored in this discrepant manner, and one of these again showing a range of -1 to +2. Unbelievably, Jeanette's marks in the Short program appear even worse. Although only three of the eight elements were discrepantly scored, those for one (a circular step sequence) ranged from -2 to +2, while those for another (a double Lutz) managed to cover virtually the full scale, from -3 to +2! Neither of these appeared to be score-punching errors, since in both cases the out-of-line judges tended to mark Jeanette either much higher or much lower than the other judges did. Moreover, the judge assigning Jeanette the low marks for the technical elements also gave him no performance marks above 5.50 in the Short program (while none of the other judges assigned marks below 5.50, and in fact only two of those 50 marks were as low as 5.50). And this, it seems to me, strikes as close to the heart of the CoP system as any of the difficulties so far encountered. The CoP was introduced and widely sold, after all, as a system that would finally do away with the pesky "subjectivity" of judges' ratings, by confining them to the mere description, rather than evaluation, of at least the technical aspects of skaters' performances. The ordinal system's use of global ratings, it was argued, allowed all those nasty forms of bias - national, stylistic, pressured, or bartered - to emerge; but both the tightened requirements and the guaranteed anonymity of the CoP system would put an end to all that. But apparently, it doesn't. 2. How anonymous are "anonymous" judges?This awareness then led me to a slightly different line of analysis, examining the anonymity provisions of the CoP, which guarantees, the ISU has repeatedly told us, that there will be no more biased judging, since no one will be able to tell which of the many judges on a panel (11 at the Grand Prix) were chosen for the smaller group (7 at the Grand Prix) whose marks were actually scored. And if, the ISU insisted, the pressuring elements cannot know which judges were actually counted they cannot know whether the pressured judges did as they were supposed to or not, and thus their pressure is meaningless. Unfortunately, however, this argument makes no sense at all in the first place - as I'll show in a little while - and is also quite untrue, in the second. This last point is easily demonstrated, though it will take a few moments to work out in detail. It goes like this: At Skate Canada, 7 out of the 11 judges were actually used in calculating each skater's GoE and their five performance scores, and of those seven the high and low scores for each element were thrown out. Trying to identify which of those eleven judges were actually used would appear to be a formidable task: the method for calculating the number of different sets (called "k" in most formulas) that can be selected from a larger number of elements (called "n") is given by the deceptive little formula n!/ (k! x (n-k)!), where the exclamation point means the product of all the integers up to whatever number precedes it. Check it out: how many different pairs can you draw from four cards? n! is 4 x 3 x 2 x 1, while both k! and (n-k)! are 2 x 1. One of those 2 x 1's will cancel out the same pair in the numerator, so we're left with 4 x 3 divided by 2 x 1 or 6. Now count the actual number, calling the cards A, B, C, and D: AB, AC, AD, BC, BD, BC = 6. But for the Grand Prix, the numbers are much larger. 11! can be rewritten as 11 x 10 x 9 x 8 x 7!, and this last will cancel out against the same figure in the denominator, so that 11! / (7! x 4!) turns into 11 x 10 x 9 x 8, divided by 4 x 3 x 2, or with the 4 and 2 canceling out the 8, 11 x 10 x 9 divided by 3, for a total of 330 possible combinations. Sounds like a lot, doesn't it? But it isn't really, for two reasons. The first is that anyone with as much programming skill as we all possessed 20 years ago, before there were canned statistical programs that did everything at the click of an icon, could write a program that would test all possible combinations of these judges, discounting highs and lows, and compare the means they produced with the numbers given by the ISU, thus easily identifying which judges were or were not counted for that particular skater. The second is that even without these programming skills (which I have unfortunately long unlearned) it's no particular problem to figure out which were the "real" judges. Consider Jeanette's case: Table 1 gives the data for his short program as presented in the ISU's official protocols, available from their home page.
The names down the left identify the eight moves that were scored in this performance, and the five "presentation" scores that the judges award (on a ten-point scale) after the performance is completed. The remaining numbers are the +3 to -3 scores the judges punch in following each move. Unfortunately, these are not the real numbers used by the computer, however, since those are weighted differently for each of the jumps, spins, and steps, depending on their complexity. Thus, in order to follow what the computer is doing, we need to replace those numbers with their weighted equivalents, to get the data shown in Table 2.
This table also gives the sum (which is the average multiplied by five) of the GoE scores given by the judges, rather than their average, to simplify calculations, which are now more or less simple enough to do in your head. Look, for example, at the second row (labeled "3A" for triple axel): Since the scores of the selected judges have to add up to -4.00, we know that this can only be achieved if one of the two scores of 0.00 is included among the five, with the other four judges contributing -1.00 each. But since 0.00 is also the high score in this set, two scores of 0.00 will have to show up among the seven judges, since if only one did, it would be the high score that was thrown out. Thus, we know that both judge4 and judge10 have to be among the seven selected judges. Similar reasoning applied to the sixth row (FSSp2) tells us both judge2 and judge7 need to be included. With four judges now identified, we need only find three more among the remaining seven. This reduces the number of possibilities from 330 to 7!/ (4! x 3!) or 7 x 6 x 5 divided by 3 x 2, which works out the 35. But there's more: row three (CSSp2) tells us that we need to have either judge1 or judge5, but not both, and this brings us down to 20 possibilities. And that's few enough that one can just run them off and check them out by hand. Which then leads to the conclusion that the set of seven was composed precisely of judges 2, 4, 5, 7, 8, 10, and 11. Check them out in Table 2 if you don't want to believe me. This doesn't, of course, tell us who the judges are, but even someone with no knowledge of the judges other than what is publicly available can make some informed guesses. For example, the statistical analyses of these data showed that two of these seven judges (numbers 4 and 8) each gave significantly higher ratings to Jeanette than did the other 10 members of the panel, and one (number 2) significantly lower. In fact, on the five "subjective" scores, this judge gave Jeanette the lowest mark of any judge on four of the five elements, while judge number 4 gave him the highest marks of any judge on four of the five. Since the ISU protocols also identify the eleven judges who served as the total pool, one might guess that Judge 2 is probably Jan Hoffman, the former World champion from the former East Germany, whose skating career was marked by an almost robotic attention to correctness: someone like Jeanette must offend him in a way that is almost metaphysical in its impact. And if you want a likely candidate for someone to give Jeanette routinely high marks, the one French judge on the panel sounds like a good place to look. Of course, none of this helps us identify the other judges, nor - it should be noted - can we assume that Judge 2 of the eleven judging Jeanette will be Judge 2 of the set judging any other skater: he or she will be on that panel, and used by the computer, but not necessarily identified in the protocols in that position. So it looks like anonymity might still be maintained: although given both the quick computer program, and the knowledge of judges' behavior and idiosyncrasies possessed by, say Didier Gailhaguet, I doubt any judge is very safe. But the more interesting question is, so what? Who, other than ISU President Ottavio Cinquanta, is really concerned about keeping judges anonymous, and - come to that - what makes Cinquanta think anonymity fixes anything? Cinquanta has argued repeatedly over the past year that anonymity will make "cheating" impossible, since the forces pressuring judges to vote in certain ways will not be able to tell how the any judge voted because they will all be anonymous, and furthermore will not be certain that their pressure works because they cannot know which judges the computer will select as the sample that is actually counted. Both of these arguments appear, however, to be specious nonsense. "Forces pressuring judges to vote in certain ways" need only look at the ISU's protocols to see if any judges voted as they wanted. If none did, they'll know who to blame. Meaning that anonymity does not protect the pressured judge from retribution. And the mere fact that the judge I have fixed (if I were in the business of fixing events) may not be drawn for the final panel wouldn't seem to make much difference either. I have everything to gain and nothing to lose by bribing or coercing him anyway, and the simple check above to make sure he behaves the way I wish: why in the world should I not coerce him? And finally, of course, as the data indicate, "biased" voting, as measured simply by the tendency of certain judges to rate certain skaters higher or lower than the rest of the panel, at statistically significant levels, still exists anyway, despite Cinquanta's promises. 3. So what does it all mean?Before I began this three-part analysis of the CoP I was, if not technically open-minded about it, thoroughly of two minds. On the one hand, the shift to a means-based rather than ranks-based system, the attempt to actually quantify elements of skating performance, and the development of a system that would allow results of one competition to be compared directly to those of another seemed to hold a great deal of potential for improving both the judges' performance and our understanding of what the judges' marks actually meant. But on the other hand, I was concerned that the goal of quantifying the elements prove too daunting, and that it might lead skaters even further toward developing programs that focused on point-values rather than any artistic elements they might have. Moreover, those changes in procedure that had nothing to do with judgment itself (anonymous judges, random selection, trimming of means) seemed neither necessary nor desirable. My first pass at the Nebelhorn data did little to resolve my questions in either direction. The major finding - that judges were apparently totally unable to distinguish among the five parts of the presentation scores - was certainly serious, but seemed easily attributable to their unfamiliarity with this way of looking at performances, and perhaps their inability to think their way through these scores while still learning to use the entirely different system of the touch-screen technical scores. Other inconsistencies seemed similarly to be readily attributable to the judges' lack of familiarity and comfort with a new system. Neither of these improved much at Skate America and Skate Canada, nor would one have expected to see much improvement in that short time span. But one by one other problems began to emerge, and to find confirmation in the more anecdotal instances arising at the other Grand Prix events. Some of these, such as the not-fully-thought-out systems of weighting the elements for their degree of difficulty, have already been acknowledged by the ISU, which promises to present revised schedules of marks in the near future. Similarly, the ISU has also promised changes to the key role of the Technical Specialist, who determines the level of difficulty of the move, to make these marks more responsive to what the skater attempted, rather than what he or she achieved. But it is still unclear whether this will be sufficient to resolve issues such as how to call a jump that is (or - worse yet - may have been) a half-revolution shy of completion or cheated on take-off or landing. Moreover, the larger questions of the balance of marks between jumps and other elements (which currently heavily favors the jumps) or of the tendency of the scoring system as a whole to turn Free programs into longer versions of Short programs does not appear to have drawn much attention at all. Other problems in the system as it emerged at these two competitions appeared more serious however. In particular, the inconsistencies in the weighting (that is, the actual effect they had on the total scores) of the various elements of skaters' performances in Men's, Ladies', and Pairs' events indicated clearly that what must ultimately be the goal of any judging system - to bring about agreement among the judges - had not only not been improved by use of the CoP, but in fact appeared to have worsened in comparison to the ordinal systems. Indeed, it became apparent that one by one, all the claims that the ISU had made in support of this new system were unrealized. The promise that skaters would have direct control over their marks, by way of standardized values for the individual elements, fell under the finding that often these count for virtually nothing in determining the skaters' final marks. The promise that judges would become more accurate in their work was - if accomplished at all - achieved only by shifting the inaccuracies to the Technical Specialist, who seems little better at that job than any judge might have been, but does not have a majority to back him up, as the ordinal system required judges have. And finally, the promise that judges would rid themselves of bias turns out to be just as chimerical as any cynic might have predicted. In my progress through these three investigations, then, I've moved from a position of willingness to entertain the CoP favorably, to one of finding corrigible flaws, to my present position, which is that there is little here than can be defended. As well, of course, all those other problems (anonymity, random selection, trimming of means) that could never have been defended in the first place. But I'm still not ready to let go of it's potential values: the objectivity of both the new scoring system and of the resulting scores, which not only allow us to compare skaters at different events, but also tell us a great deal more about what skaters are actually doing and how well they are doing it, than the ordinal systems ever could. It is just that objectivity also that makes it possible to undertake the analyses I've done here on as small as set of events as I have used. Under ordinal systems, a year or so of competitions would have been required to provide enough data to make meaningful conclusions possible. Thus it seems somewhat ironic that one of the CoP's strongest virtues should so readily be turned into a major source of criticism. As a result, the best solution I can see for the Code of Points now is for the ISU to do what it should have done in the first place: withdraw this program from public display for another three or four years, to enable judges to be properly trained to use it, to iron out the problems in the weighting of the elements, to figure out some way to double-check the accuracy of the caller's calls (an appeals system might be the simplest solution, allowing skaters who felt they had been unfairly rated a chance to plead their case before a wider panel, immediately following the competition, quickly enough to allow changes to be announced before the evening is over), and to examine more closely the issue of how much it tends to strait-jacket skaters into point-gaining cookie-cutter routines. Beyond this, the best solution I can see for the world of figure skating as a whole goes somewhat beyond these questions. Right now, that is, the skating world has been stampeded into taking a position on CoP, and that has boiled down to the question of CoP vs. ordinal systems. But is that really the issue? If we are, for whatever reason, concerned about the ability of our present scoring systems to do justice to the performances presented by various skaters, we should be allowed to explore other alternatives than just the Code of Points. For one simple example, I have for several years been arguing and demonstrating that simply de-ordinalizing our present systems by using the means of the raw technical and performance scores to determine standings would improve the quality of judging overall (see the first of the papers at www.dirk.s5.com. It may be that no more is needed). At another extreme, if may be that something much like the CoP, but offering correctives for the various problems outlined above - particularly those of the tendency of any such system to turn long programs into longer versions of short programs and those of the tendency for skaters to skate for points rather than for artistry - can be put together. An initial step in addressing this possibility would clearly be that of wide consultation among members of the skating community - a step that has so far been absent. So that what becomes apparent at this point is, with equal irony, the impression that analyses such as the ones I have been conducting are only of limited use in addressing the question we would like to address: is the CoP an improvement or a step backward from the ordinal systems we have become accustomed to? The question calls for the sort of value judgment that goes beyond statistical analysis. Pushing on it a little, then, let's assume that all the problems of the CoP outlined above can be fixed: how does it stack up against either the ordinal system, or any of the improvements in the ordinal system that have been suggested - meaning, primarily, doing away with ordinals and using only the averages of the base-6 scores now assigned. And at that limit, I can find several sticky instances. To pick just one, ballet dancers are trained to jump and spin (turn) in either direction and off either foot. I do not know of any skater who is or was truly ambipedulous (to coin a word) or indifferent to whether he or she turned clockwise or counterclockwise, although a few (Robin Cousins and Gary Beacom come to mind) have come close to this ability. Unlike Evgeny Plushenko's startling Biellmann spins - which are anatomically impossible for most males - the ability to perform these moves is a matter or training and talent and requires no unique genetic endowment. Hence skaters who can perform such moves should be awarded extra credit for them. The fact that no provisions for such extra credit exist now (other than in the subjective scores assigned after the performance, only one of which - "Skating Skills" - would be relevant here) is no particular problem at present, since no skaters do these moves. What is more of a problem, however, is that it seems difficult to incorporate them in any way into the present scheme. Give it a try: Let's take direction of rotation. Right now, clockwise or counterclockwise makes no difference whatsoever in the score a jump receives, and cannot even be recorded. So we could, I suppose, introduce a special rule that says any jump rotating counter to the skater's normal rotation is worth, say. double the points. But what if the skater does both equally often, but does the clockwise somewhat better than the counterclockwise: which set gets doubled? The one that counts for less? But how do you determine which that is before the skater has performed? None of this is a problem for the ordinals, or any simple subjectively-based system at all: you just give them what you think they've earned. Of course, this loses you both the apparent precision of the "objective" marks the CoP allows as well as the direct comparison of scores from one event to another they should allow, but this loss may seem small compared to the risk of simply not being able to reward performances that cry for specific rewards. Within a few months or years, however, the choice between the CoP and any alternative system will be made. If the CoP wins out, the changes it will require will be sufficiently extreme and expensive to make it very difficult to abandon that system, even if it turns out a failure in any or all of the respects outlined above. It will, then, not only dramatically alter the world of figure skating as we know it now, but will continue to alter it far into the future, at the very least by imposing the sort of indirect limitations that an incomplete scoring system will introduce. One can only hope that members of the ISU, in voting on this scheme, will be aware of these ramifications. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||