|
|
|
Analyzing the Code of PointsSkate America and Skate CanadaJanuary 28, 2004 IntroductionThe most significant of the innovations introduced into the judging of figure skating competitions by the ISU's new Code of Points system (CoP) is its treatment of the technical aspects of competitors' programs. Rather than being awarded a single, global mark on a scale of 0 - 6, skaters programs are now broken down into individual elements. Each element is assigned a specific "base" point value, and then rated in terms of its Grade of Execution (GoE) on a seven-point scale (usually described as ranging from +3 to -3, although the actual values differ for each skating element). Only after this work is completed, do judges return to their more familiar task of assigning subjective "presentation" marks, which are now broken down into the five following elements: skating skills, transitions, performance/execution, choreography, and interpretation. Most notable about this scheme, is that a portion of the final mark awarded each performance is now more or less under the direct control of the competitors themselves. They are free to choose elements that are as great or small a degree of difficulty as they wish: the greater the difficulty, of course, the higher the value it receives. The question of how much higher that portion is, and what elements contribute to it most meaningfully, are the central concerns of this analysis. In order to answer them, the most effective tool is a statistical procedure called multiple regression. Multiple regression is a fairly complex set of analyses which assess the importance or weight of any measure (or score) of a set of measures in determining the value (or score) of some related measure (termed the "criterion", by statisticians). For example, if we are concerned with the causes of lung cancer or heart disease, we might look at such measures (scores) as family history, general health, smoking, and the like. Multiple regression allows us to tell just how important each of these may be, or how much of a contribution it makes to the development of the cancer or heart problem. Because of its wide application to almost any problem in the social and biological sciences, multiple regression is among the most commonly used of all statistical procedures. Who's in charge - the skaters or the judges?As a first step of this analysis of the CoP at Skate America and Skate Canada, I followed the outline given above, estimating first the contribution to the final standing made by the "base" marks, then the additional contribution made by the GoE marks, and finally the additional contribution of the judges' "presentation" scores. However, since both my earlier analysis of the Nebelhorn data and Sandra Loosemore's independent analyses of the judges' ratings had indicated that these were marred by heavy "halo" effects (where judges tended overwhelmingly to mark all competitors equally high or low on all five elements), I used only the average of these five numbers as estimates of the "presentation" mark. The graph below summarizes these values for both the Short and Free programs for all competitors at the Ladies', Men's, and Pairs' events at those two competitions.
Here, each bar represents one of the competitions, and within each, the segments (reading from the bottom up) represent the weight or importance (in percent), of the base values, GoE scores and presentation marks in determining the final rank of each competitor. The numbers inside each segment are percentages, and they do not quite add up to 100 in each case for two reasons: (1) for the Ladies' and Men's events, the 10% deduction for "sequences" of jumps was not included in the scores I tallied; and (2), for technical reasons, I used the final rank achieved by each competitor, rather than the total score as criterion. This resulted in a little bit of slippage. Several aspects of this graph appear fairly striking. First of all, at least one-third of the final mark for each event appears to be under the skaters' control, with the weight of this base mark going up to more than two-thirds for the Ladies' and Pairs' Free programs. Second, while all three weights appear similar across the Short and Free programs for the Men, they vary markedly for the Ladies and Pairs. In the case of the Ladies, the importance of the base mark virtually doubles, while that of the GoE score is cut in half. In the Free program, much of the same happens - even more dramatically in the Pairs events. Finally, the judges' "performance" scores do not contribute much more than one quarter of the final mark, falling below 10% in half the cases. Given this much variation across the events, it's fair to ask for explanations - and a few can be offered. For example, the more rigid structure of the Short programs suggests that the competitors themselves have fewer choices in what elements to execute than they do in the Free programs. Hence, the relative importance of the base marks will be less in the Short program. On the other hand, since men's programs tend to focus more on jumps than those of the women and pairs, the base marks (which vary more for jumps than for the other elements) in the Short program are more important for the Men's events than for the Ladies' and Pairs' events. However, the base marks tend to be roughly the same in both the Short and Long programs within the Men's competitions. The situation in the Pairs competitions is somewhat more confusing, with the base mark accounting for only one-tenth of the final score in the Short program, but almost three-quarters in the Free. This could be due to any one of a number of "accidental" factors -- such as the restricted range of elements available in the Short program, or a more restricted range of competitors entering the field in Pairs. Or, it could point to more serious problems in the identification and evaluation of the various elements. Nevertheless, these data can provide a rough answer to our initial question: Who determines the marks? In the Short program, about one-third to one-half of the final mark appears more or less under the direct control of the skater, whereas another quarter to third comes from the judges' evaluation of how well the chosen elements are performed. A final 10% or so appears to come from the judges' overall summaries. In the Free program, about half of the final mark seems to be under the skaters' control and another quarter or so depends upon the judges' evaluation of the elements, and again, something like 10% comes from their overall summaries. What matters most - jumps, spins, or steps?Since the CoP scoring system divides the various elements of the Men's and Ladies' events into the categories of jumps, spins, and steps (with comparable groupings in the Pairs events), it seems reasonable to ask which of these three may be the most important in determining a competitor's final score, and whether they have the same effects in each of these three disciplines. The next graph summarizes those numbers which turn out to be fairly straightforward. In both the Short and Free programs, for all three events, it is the jumps that indeed matter most, accounting for two-thirds to four-fifths of the final score in all six groups. ![]() Spins account for about 15% of the final mark in all three Short programs, but seem to matter only for the men in the Free programs, while step sequences have only small effects (or none at all) throughout. Finally, the judges' evaluations, in their "presentation" (pres) marks, have variable effects, ranging from none at all in the Pairs' Free program, to close to 20% in the Ladies' Short. Since all the numbers here, however, are somewhat preliminary and drawn from fairly small samples, it would be premature to attach much significance to any numbers or differences that amount to less than 20%. Jump fests or tea parties?If jumps matter most, in both the Short and Free programs, what matters most - their difficulty or their execution? This seems to have been one of the most widely discussed questions raised by the introduction of CoP. On one hand, many fans have suggested that the heavy weights assigned to jumps in this system would lead skaters to attempt jumps beyond their ability since even a failed jump at a higher level would be worth more than a successful jump at a lower level. On the other hand, others have argued that the new system favors quality and presentation skills and would inhibit skaters from stretching to their fullest capacities. Unfortunately, the numbers here do very little to clarify this issue, since they vary so widely as to make any general summary appear suspect. For example, in five of the six events, the base score for Jumps accounts for at least 25% (Pairs' Free) and up to 50% (Men's Short, Ladies' Free) of the competitors' final ranks. In the Pairs' Short program, it accounts for exactly 0%. On the other hand, the GoE for jumps appears to have fairly consistent effects, accounting for one-fifth to one-third of the final scores of each of the groups. All the remaining scores -- the base rates and GoEs for spins and steps -- appear to be scattered more or less haphazardly across the six groups. Thus, the only consistent finding here is that jumps, either base rates or GoEs, seem more important than the other elements (which is what we began with). Boys vs. girlsSince the ISU has consistently emphasized that the CoP would provide scores that are directly comparable across all skaters and all time, it seems fair to compare the scores of male and female skaters to determine whether one gender is "better" than another at any aspect of the skating performance. Of course, this question is rather simple-minded when put that way: not only do we all know that men consistently perform more and bigger jumps than women, but the actual scoring regulations adopted along with the CoP specifically allow the men more jumps than they do the women. So let's rephrase the question: Other than in jumping ability, are there any consistent differences between the genders? The short answer to that question is "no", although we may have thought that women could be seen as better at spins or spirals than men. Indeed, the only gender difference (other than the superiority of men to women in jumping ability) to emerge from these data is the finding that for women, final scores are positively related to the number of triple jumps and the number of combination jumps performed, and negatively related to the number of double or single jumps. For men, no such relationships apply. However, this too appears quite consistent with what we have always known about figure skating. Some tentative conclusionsIn summarizing all the preceding work and trying to draw some general conclusions, it should be borne in mind that we are dealing with a fairly small sample of data, as well as a rather restricted, very early phase of experience with the CoP. Within these restrictions, however, it is also apparent that some of the findings appear to be quite robust, so that we may expected to see them repeated on larger data sets. Similarly, taking a broader, birds-eye view allows us to see the forest instead of just the individual trees, and this may lead to a picture that looks quite different from what has been reported before. For convenience, the discussion below is keyed to the familiar model that goes from the base values of the elements to the judges' evaluations of those technical elements and finally to the judges' evaluation of the five "presentation" aspects of skaters' performances.
If there is a problem here, the most likely suspect would seem to be the technical specialist, or "spotter", who identifies the elements as they occur. As many critics have already pointed out, this can often be an arbitrary decision, which has, unfortunately, very extensive effects. While it is difficult to argue with these specialists' decisions on any but a "my opinion vs. their opinion" basis, Sandra Loosemore has been noting more compelling problems. Tracking individual skaters across their two or three appearances at the Grand Prix events, she has noted a number of cases where the same move in skater's program has been scored in different ways by different specialists. For example, Michael Weiss's combination spin and both step sequences were called at level 2 by the specialist at Skate America, but only level 1 at Lalique. Similarly, Shizuka Arakawa's layback spin earned level 2 at both Skate America and Skate Canada, but only level 1 at Lalique. This seems particularly unfortunate because the main rationale for the introduction of the CoP has been that of providing a "better" and "more objective" scoring system than that obtaining for the older ordinal systems. While the present data do not allow any judgment of the relative merit of these two systems to be made, it is difficult to see how the present problems could represent an improvement over anything at all.
Here, two potential sources of difficulty arise. The first of these is the use of trimmed means in combining the evaluations of the selected judges. While it often seems that trimmed (throwing out the high and low marks for each competitor) means should provide cleaner results by eliminating the effect of extreme outliers, this argument unfortunately assumes that there will always be extreme outliers. But this has never been documented, and indeed, the reverse seems to be true. More importantly, there are real statistical problems associated with the use of trimmed means, and the desired effects of doing away with outliers can be more easily and more cleanly achieved by the use of median scores, which are unaffected by outliers, rather than averages. The second problem is that there is a great deal of variability across the judges in evaluating these elements: one would expect, for example, that all judges could agree at least as far as calling a move successful or unsuccessful. But even this low level of agreement is fairly commonly not met, with ratings from judges often ranging into both the plus and minus sides of the scale. For example, in the Ladies competition at Skate America alone, one can find Shizuka Arakawa receiving scores ranging from -1 to +2 for her triple Lutz-double Toe combination, while Jenny Kirk gets three +1's, two -1's and one -2 for her triple Flip in the Short program, while the Free program shows four more instances of scores ranging from +1 to -2, or +2 to -1, in addition to nine instances of scores ranging from +1 to -1. Most of these discrepancies appear to occur in the evaluations of jumps, which appears reasonable, since the problem of cheated jumps can perhaps account for some of this problem. It's also unfortunate, since jumps are what counts most. Worse yet, most of these discrepancies appear to occur (in all events) among the top-rated skaters rather than those at the middle or bottom of the range, suggesting that national bias may still be shading these evaluations.
In my discussions above, I have not presented much evidence to support this conclusion, primarily because it would have been quite redundant. Beyond this, however, Sandra Loosemore has been making the point much more forcibly in her analyses of the numbers actually assigned by each judge to each skater. Loosemore points out that if the judges were using these scales properly, there would be very little variability across the judges rating any one of these five elements for any given skater (that is, that the judges would agree with each other) and that there would be far more variability across the five elements for any one judge rating that skater (that is, that judges would see the elements as distinct from one another). In fact, exactly the reverse has happened: of the 586 performances she has evaluated at the six Grand Prix events, only one (1) has shown the expected pattern, while 585 have shown the reverse. It is, indeed, difficult to see how things could be much worse. Further criticisms, could easily be adduced. For example, in other contexts Loosemore has pointed to such recorded scores as values of .50 and .25 for some of the performance elements at the Grand Prix, where most are in the range of 4.00 to 8.50. Clearly, these are simple errors in keypunching on the part of some judges, but they appear to stand, either because nobody has figured out how to fix them or nobody cares, possibly because they were made by judges not included among those whose scores were actually used. Similarly, Loosemore has noted some of the competitors at the Grand Prix events received scores as low as 2.00 to 2.75 in some of these presentation evaluations. This reasonably raises the question of how much room there will be left at the bottom of this range for novice or junior skaters, if even senior international competitors can be scored this low. As most people are aware, the CoP was thrown together rather hurriedly in a fairly blatant attempt by the ISU to salvage a reputation badly damaged by the Salt Lake City Olympic scandal. There now appear to be very many indications to suggest that this rush to implementation was a badly thought out maneuver for a system that may possess some redeeming features, but at present, seems to avoid major embarrassment only by virtue of the fact that the judges appear to be ignoring it as best they can. |