|
|
|
Nebelhorn's New NumbersAn Introductory Look at the Code of Points (CoP)October 17, 2003 IntroductionThe first public test of the ISU's new Code of Points (CoP) scoring system, at the recent Nebelhorn Trophy competition in Germany, has finally given the skating world an opportunity to the assess this revision of the former, simple six-point scoring system. Although a detailed analysis of this system will not be possible for many years, the Nebelhorn Trophy competition is sufficient to address a few of the most obvious questions. The CoP scoring criteria are available at the ISU website. The massive 159-page document details the point values of every imaginable jump, spin, footwork sequence, lift, throw, spiral, etc. However, for those of us who did not attend or see the Nebelhorn competition, it's difficult to ask whether this system actually does what its proponents claim it will - provide a better scoring mechanism and reduce cheating among the judges. All we can do now is look at some of the numbers Nebelhorn has generated and ask a few questions about how they seem to be functioning in practice. Question 1: Do we really need all these numbers?The summary results of the Nebelhorn Trophy competition, posted at ISU's website, list one technical and five further scores, identified as Skating Skills, Transitions, Performance, Choreography, and Interpretation. Since they are not identified by any single summary name, we may as well go on calling them "presentation" marks. From this, it might appear that the presentation aspects of any skating program are worth five times as much as the technical, but the numbers are so weighted that a glance at the actual scores will indicate that they come out about 50-50; that is, that the sum of the presentation scores for any skating performance is roughly in the same ballpark as the sum of the technical scores, whose components make up the bulk of that 159-page scoring criteria book. This 50-50 weighting assumes that the presentation scores are each independent of all the others, which on the face of it appears unlikely. Consider a common situation quite analogous to these scores: an educational abilities test. Assume it has a mathematics and an English (or language) section, the former made up of tests of algebra, plane geometry, trigonometry, and calculus, the latter made up of tests of spelling, grammar and punctuation, usage, and expression. If each is scored on a scale of 0 to 100, and the scores for any student are added, it might appear that each element is as important as any other in determining the final score. But in actual fact, the math tests are largely independent of each other, in the sense that a student who is very good at, for example, geometry, might know nothing of, for example, calculus. The language tests, on the other hand, will be quite interdependent, with students who are very good at any one of them far more likely to be good at the others as well. Because of this overlap in the language scores, one or more of them could be 'thrown out' without seriously affecting the relative ordering of students from best to worst that the total scores produce. In an extreme case, for example, consider what would happen if one of the language tests consisted of the single item: spell your own name. Presumably, everyone would get it right and it would contribute absolutely nothing to our knowledge of the students' intelligence or abilities. This, if we read "technical scores" for "math" and "presentation scores" for "English", is exactly the situation that we find at Nebelhorn. Only one (and it does not matter which one) of the five presentation scores contributes meaningfully to the total score any performance receives while the other four ( despite their elaborate criteria and complex weighting systems) might just as well not have been collected. An example below shows the relevant numbers from the Pairs Short and Free Programs at Nebelhorn. The first of each pair of numbers gives the actual score (the sum of the technical and all five presentation scores) that each performance received. The second number gives the estimated score using only one (I chose "Performance" since it had the most familiar name) of those scores, multiplied by five. The two final columns give the totals of either the actual or estimated scores. (To save space, the competitors are identified only by the first three letters of the female and male partner.)
The differences between the actual and estimated scores throughout this table are minimal, and in only one case in the Short Program, lead to even the least change in the rank order of the competitors. Similar results were obtained for the men's and ladies' events. To illustrate these findings more clearly, consider the Men's Short program. Nicholas Young, who came in first, received the highest scores of any competitor for each of the five elements. Alexei Vasilevski, in fourth place, received the second highest on four of the five, while Nicholas LaRoche, in third place, received the third highest on four of the five, switching order with Vasilevski on the fifth. Only one of the fifteen scores assigned to these three competitors fell below 6.00, while none of the 70 scores assigned to the remaining competitors reached 6.00. Not only is the consistency of these numbers striking in itself, it is made even more dramatic by the fact that second place Scott Smith gets consistently lower scores on every one of these elements. This suggests that a global "I like this skater" factor, which we would have immediately suspected under the ordinal system, cannot account for it. Thus, If the ISU had hoped the CoP system would break down the old presentation marks into more distinct elements, these early data from Nebelhorn give no indication that this system has allowed the judges to get that message. Admittedly, part of the problem here lies in the fact the summary scores we are dealing with are not only averaged across judges, but actually averaged across a sample of seven judges drawn at random and anonymously from a pool of eleven, with the high and low score for each skater removed. Analyses of the protocols of individual judges, were that possible, would probably indicate more disagreement and more targeted use of the different scoring categories. Unfortunately, as long as this selection, and in particular, trimming procedures are used, even the best efforts by the judges' will simply be wasted in the final crunch. Question 2: Are they consistent?Although fans love to argue about the minutiae of skater's placements at any event, many people agree that the judges have been doing a pretty good job of identifying the bad, average, good, and very good performances at any competition. This "pretty good job" is reflected in several different ways in the statistics that can be computed from the judges' scores. For example, since the technical aspects of skaters' performances count more heavily in the short program, and presentation aspects in the long, one would expect that the relative weights actually given to these scores by the judges would reflect those ratios. Thus, the technical scores should contribute more to the overall score in the short program, the presentation in the long. Ideally, in fact, the relative contributions would fall somewhere around a two-thirds to one-third split, reflecting the actual weighting given to those scores in the determination of the final standings. Now let's look at some of the numbers derived from the Nebelhorn data and compare them with similar numbers derived from another international competition. I used data from the 2001 World's because they happened to have been in my computer with most of the numbers needed here already calculated. However, presumably any other competition would have served as well. The first two columns of numbers of the following table give the actual weights assigned to the presentation and technical scores by the judges at the 2001 World's, while the third column gives the weight actually assigned to "presentation" at Nebelhorn, all as percentages of the total. (Thus, the weights for the technical scores at Nebelhorn can be found simply by subtracting each of these numbers from 1.000.) Note that these are not the weights assigned by the scoring procedures, but rather were derived from the sort of statistical analysis that would have assigned a weight of zero to the "spell your name" question on the English test described in the example above. Disclaimer: Technically, these numbers are the ratios of the beta weights in the multiple regression equations using final rank order as criterion and raw technical and presentation scores as predictors in each competition. In some cases, particularly when the numbers are small, this can lead to anomalous results, as happened here in the case of the Pairs Short program at Nebelhorn: with only eight competitors. Technical scores were essentially perfectly correlated with final placement so that no room was left for presentation scores to add to the equation. This sort of problem is far more prevalent in the dance scores since the judges' agreement is generally much higher in dance than in the other disciplines. For this reason those scores have not been included in these analyses.
With that disclaimer out of the way, we can now look at the numbers themselves. In the first two columns, it is apparent that the judges at the World's are indeed assigning weights more or less as we would expect, with presentation marks contributing somewhere in the neighborhood of 67% (two-thirds) to the final results in the Free programs, and fairly near 33% in the Short programs, although things get a little rocky in the Pairs Short program. However, no such give and take applies at Nebelhorn, where presentation accounts for less than 50% regardless of event or discipline and actually counts for less in the Free program than in the Short for both the Men's and the Ladies' competitions. Thus, it seems that the numbers actually assigned by the judges at Nebelhorn are not terribly consistent with the aspects of the skaters' abilities that the Short and Free programs are supposed to assess. Question 3: Do they identify good and poor performers?Just as most tests of English and math will, in practice, show a tendency for students who do well in one subject to also do well in the other (because some students happen to be brighter than others), so will technical and presentation scores tend to show a certain amount of overlap, merely because some skaters are simply better or are more talented than others. Just how large this overlap should be is difficult to determine. My best guess would be to find what it is in general, and then accept that as what it most likely should be. In the recent years over which I've looked at this statistic, it tends to be very high (about 95% or more) for Dance, slightly lower (near 90%) for Ladies' events, variably lower (about 70% to 90%) for Pairs, and lowest (65% to 80%) for Men's events. This rank-ordering seems to make sense, in that it conforms quite well to a simple measure of 'danger' moves in each discipline. These overlap percentages for both the 2001 World's and Nebelhorn are shown in the next table.
It is clear that the numbers at the World's are uniformly high here, and consistent with those I have seen over the years at every event. However, they are markedly lower at Nebelhorn, except in the case of the pairs short program where, as we have already seen, the numbers are highly suspect. It is possible, of course, to give these lower numbers at Nebelhorn a positive spin, since they seem to indicate that presentation scores under CoP are far less tightly tied to technical scores than they used to be. Indeed, skating fans have complained about the overlap in technical and performance scores ever since their earlier complaints led to the elimination of compulsory figures - paradoxically, on grounds that the results in this event failed to agree closely enough with those of the long and, later, short programs. But my best guess now is that Nebelhorn's numbers are too low. In part, this is because the numbers of competitors in each event -- eight in Pairs, no more than 16 in the other disciplines -- was considerably lower than at the World's, and smaller numbers will produce higher correlations even when the actual effect is of the same size. This means that the numbers given in the table would be reduced by about one-fifth, if transplanted to the larger fields found at the World's or Olympics. And when this overlap falls below one-third, we are dealing with little better than chance relationships. Beyond this, the Nebelhorn data also fail to show anything like the reasonable progression of men's-pairs-ladies which is typical for these overlap statistics. Indeed, there does not appear to be any sort of consistent pattern to them at all. Putting all of that together, my best guess is that the present numbers are considerably lower than they should be, and testify more to the judges' difficulties with the new scoring system than to any virtues of that system itself. But only time will tell. Conclusion?The simplest conclusion to be drawn from all this analyses is that Nebelhorn's numbers don't seem to make much sense on even the most elementary levels. According to the data, these judges not only could not distinguish between technical and presentation programs meaningfully, but apparently could no longer even recognize basic skating talent with much reliability. Of course, this may not really be the fault of the Nebelhorn's judges who were, after all, either the same as those at the 2001 World's or drawn from the same pool of international judges. Rather, it may be the scoring system that is to blame here: for making things unnecessarily difficult for the judges, for absolving them of responsibility to anyone but the referees and the ISU's own monitoring panels, and for vitiating their evaluations through the use of randomizing, mean-trimming, and averaging procedures which eventually reach results that no one may have anticipated. It will be interesting to see whether the difficulties exposed by the Nebelhorn statistics will continue to be born out as more data become available after the coming Grand Prix events. What's your take? Feel free to leave feedback, opinions, etc here. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||