One specialist counts the rotations, another looks for edge changes. It means only one person is making the decision on a particular technical aspect and therefore subjectivity plays a significant role. Is it more reliable than seven judges give out assessments based on impression? I'm not sure. I need reliability studies to prove it. Let's think about this question: Which is more accurate: (1) a report based on one person who watches an object for 7 minutes, or (2) a report based on seven people who watches an object for only 1 minute?
I'm not sure how that last question applies. It would depend on what the object is, I guess, and how it changes over time.
Anyway, what might be interesting would be to have a number of skaters perform a variety of spins, both simple and complex, and compare two or more of the following methods of evaluating them:
1) 9 judges each give one score (on a scale of 1-10? 1-6? with or without decimals?) for each spin based on overall impression
2) Skaters perform a set of three specified spins -- e.g., a flying spin, a layback, a combination spin with one change of foot and all three basic positions -- connected by whatever skating moves they like; judges give one score on a scale of 1-6 with decimals for each skater's set of three spins. Or two scores for the whole set of spins: one for average difficulty and technical quality of the three spins and the transitions directly in and out of them, another score for artistic impression of the spins themselves and of the connecting moves and performance as a whole
3) 1 official is assigned to identify kinds of spins (that's a flying sitspin, that's a combination spin with change of foot, etc.), and each kind of spin has a base mark; 9 judges each give a grade of execution -5 to +5 according to specific guidelines -- the middle score (5 or 0) means the spin meets the requirements for that kind of spin with acceptable quality; lower scores reflect varying numbers or severity of mistakes or weaknesses according to specific rules; higher scores reflect up to 3 levels of better quality and/or up to 3 areas of added difficulty at the discretion of each judge -- there would be a published list of examples, but if a skater gets creative with a brand new variation, each judge can decide for him/herself whether it looks difficult enough to reward with a point for difficulty; difficulty points, positive quality points, and negative quality points can be added and subtracted up to the maximum and minimum and applied to the base value. -5 for severe or multiple errors would subtract the full base value of the spin to end up with no points
4) One group of officials (technical panel) is assigned to identify the element and as a group to determine which features, from a predefined list were attempted and whether the skater executed each feature well enough to get credit for it, yes or no; yes for 2, 3, or 4 features earns higher base marks; a second group of judges assigns grades of execution -3 to +3 based on a list of errors and a list of positive qualities
5) One group of technical judges each independently identify the elements and independently assign difficulty levels based on which features each one of them can recognize in real time; the computer takes the majority identification and averages the levels, but in certain ambiguous situations a conference review will occur afterward; a separate group of judges assigns grades of execution positive and negative as under 3), and they also take an additional deduction if the skater falls on the spin, so a spin that was bad in general and also earned a fall deduction could end up earning negative total points
Choose at least two of the methods above. Instruct the skaters and officials about the rules of each method. Hold the test competitions.
Now, how do you measure which method is more reliable?