Unless there is reason to hope that the underlying distribution is symmetric...
A bell-like distribution is assumed when all judges are trained and calibrated to rate skaters based on uniform, specified criteria or rubrics. The training typically includes an assessment of rater reliability. The judges must be able to produce ratings within the acceptable range in order to pass the certification qualification.
the only methods that seem to give robust results are boot-strapping (iterative) techniques
After correcting extreme scores to be within the acceptable range (within one "level" from the mean or one full point in a 10-point scale or 10 points in a 100-point scale), Skater D, your favorite, won under CoP.
if there are five scores to begin with and two are discarded, the sample size is still 5, not 3
Yes. My post #118 was embarrassing. I was half dreaming, half awake, writing it in a rush with my husband waiting anxiously to go out together. I have edited my post #118. Better now?
But for interpretation, performance/execution, etc., there can be a greater difference of opinion
I'm not sure about that. The acceptable discrepancy for a judge's score in Chopin piano competition (a competition mainly about interpretation, execution, etc.) is within one level (10 points in 100-scale) or even less (5 points in the finals). Your statement might not have empirical supports. Raters for abstract skills (writing for example) are able to reduce their biases (e.g., toward certain criteria) and become highly consistent after rater training (
http://www.ijlt.ir/portal/files/401-2011-01-01.pdf).
Figure skating is a judged sport, not a measured sport. The CoP tries to force a square peg (measurement) into a round hole (judgement). Skatinginbc is an expert at doing just that.
To make sure we don't lose each other in the semantics, I would like to clarify some definitions. By measurement I mean "the assignment of numerals to objects or events according to some rule" (Stanley Smith Stevens, 1946,
http://en.wikipedia.org/wiki/Psychometrics). Under this definition, assigning either rankings or points to skaters' performances based on specific rules or criteria is an act of measuring, and the rankings or points derived from such assessments are measurements. If you prefer calling it judgment, then let's call it judgment, or assessment or evaluation or appraisal or whatever pleases you. The measuring process can be either a norm-referenced assessment (comparing skaters against each other) or a criterion-referenced assessment (judging a performance against an explicit set of mastery standards or levels). Either way, it can be rated (ranked or scored) holistically or analytically.
Holistic vs analytic assessment:
Is a quick, overall evaluation needed? Yes. It is difficult for the judges to focus on tiny details of so many elements and still be able to evaluate the overall performance, interpretation and choreography simultaneously.
Does the performance mean more than the sum of its parts? Yes, at least in the eyes of the fans.
Can a skating performance meaningfully be broken down into distinct dimensions? Yes. Big bricks (jumps and spins) are distinct from the rest. I'm not sure about footwork and transitions though.
Is it necessary to provide formative feedback on the dimension in question? Probably. Although providing feedback to the skaters is not mandatory, people would like to know why certain skaters receive higher/lower scores.
This is my position:
Skating is a sport, so criterion-referenced assessments should be employed. Skating is more than its part, so holistic assessments should be used except that the big tricks should be separated from the rest as a distinct category (i.e., technical element scores). Therefore, skating in my opinion should adopt a holistic, criterion-referenced
measure (The vicious word that Mathman hates

).
If skating is solely a performance art, not a sport, I don't mind the last stage of the competition (say, containing only the final 6) is rated with a ranking order. Assigning rankings to a long list of competitors is not a good idea in my mind. The differences in the middle part of the bell-curve are relatively small, full of flip-flops, tie-breakers and all that. A point system can do a better job in that respect.
One of the criticisms of the IJS is that it promotes "corridor judging." No one wants to get an anomaly, so you tend to give the score that you think the rest of the judges will give, rather than "voting your conscience."
"Corridor judging" is in fact desired, perfectly in line with the main goal of judge training, namely, to reduce rater biases and differences in severity (too stringent or lenient). I think what you meant was "reputation judging"--A judge assigns scores based on the skaters' past results or reputation to avoid an extreme score that might result in a disciplinary action. Well, is there any research study suggesting that reputation judging is more rampant in CoP than in 6.0? My impression is the opposite: Margarita Dorbiazko & Povilas Vanagas wrote a letter to Cinquanta in Feb 2002 complaining that the judges relied on reputation judging rather than the actual performances. It was under the 6.0 system. How do you explain Javier Fernández's sudden jump in PCSs from 2011 Worlds to 2011 Skate Canada with your criticism of "corridor judging"? Was the outcome based on reputation or past results? Or was it because most judges recognized Javier's improved quality?
This is Javier's FP Transition score in 2011 Worlds:
5.75 6.00 6.25 6.00 6.50 6.00 6.50 6.00 6.00 (Mean = 6.11)(Note the amazing consistency among judges. No extreme scores were more than one point from the mean).
This is Javier's FP Transition score in 2011 Skate Canada:
8.50 8.00 8.75 8.00 8.00 7.25 7.50 7.00 6.75 (Mean = 7.75)(Note the amazing consistency among judges. No extreme scores were more than one point from the mean).
So they were "corridor judging" by your definition. Was that a bad thing or something good that should be promoted? It was clearly not based on the past results or reputation. The fact that so many judges dared to assign a score one point higher than the past results proved the merit of CoP (criterion-referenced measure).
But for PCS the anomalies subtract. If you are too high on interpretation but too low on choreography, that counts as 0 anomalies.
What does that say to us? It means that the design was an "analytic holistic assessment", a global synthetic judgement with specified categories to ensure that no particular aspects of the performance are overly valued by some judges and ignored by others. It is similar in design to most piano competitions where categorical aspects are specified, for instance (David Lang Piano Competition:
http://www.redpoppymusic.com/rules/official-rules.pdf),
● Interpretation of the score to the Work (20 points maximum)
● Musicianship (20 points maximum)
● Vitality of performance (20 points maximum)
● Originality of performance (20 points maximum)
● Evaluation of the performance as a whole (20 points maximum)
I cannot find the scoring criteria for the Chopin Piano Competition on the internet, but as far as I can remember, it consists of 4 or 5 categories (probably 4 because my memory told me 25 points maximum for each category). Remember that in a previous post I pointed out the raw score given by a piano judge will be adjusted if exceeding 10 points? It was based on the total score, not on individual category score. The categories serve mainly as a reminder for the judges so that they can apply similar criteria in their judging. A high inter-category coefficient is however expected given that those skills are highly integrated and it is NOT very meaningful to clearly separate them.
If that is indeed also the principle behind CoP, the criticism about judges' similar PCSs across all categories becomes meaningless.