Well, this is what I think. There is a difference between judging and measuring. The kind of statistical analysis under view applies to data that you get by taking measurements. You take many measurements of the length of a steel rod, then you take the mean of those measurements, compute the standard deviation (hoping it is not too big), and there you are.
Actually, judging
is a type of measuring (or measuring is a type of judging...whichever's definition you hold to be more expansive). Either term refers to comparing some entity relative to some other predefined metric. In measuring a steel rod, it's comparing the steel rod to a predefined unit of length. In figure skating, it's comparing a skater's performance to a predefined set of rules about how points are given out. The total number of points are then summed up for each skater, and then each skater is ranked according to the total number of points. How the rules about giving out points may not be quantified (i.e. the height of a jump, while it can be objectively measured, is a subjective estimate under the current judging method) but just because a measurement is subjective or not quantified does not make it arbitrary nor capricious.
The points are not just ordinal but also cardinal -- they carry information about not only "whether or not" but also "how much". For example, it would be difficult to say whether or not a 3rd place routine in one competition is better than a 1st place routine in another competition with just the ranking alone as your information. But assuming that the points were given out in the same way in both competitions, you
could say which was better, in the sense of, gained more points under the predefined rules. The problem is that judging may not be the same in different competitions, and indeed may not be the same for different skaters within a single competition, but at that point you have a calibration problem -- whether or not the measurement system is actually conforming to the agreed-upon metric.
I think you may be thinking of "judging" as deciding which is better or worse, i.e. eliciting preferences such as "I judge that vanilla is better than chocolate" or "I like brunettes more than blondes". In which case yes, preferences can only be ranked. But that's not what's happening here. The judges do not (ostensibly) directly decide on which skater was better or worse; rather, that's what the rules do (or more precisely, how the rules are used). The rules specifically define points to different moves that a skater may do.
We then interpret the points to rank who was better (i.e. skater X had more points than skater Y, thus skater X's performance was better than skater Y's). The judges are evaluating the skaters relative to those rules. The judges do not say that an axel is better than a lutz. They are (again, ostensibly) comparing a skater's jump to the definition of an axel as well as comparing the jump's metrics (height, etc.) to predefined criteria for assessing GOEs etc. The judges do not (again, ostensibly) simply say "I like skater X better than Y, therefore skater X gets more points".
Of course, they're not supposed to, and so when there are differences between the given scores and the established rules, then people will suspect that the judges actually
did go by "I like skater X more" or other personal reasons instead for giving out points and thus there will be controversy -- because the judges are effectively inserting their own preferences into the ranking system.
Because the judges are comparing each skater's performance to defined rules, and in fact do so in a quantitative way, it
does make sense to use parametric methods to analyze the judging. The judges themselves are not
defining value, but
measuring it, (more precisely, measuring a quantity to which we then ascribe value, not value itself) and it's perfectly valid to consider them as measurement devices with all the statistical implications. Now you can certainly claim things like the judges' scores aren't actually normally distributed and such, but that's a well-known modeling issue; very few things (or at least, things that we want to measure such as weight, height, etc.) are actually normally distributed if for no other reason than they can only have positive values while any normal distribution has a finite probability of a negative value. Instead we take that as an acceptable modeling error and move on; this is why I said "All models are wrong; but some are useful" to your postulates. If I were bored I could probably argue that the judges are effectively measuring in a binomial sense (i.e. yes/no for each of the elements to receive points under the rules) and thus it makes sense that the scores
should be approximately normally distributed via the central limit theorem, but I know that theorem's overly abused and I don't want to go through the assumptions (such as independence which may not hold here).